Speech recognition can be found in various areas of our modern world. Most people get in touch with speech recognition every day. The following article presents multiple applications in respect of speech recognition in more detail.

1. Mobile Applications

1.1 Siri

The voice recognition of Apple’s mobile operation System iOS is called Siri. The name Siri is Scandinavian, a short form of the Norse name Sigrid meaning "beautiful woman who leads you to victory", and comes from the intended name for the original developer's first child. Later on, the name Siri got an acronym for „Speech Interpretation and Recognition Interface“ and got integrated into iOS 5 for iPhone 4S and above on october 14, 2011. Siri was intended to be a stand alone application for iOS, Android and Windows Phone created by Siri, Inc before it was aquired by Apple on April, 28 2010. A key feature of functionality was that its artificial intelligence programming aimed to allow to adapt the individual language of each user and therefore the results get individualized [1].

Siri's speech recognition engine is provided by Nuance Communications, a speech technology company. Siri basically consists of two major parts: the user interface and the processing. The interface is simply what is presented on the screen when the user is interacting with the software. The iPhone or iPad records the speech, preprocesses it and sends it to servers. The actual speech recognition is done remotely here and the result is sent back to the device where it is presented to the user. Therefore, Siri only works if the device has an active connection to the internet [2].


1.2 Google Now

Google Now is a speech recognition software developed by Google. It is integrated within the Google Search mobile applications for Android (Googles mobile operations System), as well as iOS and even the Chrome web browser for home computers. It was first included in Android 4.1 ("Jelly Bean"), which launched on July 9, 2012. Later the service got available for iOS on April 29, 2013 via an update to the Google Search app, and for Google Chrome on March 24, 2014.

To get best and most relevant results for the user, Google Now uses the search history on the web and even location history of the device [3]. Furthermore it recognizes actions that are performed frequently by the user to display information that is more relevant to him in form of so called "cards". The system takes advantage of Google's „Knowledge Graph“ project, a system used to assemble more detailed search results by analyzing their meaning and connections. Like Apple’s Siri, Google Now uses a natural language user interface to interact with the user and delegating requests to a set of web servers where the actual speech recognition takes place [4].


1.3 Microsoft Cortana

Microsoft's speech recognition software is called Microsoft Cortana. It is named after an artificial intelligence character in Microsoft's "Halo" video game series and is voiced by Jen Taylor [5]. Cortana has been launched as a key ingredient of the future operating systems for Windows Phone and Windows [6] and even Windows 10 [7]. At the beginning of 2015, it is available as a beta version to all users of Windows Phone 8.1 in the United States, the United Kingdom and China. Microsoft Cortana replaces the previous Bing Search application and gets deeply integrated into the operating system: when pressing the search button on the device, Cortana is activated instead of the replaced Bing Search app [8]. Like with Apple and Google, Cortana provides a natural language user interface to interact with the user and sending requests to a set of web services where the actual speech recognition takes place.


1.4 Samsung S Voice

S Voice is the speech recognition software only available as a built-in application for the Samsung Galaxy S III and newer, and multiple tablets like Galaxy Tab 3 7.0 and above. The application uses a natural language user interface to interact with the user and delegating requests to a set of web services where all relevant steps of speech recognition are processed. The Device itself only captures the speech and pre-processes it [9]. The earlier version of S Voice is based on the Vlingo software (see 1.6) [10], whereas the Galaxy S5 and newer Samsung devices use Nuance instead of Vlingo for the S Voice.



Symbolically Isolated Linguistically Variable Intelligence Algorithms (SILVIA), is a core platform technology developed by Cognitive Code. The speech recognizing technology can be executed and operate via cloud, mobile application, or via server [11]. SILVIA can be used not only in smartphones, but in voice search or other voice-related applications and unlike some other similar technologies, SILVIA responds intelligently to the user in complete sentences and not only in some words or small phrases [12].

SILVIA was developed to recognize and interpret human interactions: text, speech and any other human input. It consists of graphical user interface tools and an array of API scripts which can be embedded in applicable applications [13]. In contrast to other similar technologies, SILVIA can be used on different operating systems and computing platforms which allow easy and seamless data transfer. On top of that, SILVIA uses a non-command based system, which means that inputs are based on normal human conversational language, whereas Siri and Google Now use pre-coded commands [14].


1.6 Further Applications

The following list shows further mobile speech recognition software:


2. Speech Recognition Software for Windows

2.1 Built-in speech recognition

The Windows Speech Recognition by Microsoft is a speech recognition system that comes built into Windows Vista, Windows 7, and Windows 8. It uses version 8.0 of Microsoft's speech recognition engine which is used in Kinect as well. It allows the user to control the computer by speaking out specific voice commands and even dictate text for text production without touching the computer [15].

Programs that do not present commands to the user can still be controlled by asking the system to show individual numbers on the interface elements. That specific number must be spoken out to activate that function and even mouse clicks can be controlled through speech. The Windows built-in Speech Recognition has a very high recognition accuracy and serves the user a set of commands that assists in dictation of a text. Currently, the application supports several languages, including English, Spanish, German, French, Japanese and Chinese [16].


2.2 Third-party software

The following list shows speech recognition software running on Windows machines:


3. Speech Recognition Software for Linux

The following list shows speech recognition software running on Linux machines:

  • ViaVoice offered by IBM, now acquired by Nuance Communications [24].
  • Speech uses Google's speech recognition engine [25].
  • FreeSpeech uses CMU Sphinx's tools (see 5.1) [26].
  • Vedics (Voice Enabled Desktop Interaction and Control System) is a speech assistant for GNOME Environments [27].
  • Xvoice requires ViaVoice (see above) [28].
  • GnomeVoiceControl is a dialogue system to control the GNOME Desktop [29].
  • NatI is a multi-language voice control system written in Python [30].
  • CVoiceControl is a KDE and X Window independent version of its predecessor KVoiceControl [31].
  • Sphinxkeys is a voice control system written in Python [32].
  • Open Mind Speech, a part of the Open Mind Initiative, aims to develop free (GPL) speech recognition tools and applications, as well as collect speech data [33].
  • PerlBox is a perl based control and speech output [34].
  • VoxForge uses speech recognition engines like CMU Sphinx (see 5.1) or Julius (see 5.3) [35].
  • Simon uses CMU Sphinx's tools (see 5.1) and/or Julius (see 5.3) [36].
  • Speeral a set of speech recognition tools developed at University of Avignon [37].


4. Speech Recognition Software for Mac 

The following list shows speech recognition software running on Macintosh systems:


5. Open Source Acoustic Models

5.1 CMU Sphinx

CMU Sphinx is the general term to describe a group of speech recognition systems developed at the Carnegie Mellon University. Sphinx is a continuous-speech, speaker-independent recognition system, that comes in four software systems: Sphinx 1 - 4. All systems make use of Hidden Markov Models and the N-gram Model [44]. Further information about Sphinx can be found here.


5.2 HTK

HTK (= Hidden Markov Model Toolkit) is a software toolkit for handling Hidden Markov Models. It was developed at the Machine Intelligence Laboratory of the Cambridge University Engineering Department (CUED) with speech recognition in mind, but can be used in other fields where pattern recognition is needed, like speech synthesis, character recognition and DNA sequencing [45].


5.3 Julius

Julius is a large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. It can perform almost real-time decoding on most current PCs in 60k word dictation task using word N-gram and context-dependent HMM. The main platform is Linux, as well as other Unix workstations, and it also works on Windows. Julius is open source and distributed with a revised BSD style license. To run the Julius recognizer, a language model and an acoustic model for the desired language is needed. Julius adopts acoustic models in HTK ASCII format, pronunciation dictionary in HTK-like format, and word 3-gram language models in ARPA standard format. Julius is currently distributed only with Japanese models, but the VoxForge project is working on creating English acoustic models to be used with the Julius speech recognition engine [46].


5.4 Kaldi

Kaldi is a free toolkit for speech recognition and licensed under the Apache License v2.0 [47]. It is developed for usage of speech recognition researchers and supports linear transforms, mutal information, MCE discriminative training, feature-space discriminative training, and deep neural networks [48].



[1] http://theweek.com/articles/476851/apples-siri-got-name (accessed January, 19 2015)

[2] http://www.pocket-lint.com/news/112346-what-is-siri-iphone-4s (accessed February, 9 2015)

[3] http://bgr.com/2012/11/15/google-now-wins-popular-science-award/ (accessed February, 9 2015)

[4] http://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph-things-not.html (accessed January, 19 2015)

[5] http://www.usatoday.com/story/tech/2014/04/02/voice-of-cortana-jen-taylor/7223357/ (accessed January, 19 2015)

[6] http://www.zdnet.com/article/microsofts-cortana-alternative-to-siri-makes-a-video-debut/ (accessed January, 19 2015)

[7] http://www.theverge.com/2015/1/21/7866741/cortana-windows-10-announced-microsoft (accessed February, 11 2015)

[8] http://www.windowscentral.com/developers-leak-new-features-windows-phone-81-sdk (accessed February, 11 2015)

[9] http://www.samsung.com/us/support/supportOwnersHowToGuidePopup.do?howto_guide_seq=6973&prd_ia_cd=N0000004&map_seq=47859 (accessed February, 12 2015)

[10] http://www.gizmofusion.com/2012/05/samsungs-new-s-voice-is-just-a-skinned-version-of-vlingo-labs-beta-on-google-play/ (accessed February, 12 2015)

[11] http://www.forbes.com/sites/karstenstrauss/2012/07/09/riding-the-wave-of-artificial-intelligence/ (accessed February, 12 2015)

[12] http://dondodge.typepad.com/the_next_big_thing/2007/09/techcrunch40--1.html (accessed February, 12 2015)

[13] http://www.engadget.com/2007/09/17/cognitive-code-shows-off-silvia-artificial-intelligence-platform (accessed February, 12 2015)

[14] http://library.fora.tv/2008/02/13/SILVIA_Artificial_Intelligence_Platform (accessed February, 16 2015)

[15] http://windows.microsoft.com/en-us/windows/use-speech-recognition-operate-windows-programs#1TC=windows-7 (accessed February, 16 2015)

[16] http://www.microsoft.com/enable/products/windowsvista/speech.aspx (accessed February, 16 2015)

[17] http://www.nuance.com/dragon/index.htm (accessed February, 16 2015)

[18] http://www.brainasoft.com/braina/ (accessed February, 16 2015)

[19] http://freesr.org (accessed February, 16 2015)

[20] http://www.speechgear.info/products/interact-as (accessed February, 16 2015)

[21] http://www.digitalsyphon.com/technologies_sonicextract.asp?contentpage=technologies_dsdcc&bodyid=technologies&technologies=technologies (accessed February, 16 2015)

[22] http://www.nuance.com/products/speechmagic/index.htm (accessed February, 16 2015)

[23] http://voxcommando.com/home/ (accessed February, 16 2015)

[24] http://www-01.ibm.com/software/pervasive/viavoice.html (accessed February, 12 2015)

[25] https://github.com/andre-luiz-dos-santos/speech-app (accessed February, 12 2015)

[26] http://thenerdshow.com/freespeech.html (accessed February, 12 2015)

[27] http://vedics.sourceforge.net (accessed February, 12 2015)

[28] http://xvoice.sourceforge.net (accessed February, 12 2015)

[29] https://wiki.gnome.org/GnomeVoiceControl (accessed February, 12 2015)

[30] https://github.com/rcorcs/NatI (accessed February, 12 2015)

[31] http://www.kiecza.net/daniel/linux/ (accessed February, 12 2015)

[32] https://code.google.com/p/sphinxkeys/ (accessed February, 12 2015)

[33] http://freespeech.sourceforge.net (accessed February, 12 2015)

[34] http://perlbox.sourceforge.net (accessed February, 12 2015)

[35] http://www.voxforge.org (accessed February, 12 2015)

[36] https://simon.kde.org (accessed February, 12 2015)

[37] http://speeral.univ-avignon.fr/ (accessed February, 12 2015)

[38] http://www.nuance.com/for-individuals/by-product/dragon-for-mac/dragon-dictate/index.htm (accessed February, 09 2015)

[39] http://www.nuance.com/for-individuals/by-product/dragon-for-mac/dictate-medical/index.htm (accessed February, 09 2015)

[40] http://www.nuance.com/macspeech/ (accessed February, 09 2015)

[41] http://www.application-systems.de/ilisten/ilisten_start.html  (accessed February, 09 2015)

[42] https://support.apple.com/kb/PH11447?locale=en_US (accessed January, 25 2015)

[43] https://www.google.de/patents/US5377303?dq=patent:5377303&hl=en&sa=X&ei=MTrzVN-nA4TKOcbngcgD&ved=0CCEQ6AEwAA (accessed February, 09 2015)

[44] Kai-Fu Lee, Hsaio-Wuen Hon, Raj Reddy. An Overview of the SPHINX Speech Recognition System. In IEEE Trans. on Acoustics Speech, and Signal Processing, vol. 38, no. 1, 1990.

[45] http://htk.eng.cam.ac.uk (accessed February, 09 2015)

[46] http://julius.sourceforge.jp/en_index.php (accessed February, 10 2015)

[47] http://kaldi.sourceforge.net/about.html (accessed February, 10 2015)

[48] Povey, Daniel and Ghoshal, Arnab and Boulianne, Gilles and Burget, Lukas and Glembek, Ondrej and Goel, Nagendra and Hannemann, Mirko and Motlicek, Petr and Qian, Yanmin and Schwarz, Petr and Silovsky, Jan and Stemmer, Georg and Vesely, Karel. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, 2011.