An in-depth understanding of how speech recognition works, requires an intensive engagement with basic biological and anatomical aspects of speech production. Hence, the following will shortly overview important characteristics of articulation, while the function of lips is especially highlighted.
1. Introduction to articulatory gestures
The vocal tract consists of various elements, that are used to form sounds. These elements are called articulators. Figure 1 names the principal parts of the upper and lower surface of the vocal tract used for articulation purposes.
The production of vowel sounds usually does not require the articulators’ contact, but are rather determined by the position of tongue and lips modifying the airstream. Vowel sounds are differentiated by the position and height of the tongue: on the one hand, if the highest point of the tongue is located in the front of the mouth, they are called front vowels („heed“) , on the other hand, if the tongue is closer to the back of the mouth, they are classified as back vowels („father“) .
Considering different vowels, the lip play an important role in forming these sounds. In back vowels, they will move generally closer together, for example when pronouncing the word „good“. Furthermore, when pronouncing the word „good“ the lips will round leading to a further classification of vowels into rounded („good“) and unrounded („heed“) vowels .
As vowels are characterized by being generated without obstructing greatly the airstream produced by the lung, these elements rather play an important role when a speaker tries to form consonants. The airstream has to be obstructed in order to form consonants as you can simply verify by pronouncing the word „town“. Different consonants require different interactions largely realized by the primary articulators - the lips and the tongue. Therefore, articulations describing varying speech gestures can be generally classified as follows:
- Labial articulations, describing gestures involving the lips.
- Coronal articulations, that is to say, gestures using the tip or blade of the tongue.
- Dorsal articulations, indicating the utilization of the back of the tongue.
These terms can be intuitively explained by analyzing the word „topic“: The first consonant („t“) is a coronal constant because the tip respectively the blade of the tongue must touch the palate. The second consonant („p“) is obviously a labial one requiring the lips closing. Finally, the third one („c“) implies the back of the tongue touching the palate and therefore can be identified as dorsal consonant .
2. Types of obstruction required for the English language
With the previously defined types the articulation gestures for consonants can be roughly described, but when it comes to describe the particular types of obstruction in detail, these reach their limits. Hence, in order to cap with all possible interactions between articulators the field of phonetics identifies eight types of obstruction. The following listing names the type, the involved parts of the vocal tract and gives example words :
1. Labial subtypes
- Bilabial (made with two lips)
-> example: pie, buy, my
- Labiodental (lower lip and upper front teeth)
-> example: fie, vie
2. Coronal subtypes
- Dental (tongue tip or blade and upper front teeth)
-> example: thigh, thy
- Alveolar (tongue tip or blade and the alveolar ridge)
-> example: tie, die, nigh
- Retroflex (tongue tip and the back of the alveolar ridge)
-> example: rye, row ,ray
- Palato-Alveolar (tongue blade and the back of the alveolar ridge)
-> example: shy, she, show
3. Dorsal subtypes
- Palatal (front of the tongue and hard palate)
-> example: you
- Velar (back of the tongue and soft palate)
-> example: hack, hag,hang
Summarizing, those are only the key aspects of phonetics regarding the english language, but the reader can easily conclude from this short introduction, that the importance of each vocal tract element is crucial to a successful speech recognition applications. The focus on the lips' function is mainly due to its external observability.
By way of example, Bregler (1994) improved their speech recognition applications by including visual information about the face, especially finding and tracking the movement of the lips .
 Marlett, S. (2013). Phonology From the Ground Up: The Basics. . Retrieved from http://arts-sciences.und.edu/summer-institute-of-linguistics/teaching-linguistics/_files/docs/marlett-2014-phonology-student.pdf.
 Ladefoged, P. (1993). A Course In Phonetics (3 ed). Heinle & Heinle Publishers Inc..
 Bregler, C (1994). “Eigenlips” for robust speech recognition. In Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on, pages II-669.