1 Introduction

Speech has always been an interesting topic for thousand years. However, it can not be explained in a few sentences since this topic covers a broad field. Under the section "Speech" in our website, the topic of speech is separated into 5 chapters: Composition of language, Speech Production, Sense of Hearing, Grammar, and Lips' function in speech recognition.

First, let us take a closer look at the composition of language, which is namely the introduction to linguistics. Speech has been the catalyst of human development and was refined for thousand years into what it is today – on the one hand, a seemingly primitive tool of communicating, even a child can use this tool to express their feelings and needs; on the other hand, language is sophisticated: the study of language (it is also called linguistic) includes a variety of subjects, such as phonology, lexicology, semantics, etc. It still remains many difficulties when dealing with the speech recognition.

Understanding the structure of a sentence is relatively easy (Figure 1): phonemes (including vowels, diphthongs, and consonants) build syllables; syllables construct words; words build phrases; phrases build clauses, and finally, they compose a sentence. However, a speech recognition system is not constructed only by using these blocks, it is more complicated. Figure 2 shows the composition of the speech recognition technology according to the content of the book "Spoken Language Processing".

Figure 1: Structure of a sentence
Figure 1: Structure of a sentence

Composition of language

Figure 2: Composition of language


To make it easier to understand what figure 2 talks about, the important definitions will be introduced first. Due to the limitation of this page, only the most important terms will be explained later more in depth while their functions and examples will be presented in different small sections. 


1.1 Definition of terms

Phoneme: Smallest discrete unit of sound that distinguishes words in a certain language (Minimal Pair Principle).

  • Vowel: a sound, pronounced with an open vocal tract. 
  • Diphthong: a special class of vowel, one diphthong combines two adjacent vowels.
  • Consonant: a sound, it occurs a closure or constriction during the vocal tract.
  • Coarticulation: the process by which neighboring sounds influence each other[1].
  • Allophone: the modified phonemes influenced by coarticulation, but perceivable. 

Syllable: Acoustic component perceived as a single unit

Word: Speaker identifiable unit of meaning Phrase: Sub-message of one or more words

  • Part of speech (POS): a category of speech, which has the similar grammatical properties. 
  • Morpheme: Smallest linguistic unit with meaning
  • Word class: a group of words which have something in common based on the grammatical and lexical analysis.

Syntax: a study of how to form sentences dealing with the grammatical rules.

  • Phrase schemata: generalized schemata according to POS, describing the phrase structure.
  • Clause: an inflectional phrase (IP) due to the Phrase schemata, a kind of sentence. 
  • Sentence: Self-contained message derived from a sequence of phrases and words.
  • Parse tree representation: a tree diagram that analyze the phrase-structure of a sentence.

Semantic: a branch of linguistics dealing with the study of meaning[1].

  • Semantic roles: also called case relations, they represent essential factors of an event. 
  • Lexical semantics: the meaning templates, in order to derive the inferential words with the existing word in some kind of a specific relation.
  • Logical form: systematization of the meaning related words.


2 Principles and examples

2.1 Phonemes and Phones

So far many terms have been introduced and only briefly explained, there is still one thing that has to be cleared: the differences between phonemes and phones. Thought They look alike, the meanings are different. A phoneme is the smallest sound unit that distinguishes the meaning. A phone is a "speech-sound", which is the unit of sound. If you ask whether two 'sounds' are the same or different without saying if the 'sound' is of substance (phone) or form (phoneme), then there is no answer. In the first case, the difference is according to what they sound like; in the second case the difference is then according to the phonological description (physical segments)[2].

One example: 
the words "madder" and "matter" have obviously the different phonemes. However, in American English, both words are pronounced almost the same, which means that their phones are the same, or at least very close in the acoustic domain.


2.2 Vowels, diphthongs, and consonants

The most basic phonemes are vowels and consonants, which together form the speaking language. A vowel is a speech sound made by allowing breath to flow out of the mouth, without closing any part of the mouth or throat (although the lips may move to create the correct sound, as in creating the sound “o”). A consonant is a speech sound made by partially or completely blocking the flow of air through the mouth (using the lips, teeth, tongue, and palate). Letters of the English alphabet that represent consonants include all the letters that are not vowels. Examples: b, d, k, s. A Diphthong is a special type of vowel in that it combines two different vowels into one diphthong[3].

There are many phonetic alphabets, the most popular two types of presenting methods are the International Phonetic Alphabet (IPA) and ARPAbet. IPA is a set of standard symbols for all possible voice transcriptions of human speech. It is a one-to-one-mapping of symbols to sounds. ARPAbet is a phonetic alphabet specifically designed for American English in 1980. It uses ASCII symbols to represent the IPA's subsets. ARPAbet is applied typically for automatic speech recognition. Table 1 and Table 2 show the vowels (with diphthongs) and consonants respectively with the list of ARPAbet and IPA presenting methods. 


Vowels and dipthongs
Table 1: Vowels and diphthongs[4]


Table 2: Consonants[4]


2.3 Formants

“Formant” has several different meanings in different areas. For speech recognition, a formant stands for the acoustic resonance of the voice. In the voice, it is well known that the vowels play a main role in understanding the speech no matter the normalized mechanism has some variations due to many factors (if the speaker is male, female, young, old, healthy, ill, or has accent). Thus formant, as a primary acoustic characteristic of vowels, can be easily measured in the frequency spectrum of the speech, using the spectrogram in order to detect the vocal resonant frequencies and analyze the speech for the speech recognition devices[5][6][7][8]. The major resonances of the oral and pharyngeal cavities for vowels are called F1 and F2 - the first and second formants, respectively. They are determined by tongue placement and oral tract shape in vowels, and they determine the characteristic timbre or quality of the vowel[1].

With this special feature, F1 and F2 together are used to describe and recognize the different vowels. F1 shows the sound coming from the back or pharyngeal portion of the cavity, while F2 represents the sound from the forward part of the oral cavity. It is known that the resonance from the pharyngeal portion is far lower than the one from the oral cavity, that is why F1 is always smaller than F2. For example, in the vowel of ”see”, the tongue extrusion is far forward in the mouth, creating an exceptionally long rear cavity, and correspondingly low F1. The forward part of the oral cavity, at the same time, is extremely short, contributing to higher F2. Typical values of F1 and F2 of American English vowels are listed in Table 3[1].


Phoneme labels and typical formant values for vowels of English
Table 3: Phoneme labels and typical formant values for vowels of English[1]


If you are still confused with the theoretical principals of formants, you can sit back and relax by listening to one famous talking harmonica song “I Found My Mama", by Mattie O'Neil & Salty Holmes, it will help you to understand formants in a concrete way: 

Sure the harmonica can not really talk, however, with the help of formants, the frequencies the harmonica makes can get closer to the human vocal track (mostly affected by F2). That is why it sounds like "talking".


2.4 Syllable

Like it is explained in the first chapter, a syllable is one acoustic component perceived as a single unit. It includes two main parts: the onset and the rhyme (it is also called rime, see Figure 3). Onset is the initial consonant before the first vowel appears. The rhyme is then composed of a nucleus and a coda (not mandatory). The name of nucleus also indicates the function of itself: it represent the vowel peak, that is, the core part of a syllable. The coda is then the trailing consonants right after the nucleus. Attention, the onset and the coda are sometimes not necessary, so the smallest possible syllable would be made up of a nucleus only[9].
To summarize:
Onset = optional start of syllable -> Consonant
Rhyme = main part of syllable -> Nucleus + (Coda)
Nucleus = mandatory core of syllable -> Vowel
Coda = optional end of syllable -> Consonant
Structure of syllable
Figure 3: Structure of syllable


2.5 Morphology

Morphology is the study of the internal structure of words (verb, noun, adjective and so on).It is meaningful to introduce this study since the knowledge of the words of a language can not be concluded and summarized in a finite list. Thus, we need to know the internal pieces of words and the principles of word-formation.

A morpheme is the smallest unit containing the meaning in a language. One type of morphemes are called the free morphemes, they are independent words. Another type are bound morphemes, also called affixes, which is always bound with other words. The affixes can also be divided into several different types: prefix attaches at the front of word; suffix attaches at the end of the word ; circumflex attaches around the word; and infix attaches inside the word.

There are different kinds of morphological processes, in particular inflectional morphology and derivational morphology. Inflectional morphology does not change any grammatical category (see left side of table 4). In English, all inflectional affixes are suffixes (they attach to the end of a word) and they are attached after any derivational affixes. Derivational morphology can change grammatical category, but it is not necessary under some situations. Derivational morphology does not always induce a regular/predictable meaning change (see the right side of table 4)[10].




English inflectional morphology and derivational morphology

Table 4: English inflectional morphology and derivational morphology[10]


 2.6 Part-of-speech (POS)

Lexical part-of-speech (POS) presents word-type categories. It categorizes different groups of words (words of same function) by giving them the specific tags. A typical set of POS categories include noun, verb, adjective, adverb, interjection, conjunction, determiner, preposition, and pronoun. Now you may ask, how the newly created words add to these categories? Actually only some of these categories can be extended by creating new words, for instance, noun, verb, adjective and so on. But the new words are not totally new, they are created by using the same paradigmatic pattern of existing POS. Thus these new words can be predictable and they are easily adopted for the readers/speakers. These open POS categories are listed in Table 4[1].

Table 4: Open POS categories[1]

Table 4: Open POS categories[1]


In contrast to the open-class categories, the other categories are called closed POS. They are developing  with a very slow rate, and seldom accept new words in their categories. They seem to have only developed at the beginning of the English linguistic history. These closed POS categories are shown in Table 5. The closed-category words are fairly stable over time. These closed-class words are sometimes called function words[1].


Table 5: Closed POS categories[1]

Table 5: Closed POS categories[1]



[1] Huang, X., Reddy, R. & Acero, A. (2001). Spoken Language Processing: A Guide to Theory, Algorithm and System Development.. Prentice Hall.

[2] Lyons, J. (1971). Introduction to theoretical linguistics. Cambridge University Press.

[3] NAMC - North American Montessori Center,. REVIEW OF VOWELS AND CONSONANTS.

[4] Brumberg, J. S., Wright, E. J., Andreasen, D. S., Guenther, F. H. & Kennedy, P. R. (2011). Classification of intended phoneme production from chronic intracortical microelectrode recordings in speech-motor cortex.

[5] Rosner, B. & Pickering, J. (1994). Vowel perception and production. Oxford University Press.

[6] Broad, D. & Wakita, H. (1977). Piecewise--planar representation of vowel formant frequencies. Journal of the Acoustical Society of America.

[7] Fant G. (1960). Acoustic theory of speech production. Mouton.

[8] Monahan, P. J. & Idsardi, W. J. (2010). Auditory Sensitivity to Formant Ratios:Toward an Account of Vowel Normalization. Lang Cogn Process.

[9] Tian, F. (2007) The theory and partition of syllables. HEILONGJIANG SCIENCE AND TECHNOLOGY INFORMATION.

[10] Akmajian, Demers & Harnish. Linguistics (second edition). MIT press.