1. Introduction

In order to communicate humans encode the linguistic message with the vocal organs. As the goal of automatic speech recognition is to describe the linguistic message from a given speech signal, the characteristics of the encoding are important to design systems to decode the linguistic message. This article covers the human sound generation with the vocal organs and presents a mathematically model to describe the sound generation. Both sections will enhance the understanding of following chapters Preprocessing and Feature Extraction.

 

2. Human speech production

2.1 Speech Signal Generation

During the human evolution the physiology of the vocal organs has changed to enable the extensive and efficient inter human communication. This change of physiology enabled the creation of new, distinctive and complex sounds such as /i/ or /u/, which were not possible with the standard mammalian physiology of the vocal organs [1]. The vocal organs, which enable the generation of distinctive sounds, are shown in Figure 1. The vocal organs consist of the lung, the pharnyx and the vocal tract. The vocal tract consists of the mouth cavity and the nasal cavity. These cavities are separated by the velum, which couples or decouples these cavities when it is lowered or retracted. [2]

Figure 1: Schematic representation of the vocal organs [2]

 

To produce sounds the lung exhales air through the windpipe towards the glottis. If the vocal cords are relaxed the air can pass through and unvoiced sounds (i.e. /f/, /th/) are produced. If the vocal cords are tensed, the vocal cords start to vibrate in a frequency between 50 Hz and 500 Hz. These vibrating vocal cords chop the air stream into a quasi-periodic signal of impulses. This sound wave passes through the pharynx into the vocal tract. The position and shape of the jaw, velum, tongue and the lips determine the resonances and anti-resonances of the vocal tract and articulate the specific sound from the incoming sound wave. The resonances of the vocal tract are called formants. The modified sound waves leave the vocal tract through the nostrils and the lips to be transmitted by the surrounding air [1,2,3]. However the resulting sound is not speaker independent because the physiology of the vocal tract and glottis varies by speaker. For example gender and age of the speaker change the frequency of the vibrating vocal cords [1]. This individual physiology of the vocal tract determines the individual timbre of the speaker [4]. Only the frequency of the formants is speaker independent and carries the most information about the linguistic message [1].

The position and shape of the jaw, velum, tongue and lips change rapidly over time to produce a sequence of sounds, which corresponds to a specific word. Therefore, a speech signal can only be considered as slowly time varying signal with stationary characteristics for periods in the range of 5 ms to 100 ms. These short stationary periods correspond to a stationary configuration of the vocal tract. Longer periods ( ~200 ms) of a speech signal exhibit non stationary characteristics. [2]

2.2 Variation of the Speech Signal

As previously mentioned the individual physiology of the speaker varies some characteristics of the speech signal. However, there are much more general characteristics of an individual speaker, which have an impact on the speech signal and therefore on the word error rate of an automatic speech recognition system. Some of these speaker dependent variations are inter-speaker dependent, which means that the characteristics only depend on the speaker. While intra-speaker dependent characteristics depend on the state of the speaker and the state of a speaker can change rapidly. [4] Examples for inter- and intra-speaker depend variations are:

Age:
The age of a speaker is a inter-speaker dependent variation because the physiology of the vocal tract varies by age. For example children articulate sounds different compared to adults. Furthermore, the composition of the linguistic message is different for children because they use a different vocabulary compared to adults.

Gender:
The gender of a speaker is a inter-speaker dependent variation because the physiology of the vocal tract is different for male and female bodies. For example the fundamental frequency of a female speaker is higher compared to a male speaker. [4]

Accents:
The accent of a speaker is a inter-speaker dependent variation because the pronunciation of words varies by different accents. Especially foreign accents have a significant different pronunciation compared to native speakers. [4]

Speaking Style:
The speaking style of a speaker is a intra-speaker dependent variation because articulation of words varies by speaking style such as reading or spontaneous speech. For example during spontaneous speech most speakers do not properly pronounce phonemes and incorporate many false starts, repetitions or hesitations. [4]

State:
The state of a speaker is a intra-speaker dependent variation because the state of the speaker alters the articulation, the rate of speech and the loudness. The state of the speaker can be emotions such as sadness, happiness or angriness [4] but also levels of intoxications, stress or health conditions. [5].

 

3. The Source Filter Model

3.1 Theory

The acoustic speech production of the human vocal organs can be modeled by the source filter model, which is shown in Figure 2. This model was developed by Fant in 1960 and is the basis for some feature extraction algorithms in automatic speech recognition. The model separates the speech production into a signal source and a filter. The source can be biologically interpreted as the lung and the pharynx, which produce the voiced or unvoiced signal. The filter is the technical representation of the configuration of the vocal tract, which articulates the final sound. [3]

Figure 2: Block diagram of the source filter model [2]

 

The source model for unvoiced sounds consists of a white noise source. For voiced sounds the source model consists of an impulse generator with a fixed period and a linear time invariant filter, which simulates the vibration of the vocal cords. The output signal of the source un for a voiced sound is a quasi-periodic signal. One period of this signal is shown in Figure 3a. The corresponding idealised spectrum of the source signal (see Figure 3b) consists of the speaker dependent fundamental frequency and the harmonics of the fundamental frequency. [3]

Figure 3: a) One period of the signal un for a voiced sound in the time domain. b) Idealised spectrum of the un signal for a voiced sound. [2]

 

Two sequential linear time invariant filters model the filter. The first simulates the shape of the vocal tract. Physically the vocal tract can be simulated as a tube resonator, which consist of cylindrical slices with different diameters (see Figure 4). The second linear time invariant filter simulates the emission of the sound wave through the lips. This emission of the sound wave can be modeled as a high pass filter under ideal conditions. [3]

Figure 4: Tube resonator model for the vocal tract [2]

 

The simulated speech signal of the source filter model is the convolution of the source and filter signal in the time domain.

Figure 5 shows the individual log spectrum of the source (Figure 5a), the filter (Figure 5b) and the speech signal (Figure 5c) for the phone /a/. The logarithm of the spectrum of the speech signal (Figure 5c) is due to the logarithmic representation the addition of both previous spectrums. [3]

Figure 5: a) Non-idealised Log spectrum of the source signal for the voiced sound /a/. b) Log spectrum of the filter transfer function for the voiced sound /a/. c) Log spectrum of the output signal fn of the source filter model for the sound /a/. [2]

3.2 Applications in Speech Recognition

The source filter model and the resulting characteristics of the speech signal is often used as foundation for various preprocessing and feature extraction methods. For example the harmonic decomposition uses the harmonic spectrum of the speech signal to separate speech from noise. Most feature extraction methods use the stationary segments of the speech signal to extract descriptive features. Especially the linear prediction feature extraction for speech signals uses the source filter model when the method approximates the transfer function of the filter to extract descriptive features.

 

References

[1] W. Tecumseh Fitch, “The evolution of speech: a comparative review”, Trends in Cognitive Sciences, vol. 4, pp. 258-267, 2000.

[2] L. Rabiner and B.-H. Juang, Fundamentals of speech recognition, Englewood Cliffs: Prentice-Hall International, 1993.

[3] E. Schukat – Talamazzini, Automatische Spracherkennung, Braunschweig, Wiesbaden: Vieweg Verlag, 1995.

[4] M. Benzeghiba, R. De Mori, P. Deroo, S. Dupont, T. Erbes, D. Jouvet, L. Fissore, P. Laface, A. Mertins, C. Ris, R. Rose, V. Tyagi, C. Wellekens, “Automatic speech recognition and speech variability: A review”, Speech Communication, vol. 49, pp. 763–786, 2007.

[5] B. Schuller, “Voice and speech analysis in search of states and traits”, in Computer Analysis of Human Behaviour , A. A. Salah, T. Gevers (eds.), Berlin, New York, Tokyo: Springer, 2011, ch. 9, pp. 227-253.


Contents