1. Motivation

Knowledge about the human sense of hearing is crucial to understanding the way we process speech information. Since technical speech recognition systems deal with the same kind of information, it is a good starting point to look at evolution's approach at this task. This article will give an insight into the physiology of the human ear. Starting with a description of the outer, middle and inner ear components, we will continue with addressing the way the ear transforms sound waves into nerve impulses. After an insight into the characteristics of human sound perception, the last section will provide an overview over technical applications of some of these characteristics in speech recognition.

2. Physiology of the Human Ear

2.1 Structure

Basically the structure of the human ear can be divided into three parts: The outer, middle and inner ear (see figure 1). But the body's influence on arriving sound stimuli begins even earlier. Effects like head shadowing or sound diffraction at body parts like the shoulders, influence the sound before it arrives at the outer ear. The complex filter effect of head, torso and outer ear is described by the hear-related transfer function (HRTF) and is an important factor in hearing functions like spatial hearing [1]. 

Figure 1: Schematic drawing of outer, middle and inner ear [2]

 

The outer ear has a strong influence on the frequency transfer characteristics of the human ear. It's 2 cm ear canal acts as a resonator for frequencies around 4000 Hz, leading to a high sensitivity for these frequencies. Since important cues of human speech occur in that range, the outer ear characteristic helps us to understand speech. On the other hand it is also responsible for hearing impairment that is especially common in the region around 4000 Hz (see figure 2).

The middle ear serves as an interface between outer ear and inner ear. To excite the sensory cells in the inner ear, its surrounding fluid needs to be oscillated by the effects of the arriving sound waves in the ear canal. Since the air in the ear canal and the liquid in the inner ear have different impedances, some sort of matching needs to be done to avoid high energy losses through reflection. The ossicles called malleus, incus and stapes mechanically connect the ear drum with the oval window of the inner ear, while achieving the desired imedance matching.

Most important part of the inner ear is the cochlea. The cochlea is formed like a snail and embedded in the temporal bone. Inside the cochlea there are three channels filled with liquid - scala vestibuli, scala media and scala tympani. The basilar membrane separates scala media from the scala tympani and contains the organ of corti with its sesory cells. The pressure, which is exerted on the oval window through the auditory ossicles, runs as a travelling wave through the cochlea, deflecting the basilar membrane. The pressure in the scala tympani is compensated through the round window.

Figure 2: Threshold in quiet as a function of frequency with age as a parameter [2]

2.2 Function

After a sound wave has passed the outer ear canal, its air pressure oscillations lead to a displacement of the eardrum. Through the auditory ossicles, these oscillations are impedance matched and transferred to the oval window. From there they run as a travelling wave through the liquids of the cochlea channels, deflecting the basilar membrane (see section 2.1).

The cochlea is stiff near the oval window (basis) and more flexible at the peak of its snail-like shape (apex). Thus it exhibits different resonance characteristics along its structure. The basis is more resonant to high frequencies, while the apex is more resonant to lower frequencies. As the travelling waves runs through the cochlea, its different frequency components excite different regions of the cochlea (see figure 3). The organ of corti containing the sensory hair cells is located on the basilar membrane inside the cochlea. So different hair cells are predominantly excited by different frequencies, and a frequency-place mapping occurs. In effect it is behaving in a similar manner like a Fourier analyzer [3] (see Wavelet Based Features).

The sensory hair cells are responsible for the transformation of mechanical oscillations to the neural impulses. They are organized as three rows of outer hair cells and one row of inner hair cells. Both types of hair cells serve to different purposes:

The outer hair cells

  • Serve as cochlea amplifier
  • Perform a dynamic adaption
  • Increase frequency selectivity

The inner hair cells

  • Are specific to certain frequencies
  • Transform mechanical oscillations into electrical impulses

Nerve fibers transport the electrical impulses from the hair cells to the auditory cortex in the brain. It is important to note that the pre-filter function of the outer hair cells is essential to speech understanding.

Figure 3: Traveling wave in the cochlea with maxima at different places corresponding to the frequency [2]

2.3 Characteristics of the Human Sound Perception 

The physiological structure of the human ear results in some specific signal processing properties. These properties are the cornerstone of technical applications derived from human hearing. Most of the perceptual characteristics are based on the way the cochlea processes the acoustic stimuli.

The human audible range extends approximately from 20 Hz to 18 kHz. As mentioned in section 2.1 it is especially sensitive to sensation in the 4 kHz area, resulting in the specific threshold in quiet curve that can be seen in figure 2.

Pitch perception is based on the frequency-place mapping that happens in the cochlea. Frequency response characteristics of the cochlea (see section 2.2), along with the filter-like properties of the outer hair cells, limit the number of inner hair cells responsible for analyzing a certain frequency. Every audible frequency has its maximum response area on the cochlea. Hair cells that are located in this area will respond best to excitation with that frequency.

Since the spatial resolution of the cochlea is not infinitely sharp, frequency analysis is limited to 24 critical bands. Information in each band is processed together. Loudness, tone and direction of sound are analyzed in these bands. The width of one frequency band is defined as one Bark and can be approximated as follows:

  • approximately 100 Hz at frequencies below 500 Hz
  • approximately a minor third at frequencies above 500 Hz [4]
Figure 4: Illustration of Mel scale, critical bands and frequency

 

The Mel scale is a measure for the perceived pitch level and can be derived from the Bark scale [2]. 1 bark equals approximately 100 mel (see figure 4). A tone that is perceived with double the pitch is assigned double the mel value. As shown in figure 5, pitch perception does not grow linearly with frequency. Above 500 Hz more then double the frequency is needed to achieve a doubling in pitch perception.

Figure 5: Mel scale over frequency

 

Other then loudness, tone, or direction of sound, resolution of pitch is not limited to the critical bands. Up to 800 Hz, the temporal structure of acoustic signals can still be processed. This yields the increased pitch resolution in lower frequencies. Above 1600 Hz the ear can no longer follow the temporal structure of sound, and is limited to analysis of the critical bands again. In between both effects overlap.

3. Technical Applications

Many technical applications are based on characteristics of the human sens of hearing. Music codecs for example use masking effects, introduced by the critical bands, to effectively reduce data rate. Especially important for use in speech recognition system are filters derived from the Mel scale (see section 2.3). 

The Mel Frequency Cepstral Coefficients (MFCC) are a commonly used feature extraction method in speech recognition. They are based on the specific pitch and loudness perception of the hearing, imitating the critical band processing. Harmonic Decomposition as a denoiseing technique in preprocessing also uses insights from the critical band analysis.

References

[1] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization. MIT Press, revised edition, October 1996.

[2] H. Fastl and E. Zwicker, Psychoacoustics - Facts and Models. Springer, 3rd edition, December 2006.

[3] B. C. Moore, An Introduction to the Psychology of Hearing. BRILL, 6th edition, April 2013.

[4] E. Terhardt, Akustische Kommunikation. Springer, 1998.


Contents