1. Introduction

The most commonly used feature extraction method in automatic speech recognition (ASR) is Mel-Frequency Cepstral Coefficients (MFCC) [1]. This feature extraction method was first mentioned by Bridle and Brown in 1974 and further developed by Mermelstein in 1976 and is based on experiments of the human misconception of words [2].

To extract a feature vector containing all information about the linguistic message, MFCC mimics some parts of the human speech production and speech perception. MFCC mimics the logarithmic perception of loudness and pitch of human auditory system and tries to eliminate speaker dependent characteristics by excluding the fundamental frequency and their harmonics. To represent the dynamic nature of speech the MFCC also includes the change of the feature vector over time as part of the feature vector [3,4].


2. Implementation

The standard implementation of computing the Mel-Frequency Cepstral Coefficients is shown in Figure 1 and the exact steps are described below [3].

Figure 1: Block diagram of the MFCC algorithm


The Input for the computation of the MFFCs is a speech signal in the time domain representation with a duration in the order of 30 ms.

2.1 Fourier Transform

The first processing step is the computation of the frequency domain representation of the input signal. This is achieved by computing the Discrete Fourier Transform.

Where N is the number of sampling points within a speech frame and the time frame τ. For implementations the Fast Fourier Transform, which is a variation of the Discrete Fourier Transformation optimized for speed, is used. [3]

2.2 Mel-Frequency Spectrum

The second processing step is the computation of the mel-frequency spectrum. Therefore, the spectrum is filtered with Nd different band-pass filters and the power of each frequency band is computed. This filtering mimics the human ear because the human auditory system uses the power over a frequency band as signal for further processing. This processing step can be described by

, where d is the amplitude of the band-pass filter with the index j at the frequency k. The filter bank with the band-pass filters cannot mimic the ear because the ear can use any frequency as center frequency. For ASR Nd equidistant band-pass filters on the mel scale are used. The mel-scale is a non-linear scale that is adapted to the non-linear pitch perception of the human auditory system (For more information about the mel scale see Sense of Hearing). The number, the shape (triangular, trapezoidal rectangular) and the center frequency of the band-pass filters can be varied [3]. Figure 2 shows a typical filter-bank with 25 triangular band-pass filters. Some research suggests that to few and to many band-pass filters have a negative impact on the classification performance and that overlapping rectangular shaped filters achieve a better performance compared to triangular shaped filters [5].

Figure 2: Filterbank with 25 triangular bandpass filters to compute the mel frequency spectrum. [4]

2.3 Logarithm

The third processing step computes the logarithm of the signal, to mimic the human perception of loudness because experiments showed that humans perceive loudness on a logarithmic scale [3].

2.4 Cepstral Coefficients

The fourth processing step tries to eliminate the speaker dependent characteristics by computing the cepstral coefficients. From the Source-Filter model is known, that the signal is the convolution of the speaker dependent source signal and the filter signal. To suppress the source signal the cepstrum is computed. The cepstrum can be interpreted as the spectrum of a spectrum. Therefore, the speaker dependent harmonics of the fundamental frequency are transformed to one higher order cepstral coefficient under ideal conditions (highlighted bar in Figure 3b). The inverse transformation of the lower cepstral coefficients show the frequency response of the vocal tract (Figure 3c) and the inverse transformation of the higher order cepstral coefficients show the frequency spectrum of the source signal. Therefore, the speaker dependent harmonics are suppressed by taking the lower order cepstral coefficients for further processing. The cepstrum of a signal is computed by

, where f is the input signal and F is the Fourier Transformation [6]. The computation of the logarithm can be omitted because the logarithm of the signal was computed in the previous processing step 2.3. Instead of the Fourier Transform the discrete cosine transform can be used because the absolute value of the spectrum, respectively the periodic continuation of the signal, is real and symmetric. The cepstral coefficients are computed by

where Nmc is the number of chosen cepstral coefficients for further processing [3]. Typically Nmc is in the range of thirteen to twenty.

Figure 3: a) Logarithmic power density spectrum of a speech signal. The highlighted frequency is the speaker dependent fundamental frequency. b) Cepstral coefficients of a speech signal. The highlighted quenfrency is the transformed fundamental frequency and the corresponding harmonics. c) Logarithmic power density spectrum of the inverse transformation of the low-pass filtered cepstral coefficients. Cutoff quefrency q= 20ms d) Logarithmic power density spectrum of the inverse transformation of the high-pass filtered cepstral coefficients. Cutoff quefrency q= 20ms. [4]

2.5 Derivatives

All previous processing steps included information about the current signal frame. To represent the dynamic nature of speech the first and second order derivatives of the cepstral coefficients extend the feature vector [6].

The final feature vector is

. A typical MFCC feature vector would be calculated from a window with 512 sample points and consist of 13 cepstral coefficients, 13 first and 13 second order derivatives. This example would reduce the dimensionality from 512 to 39 dimensions.


3. Limitations

Even though MFCC feature vectors are commonly used in ASR systems, the MFCC feature vectors have some limitations. Most of these limitations arise from the computation of the cepstral coefficients.

One critical assumption of the cepstral coefficients is that the fundamental frequency is much lower than the frequency components of the linguistic message. This assumption is needed because otherwise the exclusion of the fundamental frequency and the harmonics is not possible while including all information about the linguistic message. However, many female speaker do not fulfill this assumption. Therefore, it is unknown if the speaker dependent characteristics can be suppressed for all speakers. [6]

Another limitation of the cepstral coefficients their lack of interpretation. Only the first two cepstral coefficients c0 and c1 have a meaningful interpretation. c0 is the power over all frequency bands and c1 is the balance between low and high frequency components within the signal frame. The other cepstral coefficients have no clear interpretation other than they contain the finer detail of the spectrum to discriminate the sounds. Due to this lack of interpretations the reaction of MFCC features to accents or noise is unknown. As a consequence the feature vector distributions for each speaker have to be merged, which yields greater variances and could reduce the separability of the classes. [1]

Furthermore cepstral coefficients apply an equal weight to high and low amplitudes to the log spectrum even though it is known that high energy amplitudes dominate the perception of speech. This equal weight reduces the robustness of cepstral coefficients because the noise fills the valleys between formants and harmonics and deteriorates the performance of MFCCs. [1]


4. Variations

To improve the MFCCs many variations and extensions have been proposed. This section will give a brief overview of some proposed variations and extensions.

One example of an extension of MFCC is to include cepstral mean normalization. This extensions tries to reduce channel effects such as different microphones or different locations by subtracting the cepstral mean from the MFCC feature vector. Another extensions method normalizes the spectrum of the speech signal with the hearing threshold to prohibit features, which could not be identified with the human auditory system [3].

Other methods substitute some parts of MFCC to improve the error rate of ASR. For example the Bark Frequency Cepstral Coefficients (BFCC) use Nd equidistant band-pass filters on the bark scale instead of the mel scale [5]. Another example is the root-Cepstrum coefficients and the µ-Law coefficients, which substitute the discrete cosine transform with a different transformation to reduce the impact of noise on the feature vector [3].




[1] D. O’Shaughnesssy, “Invited paper: Automatic speech recognition: History, methods and challenges”, Pattern Recognition, vol. 41, no. 10, pp. 2965–2979, 2008.

[2] P. Mermelstein, “Distance Measures for Speech Recognition – Psychological and Instrumental”, Pattern Recognition and Artificial Intelligence, pp. 374–388, 1976.

[3] H. Niemann, Klassifikation von Mustern, 2nd ed. Berlin, New York, Tokyo: Springer, 2003.

[4] E. Schukat – Talamazzini, Automatische Spracherkennung, Braunschweig, Wiesbaden: Vieweg Verlag, 1995.

[5] F. Zheng, G. Zhang, Z. Song, “Comparison of Different Implementations of MFCC”, Journal of Computer Science & Technology, vol. 16, pp. 582–589, 2001.

[6] K. Kido, Digital Fourier Analysis: Advanced Techniques, Berlin, New York, Tokyo: Springer, 2014.