The chapter “Feature extraction” describes the feature extraction methods for speech signals. Therefore, the article introduces the motivation for the feature extraction from speech signals, describes various low-level features and presents the various combination of low-level features. Finally this chapter describes the feature extraction component of the RWTH Aachen, which combines the previously described methods.

1. Introduction

After the preprocessing step, feature extraction is the second component of automatic speech recognition (ASR) systems. This component should derive descriptive features from the windowed and enhanced speech signal to enable a classification of sounds. The feature extraction is needed because the raw speech signal contains information besides the linguistic message and has a high dimensionality. Both characteristics of the raw speech signal would be unfeasible for the classification of sounds and result in a high word error rate. Therefore, the feature extraction algorithm derives a characteristic feature vector with a lower dimensionality, which is used for the classification of sounds. [1,2]

A feature vector should emphasize the important information regarding the specific task and suppress all other information. As the goal of automatic speech recognition is to transcribe the linguistic message the information about this message needs to be emphasized [3]. The Speaker dependent characteristics, the characteristics of the environment and the recording equipment should be suppressed because these characteristics do not contain any information about the linguistic message. Including this non-linguistic information would introduce an additional variability, which could have a negative impact on the separability of the phone classes. Furthermore, the feature extraction should reduce the dimensionality of the data to reduce the computation time and the number of training samples [1].

2. Low Level Features

Until now many different features, which highlight different aspects of the speech signal have been proposed. These features can mostly divided into linguistic and acoustic features. Acoustic features are only relevant for classification of non-verbal vocal outbursts such as laughter or sighs. Linguistic features are more relevant to ASR systems because these systems try to transcribe the linguistic message [6]. For example some of the linguistic features are Intensity, Linear Predictive Coding (LPC) [4], Perceptional Linear Predictive Coefficients (PLP) [5], Mel-Frequency Cepstral Coefficients (MFCC), Linear Prediction Cepstral Coefficients (LPCC), Wavelet Based Features and Non-Negative Matrix Factorization features.  Many of the previously mentioned low level features use speech signal frames in the range from 10ms to 30ms due to their quasi stationary characteristics [7]. Furthermore, many of these features are biologically inspired and extract the features from the spectrum because the human speech production controls the spectrum of the signal and the ear acts as a spectrum analyzer (for more information see Speech Production and Sense of Hearing)[1]. The articles

give an extensive description of the corresponding feature extraction methods. A comparison of these three low level features is described in the article Comparison of different feature extraction algorithms. Further information regarding the other feature extraction methods, which are not described in this wiki, can be found within the corresponding references.

3. Combination of Features

These previously described low level features have all different shortcomings. To compensate each individual shortcoming various combinations of low-level features have been proposed. This section presents an overview of various combinations.

3.1 Longer Time Frames

Many of the in section 2.0 described algorithms use a quasi-stationary speech signal frame of 10 ms to 30 ms. These quasi-stationary signals correspond to a static configuration of the vocal tract. Even though steady configurations of the vocal tract contain relatively little linguistic information and the fundamental linguistic units are likely to be longer. Furthermore, it is hard to distinguish the short term stationary speech signal from the long term stationary noise signal in a signal frame of 10 ms. Therefore, longer time frames of the speech signal seem more appropriate for ASR systems. These longer time frames also correspond to the auditory systems of mammals, which processes auditory signals in the order of 200 ms. [1] To achieve features vectors spanning a longer time frame, n sequential low level feature vectors within a sliding window are combined into a single feature vector and processed simultaneously [8].

3.2 Feature Stacking

Besides the use of longer time frames, feature stacking is used to gain more robust features for ASR. Feature stacking combines the extracted features from different algorithms as shown in Figure 1. These stacked features should result in a lower word error rate because the error characteristics should be different between different feature extraction methods and compensate the limitations of each individual feature extraction method. [7] Most commonly the dimensionality of the stacked feature vector is reduced by Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) [8]. An example of this method is described in the section the tandem architecture within the article "Artificial Neural Networks for Feature Extraction".

Figure 1: The features of different feature extraction methods are stacked to achieve a better word error rate [9]

3.3 Artificial Neural Networks

Another approach to generate more robust features for ASR is the post processing of the low-level features with an artificial neural network. Different architectures for this post processing are described in the article "Artificial Neural Networks for Feature Extraction".

4. Modern Feature Extraction

Many modern feature extraction components for ASR system use all of the previously described combinations of low-level features. For example the feature extraction component of the RWTH Aachen, which yielded the best word error rate for English and German in 2011, uses sequential low-level features as a single feature vector, feature stacking of different feature streams and post processing of the low-level features with an artificial neural network using the tandem architecture with bottle-neck dimension reduction. As low-level features the feature extraction component of the RWTH Aachen uses Mel Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Predictive coefficients (PLP). The first 16 cepstral coefficients of MFCCs are computed with 20 band-pass filters, cepstral mean normalization and variance normalization. Nine MFCC feature vectors within a sliding window are stacked and the dimensionality is reduced to 45 dimensions using LDA . The additional feature stream of PLP features is extracted and processed similar to the MFCC features. Finally the phone posterior probabilities are computed by a hierarchical bottle-neck feed-forward ANN. The probabilistic bottle-neck features are decorrelated by PCA and added to the feature vector. [10]

References

[1] H. Bourlard, H. Hermansky, N. Morgan, “Towards increasing speech recognition error rates”, Speech Communications, vol. 18, pp. 205–231, 1995.

[2] D. O’Shaughnesssy, “Invited paper: Automatic speech recognition: History, methods and challenges”, Pattern Recognition, vol. 41, no. 10, pp. 2965–2979, 2008.

[3] H. Niemann, Klassifikation von Mustern, 2nd ed., Berlin, New York, Tokyo: Springer, 2003.

[4] E. Schukat – Talamazzini, Automatische Spracherkennung, Braunschweig, Wiesbaden: Vieweg Verlag, 1995.

[5] H. Hermansky, B. A. Hanson, H. Wakita, “Perceptually based linear predictive analysis of speech”, Acoustics, Speech, and Signal Processing , vol. 10, pp. 509-512, 1985.

[6] B. Schuller, “Voice and speech analysis in search of states and traits”, in Computer Analysis of Human Behaviour , A. A. Salah, T. Gevers (eds.), Berlin, New York, Tokyo: Springer ,2011, ch. 9, pp. 227-253.

[7] L. Rabiner and B.-H. Juang, Fundamentals of speech recognition, Englewood Cliffs: Prentice-Hall International, 1993.

[8] C. Plahl, “Neural Network based Feature Extraction for Speech and Image Recognition”, Ph.D. dissertation, Dept. Computer Science, RWTH Aachen, Aachen, Ger, 2014 .

[9] A. Hagen, A. Morris, “Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR”, Computer Speech and Language,vol. 19, no. 1, pp. 3–30, 2005.

[10] M. Sundermeyer, M. Nußbaum-Thom, S. Wiesler, C. Plahl, A. El-Desoky Mousa, S. Hahn, D. Nolden, R. Schlüter, H. Ney, “The RWTH 2010 Quaero ASR Evaluation System for English, French, and German”, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2212–2215, 2011.


Contents