1. Introduction

Modern feature extraction algorithms for speech signals incorporate an artificial neural network (ANN) for feature extraction. These algorithms take the feature vector of a Mel-Frequency Cepstral Coefficients (MFCC), Linear Prediction, Wavelet transform computed for time frames in the range of 10ms and post process them using an ANN. To incorporate the perspective of longer timeframes n MFCC vectors within a sliding window of the size n are used as an input for the ANN. Therefore, the number of input neurons varies with the size of the sliding window and the dimensionality of the feature vector. The trained target classes, the number of hidden layers and the number of output neurons can be varied. [1] Generally two different architectures, the tandem architecture and the hybrid architecture, can be used to include ANNs for feature extraction.

2. ANN Architectures

2.1 Hybrid Architecture

The flow chart for an automatic speech recognition (ASR) system using a hybrid architecture is shown in figure 1. First a low-level feature extraction method is used to compute n sequential feature vectors. These feature vectors are used as input for the ANN and the ANN computes the posterior probability of the trained classes. The posterior probability is described by

where si is the ith class and xj is the jth feature vector of the previously used featured extraction algorithm. The probability distribution is used as an input for a Hidden Markov Model (HMM) to transcribe the spoken word. [1,2] A complete description of the hybrid approach can be found in the article Context-Dependent Deep Neural Networks: Breakthrough of neural networks for speech recognition.

Figure 1: Block diagram of the hybrid ANN architecture

 

2.2 Tandem Architecture

The tandem architecture extends the low-level feature extraction methods with the probability distributions of the target classes. The flow chart of the tandem approach is shown in Figure 2.

Figure 2: Block diagram of the tandem ANN architecture

 

The n short time feature vectors are used inputs for the ANN and the probability distribution of the target classes is computed. This probability distribution extends the feature vector. The extended feature vector consists of the n feature vectors from the short time feature extraction algorithm and the probability distribution computed by the ANN. Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) reduce the high dimensionality of the extended feature vector. Figure 3 shows two potential methods to stack the separate feature streams to a single feature vector. The first method (Figure 3.a) reduces the dimension of each feature stream separately. The second method (Figure 3.b) reduces the dimension of both feature streams in a global transformation. [1]

Figure 3: a) The dimension of each feature stream is reduced separately. b) The dimension of the feature vector is reduced in a global transformation. [1]

 

3.0 Bottle-neck Features

A different method for a non-linear dimension reduction technique for ANNs is bottle-neck processing. Some research suggests that bottle-neck dimension reduction can learn a better low dimensional representation than PCA [1]. For bottle-neck processing a hidden layer within a deep neural network is replaced by a layer which number of neurons correspond with the target dimensionality. Figure 4.a shows a bottle-neck layer with 4 neurons within the hidden layers. During training the complete ANN is trained on the target classes. After the training all layers beyond the bottle-neck layer are removed and the output of the neurons in the bottle-neck layer are used as features. This method is referred to as probabilistic bottle-neck features. [1,3,4] Experimental results showed that an additional step for decorrelation using LDA or PCA is necessary to achieve competitive results [1].

Figure 4: Artificial Neural Network with a bottle-neck layer of 4 neutrons within the hidden layers during training (a) and decoding (b). [1]

References

[1] C. Plahl, “Neural Network based Feature Extraction for Speech and Image Recognition”, Ph.D. dissertation, Dept. Computer Science, RWTH Aachen, Aachen, Ger, 2014 .

[2] A. Hagen, A. Morris, “Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR”, Computer Speech and Language,vol. 19, no. 1, pp. 3–30, 2005.

[3] D. E. Rumelhard, G. E. Hinton, R. J. Wiliams, “Learning internal representations by error propagation”, Parallel Distributed Processing: Explorations in the Microstructure of Cognition , vol. 1, pp. 318–362, 1985.

[4] F. Grézl, M. Karafiat, S. Kontar, J. Cernock, “Probabilistic and Bottle- Neck Features for LVCSR of Meetings”, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) vol. 4, pp. 757–760, 2007.


Contents