1. Introduction

Until recent years, most speech recognition systems use Hidden Markov Models (HMMs) to deal with the temporal variability of speech and Gaussian Mixture Models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input [1]. GMMs and HMMs co-evolved as a way of doing speech recognition when computers were too slow to explore more computationally intensive approaches as neural networks [2]. Better training algorithms and faster processors lead to a new interest for the application of neural networks for speech recognition. In 2012, a clear breakthrough of neural networks for speech recognition came with the introduction of the hybrid context-dependent deep neural network hidden markov model (CD-DNN-HMM) , which outperformed the hybrid GMM-HMMs in all usage scenarios [3]. 

This article should describe the CD-DNN-HMM architecture in detail. First, the architecture of the CD-DNN-HMM should be explained. Consecutively, the advantages of the CD-DNN-HMM architecture compared to the GMM-HMMs will be pointed out. In the last section, different experiments confirm the theoretical thoughts about the reason for the strength of the CD-DNN-HMM.

2. Architecture

The context dependent deep neural network (CD-DNN) simply replaces the GMM in common GMM-HMM systems. Figure 1 shows the general architecture of a state-of-the-art speech recognition system. Figure 2 shows the architecture of the CD-DNN-HMM introduced in [3]. 


 Figure 1: General architecture of a speech recognition system.                                                                Figure 2: Architecture of the CD-DNN–HMM [4]. 

The neural network consists of an input layer, several hidden layers and an output layer. The input of the neural network is composed of the feature vectors of several frames in the the context window. For example, the context window in [3] consists of the feature vectors of 5 predecessor frames, 1 central frame and 5 successor frames. The weights and neurons in the hidden layers should give the model enough degrees of freedom to model speech. The model in [3] has 5 hidden layers with 2048 neurons. The last layer implements the softmax function and produces the output posterior probabilies for the different HMM-states. For any target HMM state, the model needs one output neuron. The HMM is necessary to model the temporal structure of speech by managing the transition probabilities between different states. In [3], the HMM states are  senones (tied context-dependent triphone states). With the algorithm of Bengios work [5], the whole CD-DNN-HMM can be optimized globally using backpropagation. 

3. Advantages

As we can see in the benchmarks section, the neural network outperforms all GMM-HMM models in different usage scenarios. Since we receive better results by replacing the GMM by a CD-DNN, the question arises, why does the CD-DNN outperforms the GMM in all usage scenarios.

Efficient representation

Speech is produced by modulating a relatively small number of parameters of a dynamical system. This implies that its true underlying structure is much lower-dimensional than is immediately apparent in a window that contains hundreds of coefficients. GMMs are easy to fit to data using the EM algorithm. They can model any distribution with enough components, but using their parameters inefficiently [2]. Each parameter only applies to a very small fraction of the data and therefore you have to train a large number of components. For DNNs, the output on each training case is sensitive to a large fraction of the weights. That means that each parameter of a DNN is influenced by a large part of the training data. In other words NN can share similar portions of the input space to train some hidden units but keep other units sensitive to a subset of the input features that are significant to recognition. That means, that DNNs use their paremeters more efficient than GMMs. Because less parameters have to be trained the NNs can model speech with much less training data.

Modelling of correlated input data

Almost the entire gain of DNN is connected to the ability of the DNN to use a long context window. This long context window consists of input feature vectors that are concatenated from several consecutive speech frames and provides more context information to distinguish among different phones. These concatenated frames are highly correlated because of large overlaps in speech analysis windows (see Figure 2). 

The high correlation leads to ill-formed covariance matrices in the GMMs. DNNs only use linear perceptrons as their basic units for classification and are quite powerful to use these highly correlated features. As long as there is not a good way to use these highly correlated features in GMMs, it seems difficult for GMMs to compete with DNNs [6].

Figure 3 and 4 should visualize extraction of the feature vectors in the context window and the construction of the input for the DNN



 Figure 3: Visualisation of a context window [7]                                                            Figure 4: Visualisation of the input for the neural network. [7]


4. Experiments

The small vocabulary  PSC Task

In [6], several experiments related to the effect of the context window size have been conducted.

First, the same input features are used  for the GMM and the DNN model. Since the GMM cannot handle highly correlated features, it does not make sense to use 11 concatenated frames and only the current frame is used. The results for the small PSC Task are shown in Table 1. It is quite surprising that DNN does not yield any better performance than discriminatively trained GMMs if they both use the current frame only.  If we extend the context window to augment more neighboring frames as DNN input features, word error rate (WER) improves dramatically from 18.0% to 13.4% (using 11 frames). This  gain can be attributed to to the DNNs ability to use the highly correlated concatenated input feature vectors from several consecutive speech frames within a relatively long context window.

Table 1: The character error rate (CER) for a different size of the context window and a different number of hidden layers applied to the small vocabulary PSC task [6].

Effects on the large vocabulary switchboard task

The experiments were repeated on the switchboard task. The results in Table 2 lead to the same conclusion. If only the current frame is used, DNN does not surpass GMMs even when the number of hidden layers is increased up to 5. However, performance of DNN is significantly improved when we use a longer (5+1+5) context window, WER is quickly brought down to 31.2% from 42.6% in Hub98 and 33.1% to 23.7% in Hub01.

Table 2: The character error rate (CER) for a different size of the context window and a different number of hidden layers applied to the large vocabulary switchboard task [6].

Effects on the small vocabulary TIMIT corpus

In [2], varying the number of frames in the input window shows that the best performance on the development set is achieved using 17 frames. Much smaller (7 frames) and much bigger (37 frames) windows give remarkably worse performance. The context window (110ms to 270ms) covers the average size of phones or syllables. Smaller input windows miss important discriminative information, while networks with larger windows are probably getting distracted by the almost irrelevant information far from the center of the window. 

Figure 5: Graph of the phone error rate dependent on the context window size and the number of layers [2]



[1] G. Hinton, L. Deng, D. Yu, G. Dahl, "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups", IEEE Signal Processing Magazine, vol. 8, issue 6, pp. 82 - 97, Nov. 2012

[2] A. Mohamed, G. Dahl, G. Hinton, "Acoustic Modeling Using Deep Belief Networks", IEEE Transactions on Audio, Speech, and Language Processing, pp. 14 - 22, vol. 20, Jan. 2012  

[3] G. Dahl, D. Yu, L. Deng, A. Acero, "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition", IEEE Transactions on Audio, Speech and language processing, vol. 20, 30 - 42, Jan. 2012

[4] L. Deng, D. Yu, “Deep Learning: Methods and Applications”, Foundations and Trends in Signal Processing, vol. 7, issues 3-4, p. 249, Jun. 2014.

[5] Y. Bengio, R. de Mori, G. Flammia, R. Kompe, "Global optimization of a neural network-hidden Markov model hybrid", IEEE Transactions on Neural Networks, vol. 3, issue 2, pp. 252 - 259, Mar. 1992

[6] J. Pan, C. Liu, Z. Wang, Y. Hu, H. Jiang, "Investigation of Deep Neural Networks (DNN) for Large Vocabulary Continuous Speech Recognition: Why DNN Surpasses GMMs in Acoustic Modeling", International Symposium on Chinese Spoken Language Processing (ISCSLP), vol. 8, pp. 301 - 305, Dec. 2012

[7] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, "Convolutional neural networks for speech recognition", IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, issue 10, pp. 1533-1545, Oct. 2014