1. Introduction

For testing of different ASR architectures, a large number of different benchmark datasets have been developed. Figure 1, which has been created by NIST shows the word error rate (WER) as a function of time for a number of increasingly difficult speech recognition tasks. The word error rates (WERs) were obtained using the GMM–HMM technology. There is a flat curve for one particularly difficult large vocabulary task (Switchboard) over many years. In 2009, the success of DNNs on the TIMIT task motivated more ambitious experiments with much larger vocabularies and more varied speaking styles. In 2011, the word error rate drops  tremendously  (marked by the red star in Figure 1) with the introduction of CD-DNN-HMMs. [1]. 

Figure 1: The evolution of the word error rate (WER) of different speech recognition tasks compiled by NIST [1]


Meanwhile, there are new large vocabulary benchmark tasks developed by three of the biggest speech research groups. This article will examine the different benchmarks developed by the dominating research groups for speech recognition that compare the performance of GMMs, DNNs and CNNs for large vocabulary tasks. First the general procedure for finding the optimal setup of the examined architecture will be described. After that, 5 large vocabulary tasks (LVCSR) will be introduced and the setup for the GMM, DNN and CNN will be presented. The results of every benchmark task will be concluded in the last section. 

2. Procedure

The TIMIT Acoustic-Phonetic Continuous Speech Corpus provided by the linguistic data consortium (LDC) supplies a simple and convenient way of testing new approaches to speech recognition. TIMIT provides a good starting point for developing a new approach, especially one that requires a challenging amount of computation. The training set is small enough to make it feasible to try many variations of a new method and a lot of existing techniques have already been benchmarked on TIMITs core test set. With TIMIT, it is easy to see if a new approach is promising by comparing it with existing techniques that have been implemented by their proponents [2]. For finding the best configuration of the different architectures, the research groups performed different experiments and evaluations on TIMIT. After that, they used the best configuration for the very challenging large vocabulary tasks (LVCSR) and compared the results to their best GMM-HMM-Model. The GMMs, DNNs and CNNs that worked best on the TIMIT data formed the starting point for subsequent experiments on much more challenging large vocabulary tasks that were too computationally intensive to allow extensive exploration of variations in the architecture, the representation of the acoustic input or the training procedure [2].

3. Benchmarks on LVCSR Tasks

3.1 Switchboard Recognition Task

After the success of convolutional neural networks for the TIMIT data set, Sainath et al. [3] applied the hybrid CNN-HMMs to the very challenging Switchboard Recognition Task. The Switchboard collection focused primarily on GSM cellular phone technology with over 300 hours of training data. The project's goal was to target 190 subjects balanced by gender and under varied environmental conditions to participate in five to six minute conversations on GSM cellular phones. The speech data was collected for research, development, and evaluation of automatic systems for speech-to-text and allows rigorous comparisons among different ASR techniques. Development is done on the Hub5'00 set, while testing is done on the rt03 set, Performance will be reported separately on the Switchboard (SWB) and Fisher (FSH) portions of the set. 

Setup of the GMM-HMM, DNN-HMM and CNN-HMM

The HMM use 8,260 quinphone states and is HMM is combined with different models (GMM,DNN,CNN). The GMM consists of 372K Gaussians, uses feature space maximum likelihood linear regression (fMLLR) for more general speaker adaptation, vocal tract length normalization (VTLN) for male-female normalization and is trained with expectation maximization (EM). The pre-trained hybrid DNN system uses the same fMLLR features and 8,260 states as the GMM. It takes an 11-frame context around the current frame, and use six hidden layers (2,048 sigmoidal units per layer). The DNN hybrid system is pre-trained, followed by cross entropy and sequence-training. The CNN system is trained with 40-dimensional VTLN-warped mel filter-bank features. The CNN has two convolutional layers with 424 units, four fully connected layers with 2,048 units and a softmax layer with 512 output units. The number of parameters of the CNN matches that of the DNN. No pretraining is performed, only cross entropy and sequence-training. Again, after 40-dimensional features are extracted with principle component analysis (PCA), GMM Maximum Likelihood training is done followed by discriminative training[3].


Table 1 shows the performance of the hybrid CNN-HMM compared to the hybird DNN-HMM and GMM-HMM. The hybrid CNN offer between a 13-33% relative improvement over the GMM-HMM system and between a 4-7% relative improvement over the hybrid DNN-HMM.

Table 1: The performance of the hybrid CNN-HMM compared to the hybrid DNN-HMM and GMM-HMMs [3]

3.2 English Broadcast News Recognition Task

Sainath et al. [3] also applied the hybrid CNN-HMMs to another challenging English Broadcast News Recognition Task. The English Broadcast News Speech Corpora of the linguistic data consortium contains more than 100 hours of broadcasts from ABC, CNN and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts. The primary motivation for this collection is to provide training data for the DARPA "HUB4" Project on continuous speech recognition in the broadcast domain. The acoustic models are trained on 50 hours of data and evaluated with the EARS 2004 development set. Testing is done on the DARPA EARS rt04 evaluation set [3].

Setup of the GMM-HMM, DNN-HMM and CNN-HMM

The raw acoustic features are 19-dimensional perceptual linear predictive (PLP) features with speaker-based mean, variance, and VTLN, followed by an linear discriminant analysis (LDA) and then fMLLR. The GMMs are feature-space and model-space discriminatively trained using the boosted maximum mutual information (BMMI) criterion. The GMM-HMM use 5,999 quinphone states and 150K diagonal-covariance Gaussians. The generatively pre-trained DNN hybrid system has five hidden layers with 1,024 sigmoidal units and one output layer with 512 units. It uses the same fMLLR features and 5,999 quinphone states as the GMM system, but a 9-frame context around the current frame. The DNN training begins with greedy layerwise pretraining, followed by cross entropy training and then sequence training. The number of parameters of the CNN matches that of the DNN. No pre-training is performed, only crossentropy and sequence-training. After 40-dimensional features are extracted with PCA, we apply maximum likelihood GMM training, followed by discriminative training using the boosted maximum mutual information (BMMI) criterion [3].


Table 2 shows the performance of the hybrid CNN-HMM compared to the hybrid DNN-HMM and GMM-HMM. The hybrid CNN offer between 13-18% relative improvement over the GMM-HMM system, and between a 10-12% relative improvement over the DNN-HMM [3]. 

Table 2: WER on the  Broadcast News Recognition Task [3]

3.3 Bing Voice Search Recognition Task

The Switchboard and the Broadcast News Recognition Task were based on a publicly available test set. Microsoft researchers developed a data set containing 18 hours of bing mobile voice search data to test the new techniques for its own usage scenarios. The data set has a high degree of acoustic variability caused by noise, music, side-speech, accents, sloppy pronunciation, hesitation, repetition, interruptions, and mobile phone differences [4]. 

Setup of the DNN-HMM and CNN-HMM

Initially, a conventional state-tied triphone HMM was built. The HMM state labels were used as the targets in training the DNNs and CNNs, which both followed the standard recipe.  Abdel-Hamid et al. [4] investigated the effects of pretraining using an Restricted Bolzman Machine (RBM) for the fully connected layers and using a CRBM, as described in section 4-B of [4], for the convolution and pooling layers. The DNNs first hidden layer had 2000 units, to match the increased number of units in the CNN. The other two hidden layers had 1000 units in each. The CNN layer used limited weight sharing of convolution nodes that will be pooled together (section). Each section had 84 feature maps per section. It had a filter size of 8, a pooling size of 6, and a shift size of 2. The context window had 11 frames. Frame energy features were not used in these experiments [4].


Table 3 shows that the CNN improves word error rate (WER) performance over the DNN regardless of whether pretraining is used. The CNN provides about 8% relative error reduction over the DNN in the voice search task without pretraining. With pretraining, the relative word error rate reduction is about 6%. The results show that pretraining the CNN can improve its performance, but the effect of pretraining for the CNN is not as strong as that for the DNN [4].

Table 3: WER with and without pretraining (PT) on the Bing Voice Search Task [4]

3.4 Google Voice Input Recognition Task

Google Voice Input transcribes voice search queries, short messages, e-mails, and user actions from mobile devices. This is a large vocabulary task that uses a language model designed for a mixture of search queries and dictation. Jaitly et. al. [5] used approximately 5,870 h of aligned training data for a his work to evaluate the change to a DNN-HMM in Googles conventional GMM-HMM model.

Setup of the GMM-HMM and DNN-HMM

Google’s model for this task, which was built from a very large corpus, uses an speaker-independent GMM-HMM model composed of context-dependent triphone HMMs with 7,969 senone states. This Model uses PLP features as acoustic input that have been transformed by LDA. Semitied covariances (STCs) [6] are used in the GMMs to model the LDA transformed features and BMMI was used to train the model discriminatively. The DNN had four hidden layers with 2,560 fully connected units per layer and a final “softmax” layer with 7,969 alternative states. The input was 11 contiguous frames of 40 log filter-bank features with no temporal derivatives. Each DNN layer was pretrained for one epoch as an Restricted Bolzman Machine (RBM) and the resulting DNN was discriminatively finetuned for one epoch. After pretraining, discriminative fine-tuning of the neural network was performed using MMI [2]. 


On a test set of anonymized utterances from the live Voice Input system, the hybrid DNN-HMM system achieved a WER of 12.3% — a 23% relative reduction compared to the best GMM-based system for this task [2].

4. Main results

Table 4 summarizes the results of the large vocabulary benchmark tasks described above. It shows that DNN-HMMs consistently outperform GMM-HMMs that are trained on the same amount of data. Moreover, CNNs with the same number of parameters outperform DNNs.

Table 4: Results of the large vocabulary benchark tasks 



[1] L. Deng, D. Yu, “Deep Learning: Methods and Applications”, Foundations and Trends in Signal Processing, vol. 7, issues 3-4, p. 249, June 2014.

[2] G. Hinton, L. Deng, D. Yu, G. Dahl, "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups", IEEE Signal Processing Magazine, vol. 8, issue 6, pp. 82 - 97, Nov. 2012

[3] T. Sainath, A. Mohamed, B. Kingsbury, B. Ramabhadran, "Deep convolutional neural networks for LVCSR", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614 - 8618, May 2013

[4] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, Convolutional neural networks for speech recognition, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, issue 10,pp. 1533-1545, October 2014 

[5] N. Jaitly, P. Nguyen, A. Senior, V. Vanhoucke, “An application of pretrained deep neural networks to large vocabulary speech recognition”, Proceedings of Interspeech 2012, 2012

[6] M. Gales, "Semi-tied covariance matrices for hidden Markov models", IEEE Transactions on Speech and Audio Processing, vol. 7, issue 3, pp. 272 - 281, Aug. 2002