Recently a fairly decent number of new architectures such as Deep Stacking Networks and Sum-Product Networks were proposed. Problems with more well-known types, e.g. Recurrent Neural Networks have also been approached and solved in different ways.

This article will summarise and review those as well as the classic approaches to artificial neural networks (ANNs) in acoustic modeling. A few of the most interesting concepts will be featured more in-depth in separate articles.

The architecture of ANNs for acoustic modeling can be subdivided into three parts: the input interface (i.e. the output of the Feature Extraction pass), the output interface (e.g. a Hidden Markov Model) and the actual architecture of the neural net itself. 

We will start off with a review of the first deep architecture to outperform a comparable architecture based on Gaussian Mixture Models.


Context-Dependent Deep Neural Networks

Until recent years, most speech recognition systems use Hidden Markov Models (HMMs) to deal with the temporal variability of speech and Gaussian Mixture Models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. GMMs and HMMs co-evolved as a way of doing speech recognition when computers were too slow to explore more computationally intensive approaches as neural networks. Better training algorithms and faster processors lead to a new interest for the application of neural networks for speech recognition. In 2012, a clear breakthrough of neural networks for speech recognition came with the introduction of the hybrid context-dependent deep neural network hidden markov model (CD-DNN-HMM) , which outperformed the hybrid GMM-HMMs in all usage scenarios [1]. 

This article should describe the CD-DNN-HMM architecture in detail. First, the architecture of the CD-DNN-HMM should be explained. Consecutively, the advantages of the CD-DNN-HMM compared to the GMM-HMMs will be pointed out. In the last section, different experiments confirm the theoretical thoughts about the reason for the strength of the CD-DNN-HMM. Read more ...


Advancements in Input stage: Deep Autoencoders

As mentioned in the Feature Extraction section, Artificial Neural Networks can be used to post-process or enhance features gained earlier, which are already encoded as, for instance, MFCCs. This way, the traditionally set line between feature extraction and acoustic modeling blurs or isn't existent any more - there is no reason not to merge two Artificial Neural Nets into a single architecture, if they follow after one another.

But this is also possible into the other direction. By further extending the input layer of the ANN towards the raw data, we can get rid of feature vectors altogether. The input layer is now directly connected to to the raw data, which in our case is the audio stream.

Deep Autoencoders

By definition of [9, pp. 230] an autoencoder is an ANN-architecture with (at least) the same output dimensionality as the input. A deep autoencoder is therefore an autoencoder with at least two hidden layers of perceptrons. According to that the difference to usual DNNs is solely the output dimensionality; DNNs for the most part are applied for dimensionality reduction and classification tasks, i.e. a less wide output layer.

While this separation may seem like semantics only, it helps to understand the difference in those applications; feature extracting ANNs can be seen as an isolated unit with the sole purpose of arranging and encoding the input data in a favourable way.

The idea of the autoencoder is that of a neural network being trained unsupervised to generate an encoding function of any type of input. In opposite to a DNN (which idea is to generalise many sets of data), it is possible to reconstruct the input from the output up to a high degree. The same holds true for feature vectors as well. The advantage of autoencoders (or autoassociators or Diabolo networks [13]) is their customisation to encoding speech signals. Mel Frequency Cepstral Coefficients, for instance, are based on  human perception of speech signals, but not on characteristics of the signal itself.

The unsupervised learning procedure of autoencoders is similar to that of DNNs: first, the weights are being pre-initialized layer-wise by forming several Restricted Boltzmann Machines. The 'autoencoding' step subsequently is a somewhat unsupervised supervised finetuning algorithm; in a way autoencoders always operate 'supervised', as their very function is to create an accurate representation of the input. So every speech signal is therefore 'labeled' data per definition - with one slight difference: to actually obtain the output, which can then be matched with the labeled data, i.e. the original speech signal, back-propagation can be used to reconstruct the input, a supervised training method similar to Contrastive Divergence, as seen in our corresponding article

Figure 1: Reconstruction performance of autoencoders. From [11]

On the top of figure 1 you can see the original raw spectrum of a speech signal, which has been encoded by both vector quantization and an autoencoder. While vector quantization may not be used for generating feature vectors in speech recognition, according to [11] a comparison to MFCCs would have been pointless, as this technique smoothes the spectrum to a high degree.

The second and third column shows a reconstruction of the signal based upon the vector quantization technique and the trained autoencoder respectively. Note the much better reconstruction achieved by the autoencoder. The last two columns show the difference between original and reconstructed signal; blue colour signifies low difference and therefore a good-quality reconstruction, while a dark red colour denotes significant deviation.

Another application of autoencoders can be purposefully preprocessing the input signal; in this case a perfect reconstruction is not the goal any more. This way, basically any type of filtering desired can be build into the autoencoder. [14]

Figure 2: Denoising autoencoder. From [9, pg. 238] 

Figure 2 shows an autoencoder configured as a denoising filter.


Architectural advancements

Appropriately trained Deep Neural Networks have proven to be capable of surpassing Gaussian Mixture Models [1] for acoustic modeling. However, there are a few caveats to consider when using them.

For one thing, training them is elaborate and therefore computationally expensive. Further more in their basic, feed-forward architecture (see Introduction to Artificial Neural Networks) they only possess limited power for modeling temporal dependencies. 

While many of the following ANN-architectures have shown to improve WER in regards to the classic Context-Dependent DNN, in most cases it has yet to be determined in which way a combination of them might lead to the best overall performance.

Convolutional Neural Networks

Convolutional Neural Networks (CNN) have showed success in achieving translation invariance through local filtering and maxpooling for many image processing tasks.  It seems intuitive to use CNNs for speech recognition to improve the hybrid CD-DNN-HMM model. In [2], the concepts of CNNs were used in the frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.

In some experiments, the proposed CNN architecture is evaluated in a speaker independent speech recognition task using the standard TIMIT data sets. Experimental results show that the proposed CNN method can achieve over 10% relative error reduction in the speaker independent TIMIT test sets when comparing with a regular CD-DNN using the same number of hidden layers and weights. 

In this article we want to explain the three extra concepts over the simple fully connected feed-forward NN and the application of these concepts onto speech recognition. Read more ...

Long-Span Temporal Patterns

Modeling temporal dependencies of speech has proven to be a difficult task. Time dynamics in speech are usually addressed by the HMM, but according to Deng [15, pg. 4], most of today's state-of-the-art ASR systems only implement an "ultra-weak form of speech dynamics". This is induced by HMMs being a sequence of discrete states representing senones or phonemes. State transitions can happen at specific times, dependent on the size of the input window(s) taken into account. Precise timings and quantitative changes between those states although cannot be observed and are lost.


Figure 3: Building STC features

An easy way of mitigating this problem is using Long-Span Temporal Patterns. Siniscali et al. [5] extended a standard DNN-HMM architecture with a hierarchy of DNNs modeling temporal patterns of spectral energy densities in a window of 310ms centered around the 10ms-frame being processed at the specific time. Thereby this long-span window was divided into two sub-frames to gain a left-context and a right-context as seen above. Each of those contexts were represented by Split Temporal Context (STC) features [4] (correspondingly dividing the frames into even more sub-frames) and processed by independent DNNs. The outputs of those two nets were then used as inputs of a third DNN, merging the long-span information into a set of posterior probabilities for each frame. The values shown in figure 3 are samples based upon information given in [5].

The resulting STC feature served as an additional input vector for the standard DNN used to evaluate the current window. The architecture was then compared with the classic approach, which only takes into account the central window as well as (slightly) different ways of obtaining the STC feature (which involve e.g. shallow nets instead of DNNs for left- and right-context as well as the merging upper net). It outperformed its competition by up to 10 % in classifying phonemes per frame, where it reached 88.3 %. Combined with a HMM based on senones instead of phonemes, it also achieved a word error rate (WER) of 5.2 % in the Nov. 92 ARPA Benchmark.


Deep Stacking Networks

One of the oldest obstacles in the history of DNNs is the amount of computational power involved in training themDeep Convex Networks [15] have been designed specifically with scalability of the learning process in mind. While they share architectural similarities with CNNs, their unique property is the concatenation of raw input features with the output of a higher-level sub-net. 


Sum-Product Networks

Sum-product networks (SPNs) are a new take on generative deep neural architectures. In short, they are directed acyclic graphs (Wikipedia: Directed acyclic graph) consisting of sums and products. Therefore they can be seen as a combination of mixture models and feed-forward neural networks. 

As to the author's knowledge, SPNs have yet to be used for acoustic modeling, therefore only a brief overview will be given on the topic. They have although been successfully applied to language modeling tasks, as seen in [6].

The basic structure, as proposed by Poon and Domingos [7] is a graph structure, alternating between layers of sum-nodes and product-nodes. An example can be seen in figure 4, a SPN used for the language modeling application mentioned before. Notice that sum- and product nodes are not strictly alternating, as is the case in Poon and Domingos' original release. 

Figure 4: SPN-layout in a language modeling application. From [6]

So while SPNs share similarities with both Gaussian Mixture Models (they can model 'interferences' between layers) and the DNNs (a restricted nonlinear feature hierarchy), they manage to be rather simplistic.

This is due to artificially inducing constraints to the possible SPN-layout:

  • the edges of sum-nodes must have a positive weight,
  • the value of a product node is the product of its children
  • and the value of a SPN is the value of its root (i.e. exactly one output node).

Furthermore, a SPN can fulfill the properties completeness, consistency and decomposability which are desirable if constructing deep architectures, as they simplify the learning procedure. They also allow statements about sub-SPNs such as a sum of SPNs; e.g. this sum of SPNs will still fulfill those properties, if they hold true for the summands. [7] 

The drawback of those constraints, of course, is a restricted ability of modeling complex features in a certain number of layers, compared to DNNs. Initially SPNs were also held back by a lack of appropriate supervised training algorithms, until a subsequent work introduced a discriminative algorithm [8]. This helped to mitigate the problems occurring when using generative models for unsupervised training.

Recurrent Neural Networks / Long-Short-Term Memory

Most architectures of DNNs are based upon feed-forward nets to decrease complexity and to retain a predictable run time behaviour. Recent research into training methods and architectural adaptions for acoustic models has proven the viability of recurrent neural networks for large-vocabulary ASR purposes. In fact, other than feed-forward networks, recurrent neural nets are able to model temporal dependencies. This also effects the output stage of acoustic modeling in that it can make HMMs redundant.


Advancements in Output stage

At first, let us outline the current structure of the output layer of a generic DNN-HMM configuration once more.

As already stated in the article describing the classical approach (see Context-Dependent DNNs as well as HMMs), output representations are usually implemented as posterior probabilities of either

  • phones (non-context-dependent acoustic modeling)
  • triphones (which are all possible pre- and post-phones for a certain phone, therefore  )
  • or senones, which cluster similar triphones to reduce the HMM-complexity.

Using context-dependent modeling as a result leaves a major footprint in computation time (up to 1/3), because of the high dimensionality of the output matrix. According to Deng [12], this problem can be mitigated by performing a SVD (singular value decomposition) on the output matrix. Despite dimensionality reduction, the performance of the system did not suffer.

However, even context-dependent modeling of speech, according to Deng [15, pg. 4 - 8], is only an "ultra-weak" form of describing the highly dynamic nature of speech. He divides speech dynamics into four major sub-sets,

  • the linguistic level, which is the usual way of modeling speech, also called the "beads-on-a-string" representation: a logical sequence of phones to form an expression,
  • the physiological level, which is the set of all articulatory muscle movements,
  • the acoustic level, which describes the result of the articulator's movements, i.e. the set of all filters shaping the sound wave emerging,
  • and the auditory and perceptual level, which is the description of 'hearing', ranging from input filtering (ears, head-shape etc.) up to experiences of the listener aiding in perception.

All of those have to be taken into consideration to achieve the best possible result in speech recognition. The physiological level, for instance, can be visualized by assigning the articulator's movements to certain words or phones, including variable lengths to model speech dynamics. Figure 5 shows an example spectrum of the word 'strong' and corresponding (overlapping) activity of articulators involved.

Figure 5: Overlapping articulatory features. From [12]

In this case, the velum (back palate) provides variable length nasal features. The same holds true for the lip-rounding feature.

Of course it would be a major effort to provide this sort of feature density for a whole language. It is imaginable, though, that a sort of hybrid form of various different levels will be introduced in the future. Well-known, delicate and often misunderstood sequences of phones and words could then be enriched with additional features, e.g. from the physiological level of speech dynamics.


Comparison and Benchmark

We have shown in this article, that there exists a huge number of variables for customizing the architecture of (pseudo-) Deep Neural Networks. But how do they compare to each other? 

Even though there is no common consensus, most publications compare their results to other technologies (like GMMs) and architectures in de facto voice recognition standards like TIMIT or Switchboard. See Benchmarks: Comparison of different architectures on TIMIT and large vocabulary tasks for a more in-depth article on this topic.

An insight, in which way wide and deep architectures can improve the WER can be found in [10]. Unsurprisingly, a mixture of both provided the best results, especially within noisy environment.




[1] G. Dahl, D. Yu, L. Deng, A. Acero, "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition", IEEE Transactions on Audio, Speech and language processing, vol. 20, 30 - 42, Jan. 2012

[2] O. Abdel-Hamid, L. Deng, D. Yu, "Exploring convolutional neural network structures and optimization techniques for speech recognition", Interspeech 2013

[3] G. E. Hinton, R. S. Zemel, "Autoencoders, Minimum Description Length and Helmholtz Free Energy", Proc. Advances in Neural Information Processing Systems 6 (NIPS), 1993

[4] P. Schwarz, P. Matějka, J. Černocký, "Hierarchical structures of neural networks for phoneme recognition", Proc. Acoustics, Speech and Signal Processing (ICASSP), May 2006

[5] S. M. Siniscalchi, D. Yu, L. Deng, C. Lee, "Speech Recognition Using Long-Span Temporal Patterns in a Deep Network Model", Signal Processing Letters, vol. 20, 201 - 204, Jan. 2013

[6] W. Cheng, S. Kok, H. V. Pham, H. L. Chieu, K. M. A. Chai, "Language Modeling with Sum-Product Networks", Interspeech, Sep. 2014

[7] H. Poon, P. Domingos, "Sum-Product Networks: A New Deep Architecture", Proc. Computer Vision Workshops (ICCV), Nov. 2011

[8] R. Gens, P. Domingos, "Discriminative Learning of Sum-Product Networks", Proc. Advances in Neural Information Processing Systems 25 (NIPS),  2012

[9] L. Deng, D. Yu, "Deep Learning: Methods and Applications", Foundations and Trends in Signal Processing, vol. 7, 197 - 387, Jun. 2014

[10] N. Morgan, "Deep and Wide: Multiple Layers in Automatic Speech Recognition", IEEE Trans. Audio, Speech and language processing, vol. 20, 7 - 13, Jan. 2012

[11] L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, G. Hinton, "Binary Coding of Speech Spectrograms Using a Deep Auto-encoder", Proc. Interspeech, Sep. 2010

[12] L. Deng, "Design and Learning of Output Representations for Speech Recognition", Proc. Advances in Neural Information Processing Systems (NIPS), Dec. 2013 

[13] Y. Bengio, "Learning Deep Architectures for AI", Foundations and Trends in Machine Learning, vol. 2, 1 - 127, Oct. 2009

[14] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P. Manzagol, "Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion", Journal of Machine Learning, vol. 11, 3371 - 3408, Mar. 2010

[15] G. Tur, L. Deng, D. Hakkani-Tür, X. He, "Towards Deeper Understanding: Deep Convex Networks for Semantic Utterance Classification", Proc. Acoustics, Speech and Signal Processing (ICASSP), Mar. 2012

[16] L. Deng, "Dynamic Speech Models - Theory, Algorithms and Application", Morgan & Claypool, 1st. edition, 2006