In acoustic modelling Artificial Neural Networks can be used as an alternative approach to Hidden Markov Models for phoneme recognition. A pre-processed feature vector is fed into the input layer of a neural network. The goal is to correctly match different phones to phonems, which can then be further processed in the language model.

The dynamic nature of speech is an impairing factor when using artificial neural networks for phonem recognition. Traditional neural networks require the phones to be perfectly aligned in time to allow for flawless allocation. This is usually not the case in human speech production. In 1989 Waibel et al. proposed an approach using time delay neural networks to overcome this problem [2]. They used normalized melscale spectral coefficients as input and considered a number of neighboring frames. Their four layer architecture then condenses wider frame windows in each layer. In that way shorter features would be formed at lower layers, longer and more complex features at higher layers. While their results were promising, they were computationally rather expensive for their time. Thus Hidden Markov Models remained in focus of research.

Figure 1: Time Delay Neural Network as proposed by Waibel et al. [2]


Today deep learning algorithms are gaining popularity. Three important reasons for this are the drastically increased chip processing abilities, significantly increased size of data used for training and recent advances in machine learning and signal/information processing research. With the introduction of pretraining techniques by Hinton in 2006, there was a revival of neural networks for speech recognition. In 2012, the Context Dependent Deep Neural Network outperformed the common technique of GMM-HMMs for acoustic modeling in many usage scenarios. That was a breakthrough for neural networks and the starting point for architectural advancements

An overview of the basic Context Dependent Deep Neural Network architecture and some architectural advancements is given here.


[1] E. Trentin and M. Gori, "Robust Combination of Neural Networks and Hidden Markov Models for Speech Recognition", in IEEE Transactions on Neural Networks, Vol. 14, No. 6, 2003.

[2] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang, "Phoneme Recognition Using Time-Delay Neural Networks", in IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 37 No 3, 1989.

[3] W. Welf, Hidden Markov Modelle und Künstliche Neuronale Netze, 2004