1. General concept

The previous page describes the concept of using Hidden Markov Models (HMMs) for speech recognition. One important part of the model is the state output distribution which is a statistical description of the acoustic feature vector. The scope of this article is to explain how Gaussian Mixture Models (GMMs) can be applied for that purpose. A Gaussian Mixture Model is a highly flexible statistical distribution which is able to model almost any set of data. A general introduction can be found in the basics section. The acoustic feature vector obtained by feature extraction is in pratice often multimodal. This can be caused by speaker, accent and gender differences. GMMs are therefore well suited to represent the feature vector and are applied in many Automatic Speech Recognition (ASR) systems. In order to implement such a system, it is necessary to learn the model parameters. For GMMs this can be done with a technique called Expectation Maximization.

2. Complexity Analysis

The feature vector dimension in ASR systems if typically around 40. If we assume for instance 12 MFC coefficients plus the signal energy, the static feature vector has 13 dimensions. Taking into account time dependancy and regarding delta and delta-delta parameters as well, we get a dynamic feature vector with $D=39$ dimensions. For a given number of mixture components $M$ we can now calculate the parameter count of the covariance matrix.

• Full covariance matrix: $MD^2$ parameters
For  M=10 one would already get more than 15,000 parameters per state - certainly too much for a pratical system that should operate in real-time.

• Diagonal covariance matrix: $MD$ parameters
If we take into account that the feature vectors are at least ideally uncorrelated, then a diagonal covariance matrix is sufficient to describe the GMM. With this approach the number of model components is typically in the order of 10-20. This approach was for instance used by "The 1998 HTK system for transcription of conversational telephone speech" developed at the University of Cambridge [2] which applied 16 component GMMs.

• Tied (same) diagonal covariance matrix for all components: $D$ parameters
In order to further reduce the number of parameters, one can also use the same covariance matrix for all components. However this requires increasing the number of components (typically >100) to get a good representation of the feature vector. Since this model spans only a subspace of the total parameter space, it is called subspace GMM.

3. Subspace GMM

The subspace approach became more popular in recent years. We therefore analyse a very basic form of a subspace GMM. The conditional distribution of the feature vector given the state can be written as

${p}(x|s)\,&space;=&space;\sum\limits_{m&space;=&space;1}^M&space;{{c_{sm}}&space;{\cal&space;N}({x;&space;\mu&space;_{sm}},{C_m})}$.

The mean is derived from the state specific vector $v_s$

$\mu_{sm}&space;=&space;M_m&space;v_s$

The same applies for the mixture weights

$c_{sm}&space;=&space;\frac{\operatorname{exp}&space;c_m^T&space;v_s}{&space;\sum\limits_{m&space;=&space;1}^M&space;{&space;\operatorname{exp}&space;c_{m}^T&space;v_s}&space;}$

$M_m$ and $c_m$ are globally shared constants. $x&space;\in&space;\Re^D$ is the feature vector and $v_s&space;\in&space;\Re^S$ a state specific vector. One can see that the difference between this subspace GMM and a usual GMM is that the parameters of the GMM are not the parameters of the overall model. Instead there exists a vector $v_s$ for each state. The means and weights of the GMM are defined by a globally shared mapping from $v_s$. Counting the parameters describing the GMM, namely means and weights, yields $M(D+1)$ parameters per state. The dimension $S$ is typically much smaller. This is exactly why the model is called "subspace GMM".

The model given here can be extended. For instance, speaker adaptation can be included in the model by adding a speaker vector to the means. To avoid excessive speaker dependent computation, the weights can be left unchanged. A subspace GMM system typically has 2-4 times less parameters than a GMM system. Nevertheless it outperforms a standard GMM system [3].

References

[1] Gales, Mark, and Steve Young. "The application of hidden Markov models in speech recognition." Foundations and Trends in Signal Processing 1.3 (2008): 195-304.

[2] Hain, Thomas, et al. "The 1998 HTK system for transcription of conversational telephone speech." Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on. Vol. 1. IEEE, 1999.

[3] Povey, Daniel, et al. "Subspace Gaussian mixture models for speech recognition." Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010.