1. General concept

1.1 Introduction

The general framework for the acoustic model applied by virtually all present day Automatic Speech Recognition (ASR) systems is the Hidden Markov Model (HMM). It is suitable for modelling the temporal structure of speech. HMMs are a well-known model and applied in in many fields. If the reader is not familiar with the general concept of an HMM, he or she can find a short introduction to HMMs in the Basics section.

This article is structured as follows: First we explain how speech can be modeled with an HMM. In the Implementation section, the model is then derived in a mathematical manner. The derivation will lead to three key problems that need to be solved. In the last section, we discuss drawbacks and present possible extensions of this model.

1.2 Probabilistic model

If we want to represent speech by a Hidden Markov Model, one of the first questions that arise is, what is a state in that context? Intuitively one could say, every state should represent one word. The English language consists of more than 250,000 words, which would be far too many states for a feasible system. Hence, one divides words into a set of basic sounds and ends up with a finite set of symbols. For English there are ~40-45 such basic sound units which are called phones. A word is thus split into phones using a pronunciation dictionary. These phones can now be associated with an HMM. Here it is assumed that each phone is represented by a HMM with three states. The beauty of this approach is that the HMMs from several phones can just be concatenated. Like that an utterance model is constructed from a sequence of word models. Each word is formed out of subwords/phones which are described by HMMs.

Hierarchichal modelling of speech [1]

The following example shows how the word "read" is modeled by an HMM. Each phone is represented by three states. The phone models are then concatenated to form a word. Another advantage of this approach is that different pronunciation variants can be included easily by adding a parallel path.

Concatenation of phonemes, branching for different pronunciations [1]

So far we have broken things down from a sequence of words to a Hidden Markov Model. The HMM can be represented as a finite state automaton. The sequence of states 2-3 correspond to a distinct phone. States I and E are the initial and exit states. The state transitions are described by transition probabilities $p_{i|i-1}$. The emission of each state (phone) is the acoustic feature vector $x$ which is generated by that phone. The emission probabilities for a certain state are described by it's output distribution $p_{x|s_i}$

Transition and emission prababilities [3]

In order to completely parametrise an HMM based acoustic model, all the transition and output distributions have to be estimated from training data. The next section focuses on a mathematical derivation of the model as well as its practical implementation.

2. Statistical Model

2.1 Structure

The underlying structure of the Hidden Markov model can be visualized as follows.

Hidden markov model, adapted from [3]

The transition probabilities between hidden states $q_i$ and $q_j$ are given as $\{a\}_{i,j}&space;:=&space;P(q_j&space;|&space;q_i)$ the conditional emission densities are given by $b_i(x_t)&space;:=&space;p(x_t&space;|&space;q_{i,t})$.  For stationary discrete emissions $\{x_j\}$ the output distribution can be represented as a matrix $\{b\}_{i,j}$ for continuous output (e.g. Gaussian Emissions) we understand $k$ as a gating variable that selects which of the $K$ component distributions to use.

2.2 Essential probabilities

The key to sucess is, as introduced before, hierarchical modelling. Following the divide-and-conquer principle we first split each word $w$ into a sequence of phones $q$. We call the sequence $Q$.

$Q&space;=&space;\mathbf{q}^{(w)}_{1:K}=q_1,...,q_K$

Since a phone sequence is just a concatenation of it's constituent phones, we can express the likelihood of the phone sequence $Q$ given a word $w$ as

$P(Q|w)&space;=&space;\prod_{l=1}^L&space;P(q_{1:K_l}}^{(w_l)}|w_l)$.

At this point can take into account that different pronunciations exist for the same word. Different pronunciations lead to parallel paths in the HMM state sequence, so we can just take the sum over the existing phone sequences $Q$ for a certain word.

$P(X|w)&space;=&space;\sum_{Q}&space;p(X|Q)&space;P(Q|w)$

Taking into account the structure of the HMM, we can express the joint likelihood for certain sequence of observations $X$ and hidden state trajectories (paths) $\theta&space;=&space;\theta_0,...,.\theta_{T+1}$ as

$P(\theta,X|Q)&space;=&space;a_{\theta_0\theta_1}\prod_{t=1}^T&space;b_{\theta_t}(y_t)&space;a_{\theta_t\theta_{t+1}}$.

If we take the sum over all existing paths we find the marginal distribution of the evidence.

$P(X|Q)&space;=&space;\sum_{\theta}&space;p(\theta,X|Q)$

Having established the statistical framework we can summarize our tasks as follows:

Given a HMM there are three key problems [2] we have to solve.

• Filtering, that is computation of the (log) likelihood of the evidence $P(X&space;\left|&space;\lambda)$ (see Forward Backward Algorithm).
• Decoding, that is determining the maximum a posteriori marginals (MPMs)  $\theta_i^*&space;=&space;\underset{q_{i,j}}{\operatorname{argmax}}&space;\;P(q_{i,j}&space;\left|&space;X,&space;\lambda)$ (also given by the Forward Backward Algorithm) or the single most likely hidden state trajectory $\theta^{*}$(see Viterbi Algorithm).
• Training, that is adjusting the model parameters to the values $\lambda^*&space;=&space;\left(\{a\}_{i,j}^{*},&space;b_j^{*}\right)$ that maximize the overall likelihood $P(X,Q&space;\left|&space;\lambda)$ (see Baum Welch Algorithm).

3. Discussion

3.1 Structural refinement

The phone model that we have introduced is context independent. For real speech, however, the basic sounds differ depending on their context. One example would be the words "mood" and "cool". Despite they both contain the same vowel "oo", in pratice they sound different because the preceding and following consonants are different. To overcome this issue, one combines three phones to a so-called triphone or senone. This approach enables context dependent modeling.

Context dependent phone modeling [3, p. 207]

A set of 40 phones would hence produce a set of 40³=64,000 triphones. This is certainly a lot. However, many of these combinations actually don't exist and some sound very similar. One can take advantage of that and map the complete set of triphones to a reduced set. By clustering and tying similar states the amount of parameters is reduced considerably. A detailed explanation of this approach can be found in [3].

3.2 Output Distribution

What we have not regarded so far is the output distribution of the Hidden Markov model. The acoustic vector $x$ usually has a dimension around 40 and is multimodal. Hence Gaussian Mixture Models (GMMs) are frequently used to describe the output distribution. We would like to point the interested reader to the page "Gaussian Mixture Models for Speech Recognition". Another approach that came up in recent years is to model the ouput by Aritificical Neural Networks.

References

[1] Clark, Alexander, Chris Fox, and Shalom Lappin, eds. The handbook of computational linguistics and natural language processing. Vol. 57. John Wiley & Sons, 2010. Clark (2010)

[2] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Procedings of IEEE, vol. 77, no. 2, pp. 257–286, 1989.

[3] S. Young, "HMMs and Related Speech Recognition Technologies", in Springer Handbook of Speech Processing, J Benesty, MM Sondhi and Y Huang (eds), chapter 27, 539-557, 2008 Young (2008)