# 1. Introduction

At this point in our speech recognition system we assume that the acoustic waveform has been transformed into a feature vector $X$ by one of the mentioned feature extraction techniques. Now the task of the decoder is to find the sequence of words $W$ which are most likely to have generated feature vector $X$. This means that we can estimate the word sequence $w_1,...,w_n$ by choosing, one after the other, the word which maximizes the conditional probability

$\hat{w}&space;=&space;\underset{w}{\operatorname{arg\;max}}&space;\left\{&space;p(w|X)&space;\right&space;\}$.

From this point on two fundamentally different approaches exist.

# 2. Discriminative model

In discriminative training (a parameter training method using the discriminative acoustic model), an objective function with respect to a reference word sequence and its competing word sequences (competing hypothesis) is introduced to minimize the recognition error rate by optimizing the model parameter $\theta$. In the objective function, the $P(X|w;\theta&space;)P(w)$ with respect to the reference word sequence $w$ is divided by the $\sum_{{w_{i}}'}P(X|{w_{i}}';\theta&space;)P({w_{i}}'&space;)$ with respect to the competing word sequence $w_1^{'},...,w_n^{'}$.  The larger the value of the objective function is, the smaller the recognition error is. The formation of the objective function varies with different methods of discriminative training.

# 3. Generative model

The second and more common way is to use Bayes rule and rewrite the equation as

$\hat{w}&space;=&space;\underset{w}{\operatorname{arg\;max}}&space;\left\{&space;p(X|w)p(w))&space;\right&space;\}$.

Now we have split the problem into two sub-problems. The likelihood of the feature vector given a certain word $p(X|w)$, determined by the generative acoustic model, and the prior probability for each word $p(w)$ determined by the language model.

For the acoustic model, once again, two different frameworks have been developed in the last decades. The older one, which is nevertheless still in use in a lot of present-day ASR systems, is the Gaussian Mixture Model. In recent years, when more computation power became available, acoustic models based on Artificial Neural Networks were introduced.

# 4. Output representations of the Acoustic Model

Modeling speech dynamics in the temporal domain in most cases is carried out by Hidden Markov Models. For an insight into HMMs concerning only Automatic Speech Recognition, take a look at the corresponding article Hidden Markov Models for Speech Recognition.

HMMs, however, are not the only or even the best solution for modeling temporal interdependencies in speech, let alone the whole range of speech dynamics. The section Advancements in output stage of the article Comparing Different ANN-Architectures gives a brief introduction into the topic.

After a (set of) word(s) has been classified by the Acoustic model, the Language model takes over, modeling the higher levels of speech.

# References

[1] S. Young, "HMMs and Related Speech Recognition Technologies", in Springer Handbook of Speech Processing, J Benesty, MM Sondhi and Y Huang (eds), chapter 27, 539-557, 2008 Young (2008)