# From the Phonome-Sequence to a Word

In the language model, it is of interest to determine the most probable N words given a present phoneme sequence. For this, Hidden Markov Models are drawn up for each word of the training set and then to search for the greatest occurrence probability of a word by means of the Viterbi Algorithm.

# 1 Motivation

Basically, the speech recognition can be divided into three steps. In the first step, the spoken sentences are preprocessed in such a way that characteristic feature vectors can be extracted. In the next major step, the acoustic model, a phonemes sequence is computed by means of the feature vectors. Finally in the language model, the corresponding sentences are determined to the phoneme sequence. For this, it is required to determine the most probable words given the phoneme sequence. The approach of how to do this is the topic of the following article.

# 2 Training Phase

Generally, for the determination of the most probable word, a model for each word, the speech recognition software should recognize has to be drawn up. This is done during a training phase. In this training phase, a dictionary with many thousands words is used as training set. For every word $w_i$ of this dictionary, a Hidden Markov Model $\lambda_i$ is drawn up. Usually, a left-to-right Hidden Markov Model is used with five to seven states. A left-to-right Hidden Markov Model is a Hidden Markov Model, in which the states are ordered in a line. The only allowed state transitions are remaining in the same state or moving to the next state of the line. An example of such a left-to-right Hidden Markov Model is illustrated in the figure below.

The advantage of such Hidden Markov Models is that it improves modeling timing-controlled behavior. Usually, for each subword of a word, a state is used. For example, the word "office" can be divided into four subwords - the "O"-sound, the "F"-sound, the "I"-sound ,and the "S"-sound. Having drawn up the Hidden Markov Model for each word, parameters of each Hidden Markov Model are then trained by the usual training methods of Hidden Markov Models.

# 3 Determination of the most probable word

In the previous section, it was described how a Hidden Markov Model $\lambda_i$ is drawn up for each word $w_i$ of the dictionary. Now, these models are used to determine the most probable word for a preset phoneme sequence. For this, for each Hidden Markov Model $\lambda_i$ the occurrence probability of a word is determined by means of the Viterbi Algorithm given the phoneme sequence, which the acoustic model delivers. The maximum of these probabilities is the word which was most probably spoken.
One possibility to optimize the speech recognition rate is to determine the most probable N words of a given phoneme sequence instead of the most probable word. However, using more possible words mean that many possible word sequences occur. One possibility to illustrate these different word sequences is the so-called confusion network. An example of such a confusion network is illustrated in figure below:

An approach to find the sentences, which was most probable spoken, is to determine the most probable word sequence by means of the n-gram model

# 4 References

[1] Gales, M. & Young, S. (2008). The Apllication of Hidden Markov Models in Speech Recognition.

[2] Huang, X. & Deng, L. (2009). An Overview of Modern Speech Recognition.

[3] Pawate, B. I. & Robinson, P. D. (1996). Implementation of an HMM-Based, Speaker-Independent Speech Recognition System on the TMS320C2x and TMS320C5x

[4] Renals, S. & Morgan, N. & Bourlard, H. & Cohen, M. & Franco, H. (2002) Connectionist Probability Estimators in HMM Speech Recognition