N-Gram Model - Basics

The n-gram model is an approach in the language model to determine the most probable word sequence with several word sequences given. By means of a probability model, it is possible to compute the probability of each possible word sequence. The desired word sequence is the sequence with the greatest probability.

1 Motivation

Generally, the language model is divided into two steps. In the first step, the phoneme sequence generated by the acoustic model is used to determine very probable word sequences. Since there are several possible words for each time step, there are many possible word combinations, which might have been spoken. The task of the second step is to choose that sentence from all possible word sequences, which is most probable. For this, there are different approaches in the literature. One approach is based on the underlying grammatical structure, which is known as grammar model. The grammatical rules of a language are used to choose the word sequence which is grammatically correct. In case that there are two possible word sequences: "I ball" and "I play". Then the grammar model decides for the word sequence "I play", since it knows that it is more probable that a verb follows the word "I " than a noun. A more common approach is the so-called stochastic language model. In this case, the probability of each possible word sequence is computed. Then the word sequence with the greater probability was most probably spoken by the user of the speech recognition software. The probability of each word sequence is determined by means of n-grams. N-grams are a word sequence of n-words. The basic concept of the n-grams is described in this article.

2 The n-gram Model

Before providing more information on the n-gram model, we first want to consider how to calculate the probability of a sentence. Assume that the sentence $\mathbf{W}$ consists of N words $w_1,&space;w_2,&space;\cdots,&space;w_N$. Then the probability $p(\mathbf{W})$ can be determined as follows:

$\dpi{100}&space;p(\mathbf{W})&space;&=&space;p(w_1,&space;w_2,&space;\cdots,&space;w_N)&space;=&space;p(w_1)&space;\cdot&space;p(w_2|w_1)&space;\cdot&space;p(w_3|w_1,w_2)&space;\cdots&space;p(w_n|w_1,&space;w_2,&space;\cdots,&space;w_{N-1})&space;\\&space;&=\prod\limits_{i=1}^{N}&space;p(w_i|w_1,w_2,&space;\cdots,&space;w_{i-1})$

The probability $p(w_i|w_1,w_2,&space;\cdots,&space;w_{i-1})$ is very hard or even impossible to determine, since many word sequences do not appear often or are unique. It sounds reasonable to assume that the word $w_i$ only depends from its last (n-1)-words:

$p(w_i|w_1,w_2,&space;\cdots,&space;w_{i-1})&space;\approx&space;p(w_i|w_{i-1},&space;w_{i-2},&space;\cdots,&space;w_{i-n+1})$

This approach is called n-gram model in the literature. In earlier times n-grams models with order $n&space;=&space;2$ were very common, which are also called bigrams. Nowadays, trigrams ($n&space;=&space;3$) or even greater $n$ are used since the computation power has increased in the last years. A comparison between unigrams ($n&space;=&space;1$), bigrams and trigrams is shown under the following link. Note that only bigrams will be used in the further article, since the extension from the bigram model to the trigram model and n-gram model is straightforward.

Using the n-gram model, the probability of a sentence yields:

$p(\mathbf{W})&space;=&space;p(w_1,&space;w_2,&space;\cdots,&space;w_N)&space;=&space;\prod\limits_{i=1}^{N}&space;p(w_i|w_{i-1},&space;w_{i-2},&space;\cdots,&space;w_{i-n+1})$

For example, the probability of the sentence "Anne studies in Munich" can be calculated by

$p(\textit{Anne&space;studies&space;in&space;Munich})&space;=&space;p(Anne|)&space;\cdot&space;\\&space;\cdot&space;p(studies|Anne)&space;\cdot&space;P(in|studies)&space;\cdot&space;p(Munich|in)&space;\cdot&space;p(|Munich)$

Note, that the $$ token is used as a sentence start such that the probability $p(w_1|w_0)$ is defined. $$ symbolizes the end of a sentence, such that the probability of all word sequences is one.

3 Probabilities of n-grams:

First at all, a training set is needed to calculate the probability of the n-grams compared to a given set of word sequences. The training set usually consists of many million words. The probability of the n-grams can be estimated as follows:

$p(w_i|w_{i-1},w_{i-2},\cdots,w_{i-n+1})&space;=&space;\frac{C(w_{i-n+1},\cdots,w_{i-2},w_{i-1},&space;w_i)}{C(w_{i-n+1},\cdots,w_{i-2},w_{i-1})}$

where $C(w_{i-n+1},\cdots,w_{i-2},w_{i-1},&space;w_i)$ is the number of the word sequence $w_{i-n+1},\cdots,w_{i-2},w_{i-1},&space;w_i$ in the training set. Using the above definition the probability for a bigram yields:

$p(w_i|w_{i-1})&space;=&space;\frac{C(w_{i-1},&space;w_i)}{C(w_{i-1})}$

4 A small example

Let's consider a small example: We know that the sentences "Anne studies in Munich" and "Anton studies in Munich" are probably spoken by some speaker. Now it is of interest, which of them was most probably spoken. The procedure is as follows. First, we determine the probability of each sentence, $p(\textit{Anne&space;studies&space;in&space;Munich})$ and $p(\textit{Anton&space;studies&space;in&space;Munich})$ using the bigram model. Then it is decided, which sentence was spoken by the speaker by comparing both probabilities. The sentence with the greater probability was more probably spoken.
In this example the training set consists of the following three sentences: "Anne studies in Munich. Anton studies in Nuremberg. Anne studies electrical engineering". The training set is used to determine the probabilities of each bigram:

$p(Anne|)&space;=&space;\frac{C(,Anne)}{C()}&space;=&space;\frac{2}{3}$

$p(Anton|)&space;=&space;\frac{C(,Anton)}{C()}&space;=&space;\frac{1}{3}$

$p(studies|Anne)&space;=&space;\frac{C(Anne,studies)}{C(Anne)}&space;=&space;\frac{2}{2}$

$p(studies|Anton)&space;=&space;\frac{C(Anton,studies)}{C(Anton)}&space;=&space;\frac{1}{1}$

$p(in|studies)&space;=&space;\frac{C(studies,in)}{C(studies)}&space;=&space;\frac{2}{3}$

$p(Munich|in)&space;=&space;\frac{C(in,Munich)}{C(in)}&space;=&space;\frac{1}{2}$

$p(|Munich)&space;=&space;\frac{C(Munich,)}{C(Munich)}&space;=&space;\frac{1}{1}$

These bigram probabilities are used to calculate the probability of each of the both sentences:

$p(\textit{Anne&space;studies&space;in&space;Munich})&space;=&space;p(Anne|)&space;\cdot&space;p(studies|Anne)&space;\cdot&space;p(in|studies)&space;\cdot&space;p(Munich|in)&space;\cdot&space;p(|Munich)&space;=&space;\frac{2}{3}&space;\cdot&space;\frac{2}{2}&space;\cdot&space;\frac{2}{3}&space;\cdot&space;\frac{1}{2}&space;\cdot&space;\frac{1}{1}&space;\approx&space;0.222$

$p(\textit{Anton&space;studies&space;in&space;Munich})&space;=&space;p(Anton|)&space;\cdot&space;p(studies|Anton)&space;\cdot&space;p(in|studies)&space;\cdot&space;p(Munich|in)&space;\cdot&space;p(|Munich)&space;=&space;\frac{1}{3}&space;\cdot&space;\frac{1}{2}&space;\cdot&space;\frac{2}{3}&space;\cdot&space;\frac{1}{2}&space;\cdot&space;\frac{1}{1}&space;\approx&space;0.111$

Consequently, the speech recognition software assumes that the spoken sentence was "Anne studies in Munich" since the probability $p(\textit{Anne&space;studies&space;in&space;Munich})&space;\approx&space;0.222$ is greater than $p(\textit{Anton&space;studies&space;in&space;Munich})&space;\approx&space;0.111$.

5 Problem of the N-Gram Model

In this section, the example of the previous section is used to show the main problem of the n-gram model.
For example, it should be no problem to determine the probability $p(\textit{Anne&space;studies&space;electrical&space;engineering&space;in&space;Munich.})$ using the training set from above. Though it seems reasonable that the probability is unequal to zero, it is always zero since $p(in|engineering)=&space;0$ using the training set from above. This problem does not only exist in this small example. Even in very large training sets, it is not possible to determine the probability of each word combination since many word combinations do not exist due to the sparsity of the training set, especially if trigrams are used instead of bigrams.

There are many approaches which try to compensate this problem. Examples are the  Backoff Approach, the Katz Smoothing Approach, the Kneser-Ney Smoothing, the  Good Turing Smoothing, or the Laplace Smoothing.

6 References

[1]  Huang, X & Deng, L. (2009) An Overview of Modern Speech Recognition, Chapter 15 of the book "Handbook of Natural Language Processing"

[2] Gales, M. (2008). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing

[3] Adami A. Automatic Speech Recognition: From the Beginning to the Portuguese Language

[4] Chen, S. & Goodman, J. (1998) Empirical Study of Smoothing Techniques for Language Modeling