Backoff Approach

The Backoff Approach is an extension of the n-gram model. If the n-gram probability is equal to zero because of sparsity of the training set, the probability of the (n-1)-gram is used instead to calculate the probability of a sentence $p(\mathbf{W})$.

1 Motivation

The n-gram model is a very common approach in the language model to determine the most probable word sequence from various probable word sequences. The problem of this approach is that there is not any training set which can cover all possible n-grams combinations. Consequently, the probability of n-gram, which are rarely used, is zero since they do not appear in the training set. However, the probability should be nonzero because it can also appear in some texts. A naive approach to solve this problem is the Backoff Approach, which is the topic of this article.

2 Backoff Approach

The main idea of the Backoff Approach is the following: at first, the probability of the n-gram from the training set should be determined. If the probability is zero or in other words the corresponding n-gram does not occur in the training set,  a simpler model replaces the current one. In this case, the probability of the corresponding (n-1)-gram will be computed instead. If this one does not exist either, the (n-2)-gram model replaces the (n-1)-gram model and so on. Using the Backoff Approach and the trigram model, the probability is given by:

$p(w_i|w_{i-1},w_{i-2})&space;=&space;\begin{cases}&space;\frac{C(w_{i-2},w_{i-1},&space;w_i)}{C(w_{i-2},w_{i-1})}&space;&&space;\text{if&space;}&space;C(w_{i-2},w_{i-1},&space;w_i)&space;>&space;0\\&space;\frac{C(w_{i-1},&space;w_i)}{C(w_{i-1})}&space;&&space;\text{if&space;}&space;C(w_{i-2},w_{i-1},&space;w_i)&space;=&space;0&space;\text{&space;\&&space;}&space;C(w_{i-1},&space;w_i)&space;>&space;0\\&space;\frac{C(w_i)}{N}&space;&&space;\text{otherwise}&space;\end{cases}$

where $C(\mathbf{w})$ is the number of occurrences of the word sequence $\mathbf{w}$ and $N$ the number of words in the training set.

3 A little example

Let's have a look at the example from  n-gram model again. Remember, the training set consist of the following three sentences: "Anne studies in Munich. Anton studies in Nuremberg. Anne studies electrical engineering". Therefrom, the following bigram probabilities holds:

$p(Anne|)&space;&=&space;\frac{2}{3}$

$p(studies|Annne)&space;&=&space;\frac{2}{2}$

$p(electrical|studies)&space;&=&space;\frac{1}{3}$

$p(engineering|electrical)&space;&=&space;\frac{1}{1}$

$p(in|engineering)&space;&=&space;0$

$p(Munich|in)&space;&=&space;\frac{1}{2}$

$p(|Munich)&space;&=&space;\frac{1}{1}\\$

The goal was to determine the probability $p(\textit{Anne&space;studies&space;electrical&space;engineering&space;in&space;Munich.})$. The result using the simple n-gram approach was that $p(\textit{Anne&space;studies&space;electrical&space;engineering&space;in&space;Munich.})$ is equal to zero since $p(in|engineering)=&space;0$. Using the Backoff Approach, we get a result which is unequal to zero. For this, we have to determine the probability $p(in|engineering)$ again. Since the word sequence "engineering in" does not occur in the training data, we have to compute $p(in)$ according to the Backoff Approach:

$p(in)&space;=&space;\frac{C(in)}{N}&space;=&space;\frac{1}{12}$

Consequently, the probability $p(\textit{Anne&space;studies&space;electrical&space;engineering&space;in&space;Munich.})$ holds:

$p(\textit{Anne&space;studies&space;electrical&space;engineering&space;in&space;Munich.})&space;\\&space;=&space;p(Anne|)&space;\cdot&space;p(studies|Anne)&space;\cdot&space;p(electrical|studies)&space;\\&space;\cdot&space;p(engineering|electrical)&space;\cdot&space;p(in|engineering)&space;\cdot&space;p(Munich|in)&space;\cdot&space;p(|Munich)&space;\\&space;=&space;\frac{2}{3}&space;\cdot&space;\frac{2}{2}&space;\cdot&space;\frac{1}{3}&space;\cdot&space;\frac{1}{1}&space;\cdot&space;\frac{1}{12}&space;\cdot&space;\frac{1}{2}&space;\cdot&space;\frac{1}{1}\approx&space;0.009&space;\\$

4 Problems

The main problem of the Backoff Approach lies in the fact that once one mixes up the probabilities of n-grams with different n, as it was shown in the small example above, the entire model can not describe a probability distribution any more. The Katz Smoothing Approach, which is an extension of the Backoff Approach, solves this problem by gaining the shorter n-gram model such that the entire model is a probability distribution again.

5 References

[1] Chen, S. F. & Goodman, J. (1998). Empirical Study of Smoothing Techniques for Language Modeling.

[2] Gales, M. (2008). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing,