This article introduces another smoothing technique trying to complete the set of commonly used methods used to improve the results of n-gram models. After revisiting shortly the underlying general motivation of smoothing, the Bayesian smoothing method will be reviewed which represents the generalized method of the Laplace smoothing technique.
The following article will give an in-depth overview of the three basic n-gram models in which characteristics and applications of each are highlighted. In order to measure the quality of a model, a common method will be introduced enabling comparing the models on a common basis.
1. The evaluation of perplexity
Perplexity is defined as the „inability to deal with or understanding something“. Applying this definition to the actual context of this article, implies that the method has to be an inversely proportional measure allowing to quantify the modelling of unseen sentences.
The method presumes having various test data sentences , whereupon every sentence consists of a sequence of words . In doing so, it is crucial to ensure, that those sentences are not part of the estimation corpus of the language model. Every test sentence produces a measurable probability depending on the language model. Applying this procedure to the set of sentences leads to Equation 1.1 representing the probability of the entire test data set .
Equation 1.1: The probability of an entire test data set.
With as the total number of words in the test data set and taking the logarithm of Equation 1.1, results in the log probability- see Equation 1.2.
Equation 1.2: The average log probability of an entire test data set.
Finally, as demonstrated in Equation 1.3, the perplexity can be defined as two to the power of the negative log probability.
Equation 1.3: The definition of perplexity.
Therefore, the greater the probability of „seeing“ a test sentence under a language model the smaller the perplexity of the test sentence. Thus, evaluation of perplexity symbolises an inversely proportional measure allowing to make statements about the quality of a language model.
2. General differences between basic n-gram models explained
Obviously, the main difference relies on the chosen N. As already explained in detail in this (Link) article, it is very difficult to calculate the probability of the entire history of word in a sentence. Hence, practical applications limit themselves to the history represented by N-1 words: unigram models (N=1), do not observe the history at all. A bigram model (N=2) takes only the previous word into account and finally a trigram model (N = 3) involves the two preceding words. Figure 1 shows an example split of a sentence into n-grams .
The approximated probability of the exemplary sentence can be derived for each model as illustrated in Equation 1.4 :
Increasing n therefore allows to cover a greater context by including „historical“ information, but at the same time leads to a significantly increased complexity that requires a large amount of processing power.
3. Google Books N-gram Viewer and Microsoft Web N-gram Services
Besides intelligent smoothing algorithms and parameter estimation, a proper training corpus is crucial when it comes to increasing reliability of results produced by language models. The following will describe two approaches made by researchers at two well-known firms: Microsoft and Google.
The research team working for Google based their corpus on the existing datasets gathered by Google Books service. The resulting corpus contains over 8 million books and supports eight different languages (English, Spanish, French, German, Russian, Italian, Chinese and Hebrew) . Instead, researcher at Microsoft chose to base their corpus on web documents indexed by their web search engine Bing. This approach includes hundreds of billions of websites - mainly in english due to their focus on the US market .
While comparing these applications, by nature interesting key points become apparent. Firstly, the Google approach based on books reveals a better contextual results using „Oxford English“ while the Microsoft approach suffers clearly from the language downgrade caused bydigitalization. But secondly, the approach offered by Microsoft is even capable of trading with common short hands becoming more and more popular in short messaging .
In the following - see Figure 2 - you can try yourself the Microsoft Web N-gram Services API by simply writing a test phrase and clicking on „Go“. It will return the probability to see the whole sentence and the conditional probability of seeing exactly your combination of concatenated words. Google does not offer an integrable API, but it can be used manually under https://books.google.com/ngrams.
Figure 2: Exemplary implementation of the Microsoft Web N-gram Services API.
Due to the bad estimations Laplace smoothing may lead to, normally the Laplace smoothing is not used for N-grams. Instead, we have many better smoothing methods, and Good-Turing smoothing is one of them. This article brings the idea of how this method works. Besides, an optimized method will be introduced at the end of this article.
Laplace smoothing (also called add-one smoothing) is a naive approach. However, it is not sufficient to be used in the practical situation. Good-Turing estimation is the core of many advanced smoothing techniques. Its basic idea is: categorizing the statistical parameters into classes according to how often it occurs, using the next class where the current number of times plus 1 to estimate the current class. For example, the count of things which have occurred once can be use to estimate the count of things which have never been seen.
2. Good-Turing Estimation
The principal of Good-Turing smoothing is to reallocate the probability of n-grams that occur c+1 times in the training data to the n-grams that occur c times. In particular, reallocate the probability of n-grams that were seen once to the n-grams that were never seen. In Figure 1, the whole process of Good-Turing smoothing is clearly presented. N1 is the number of n-grams seen once, N4417 is then the number of n-grams seen 4471 times. After the Good-Turing smoothing, the value of N1 is given in N0, the value of N2 is given to N1 and so on.
Figure 1: Good-Turing intuition
To conclude it, the following formulas are used. For each count c > 0, the new count c* for each count c is calculated depending on Nc+1. The first formula is employed when c=0 and the second formula is when c>0:
where Nc is the number of n-grams seen exactly c times.
Imagine that you are fishing and you already caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon and 1 eel. 18 fishes in total.
1. How likely is it when the next fish is trout? (MLE) 2. How likely is it when the next fish is tuna?(MLE and GT) 3. How likely is it now if the next fish is trout by using Good-Turing?
From this example, we can see that, the basic idea can be described as "robbing the rich to help the poor": all other fishes are giving their little portion of possibilities to the new fish. It can avoid the bad values (where value = 0) while not leading a bad estimation result like the Laplace estimation does.
When the count c is very big (the n-grams show up very often), Nc+1 is very likely to be 0. When looking back at the example, you will find out that N4 is zero and it is not possible to calculate P*GT(carp). Obviously, it is not right. For example, look at the Figure 2, there will always occur the same problem when there is a "hole" in between. And even if we are not just below a hole, for a high c, the Nc is quite noisy.
Then we can think of c* as :
However, it is hard to estimate the expectations, the original formula amounts to using the Maximum Likelihood estimate . In practice, observed values are used in Good-Turing method instead of the expected ones. But it is only suitable for the case where it has a huge volume of the training words and has a large number of observed values. Under the observation data shortage, the estimation is unreliable. Thus, we may say that, in order to use Good-Turing smoothing properly, all these problems have to be taken into consideration. Except this problem, the Good-Turing smoothing still makes a base for other smoothing methods.
The simple Good-Turing smoothing (Gale and Sampson, 1995) can deal with the "empty hole". In this method, the Nc counts are smoothed before calculating the c*: replace the Nc (which is zero) in the sequence with a value computed from a linear regression that is fit to map Nc to c in log space.
In practice, the discounted estimate c* is not used for all counts c. Large counts (where c>k for some threshold k) are assumed to be reliable. Katz suggests setting k at 5. Thus, we define :
The correct equation for c* when some k is introduced is:
The result after simple Good-Turing is shown in Figure 3. The "holes" are filled and the problems are gone.
 Jarufsky, D. & Martin, J. H. (2000). Speech & language processing. Pearson Education India.
 MacCartney, B. (2005). NLP Lunch Tutorial: Smoothing.
 Wang D. & Cui R. (2009). Data Smoothing Technology Summary. Computer Knowledge and Technology. v. 5, no. 17, pp. 4507-4509.
 Church, K. & W. Gale, (1991), A comparison of the enhanced GoodTuring and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, v. 5, pp. 19-54.
Interpolated Kneser-Ney smoothing is one of the most widely used modern N-gram smoothing methods. Before the introduction of Kneser-Ney smoothing, it is better to have a look at a discounting method which is called absolute discounting. Kneser-Ney algorithm is largely inspired and based on this discounting method.
The Good-Turing smoothing is too simple for the complicated situation, that's why practically another advanced smoothing method is preferred, which is the Kneser-Ney Smoothing method. The current study shows that, Kneser-Ney Smoothing is one of the best smoothing methods in many applications. It calculates the backoff probability based on how often each word appeared in the different context, rather than the number of occurrences.
2.1 Absolute Discounting
By analyzing an example of the estimated results from Good-Turing smoothing (Table 1), it is interesting that the differences between and are always approximate 0.75, in this case, except and It makes sense that to define a fixed discount and operate the smoothing process by simply taking this discount away from the original c. By doing this, the time for calculating will be saved. That is exactly the principal of absolute discounting.
Just like the interpolation smoothing (Jelinek-Mercer), the absolute discounting involves interpolation of higher- and lower-order models. But instead of multiplying the higher-order by a , we subtract a fixed discount from each nonzero count:
P(w) is the regular unigram but is it really good only using the unigram? Let's take a look at one example: suppose "San Francisco" is common, but "Francisco" occurs only after "San". "Francisco" will get a high unigram probability. According to the absolute discounting, "Francisco" will get a high probability although it only follows the "San". That's why it's better to consider the bigram model as well.
2.2 Kneser-Ney Smoothing
We want a heuristic that more accurately estimates the number of times we might expect to see word w in a new unseen context. The Kneser-Ney intuition is to base our estimate on the number or different contexts word w has appeared in.
Let's define how likely is w appeared as a novel continuation with :
Normalized by the total number of word bigram types (Jarufsky):
Alternative metaphor: The number of # of word types seen to precede :
is a normalizing constant; the probability mass we've discounted :
Continuation count is the number of unique single word contexts for •.
 Jarufsky, D. & Martin, J. H. (2000). Speech & language processing. Pearson Education India.
 MacCartney, B. (2005). NLP Lunch Tutorial: Smoothing.
 Luo w., Liu Q. & Bai S. (2009). A Review of the State-of-the-Art of Research on Large-Scale Corpora Oriented Language Modeling. Journal of Computer Research and Development. pp. 1704-1712.
In the language model, it is of interest to determine the most probable N words given a present phoneme sequence. For this, Hidden Markov Models are drawn up for each word of the training set and then to search for the greatest occurrence probability of a word by means of the Viterbi Algorithm.
Basically, the speech recognition can be divided into three steps. In the first step, the spoken sentences are preprocessed in such a way that characteristic feature vectors can be extracted. In the next major step, the acoustic model, a phonemes sequence is computed by means of the feature vectors. Finally in the language model, the corresponding sentences are determined to the phoneme sequence. For this, it is required to determine the most probable words given the phoneme sequence. The approach of how to do this is the topic of the following article.
2 Training Phase
Generally, for the determination of the most probable word, a model for each word, the speech recognition software should recognize has to be drawn up. This is done during a training phase. In this training phase, a dictionary with many thousands words is used as training set. For every word of this dictionary, a Hidden Markov Model is drawn up. Usually, a left-to-right Hidden Markov Model is used with five to seven states. A left-to-right Hidden Markov Model is a Hidden Markov Model, in which the states are ordered in a line. The only allowed state transitions are remaining in the same state or moving to the next state of the line. An example of such a left-to-right Hidden Markov Model is illustrated in the figure below.
The advantage of such Hidden Markov Models is that it improves modeling timing-controlled behavior. Usually, for each subword of a word, a state is used. For example, the word "office" can be divided into four subwords - the "O"-sound, the "F"-sound, the "I"-sound ,and the "S"-sound. Having drawn up the Hidden Markov Model for each word, parameters of each Hidden Markov Model are then trained by the usual training methods of Hidden Markov Models.
3 Determination of the most probable word
In the previous section, it was described how a Hidden Markov Model is drawn up for each word of the dictionary. Now, these models are used to determine the most probable word for a preset phoneme sequence. For this, for each Hidden Markov Model the occurrence probability of a word is determined by means of the Viterbi Algorithm given the phoneme sequence, which the acoustic model delivers. The maximum of these probabilities is the word which was most probably spoken. One possibility to optimize the speech recognition rate is to determine the most probable N words of a given phoneme sequence instead of the most probable word. However, using more possible words mean that many possible word sequences occur. One possibility to illustrate these different word sequences is the so-called confusion network. An example of such a confusion network is illustrated in figure below:
An approach to find the sentences, which was most probable spoken, is to determine the most probable word sequence by means of the n-gram model.
 Gales, M. & Young, S. (2008). The Apllication of Hidden Markov Models in Speech Recognition.
 Huang, X. & Deng, L. (2009). An Overview of Modern Speech Recognition.
 Pawate, B. I. & Robinson, P. D. (1996). Implementation of an HMM-Based, Speaker-Independent Speech Recognition System on the TMS320C2x and TMS320C5x
 Renals, S. & Morgan, N. & Bourlard, H. & Cohen, M. & Franco, H. (2002) Connectionist Probability Estimators in HMM Speech Recognition