Due to the bad estimations Laplace smoothing may lead to, normally the Laplace smoothing is not used for N-grams. Instead, we have many better smoothing methods, and Good-Turing smoothing is one of them. This article brings the idea of how this method works. Besides, an optimized method will be introduced at the end of this article. 

1. Motivation

Laplace smoothing (also called add-one smoothing) is a naive approach. However, it is not sufficient to be used in the practical situation. Good-Turing estimation is the core of many advanced smoothing techniques. Its basic idea is: categorizing the statistical parameters into classes according to how often it occurs, using the next class where the current number of times plus 1 to estimate the current class. For example, the count of things which have occurred once can be use to estimate the count of things which have never been seen. 

 

2. Good-Turing Estimation

2.1 Introduction

The principal of Good-Turing smoothing is to reallocate the probability of n-grams that occur c+1 times in the training data to the n-grams that occur c times. In particular, reallocate the probability of n-grams that were seen once to the n-grams that were never seen. In Figure 1, the whole process of Good-Turing smoothing is clearly presented. N1 is the number of n-grams seen once, N4417 is then the number of n-grams seen 4471 times. After the Good-Turing smoothing, the value of Nis given in N0, the value of N2 is given to N1 and so on. 

Figure 1: Good-Turing intuition

To conclude it, the following formulas are used. For each count c > 0, the new count c* for each count c is calculated depending on Nc+1. The first formula is employed when c=0 and the second formula is when c>0[3]:

    ,      

where Nc is the number of n-grams seen exactly c times. 

 

2.2 Example

Imagine that you are fishing and you already caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon and 1 eel. 18 fishes in total. 

1. How likely is it when the next fish is trout? (MLE) 2. How likely is it when the next fish is tuna?(MLE and GT) 3. How likely is it now if the next fish is trout by using Good-Turing?

By using the Maximum Likelihood Estimation (see Laplace smoothing, chapter 2):

     

By using the Good-Turing Smoothing:

     

    

   

From this example, we can see that, the basic idea can be described as "robbing the rich to help the poor": all other fishes are giving their little portion of possibilities to the new fish. It can avoid the bad values (where value = 0) while not leading a bad estimation result like the Laplace estimation does. 

 

2.3 Problems

When the count c is very big (the n-grams show up very often), Nc+1 is very likely to be 0. When looking back at the example, you will find out that N4 is zero and it is not possible to calculate P*GT(carp). Obviously, it is not right. For example, look at the Figure 2, there will always occur the same problem when there is a "hole" in between. And even if we are not just below a hole, for a high c, the Nc is quite noisy[3].

Figure 2: Example (Jarufsky)

Then we can think of c* as [3]:

However, it is hard to estimate the expectations, the original formula amounts to using the Maximum Likelihood estimate [2]. In practice, observed values are used in Good-Turing method instead of the expected ones. But it is only suitable for the case where it has a huge volume of the training words and has a large number of observed values. Under the observation data shortage, the estimation is unreliable[3]. Thus, we may say that, in order to use Good-Turing smoothing properly, all these problems have to be taken into consideration. Except this problem, the Good-Turing smoothing still makes a base for other smoothing methods. 

 

2.4 Optimized method: Simple Good-Turing smoothing

The simple Good-Turing smoothing (Gale and Sampson, 1995) can deal with the "empty hole". In this method, the Nc counts are smoothed before calculating the c*: replace the Nc (which is zero) in the sequence with a value computed from a linear regression that is fit to map Nc to c in log space[2][4]. 

 

In practice, the discounted estimate c* is not used for all counts c. Large counts (where c>k for some threshold k) are assumed to be reliable. Katz suggests setting k at 5. Thus, we define [2]: 

 

The correct equation for c* when some k is introduced is:

 

The result after simple Good-Turing is shown in Figure 3. The "holes" are filled and the problems are gone. 

Figure 3

 

References

[1] Jarufsky, D. & Martin, J. H. (2000). Speech & language processing. Pearson Education India.

[2] MacCartney, B. (2005). NLP Lunch Tutorial: Smoothing.

[3] Wang D. & Cui R. (2009). Data Smoothing Technology Summary. Computer Knowledge and Technology. v. 5, no. 17, pp. 4507-4509.

[4] Church, K. & W. Gale, (1991), A comparison of the enhanced GoodTuring and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, v. 5, pp. 19-54.


Contents