Kneser-Ney smoothing

Interpolated Kneser-Ney smoothing is one of the most widely used modern N-gram smoothing methods. Before the introduction of Kneser-Ney smoothing, it is better to have a look at a discounting method which is called absolute discounting. Kneser-Ney algorithm is largely inspired and based on this discounting method.

1. Motivation

The Good-Turing smoothing is too simple for the complicated situation, that's why practically another advanced smoothing method is preferred, which is the Kneser-Ney Smoothing method. The current study shows that, Kneser-Ney Smoothing is one of the best smoothing methods in many applications. It calculates the backoff probability based on how often each word appeared in the different context, rather than the number of occurrences.

2. Introduction

2.1 Absolute Discounting

By analyzing an example of the estimated results from Good-Turing smoothing (Table 1), it is interesting that the differences between $c$ and $c^*$ are always approximate 0.75, in this case, except $c=0$ and $c=1$ It makes sense that to define a fixed discount and operate the smoothing process by simply taking this discount away from the original c. By doing this, the time for calculating will be saved. That is exactly the principal of absolute discounting.

Just like the interpolation smoothing (Jelinek-Mercer), the absolute discounting involves interpolation of higher- and lower-order models. But instead of multiplying the higher-order $p_{ML}$ by a $\lambda$, we subtract a fixed discount $\delta&space;\in&space;[0,1]$ from each nonzero count:

$P_A_b_s_o_l_u_t_e_D_i_s_c_o_u_t_i_n_g(w_i|w_i_-_1)=\frac{c(w_i_-_1,w_i)-d}{c(w_i_-_1)}+\lambda(w_i_-_1)P(w)$

P(w) is the regular unigram but is it really good only using the unigram? Let's take a look at one example:  suppose "San Francisco" is common, but "Francisco" occurs only after "San". "Francisco" will get a high unigram probability.[3] According to the absolute discounting, "Francisco" will get a high probability although it only follows the "San". That's why it's better to consider the bigram model as well.

2.2 Kneser-Ney Smoothing

We want a heuristic that more accurately estimates the number of times we might expect to see word w in a new unseen context. The Kneser-Ney intuition is to base our estimate on the number or different contexts word w has appeared in.[2]

Let's define how likely is w appeared as a novel continuation with $P_{CONTINUATION}$:

$P_{CONTINUATION}(w)\propto|\left&space;\{&space;w_i_-_1:c(w_i_-_1,w)>0\right&space;\}|$

Normalized by the total number of word bigram types (Jarufsky):

$\left&space;|&space;\left&space;\{&space;(w_j_-_1,w_j):c(w_j_-_1,w_j)>0&space;\right&space;\}&space;\right&space;|$

$P_{CONTINUATION}(w)=\frac{\left&space;|&space;\left&space;\{&space;w_{i-1}:c(w_{i-1},w)>0\right&space;\}&space;\right&space;|}{\left&space;|&space;\left&space;\{&space;(w_{j-1},w_j):c(w_{j-1},w_j)>0&space;\right&space;\}&space;\right&space;|}$

Alternative metaphor: The number of # of word types seen to precede $w$ [1]:

$P_{CONTINUATION}(w)=\frac{\left&space;|&space;\left&space;\{&space;w_{i-1},w>0\right&space;\}&space;\right&space;|}{\sum&space;\left&space;|&space;\left&space;\{&space;w'_{i-1}:c(w'_{i-1},w')>0\right&space;\}&space;\right&space;|}$

$P_{KN}(w_i|w_{i-1})=\frac{max(c(w_{i-1},w_i)-d,0)}{c(w_{i-1})}+\lambda&space;(w_{i-1})P_{CONTINUATION}(w_i)$

$\lambda$ is a normalizing constant; the probability mass we've discounted [1]:

$\lambda&space;(w_{i-1})=\frac{d}{c(w_{i-1})}\left&space;|&space;\left&space;\{&space;w:c(w_{i-1},w)>0\right&space;\}&space;\right&space;|$

$P_{KN}(w_i|w^{i-1}_{i-n+1})=\frac{max(c_{KN}(w^i_{i-n+1})-d,0)}{c_{KN}(w^{i-1}_{i-n+1})}+\lambda&space;(w^{i-1}_{i-n+1})P_{KN}(w_i|w^{i-1}_{i-n+2})$

$c_{KN}(\bullet&space;)=\left\{\begin{matrix}&space;count(\bullet&space;)&space;\textup{&space;for&space;the&space;highest&space;order}\\&space;countinuationcount(\bullet)&space;\textup{&space;for&space;lower&space;order}&space;\end{matrix}\right.$

Continuation count is the number of unique single word contexts for •.

References

[1] Jarufsky, D. & Martin, J. H. (2000). Speech & language processing. Pearson Education India.

[2] MacCartney, B. (2005). NLP Lunch Tutorial: Smoothing.

[3] Luo w., Liu Q. & Bai S. (2009). A Review of the State-of-the-Art of Research on Large-Scale Corpora Oriented Language Modeling. Journal of Computer Research and Development. pp. 1704-1712.