# Katz Smoothing Approach

The Katz-Smoothing Approach is an extension of the n-gram model. Problem occurs if there is not a corresponding n-gram for a predefined combination of words in the training set. This results in Katz Smoothing approach which cures the data sparsity of the training set.

# 1 Motivation

The  n-gram model is a very common approach in the language model to determine the most probable word sequence from various probable word sequences. It usually is the case that the training set covers not all possible n-gram combinations. Consequently, the probability of n-grams, which do not occur in the training set, is zero. However, the probability should be nonzero because it can also appear in some texts. A naive approach to solve this problem is the Backoff Approach. The idea of this approach is that if the n-gram probability is equal to zero because of sparsity of the training set, the probability of the (n-1)-gram is used instead to calculate the probability of a sentence $p(\mathbf{W})$. The main problem of the Backoff Approach lies in the fact that once one mixes up the probabilities of n-grams with different n, the entire model does not describe a probability distribution any more. The Katz Smoothing Approach, which is a further extension of the Backoff Approach, solves this problem by weighting the shorter n-gram model such that the entire model is a probability distribution again. This is the topic of this article.

# 2 Katz Smoothing Approach

The basic idea of the Katz Smoothing Approach is to combine higher-order n-gram models with lower-order n-gram models. Since then the whole model is not a probability distribution any more, the different n-gram models are weighted differently. If an n-gram often occurs in the training set, the probability is determined as in the original n-gram model. However, if the n-gram occurs very rarely, the original n-gram probability is reduced a bit by weighting.
Then the remaining probability can be distributed to the so-called unseen n-grams. Unseen n-grams are the n-grams which do not occur in the training set. It is assumed that the unseen n-grams are approximately equal to the weighted probability of the corresponding (n-1)-grams. Using this approach, the probability for the trigram $p(w_i|w_{i-1},&space;w_{i-2})$ can be calculated by

$p(w_i|w_{i-1},&space;w_{i-2})&space;=&space;\begin{cases}&space;\frac{C\left(w_{i-2},w_{i-1},w_i\right)}{C\left(w_{i-2},w_{i-1}\right)}&space;&&space;\text{if&space;}&space;C\left(w_{i-2},w_{i-1},w_i\right)&space;\geq&space;k&space;\\&space;d\left(C\left(w_{i-2},w_{i-1},w_i\right)\right)&space;\frac{C\left(w_{i-2},w_{i-1},w_i\right)}{C\left(w_{i-2},w_{i-1}\right)}&space;&&space;\text{if&space;}&space;0&space;<&space;C\left(w_{i-2},w_{i-1},w_i\right)&space;\leq&space;k&space;\\&space;\alpha\left(w_{i-1},&space;w_{i-2}\right)&space;p\left(w_i|w_{i-1}\right)&space;&&space;\text{otherwise}\\&space;\end{cases}$

where $C(\mathbf{w})$ is the number of occurrences of the word sequence $\mathbf{w}$ in the training set, $d(r)$ a discount coefficient, $\alpha\left(w_{i-1},&space;w_{i-2}\right)$ a normalisation constant and $k$ is a count threshold. Katz suggests to set $k$ equal to 5 [1].

The discount coefficient $d(r)$ is derived from the Good-Turing estimate
$d(r)&space;=&space;\frac{(r+1)&space;n_{r+1}}{r\cdot&space;n_r},$
whereas $n_r$ is the number of n-grams that occur exactly $r$ times in the training set.

The normalisation constant $\alpha\left(w_{i-1},&space;w_{i-2}\right)$ can be calculated by

$\alpha\left(w_{i-1},&space;w_{i-2}\right)&space;=&space;\frac{1-\sum\limits_{\forall_{w_i}&space;C\left(w_{i-2},w_{i-1},w_i\right)&space;>&space;0}&space;p\left(w_i|w_{i-1},&space;w_{i-2}\right)}{1-\sum\limits_{\forall_{w_i}&space;C\left(w_{i-2},w_{i-1},w_i\right)&space;>&space;0}&space;p\left(w_i|w_{i-1}\right)}.$

Until now, the Katz Smoothing Approach has been introduced only for trigrams. However, it can be defined for any n-gram model analogously. For this, just replace the trigrams by n-grams and the bigrams by (n-1)-grams.

Beyond the just-described Katz Smoothing Approach, there are many approaches, which are similar to this one. For example, the Kneser-Ney Smoothing, which is effective for very sparse training sets [2].

# 3 References

[1] Katz, S. (1987) Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer, IEEE Transactions on acoustics, speech and signal processing

[2] Gales S. & Young, S. (2008) The Application of Hidden Markov Models in Speech Recognition", Chapter 2

[3] Chen, S. & Goodman, J. (1998) Empirical Study of Smoothing Techniques for Language Modeling

[4]: Hifny, Y. (2012) Smoothing Techniques for Arabic Diacritics Restoration, University of Helwan, Egypt