# Laplace smoothing

This article introduces another smoothing technique trying to complete the set of commonly used methods used to improve the results of n-gram models. After revisiting shortly the underlying general motivation of smoothing, the Bayesian smoothing method will be reviewed which represents the generalized method of the Laplace smoothing technique.

# 1. Motivation

Primarily, a given language model based on a certain training corpus has the goal to estimate the conditional probability of concatenated words. Assuming that the corpus is finite,  it consequently leads to the inevitable case of handling unseen words. But, the probability of an unseen word - refer to the basics of language modeling for further information - will be generally underestimated - in the worst case probability will equal to zero.

Therefore, smoothing techniques discount probabilities of „seen“ words and transfer this additional probability mass to unseen words [1][2].

# 2. From the „Bayesian smoothing using Dirichlet priors“ to the „Laplace smoothing“ technique

## 1. Introduction

Based on the motivation described above, the literature and research proposed a general smoothed model, that serves as starting point for each smoothing technique and improves the understanding of how those operate - see Equation 1.1 [1]:

Equation 1.1: The general smoothed model

$p(w|d)&space;=&space;\left\{\begin{matrix}&space;p_{s}(w|d)&&space;if~word~w~is~seen&space;\\&space;\alpha&space;_{d}p(w|C)&otherwise&space;\end{matrix}\right.$

$p_{s}(w|d)$ symbolizes the smoothed probability of a word $w$ seen in the training corpus $d$, while $p(w|C)$ is the collection language model and $\alpha&space;_{d}$ is an adjusting coefficient, that ensures that all probabilities sum to one. Zhai (2004) [1] denotes that, „a language model is a multinomial distribution, for which the conjugate prior for Bayesian analysis is the Dirichlet distribution“. They chose the parameters of the Dirichlet and deduced the model for $p(w|d)$ - as illustrated in Equation 1.2. For detailed explanations on this topic, please refer to Zhai (2004) [1].

Equation 1.2: The Bayesian smoothing technique

$p_{\mu&space;}(w|d)&space;=&space;\frac{c(w;d)+\mu&space;p(w|C)}{\sum_{}^{w'&space;\in&space;V&space;}}&space;c(w';d)&space;+&space;\mu$

## 2. Laplace smoothing

Now, the Laplace smoothing technique is just a special case of the method described in the previous subsection. Thinking of unigrams, the probability of seeing a word $w_{i}$ in the training corpus is defined as $p(w_{i})&space;=&space;\frac{C(w_{i})}{N}$, while $C(w_{i})$ comprises all occurrences of the word found and $N$ is the size of the training corpus that is the total number of unigrams. If the word does not exist in the corpus, it leads to a probability of zero - exactly the problem described in the motivation.

To circumvent this issue the Laplace smoothing uses a simple method: By adding one to all counts, the minimal value of the resulting probability moves from zero to $\frac{1}{N+V}$, while $V$ is the size of vocabulary. Certainly, this technique can be easily extended to fit general n-gram models [2]:

Equation 1.3: The Laplace smoothing technique

$p(w_{i}|w_{i-1}&space;...&space;w_{i-n+1})&space;=&space;\frac{C(w_{i-n+1}...w_{i-1}w_{i})&space;+&space;1}{C(w_{i-n+1}w_{i-1})+V}$

# References

[1] Zhai, C. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2), 179-214.