The following article will give an in-depth overview of the three basic n-gram models in which characteristics and applications of each are highlighted. In order to measure the quality of a model, a common method will be introduced enabling comparing the models on a common basis.

1. The evaluation of perplexity

Perplexity is defined as the „inability to deal with or understanding something“. Applying this definition to the actual context of this article, implies that the method has to be an inversely proportional measure allowing to quantify the modelling of unseen sentences.

The method presumes having various test data sentences , whereupon every sentence consists of a sequence of words . In doing so, it is crucial to ensure, that those sentences are not part of the estimation corpus of the language model. Every test sentence produces a measurable probability  depending on the language model. Applying this procedure to the set of sentences leads to Equation 1.1 representing the probability of the entire test data set [1].


 Equation 1.1: The probability of an entire test data set.

 


 

With  as the total number of words in the test data set and taking the logarithm of Equation 1.1, results in the log probability  - see Equation 1.2.


 

 Equation 1.2: The average log probability of an entire test data set.


 

Finally, as demonstrated in Equation 1.3, the perplexity can be defined as two to the power of the negative log probability.


 Equation 1.3: The definition of perplexity. 


Therefore, the greater the probability of „seeing“ a test sentence under a language model the smaller the perplexity of the test sentence. Thus, evaluation of perplexity symbolises an inversely proportional measure allowing to make statements about the quality of a language model.

 

2. General differences between basic n-gram models explained

 

Obviously, the main difference relies on the chosen N. As already explained in detail in this (Link) article, it is very difficult to calculate the probability of the entire history of word in a sentence. Hence, practical applications limit themselves to the history represented by N-1 words: unigram models (N=1), do not observe the history at all. A bigram model (N=2) takes only the previous word into account and finally a trigram model (N = 3) involves the two preceding words. Figure 1 shows an example split of a sentence into n-grams [2].

 

Exemplary split of a phrase into uni-, bi- and trigrams
Figure 1: Exemplary split of a phrase into uni-, bi- and trigrams

 

The approximated probability of the exemplary sentence can be derived for each model as illustrated in Equation 1.4 :


Equation 1.4:


 

Increasing n therefore allows to cover a greater context by including „historical“ information, but at the same time leads to a significantly increased complexity that requires a large amount of processing power.

 

3. Google Books N-gram Viewer and Microsoft Web N-gram Services

Besides intelligent smoothing algorithms and parameter estimation, a proper training corpus is crucial when it comes to increasing reliability of results produced by language models. The following will describe two approaches made by researchers at two well-known firms: Microsoft and Google.

The research team working for Google based their corpus on the existing datasets gathered by Google Books service. The resulting corpus contains over 8 million books and supports eight different languages (English, Spanish, French, German, Russian, Italian, Chinese and Hebrew) [3]. Instead, researcher at Microsoft chose to base their corpus on web documents indexed by their web search engine Bing. This approach includes hundreds of billions of websites - mainly in english due to their focus on the US market [4].

While comparing these applications, by nature interesting key points become apparent. Firstly, the Google approach based on books reveals a better contextual results using „Oxford English“ while the Microsoft approach suffers clearly from the language downgrade caused by digitalization. But secondly, the approach offered by Microsoft is even capable of trading with common short hands becoming more and more popular in short messaging [3][4].

In the following - see Figure 2 -  you can try yourself the Microsoft Web N-gram Services API by simply writing a test phrase and clicking on „Go“. It will return the probability to see the whole sentence and the conditional probability of seeing exactly your combination of concatenated words. Google does not offer an integrable API, but it can be used manually under https://books.google.com/ngrams.

 

    ProbabilityConditional probability
    
       
      Figure 2: Exemplary implementation of the Microsoft Web N-gram Services API. 

References

[1] Collins, M. (2011). Language Modeling. Retrieved from http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CCMQFjAA&url=http%3A%2F%2Fwww.cs.columbia.edu%2F~mcollins%2Fcourses%2Fnlp2011%2Fnotes%2Flm.pdf&ei=euzNVN63LuT9ywPBg4GQBA&usg=AFQjCNEwT21Fc5NIB0h5ED6n9AYUPBdAog&bvm=bv.850768.

[2] Boyd-Graber, J. (2011). Language Models. Retrieved from http://www.umiacs.umd.edu/~jbg/teaching/INFM_718_2011/lecture_6.pdf.

[3] Lin, Y (2012). Syntactic annotations for the google books ngram corpus. In Proceedings of the ACL 2012 system demonstrations, pages 169-174.

[4] Wang, K (2010). An overview of Microsoft Web N-gram corpus and applications. In Proceedings of the NAACL HLT 2010 Demonstration Session, pages 45-48.

 


Contents