language model and perplexity

In the previous articles 1 2, we learned how to calculate probability of a given sentence using n-gram language model. Today, let’s discuss how we calculate perplexity of a corpus using a language model. To be able to follow along, it is recommended that you go through the exercises in the previous articles.

Perplexity is the standard metric for measuring quality of a language model. Qualitatively, perplexity measures the average branching factor per token predicted by the language model. Let’s take a look at two extreme ends of the metric.

perplexity of 1: the language model is 100% certain predicting the next token. This occurs if language model is severely over-fit to the evaluation corpus. In practice, this should never happen.
perplexity of V where V is the size of the vocabulary: the language model assumes uniform distribution, i.e., completely random guess. If this is what we get from the language model, we might as well roll a die to predict the next token.

So, perplexity should lie somewhere between 1 and V. A better language model will show lower perplexity in general.

Quantitatively, perplexity is given by

Perplexity Formula https://en.wikipedia.org/wiki/Perplexity

where q(s) is the probability of sentence s, n is the number of sentences, and N is the number of tokens in the corpus. Let’s calculate perplexity of our 2-gram model from before using two evaluation sentences. Create eval.txt file with the following

that is not the question
that is that

We already calculated q(s1) and q(s2) before, which are 10^(-3.273691) and 10^(-4.3119626), respectively. Here, we have total 10 tokens (including </s> but excluding <s>), i.e., N = 10. Finally, our perplexity is given by

(10**(-3.273691 + -4.3119626))**(-1/10.) = 5.735

Now, let’s use query utility from KenLM to verify if our calculation is indeed correct.

$ query 2gram.arpa < eval.txt
that=7 1 -1.3542756 is=8 2 -0.24913573 not=6 2 -0.50381577 the=9 2 -0.6380301 question=10 2 -0.26421693 </s>=2 2 -0.26421693 Total: -3.273691 OOV: 0
that=7 1 -1.3542756 is=8 2 -0.24913573 that=7 1 -1.3542756 </s>=2 1 -1.3542756 Total: -4.3119626 OOV: 0
Perplexity including OOVs: 5.735421689408422
Perplexity excluding OOVs: 5.735421689408422
OOVs: 0
Tokens: 10

Hooray! Our calculation matches exactly with what query outputs. Note that our vocab size is V = 9 including </s>, so our model is able to predict much better than random guess perplexity of 9.