Evaluating Perplexity on Language Models

A language model is a probability distribution over sequences of tokens. When you train a language model, you want to measure how accurately it predicts human language use. This is a difficult task, and you need a metric to evaluate the model. In this article, you will learn about the perplexity metric. Specifically, you will learn:

What is perplexity, and how to compute it
How to evaluate the perplexity of a language model with sample data

Let’s get started.

Evaluating Perplexity on Language Models
Photo by Lucas Davis. Some rights reserved.

Overview

This article is divided into two parts; they are:

What Is Perplexity and How to Compute It
Evaluate the Perplexity of a Language Model with HellaSwag Dataset

What Is Perplexity and How to Compute It

Perplexity is a measure of how well a language model predicts a sample of text. It is defined as the inverse of the geometric mean of the probabilities of the tokens in the sample. Mathematically, perplexity is defined as:

$$
PPL(x_{1:L}) = \prod_{i=1}^L p(x_i)^{-1/L} = \exp\big(-\frac{1}{L} \sum_{i=1}^L \log p(x_i)\big)
$$

Perplexity is a function of a particular sequence of tokens. In practice, it is more convenient to compute perplexity as the mean of the log probabilities, as shown in the formula above.

Perplexity is a metric that quantifies how much a language model hesitates about the next token on average. If the language model is absolutely certain, the perplexity is 1. If the language model is completely uncertain, then every token in the vocabulary is equally likely; the perplexity is equal to the vocabulary size. You should not expect perplexity to go beyond this range.

Evaluate the Perplexity of a Language Model with HellaSwag Dataset

Perplexity is a dataset-dependent metric. One dataset you can use is HellaSwag. It is a dataset with train, test, and validation splits. It is available on the Hugging Face hub, and you can load it with the following code:

import datasets dataset = datasets.load_dataset(“HuggingFaceFW/hellaswag”) print(dataset) for sample in dataset[“validation”]: print(sample) break

import datasets

dataset = datasets.load_dataset(“HuggingFaceFW/hellaswag”)

print(dataset)

for sample in dataset[“validation”]:

print(sample)

break

Running this code will print the following:

DatasetDict({ train: Dataset({ features: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’, ‘source_id’, ‘split’, ‘split_type’, ‘label’], num_rows: 39905 }) test: Dataset({ features: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’, ‘source_id’, ‘split’, ‘split_type’, ‘label’], num_rows: 10003 }) validation: Dataset({ features: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’, ‘source_id’, ‘split’, ‘split_type’, ‘label’], num_rows: 10042 }) }) {‘ind’: 24, ‘activity_label’: ‘Roof shingle removal’, ‘ctx_a’: ‘A man is sitting on a roof.’, ‘ctx_b’: ‘he’, ‘ctx’: ‘A man is sitting on a roof. he’, ‘endings’: [ ‘is using wrap to wrap a pair of skis.’, ‘is ripping level tiles off.’, “is holding a rubik’s cube.”, ‘starts pulling up roofing on a roof.’ ], ‘source_id’: ‘activitynet~v_-JhWjGDPHMY’, ‘split’: ‘val’, ‘split_type’: ‘indomain’, ‘label’: ‘3’}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

DatasetDict({

train: Dataset({

features: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,

‘source_id’, ‘split’, ‘split_type’, ‘label’],

num_rows: 39905

})

test: Dataset({

features: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,

‘source_id’, ‘split’, ‘split_type’, ‘label’],

num_rows: 10003

})

validation: Dataset({

features: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,

‘source_id’, ‘split’, ‘split_type’, ‘label’],

num_rows: 10042

})

{‘ind’: 24, ‘activity_label’: ‘Roof shingle removal’,

‘ctx_a’: ‘A man is sitting on a roof.’, ‘ctx_b’: ‘he’,

‘ctx’: ‘A man is sitting on a roof. he’, ‘endings’: [

‘is using wrap to wrap a pair of skis.’, ‘is ripping level tiles off.’,

“is holding a rubik’s cube.”, ‘starts pulling up roofing on a roof.’

], ‘source_id’: ‘activitynet~v_-JhWjGDPHMY’, ‘split’: ‘val’, ‘split_type’: ‘indomain’,

‘label’: ‘3’}

You can see that the validation split has 10,042 samples. This is the dataset you will use in this article. Each sample is a dictionary. The key "activity_label" describes the activity category, and the key "ctx" provides the context that needs to be completed. The model is expected to complete the sequence by selecting one of the four endings. The key "label", with values 0 to 3, indicates which ending is correct.

With this, you can write a short code to evaluate your own language model. Let’s use a small model from Hugging Face as an example:

import datasets import torch import torch.nn.functional as F import tqdm import transformers model = “openai-community/gpt2″ # Load the model torch.set_default_device(“cuda” if torch.cuda.is_available() else “cpu”) tokenizer = transformers.AutoTokenizer.from_pretrained(model) model = transformers.AutoModelForCausalLM.from_pretrained(model) # Load the dataset: HellaSwag has train, test, and validation splits dataset = datasets.load_dataset(“hellaswag”, split=”validation”) # Evaluate the model: Compute the perplexity of each ending num_correct = 0 for sample in tqdm.tqdm(dataset): # tokenize text from the sample text = tokenizer.encode(” ” + sample[“activity_label”] + “. ” + sample[“ctx”]) endings = [tokenizer.encode(” ” + x) for x in sample[“endings”]] # 4 endings groundtruth = int(sample[“label”]) # integer, 0 to 3 # generate logits for each ending perplexities = [0.0] * 4 for i, ending in enumerate(endings): # run the entire input and ending to the model input_ids = torch.tensor(text + ending).unsqueeze(0) output = model(input_ids).logits # extract the logits for each token in the ending logits = output[0, len(text)-1:, :] token_probs = F.log_softmax(logits, dim=-1) # accumulate the probability of generating the ending log_prob = 0.0 for j, token in enumerate(ending): log_prob += token_probs[j, token] # convert the sum of log probabilities to perplexity perplexities[i] = torch.exp(-log_prob / len(ending)) # print the perplexity of each ending print(sample[“activity_label”] + “. ” + sample[“ctx”]) correct = perplexities[groundtruth] == min(perplexities) for i, p in enumerate(perplexities): if i == groundtruth: symbol=”(O)” if correct else ‘(!)’ elif p == min(perplexities): symbol=”(X)” else: symbol=” “ print(f”Ending {i}: {p:.4g} {symbol} – {sample[‘endings’][i]}”) if correct: num_correct += 1 print(f”Accuracy: {num_correct}/{len(dataset)} = {num_correct / len(dataset):.4f}”)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

import datasets

import torch

import torch.nn.functional as F

import tqdm

import transformers

model = “openai-community/gpt2”

# Load the model

torch.set_default_device(“cuda” if torch.cuda.is_available() else “cpu”)

tokenizer = transformers.AutoTokenizer.from_pretrained(model)

model = transformers.AutoModelForCausalLM.from_pretrained(model)

# Load the dataset: HellaSwag has train, test, and validation splits

dataset = datasets.load_dataset(“hellaswag”, split=“validation”)

# Evaluate the model: Compute the perplexity of each ending

num_correct = 0

for sample in tqdm.tqdm(dataset):

# tokenize text from the sample

text = tokenizer.encode(” “ + sample[“activity_label”] + “. “ + sample[“ctx”])

endings = [tokenizer.encode(” “ + x) for x in sample[“endings”]] # 4 endings

groundtruth = int(sample[“label”]) # integer, 0 to 3

# generate logits for each ending

perplexities = [0.0] * 4

for i, ending in enumerate(endings):

# run the entire input and ending to the model

input_ids = torch.tensor(text + ending).unsqueeze(0)

output = model(input_ids).logits

# extract the logits for each token in the ending

logits = output[0, len(text)–1:, :]

token_probs = F.log_softmax(logits, dim=–1)

# accumulate the probability of generating the ending

log_prob = 0.0

for j, token in enumerate(ending):

log_prob += token_probs[j, token]

# convert the sum of log probabilities to perplexity

perplexities[i] = torch.exp(–log_prob / len(ending))

# print the perplexity of each ending

print(sample[“activity_label”] + “. “ + sample[“ctx”])

correct = perplexities[groundtruth] == min(perplexities)

for i, p in enumerate(perplexities):

if i == groundtruth:

symbol = ‘(O)’ if correct else ‘(!)’

elif p == min(perplexities):

symbol = ‘(X)’

else:

symbol = ‘ ‘

print(f“Ending {i}: {p:.4g} {symbol} – {sample[‘endings’][i]}”)

if correct:

num_correct += 1

print(f“Accuracy: {num_correct}/{len(dataset)} = {num_correct / len(dataset):.4f}”)

This code loads the smallest GPT-2 model from the Hugging Face Hub. It is a 124M-parameter model that you can easily run on a low-profile computer. The model and tokenizer are loaded using the Hugging Face transformers library. You also load the HellaSwag validation dataset.

In the for-loop, you tokenize the activity label and the context. You also tokenize each of the four endings. Note that tokenizer.encode() is the method for using the tokenizer from the transformers library. It is different from the tokenizer object you used in the previous article.

Next, for each ending, you run the concatenated input and ending to the model. The input_ids tensor is a 2D tensor of integer token IDs with the batch dimension 1. The model returns an object, in which you extract the output logits tensor. This is different from the model you built in the previous article as this is a model object from the transformers library. You can easily swap it with your trained model object with minor changes.

GPT-2 is a decoder-only transformer model. It processes the input with a causal mask. For an input tensor of shape $(1, L)$, the output logits tensor has shape $(1, L, V)$, where $V$ is the vocabulary size. The output at position $p$ corresponds to the model’s estimate of the token at position $p+1$, depending on the input at positions 1 to $p$. Therefore, you extract the logits starting at offset $n-1$, where $n$ is the length of the combined activity label and context. You then convert the logits to log probabilities and compute the average over the length of each ending.

The value token_probs[j, token] is the log probability at position j for the token with ID token. The mean log-probability of each token in the ending is used to compute the perplexity. A good model is expected to identify the correct ending with the lowest perplexity. You can evaluate a model by counting the number of correct predictions over the entire HellaSwag validation dataset. When you run this code, you will see the following:

… Finance and Business. [header] How to buy a peridot Evaluating Perplexity on Language Models Look at a variety of stones… Ending 0: 13.02 (X) – You will want to watch several of the gemstones, particularly eme… Ending 1: 30.19 – Not only are they among the delicates among them, but they can be… Ending 2: 34.96 (!) – Familiarize yourself with the different shades that it comes in, … Ending 3: 28.85 – Neither peridot nor many other jade or allekite stones are necess… Family Life. [header] How to tell if your teen is being abused Evaluating Perplexity on Language Models Pay attention to… Ending 0: 16.58 – Try to figure out why they are dressing something that is frowned… Ending 1: 22.01 – Read the following as a rule for determining your teen’s behaviou… Ending 2: 15.21 (O) – [substeps] For instance, your teen may try to hide the signs of a… Ending 3: 23.91 – [substeps] Ask your teen if they have black tights (with stripper… Accuracy: 3041/10042 = 0.3028

…

Finance and Business. [header] How to buy a peridot Evaluating Perplexity on Language Models Look at a variety of stones…

Ending 0: 13.02 (X) – You will want to watch several of the gemstones, particularly eme…

Ending 1: 30.19 – Not only are they among the delicates among them, but they can be…

Ending 2: 34.96 (!) – Familiarize yourself with the different shades that it comes in, …

Ending 3: 28.85 – Neither peridot nor many other jade or allekite stones are necess…

Family Life. [header] How to tell if your teen is being abused Evaluating Perplexity on Language Models Pay attention to…

Ending 0: 16.58 – Try to figure out why they are dressing something that is frowned…

Ending 1: 22.01 – Read the following as a rule for determining your teen’s behaviou…

Ending 2: 15.21 (O) – [substeps] For instance, your teen may try to hide the signs of a…

Ending 3: 23.91 – [substeps] Ask your teen if they have black tights (with stripper…

Accuracy: 3041/10042 = 0.3028

The code prints the perplexity of each ending and marks the correct answer with (O) or (!) and the model’s wrong prediction with (X). You can see that GPT-2 has a perplexity of 10 to 20, even for a correct answer. Advanced LLMs can achieve perplexity below 10, even with a much larger vocabulary size than GPT-2. More important is whether the model can identify the correct ending: the one that naturally completes the sentence. It should be the one with the lowest perplexity; otherwise, the model cannot generate the correct ending. GPT-2 achieves only 30% accuracy on this dataset.

You can also repeat the code with a different model. Here are the results:

model openai-community/gpt2: This is the smallest GPT-2 model with 124M parameters, used in the code above. The accuracy is 3041/10042 or 30.28%
model openai-community/gpt2-medium: This is the larger GPT-2 model with 355M parameters. The accuracy is 3901/10042 or 38.85%
model meta-llama/Llama-3.2-1B: This is the smallest model in the Llama family with 1B parameters. The accuracy is 5731/10042 or 57.07%

Therefore, it is natural to see higher accuracy with larger models.

Note that you should not compare perplexities across models with vastly different architectures. Since perplexity is a metric in the range of 1 to the vocabulary size, it highly depends on the tokenizer. You can see the reason when you compare the perplexity in the code above after replacing GPT-2 with Llama 3.2 1B: The perplexity is an order of magnitude higher for Llama 3, but the accuracy is indeed better. This is because GPT-2 has a vocabulary size of only 50,257, while Llama 3.2 1B has a vocabulary size of 128,256.

Summary

In this article, you learned about the perplexity metric and how to evaluate the perplexity of a language model with the HellaSwag dataset. Specifically, you learned:

Perplexity measures how much a model hesitates about the next token on average.
Perplexity is a metric sensitive to vocabulary size.
Computing perplexity means computing the geometric mean of the probabilities of the tokens in the sample.

Source link