Wednesday, April 8, 2026
Mobile Offer

🎁 You've Got 1 Reward Left

Check if your device is eligible for instant bonuses.

Unlock Now
Survey Cash

🧠 Discover the Simple Money Trick

This quick task could pay you today — no joke.

See It Now
Top Deals

📦 Top Freebies Available Near You

Get hot mobile rewards now. Limited time offers.

Get Started
Game Offer

🎮 Unlock Premium Game Packs

Boost your favorite game with hidden bonuses.

Claim Now
Money Offers

💸 Earn Instantly With This Task

No fees, no waiting — your earnings could be 1 click away.

Start Earning
Crypto Airdrop

🚀 Claim Free Crypto in Seconds

Register & grab real tokens now. Zero investment needed.

Get Tokens
Food Offers

🍔 Get Free Food Coupons

Claim your free fast food deals instantly.

Grab Coupons
VIP Offers

🎉 Join Our VIP Club

Access secret deals and daily giveaways.

Join Now
Mystery Offer

🎁 Mystery Gift Waiting for You

Click to reveal your surprise prize now!

Reveal Gift
App Bonus

📱 Download & Get Bonus

New apps giving out free rewards daily.

Download Now
Exclusive Deals

💎 Exclusive Offers Just for You

Unlock hidden discounts and perks.

Unlock Deals
Movie Offer

🎬 Watch Paid Movies Free

Stream your favorite flicks with no cost.

Watch Now
Prize Offer

🏆 Enter to Win Big Prizes

Join contests and win amazing rewards.

Enter Now
Life Hack

💡 Simple Life Hack to Save Cash

Try this now and watch your savings grow.

Learn More
Top Apps

📲 Top Apps Giving Gifts

Download & get rewards instantly.

Get Gifts
Summer Drinks

🍹 Summer Cocktails Recipes

Make refreshing drinks at home easily.

Get Recipes

Latest Posts

Evaluating Perplexity on Language Models


A language model is a probability distribution over sequences of tokens. When you train a language model, you want to measure how accurately it predicts human language use. This is a difficult task, and you need a metric to evaluate the model. In this article, you will learn about the perplexity metric. Specifically, you will learn:

  • What is perplexity, and how to compute it
  • How to evaluate the perplexity of a language model with sample data

Let’s get started.

Evaluating Perplexity on Language Models
Photo by Lucas Davis. Some rights reserved.

Overview

This article is divided into two parts; they are:

  • What Is Perplexity and How to Compute It
  • Evaluate the Perplexity of a Language Model with HellaSwag Dataset

What Is Perplexity and How to Compute It

Perplexity is a measure of how well a language model predicts a sample of text. It is defined as the inverse of the geometric mean of the probabilities of the tokens in the sample. Mathematically, perplexity is defined as:

$$
PPL(x_{1:L}) = \prod_{i=1}^L p(x_i)^{-1/L} = \exp\big(-\frac{1}{L} \sum_{i=1}^L \log p(x_i)\big)
$$

Perplexity is a function of a particular sequence of tokens. In practice, it is more convenient to compute perplexity as the mean of the log probabilities, as shown in the formula above.

Perplexity is a metric that quantifies how much a language model hesitates about the next token on average. If the language model is absolutely certain, the perplexity is 1. If the language model is completely uncertain, then every token in the vocabulary is equally likely; the perplexity is equal to the vocabulary size. You should not expect perplexity to go beyond this range.

Evaluate the Perplexity of a Language Model with HellaSwag Dataset

Perplexity is a dataset-dependent metric. One dataset you can use is HellaSwag. It is a dataset with train, test, and validation splits. It is available on the Hugging Face hub, and you can load it with the following code:

Running this code will print the following:

You can see that the validation split has 10,042 samples. This is the dataset you will use in this article. Each sample is a dictionary. The key "activity_label" describes the activity category, and the key "ctx" provides the context that needs to be completed. The model is expected to complete the sequence by selecting one of the four endings. The key "label", with values 0 to 3, indicates which ending is correct.

With this, you can write a short code to evaluate your own language model. Let’s use a small model from Hugging Face as an example:

This code loads the smallest GPT-2 model from the Hugging Face Hub. It is a 124M-parameter model that you can easily run on a low-profile computer. The model and tokenizer are loaded using the Hugging Face transformers library. You also load the HellaSwag validation dataset.

In the for-loop, you tokenize the activity label and the context. You also tokenize each of the four endings. Note that tokenizer.encode() is the method for using the tokenizer from the transformers library. It is different from the tokenizer object you used in the previous article.

Next, for each ending, you run the concatenated input and ending to the model. The input_ids tensor is a 2D tensor of integer token IDs with the batch dimension 1. The model returns an object, in which you extract the output logits tensor. This is different from the model you built in the previous article as this is a model object from the transformers library. You can easily swap it with your trained model object with minor changes.

GPT-2 is a decoder-only transformer model. It processes the input with a causal mask. For an input tensor of shape $(1, L)$, the output logits tensor has shape $(1, L, V)$, where $V$ is the vocabulary size. The output at position $p$ corresponds to the model’s estimate of the token at position $p+1$, depending on the input at positions 1 to $p$. Therefore, you extract the logits starting at offset $n-1$, where $n$ is the length of the combined activity label and context. You then convert the logits to log probabilities and compute the average over the length of each ending.

The value token_probs[j, token] is the log probability at position j for the token with ID token. The mean log-probability of each token in the ending is used to compute the perplexity. A good model is expected to identify the correct ending with the lowest perplexity. You can evaluate a model by counting the number of correct predictions over the entire HellaSwag validation dataset. When you run this code, you will see the following:

The code prints the perplexity of each ending and marks the correct answer with (O) or (!) and the model’s wrong prediction with (X). You can see that GPT-2 has a perplexity of 10 to 20, even for a correct answer. Advanced LLMs can achieve perplexity below 10, even with a much larger vocabulary size than GPT-2. More important is whether the model can identify the correct ending: the one that naturally completes the sentence. It should be the one with the lowest perplexity; otherwise, the model cannot generate the correct ending. GPT-2 achieves only 30% accuracy on this dataset.

You can also repeat the code with a different model. Here are the results:

  • model openai-community/gpt2: This is the smallest GPT-2 model with 124M parameters, used in the code above. The accuracy is 3041/10042 or 30.28%
  • model openai-community/gpt2-medium: This is the larger GPT-2 model with 355M parameters. The accuracy is 3901/10042 or 38.85%
  • model meta-llama/Llama-3.2-1B: This is the smallest model in the Llama family with 1B parameters. The accuracy is 5731/10042 or 57.07%

Therefore, it is natural to see higher accuracy with larger models.

Note that you should not compare perplexities across models with vastly different architectures. Since perplexity is a metric in the range of 1 to the vocabulary size, it highly depends on the tokenizer. You can see the reason when you compare the perplexity in the code above after replacing GPT-2 with Llama 3.2 1B: The perplexity is an order of magnitude higher for Llama 3, but the accuracy is indeed better. This is because GPT-2 has a vocabulary size of only 50,257, while Llama 3.2 1B has a vocabulary size of 128,256.

Further Readings

Below are some resources that you may find useful:

Summary

In this article, you learned about the perplexity metric and how to evaluate the perplexity of a language model with the HellaSwag dataset. Specifically, you learned:

  • Perplexity measures how much a model hesitates about the next token on average.
  • Perplexity is a metric sensitive to vocabulary size.
  • Computing perplexity means computing the geometric mean of the probabilities of the tokens in the sample.



Source link

Mobile Offer

🎁 You've Got 1 Reward Left

Check if your device is eligible for instant bonuses.

Unlock Now
Survey Cash

🧠 Discover the Simple Money Trick

This quick task could pay you today — no joke.

See It Now
Top Deals

📦 Top Freebies Available Near You

Get hot mobile rewards now. Limited time offers.

Get Started
Game Offer

🎮 Unlock Premium Game Packs

Boost your favorite game with hidden bonuses.

Claim Now
Money Offers

💸 Earn Instantly With This Task

No fees, no waiting — your earnings could be 1 click away.

Start Earning
Crypto Airdrop

🚀 Claim Free Crypto in Seconds

Register & grab real tokens now. Zero investment needed.

Get Tokens
Food Offers

🍔 Get Free Food Coupons

Claim your free fast food deals instantly.

Grab Coupons
VIP Offers

🎉 Join Our VIP Club

Access secret deals and daily giveaways.

Join Now
Mystery Offer

🎁 Mystery Gift Waiting for You

Click to reveal your surprise prize now!

Reveal Gift
App Bonus

📱 Download & Get Bonus

New apps giving out free rewards daily.

Download Now
Exclusive Deals

💎 Exclusive Offers Just for You

Unlock hidden discounts and perks.

Unlock Deals
Movie Offer

🎬 Watch Paid Movies Free

Stream your favorite flicks with no cost.

Watch Now
Prize Offer

🏆 Enter to Win Big Prizes

Join contests and win amazing rewards.

Enter Now
Life Hack

💡 Simple Life Hack to Save Cash

Try this now and watch your savings grow.

Learn More
Top Apps

📲 Top Apps Giving Gifts

Download & get rewards instantly.

Get Gifts
Summer Drinks

🍹 Summer Cocktails Recipes

Make refreshing drinks at home easily.

Get Recipes

Latest Posts

Don't Miss

Stay in touch

To be updated with all the latest news, offers and special announcements.