Large Language Model Evaluation: The Complete Guide

Generative AI applications and other artificial intelligence technologies use large language models (LLMs) to predict, summarize, or generate text. LLM-powered applications can help improve productivity and cut costs, but only if they make trustworthy decisions (or inferences). To improve LLM outcomes and ROIs, it’s critical to evaluate model performance by assessing the accuracy, ethics, and relevance of outputs.

This guide to large language model evaluation discusses the strategies and metrics used to assess model performance and provides a list of tools to help streamline or automate the evaluation process.

A guide to large language model evaluation

Numerous methodologies and metrics can be used to evaluate LLMs throughout the development, fine-tuning, and production stages. Most enterprises don’t use a stock LLM as-is, however. They wrap the model in an application that adds functionality like RAG (retrieval-augmented generation), management controls, and safety and security measures.

Although this is by no means an exhaustive list, the table below details some of the most common strategies and metrics used in large language model (and LLM-powered application) evaluation. Click the links to learn more about each option.

LLM Evaluation Strategy	Description	Tips
Evaluation datasets	Compares LLM outputs to a specific dataset to evaluate accuracy.	Leverage ML-powered tools to streamline data curation and evaluation
Summarization metrics	Evaluates the accuracy of an LLM tool’s text summarization capabilities.	Example metrics include BLEU, ROGUE, Perplexity
Named Entity Recognition (NER) metrics	Evaluates an LLM’s ability to correctly identify and classify specific entities.	Example metrics include Precision, Recall, F1 score
Retrieval-Augmented Generation (RAG) metrics	Evaluates the accuracy and relevance of RAG outputs.	Example metrics include Faithfulness, Answer relevance, Context precision, Answer correctness
Ethical AI metrics	Ensures an LLM meets ethical standards in its training and inference.	Use bias and toxicity detection tools to ensure outputs meet ethical standards
LLM evaluation tools	Uses third-party benchmarks and platforms to streamline and standardize evaluation.	Example tools include GLUE, DeepEval, Granica Screen

Evaluation datasets (a.k.a., golden datasets)

LLM developers typically evaluate model accuracy in the training stage by comparing its output responses to a specific dataset, often referred to as an evaluation dataset or golden dataset. Creating this golden dataset manually can be difficult and time-consuming, as the data must be carefully curated for accuracy and to ensure the breadth of data can test the LLM across a variety of input scenarios and topics. This evaluation phase can drive up the costs of developing a new model or LLM-powered application.

One way to streamline this process and keep costs manageable is to leverage another LLM to generate evaluation datasets. This approach saves time and effort, but it relies upon the accuracy of one large language model to assess the accuracy of another, so it’s beneficial to have human engineers monitor the process to ensure quality results. Another method is to use an ML-powered data discovery and classification tool to automatically curate large datasets.

Summarization metrics

LLMs are commonly used to generate summaries of complex topics or lengthy documents. Some common metrics used to evaluate the accuracy and relevance of these summaries include:

BLEU

Bilingual Evaluation Understudy evaluates the precision of LLM-generated text, or how closely it resembles human sources, using a numerical scale from 0 to 1.

Example:

Reference: “Hello Captain Vimes”, “No way lmao”

LLM Output: “Hello Captain Vimes”, “No way lmao”

BLEU Results: 1.0

Reference: “Hello Captain Vimes”, “No way lmao”

LLM Output: “Hello Vimes”, “Not happening”

BLEU Results: 0.3

ROUGE

Recall-Oriented Understudy for Gisting Evaluation is a group of metrics that evaluate LLM summarization and NLP (natural language processing) translations. It also uses a numerical scale from 0 to 1.

Example:

References: “Goodbye”, “Mos Eisley”

LLM Outputs: “Hello goodbye”, “Ankh Morpork”

ROGUE Results: 0.5, 0.0

Perplexity

Perplexity (PPL) evaluates an LLM’s accuracy when encountering new data. A lower perplexity score indicates a high degree of accuracy. To read more about how PPL is calculated and to see a demonstration, view the developer’s page.

Named Entity Recognition (NER) metrics

Named entity recognition (NER) describes an LLM’s ability to correctly identify and classify specific entities - words, phrases, acronyms, etc. - within a dataset. NER is usually evaluated based on the following metrics:

NER Metric Name	Description	Formula
Precision	The ratio of correctly identified positives (“true positives”) to all identified positives determines how many positive identifications are correctly labeled.	Precision = #True_Positive / (#True_Positive + #False_Positive)
Recall	The ratio of true positives to all actual positives determines the LLM’s ability to correctly predict all positives.	Recall = #True_Positive / (#True_Positive + #False_Negative)
F1 Score	Measures the balance between precision and recall.	F1 Score = 2 * Precision * Recall / (Precision + Recall)

Retrieval-Augmented Generation (RAG) metrics

Retrieval-augmented generation, or RAG, supplements an LLM’s training dataset by querying external sources at inference time, improving the accuracy and relevance of outputs. Examples of RAG evaluation metrics include:

Faithfulness

Faithfulness determines how factually consistent a generated output is with the external source material. It’s measured on a scale of 0 to 1, with 1 being the best score.

Formula:

Faithfulness = Number of claims in the output that can be inferred from given context / Total number of claims in the output

Example:

**Question: **Who was the first woman to win a Nobel Prize?

Context: Marie Curie, a Polish-French physicist and chemist, was the first woman to receive a Nobel Prize for developing the theory of radioactivity.

High Faithfulness: Marie Curie was the first woman to win a Nobel Prize.

Low Faithfulness: Irène Joliot-Curie was the first woman to win a Nobel Prize.

Answer relevance

The answer relevancy metric evaluates how pertinent an answer is to the input prompt. It’s also measured on a scale of 0 to 1, with 1 being the best score. It does not measure accuracy, just whether the answer completely addresses the prompt question.

Example:

**Question: **Where is New Zealand and what is its capital?

High Relevance: New Zealand is an island in the southwestern Pacific Ocean, and its capital is Wellington.

Low Relevance: New Zealand’s capital is Wellington.

Context precision

Context precision measures how many relevant items from the golden dataset (also known as the “ground-truth” in RAG) are present and how high they appear in an LLM’s output. Basically, context precision ensures that the LLM prioritizes ground-truth information that directly answers the question at the beginning of its response instead of displaying irrelevant information first.

Example:

**Question: **Where is New Zealand and what is its capital?

Ground Truth: New Zealand is an island in the southwestern Pacific Ocean, and its capital is Wellington.

High Context Precision: “New Zealand, an island country in the southwestern Pacific Ocean, consists of two main landmasses and over 700 smaller islands. Its capital, Wellington, sits near the southernmost point of the North Island”, “New Zealand is known for its diverse environment including active volcanoes, glacier lakes, dazzling fjords, and long sandy beaches.”

Low Context Precision: “New Zealand is known for its diverse environment including active volcanoes, glacier lakes, dazzling fjords, and long sandy beaches”, “New Zealand, an island country in the southwestern Pacific Ocean, consists of two main landmasses and over 700 smaller islands. Its capital, Wellington, sits near the southernmost point of the North Island.”

Answer correctness

Answer correctness measures the accuracy of generated outputs compared to the ground truth. The major difference between this and the faithfulness metric is that faithfulness uses external sources rather than the golden dataset to measure accuracy.

Example:

Ground Truth: The president of the U.S. from 1933 to 1945 was Franklin D. Roosevelt.

High Answer Correctness: Franklin D. Roosevelt was the president of the United States from 1933 to 1945.

Low Answer Correctness: Franklin D. Roosevelt was the president of the United States from 1901 to 1909.

Ethical AI metrics

As large language models and generative AI technology continue to affect more sectors of society, it becomes ever more important for LLM developers to ensure ethical training, inference, and usage.

The key to developing ethical AI is to implement responsible practices and safeguards, starting with the source data used for pre-training to create the foundation model, and continuing to enterprise consumer fine-tuning and the user-facing application. It’s also important to evaluate the LLM both before and after its deployment using test queries to assess generated outputs for toxicity, bias, and other harmful content.

Below is a table of potentially harmful language categories to consider testing an LLM for when evaluating outputs for ethical behavior.

Harmful Language Categories	Description
Toxicity	• Hate • Violence • Attacks • Sexual material • Profanity • Self-harm
Bias	• Sexual orientation • Age • Disability • Physical appearance • Religion • Pregnancy status • Marital status • Nationality / location • Gender • Race / ethnicity • Socioeconomic status • Political affiliation
Privacy	• Personally identifiable information (PII) • Confidential company information • Regulated data

LLM evaluation tools

There are numerous benchmarks, frameworks, and software tools that can assist with LLM application evaluation. In addition, major cloud AI platforms like Microsoft Azure, Amazon Bedrock, and Google Vertex offer native evaluation tools. Examples of popular LLM evaluation tools include:

Arthur Bench - Evaluation platform that helps companies compare different LLM options using consistent metrics and benchmarks.
DeepEval - Open-source LLM evaluation framework that uses both standardized metrics and custom metrics to test outputs.
GLUE (General Language Understanding Evaluation) - Evaluates the effectiveness of LLMs using a standardized set of NLP (Natural Language Processing) tasks as benchmarks.
Granica Screen - Enterprise data safety and privacy software that detects toxicity, bias, PII, and other sensitive or unwanted information in training data, user prompts, and LLM outputs.
HellaSwag - Benchmark that uses Adversarial Filtering (AF) to evaluate how accurately an LLM can complete a sentence.
MMLU (Massive Multitask Language Understanding) - Evaluates an LLM’s multitasking abilities in zero-shot and few-shot settings, meaning with subjects that the model didn’t encounter during training or of which it only saw a few examples.

Develop Responsibly

Any company developing or deploying an LLM-powered application is responsible for ensuring truthful and ethical outputs. The strategies and tools listed above can help with large language model evaluation at various stages of training, development, and deployment. An effective large language model evaluation strategy can help companies realize the full potential of their AI investments.

Granica is a data management platform for AI datasets in cloud data lakes and lakehouses. Granica Screen’s new “Safe Room for AI” capabilities detect sensitive information, bias, and toxicity with state-of-the-art accuracy, helping companies evaluate and mitigate ethical issues with their LLM-powered applications with ease.

Explore an interactive demo of Granica Screen to see its fine-grained detection capabilities in action.

Sources: