Evaluation¶
One thing you might be wondering is how we can evaluate the RAG process. Well, it's hard. There are a few possible techniques we can use. And here we will demonstrate a few here:
Perplexity
Semantic similarity
Faithfulness
The core of these final two methods (and many methods that evaluate RAG systems) involves feeding the entire paper into an LLM and asking it to generate some questions and some answers based on the paper. We can then assess things like semantic similarity. We can also ask the model to evaluate whether the answer it gave can actually be inferred from the context given.
from llama_index.readers.file import PyMuPDFReader
from llama_index.core.node_parser import SentenceSplitter
from pydantic import BaseModel, Field
import fitz
from PIL import Image
import matplotlib.pyplot as plt
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
import dotenv
import os
from openai import OpenAI
from jinja2 import Environment, FileSystemLoader, select_autoescape
from typing import Any
import json
dotenv.load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
Log probabilities¶
A language model returns a probability distribution over tokens in order to give us an idea of which token to predict next.
It is usually more convenient to work with the logarithm of these probabilities (logprobs) for a few theoretical and practical reasons:
It turns multiplications into additions, which is handy if you want to look at sequences of outputs.
It helps with numerical issues. Multiplying very small numbers could cause underflow in floating point operations. Taking the log converts small numbers to big numbers.
But we can also get the logprobs from OpenAI models (and many other models) in order to develop metrics:
Classification tasks: a measure of confidence in the result;
During RAG: confidence of whether the answer is contained in the retrieved context;
Autocomplete;
Perplexity: overall confidence in a result.
client = OpenAI()
def get_completion(
messages: list[dict[str, str]],
model: str = "gpt-4o-mini",
max_tokens=512,
temperature=0,
stop=None,
seed=420,
tools=None,
logprobs=None,
top_logprobs=None,
) -> str:
params = {
"model": model,
"messages": messages,
"max_tokens": max_tokens,
"temperature": temperature,
"stop": stop,
"seed": seed,
"logprobs": logprobs,
"top_logprobs": top_logprobs,
}
completion = client.chat.completions.create(**params)
return completion
prompt = (
"You will be given a list of sentences to classify into a particular field of study. "
"You will need to classify each sentence into one of the following categories:\n"
"- Physics\n"
"- Biology\n"
"- Computer Science\n"
"Respond only with one of these categories.\n\n"
"Sentence: {sentence}"
)
sentences = [
"Connections between neurons can be mapped by acquiring and analysing electron microscopic brain images.",
"A straightforward way to quantify the creation of light is through the coefficient of spontaneous emission.",
"This method optimizes the simulation of protein folding using deep learning.",
]
for sentence in sentences:
messages = [{"role": "system", "content": prompt.format(sentence=sentence)}]
completion = get_completion(messages, model="gpt-4o-mini")
print(f"Sentence: {sentence}\nClassification: {completion.choices[0].message.content}\n")
Sentence: Connections between neurons can be mapped by acquiring and analysing electron microscopic brain images. Classification: Biology Sentence: A straightforward way to quantify the creation of light is through the coefficient of spontaneous emission. Classification: Physics Sentence: This method optimizes the simulation of protein folding using deep learning. Classification: Biology
We can return the top token and the logprobs
import math
for sentence in sentences:
messages = [{"role": "system", "content": prompt.format(sentence=sentence)}]
completion = get_completion(
messages,
model="gpt-4o-mini",
temperature=0.0,
max_tokens=64,
logprobs=True,
top_logprobs=2,
)
logprobs = completion.choices[0].logprobs.content[0]
print(f"Sentence: {sentence}\n"
f"Classification: {completion.choices[0].message.content}\n"
f"Logprobs: {math.exp(logprobs.logprob)*100:.2f}\n"
)
Sentence: Connections between neurons can be mapped by acquiring and analysing electron microscopic brain images. Classification: Biology Logprobs: 100.00 Sentence: A straightforward way to quantify the creation of light is through the coefficient of spontaneous emission. Classification: Physics Logprobs: 100.00 Sentence: This method optimizes the simulation of protein folding using deep learning. Classification: Biology Logprobs: 97.68
Perplexity¶
Perplexity can be considered a measure of uncertainty. In the context of LLMs, it is calculated by taking the average of the logprobs and exponentiating the negative. If we have a tokenized sequence $X = (x_0, x_1, ... x_t), then the perplexity is
$$ \textrm{PPL}(X) = \exp\left\{-\frac{1}{t}\sum_i^t \log p_\theta(x_i | x_{<i})\right\} $$
The $\log p_\theta(x_i | x_{<i})$ term is the log-likelihood of the $i^{\textrm{th}}$ token conditioned on the preceding tokens before $i$. Let's look at an example. To see this in action, we ask two questions: one that has a fairly certain answer, and another that is more speculative.
questions = [
"In a few sentences, consicely summarize the theory of special relativity.",
"In a few sentences, consicely explain who you think will win the 2025 Formula One Drivers' Championship.",
]
import numpy as np
for question in questions:
messages = [{"role": "system", "content": question}]
completion = get_completion(messages, model="gpt-4o-mini", logprobs=True, temperature=0.0)
log_probs = [token.logprob for token in completion.choices[0].logprobs.content]
response = completion.choices[0].message.content
perplexity_score = np.exp(-np.mean(log_probs))
print(
f"Question: {question}\nAnswer: {completion.choices[0].message.content}\n"
f"Perplexity: {perplexity_score:.2f}\n")
Question: In a few sentences, consicely summarize the theory of special relativity. Answer: The theory of special relativity, proposed by Albert Einstein in 1905, revolutionizes our understanding of space and time. It asserts that the laws of physics are the same for all observers, regardless of their relative motion, and introduces the concept that the speed of light in a vacuum is constant for all observers. This leads to counterintuitive consequences, such as time dilation (time moving slower for objects in motion relative to a stationary observer) and length contraction (objects appearing shorter in the direction of motion). Special relativity fundamentally alters the relationship between space and time, merging them into a four-dimensional spacetime continuum. Perplexity: 1.12 Question: In a few sentences, consicely explain who you think will win the 2025 Formula One Drivers' Championship. Answer: Predicting the winner of the 2025 Formula One Drivers' Championship is challenging, as it depends on various factors such as team performance, driver skill, and technological advancements. However, if current trends continue, drivers like Max Verstappen or Charles Leclerc, who have shown exceptional talent and are in competitive teams, could be strong contenders. Ultimately, the outcome will hinge on the developments in the sport leading up to the 2025 season. Perplexity: 1.22
In the more speculative answer, the perplexity is higher. Now try increasing the temperature for 0.0
to something like 0.7
and see what happens to the scores...
Evaluating RAG using synthetic data¶
In these next examples, we look at some methods to evaluate RAG - semantic similarity and faithfulness.
We will use the same approach as previous notebook. So we have moved a bunch of our code into a utils.py
file. We have mostly kept things the same, but have a look over it and make sure you understand how it all works.
from utils import chunker, DocumentDB, load_template
loader = PyMuPDFReader()
documents = loader.load(file_path="data/paper.pdf")
text_chunks, doc_idxs = chunker(chunk_size=1024, overlap=128, documents=documents)
doc_db = DocumentDB("paper_db", path="../data-storage-and-ingestion/")
Generate question answer pairs¶
For this, we will use gpt-4o
because we want high quality question answer pairs. Ideally, you would do this with humans - subject matter experts would carefully hand-craft these pairs.
The first stage is to then generate 10 Q&A pairs using pydantic again. The implementations presented here closely follow the method used by the RAGAS library.
We implement a Pydantic BaseModel class that will house our list of questions.
class QAPairs(BaseModel):
questions: list[str] = Field(..., title="List of questions")
answers: list[str] = Field(..., title="List of answers")
print(QAPairs.model_json_schema())
{'properties': {'questions': {'items': {'type': 'string'}, 'title': 'List of questions', 'type': 'array'}, 'answers': {'items': {'type': 'string'}, 'title': 'List of answers', 'type': 'array'}}, 'required': ['questions', 'answers'], 'title': 'QAPairs', 'type': 'object'}
Next, we need a prompt that we can use to generate these Q&A pairs. It looks something like this:
You are a reading comprehension system that is an expert at extracting information from academic papers.
Your task is to carefully read the provided text "CONTEXT" and then generate question and answer pairs.
Your questions should be concise. Your answers should be as detailed as possible, including any mathematical or numerical results from the text.
You should aim to produce approximately one paragraph for your answers (100-200 words).
Your questions should be a mixture of general, high-level concepts, and also highly detailed questions about specific points, including any mathematical or numerical results.
You should respond in JSON format according to the following schema:
{{ schema }}
You should generate {{ number }} question and answer pairs.
system_prompt_qa = load_template(
"prompts/qa_generation_system_prompt.jinja",
{
"number" : 10,
"schema" : QAPairs.model_json_schema()
}
)
print(system_prompt_qa)
You are a reading comprehension system that is an expert at extracting information from academic papers. Your task is to carefully read the provided text "CONTEXT" and then generate question and answer pairs. Your questions should be concise. Your answers should be as detailed as possible, including any mathematical or numerical results from the text. You should aim to produce approximately one paragraph for your answers (100-200 words). Your questions should be a mixture of general, high-level concepts, and also highly detailed questions about specific points, including any mathematical or numerical results. You should respond in JSON format according to the following schema: {'properties': {'questions': {'items': {'type': 'string'}, 'title': 'List of questions', 'type': 'array'}, 'answers': {'items': {'type': 'string'}, 'title': 'List of answers', 'type': 'array'}}, 'required': ['questions', 'answers'], 'title': 'QAPairs', 'type': 'object'} You should generate 10 question and answer pairs.
Next, we need to the pages of the pdf as a single text string
pdf_text = " ".join([doc.text for doc in documents])
Finally, we are in a position to generate our question answer pairs using gpt-4o
client = OpenAI()
user_prompt = (
f"CONTEXT:\n\n{pdf_text}"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt_qa},
{"role": "user", "content": user_prompt}
],
temperature=0.1,
response_format={"type": "json_object"}
)
We then create the QAPairs
object using the LLM output, and also save it to file.
questions_answers = QAPairs(**json.loads(response.choices[0].message.content))
# save the Q&A to file
with open("data/qa.json", "w") as f:
json.dump(questions_answers.dict(), f, indent=4)
What does an example look like?
print(questions_answers.questions[0])
print('---')
print(questions_answers.answers[0])
What are the main philosophical debates surrounding large language models (LLMs)? --- The main philosophical debates surrounding large language models (LLMs) include questions about their linguistic and cognitive competence, their ability to model human cognition, and their role in classic philosophical issues such as compositionality, language acquisition, semantic competence, grounding, and the transmission of cultural knowledge. These debates echo longstanding discussions about the capabilities of artificial neural networks and whether they can truly replicate human-like intelligence and understanding.
Semantic Similarity¶
We can now try and do cosine similarity scores between the returned contexts and the actual answers.
from utils import rag_query
example_query = questions_answers.questions[0]
response, context = rag_query(
query=example_query,
n_context=5,
doc_db=doc_db,
return_context=True
)
print(response)
The main philosophical debates surrounding large language models (LLMs) like GPT-4 focus on their capacity to exhibit linguistic and cognitive competence, challenging traditional views on artificial intelligence. Key issues include the nature of their learning processes, the validity of ascribing communicative intentions to them, and whether they possess world models that allow for a deeper understanding of language and context. Critics often invoke the "Redescription Fallacy," arguing that LLMs' operations, being statistical in nature, cannot model human cognition. However, proponents suggest that LLMs can blend patterns from training data to produce novel outputs, raising questions about the empirical evidence needed to assess their cognitive capabilities. Overall, these debates reflect broader concerns about the implications of LLMs for our understanding of intelligence, rationality, and the nature of language itself.
First look at semantic similarity between the predicted response and the desired response.
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
client = OpenAI()
response_embedding = client.embeddings.create(
input=response,
model="text-embedding-3-small"
).data[0].embedding
answer_embedding = client.embeddings.create(
input=questions_answers.answers[0],
model="text-embedding-3-small"
).data[0].embedding
cosine_similarity([response_embedding], [answer_embedding])
array([[0.84958852]])
print(response)
print('---')
print(questions_answers.answers[0])
The main philosophical debates surrounding large language models (LLMs) like GPT-4 focus on their capacity to exhibit linguistic and cognitive competence, challenging traditional views on artificial intelligence. Key issues include the nature of their learning processes, the validity of ascribing communicative intentions to them, and whether they possess world models that allow for a deeper understanding of language and context. Critics often invoke the "Redescription Fallacy," arguing that LLMs' operations, being statistical in nature, cannot model human cognition. However, proponents suggest that LLMs can blend patterns from training data to produce novel outputs, raising questions about the empirical evidence needed to assess their cognitive capabilities. Overall, these debates reflect broader concerns about the implications of LLMs for our understanding of intelligence, rationality, and the nature of language itself. --- The main philosophical debates surrounding large language models (LLMs) include questions about their linguistic and cognitive competence, their ability to model human cognition, and their role in classic philosophical issues such as compositionality, language acquisition, semantic competence, grounding, and the transmission of cultural knowledge. These debates echo longstanding discussions about the capabilities of artificial neural networks and whether they can truly replicate human-like intelligence and understanding.
Well OK, but what does this score mean? It is simply a measure of the similarity of the embeddings. It gives no real indication if the output is "better" or "worse" or more or less informative than the original answer. It is important to consider these scores in context of your overall objective. It is also important to curate good quality Q&A pairs.
Faithfulness¶
This is a little more complicated. First, we get an LLM to extract key statements from the answer. For example:
[
['This study was conducted by Mallinson et al.'],
['The main focus is to investigate avalanches and criticality in self-organized nanoscale network.']
['They analyzed electrical conductance.']
['They analyzed the behavior of the networks under various stimulus conditions.']
]
We then ask a second LLM to look at each statement and see if that statement can be inferred from the text, assigning a score of 0 for no, and 1 for yes.
To do this, we create two additional Pydantic classes:
class Statements(BaseModel):
simpler_statements: list[str] = Field(..., description="the simpler statements")
class StatementFaithfulnessAnswer(BaseModel):
statement: str = Field(..., description="the original statement, word-for-word")
reason: str = Field(..., description="the reason of the verdict")
verdict: int = Field(..., description="the verdict(0/1) of the faithfulness.")
class Faithfulness(BaseModel):
answers: list[StatementFaithfulnessAnswer] = Field(..., description="the faithfulness answers")
score: float = Field(..., description="the average faithfulness score")
We also create two more prompts statement_instruction
, and faithfulness_instruction
Given a piece of text, analyze the complexity of each sentence and break down each sentence into one or more fully understandable statements while also ensuring no pronouns are used in each statement. Format the outputs in JSON, according to the following schema:
{{ schema }}
Here is a new piece of text:
{{ statement }}
Your task is to judge the faithfulness of a statement based on a given context. For the statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context.
You will give the exact statement, the reason, and the verdict.
Format the outputs in JSON, according to the following schema:
{{ schema }}
Here is a statement:
{{ statement }}
def get_statements(answer):
prompt = load_template(
"prompts/faithfulness/statement_instruction.jinja",
{
"schema" : Statements.model_json_schema(),
"text" : answer
}
)
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": answer}
],
temperature=0.0,
response_format={"type": "json_object"},
logprobs=True,
)
return Statements(**json.loads(completion.choices[0].message.content))
statements = get_statements(response)
from rich.pretty import pprint
print(response)
pprint(statements)
The main philosophical debates surrounding large language models (LLMs) like GPT-4 focus on their capacity to exhibit linguistic and cognitive competence, challenging traditional views on artificial intelligence. Key issues include the nature of their learning processes, the validity of ascribing communicative intentions to them, and whether they possess world models that allow for a deeper understanding of language and context. Critics often invoke the "Redescription Fallacy," arguing that LLMs' operations, being statistical in nature, cannot model human cognition. However, proponents suggest that LLMs can blend patterns from training data to produce novel outputs, raising questions about the empirical evidence needed to assess their cognitive capabilities. Overall, these debates reflect broader concerns about the implications of LLMs for our understanding of intelligence, rationality, and the nature of language itself.
Statements( │ simpler_statements=[ │ │ 'Philosophical debates exist surrounding large language models like GPT-4.', │ │ 'These debates focus on the capacity of large language models to exhibit linguistic and cognitive competence.', │ │ 'Traditional views on artificial intelligence are challenged by these debates.', │ │ 'Key issues in these debates include the nature of learning processes of large language models.', │ │ 'Another key issue is the validity of ascribing communicative intentions to large language models.', │ │ 'A further key issue is whether large language models possess world models.', │ │ 'World models would allow for a deeper understanding of language and context.', │ │ "Critics invoke the 'Redescription Fallacy' in these debates.", │ │ 'Critics argue that the operations of large language models are statistical in nature.', │ │ 'Critics claim that statistical operations cannot model human cognition.', │ │ 'Proponents suggest that large language models can blend patterns from training data.', │ │ 'Blending patterns from training data allows large language models to produce novel outputs.', │ │ 'This raises questions about the empirical evidence needed to assess cognitive capabilities of large language models.', │ │ 'Overall, these debates reflect broader concerns about implications of large language models.', │ │ 'Concerns include understanding of intelligence, rationality, and the nature of language.' │ ] )
def get_faithfulness(statements : Statements, context):
context_joined = " ".join(context)
faithfulness_answers = []
for statement in statements.simpler_statements:
prompt = load_template(
"prompts/faithfulness/faithfulness_instruction.jinja",
{
"schema" : StatementFaithfulnessAnswer.model_json_schema(),
"statement" : statement,
"context" : context_joined
}
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": context_joined}
],
temperature=0.0,
response_format={"type": "json_object"}
).choices[0].message.content
faithfulness_answers.append(StatementFaithfulnessAnswer(**json.loads(response)))
score = sum([answer.verdict for answer in faithfulness_answers]) / len(faithfulness_answers)
return Faithfulness(answers=faithfulness_answers, score=score)
results = get_faithfulness(statements, context)
pprint(results)
Faithfulness( │ answers=[ │ │ StatementFaithfulnessAnswer( │ │ │ statement='Philosophical debates exist surrounding large language models like GPT-4.', │ │ │ reason='The context discusses ongoing disagreements and philosophical inquiries related to large language models, including GPT-4, indicating that philosophical debates indeed exist surrounding these models.', │ │ │ verdict=1 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement='These debates focus on the capacity of large language models to exhibit linguistic and cognitive competence.', │ │ │ reason='The context discusses ongoing disagreements about the extent to which linguistic and cognitive competence can be ascribed to language models, indicating that the debates indeed focus on this capacity.', │ │ │ verdict=1 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement='Traditional views on artificial intelligence are challenged by these debates.', │ │ │ reason='The context discusses how the success of large language models like GPT-4 challenges long-held assumptions about artificial neural networks, indicating that traditional views on artificial intelligence are indeed being challenged by ongoing debates.', │ │ │ verdict=1 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement='Key issues in these debates include the nature of learning processes of large language models.', │ │ │ reason='The context discusses various philosophical questions surrounding large language models (LLMs), including their cognitive capacities and the implications of their learning processes. It highlights the need for empirical investigation to understand their internal mechanisms, which implies that the nature of learning processes is indeed a key issue in these debates.', │ │ │ verdict=1 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement='Another key issue is the validity of ascribing communicative intentions to large language models.', │ │ │ reason='The context discusses the philosophical implications of ascribing communicative intentions to large language models, indicating that this is a key issue in the ongoing debates about their capabilities. The statement can be inferred as it aligns with the themes presented in the context regarding the understanding of LLMs and their potential communicative behaviors.', │ │ │ verdict=1 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement='A further key issue is whether large language models possess world models.', │ │ │ reason='The context discusses the skepticism surrounding whether large language models (LLMs) can possess world models, which are internal representations that simulate aspects of the external world. It explicitly mentions that this is a core skeptical concern and elaborates on the implications of LLMs having or not having such models. Therefore, the statement can be directly inferred from the context.', │ │ │ verdict=1 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement='World models would allow for a deeper understanding of language and context.', │ │ │ reason='The context discusses the concept of world models in relation to language models, indicating that they enable understanding and interpretation of real-world dynamics, which implies that they contribute to a deeper understanding of language and context. However, it does not explicitly state that world models would allow for a deeper understanding, only that they are crucial for tasks requiring such understanding.', │ │ │ verdict=0 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement="Critics invoke the 'Redescription Fallacy' in these debates.", │ │ │ reason="The context explicitly mentions the 'Redescription Fallacy' as a misleading inference pattern that critics use in debates about the capabilities of language models. Therefore, it can be inferred that critics indeed invoke this fallacy in their arguments.", │ │ │ verdict=1 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement='Critics argue that the operations of large language models are statistical in nature.', │ │ │ reason="The context discusses the philosophical debates surrounding large language models (LLMs) and mentions a misleading inference pattern termed the 'Redescription Fallacy,' which includes claims that LLMs cannot model certain cognitive capacities because their operations can be described as statistical calculations. This implies that critics do indeed argue about the statistical nature of LLM operations.", │ │ │ verdict=1 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement='Critics claim that statistical operations cannot model human cognition.', │ │ │ reason='The context discusses ongoing disagreements about the extent to which language models can be ascribed linguistic or cognitive competence, and mentions a fallacy where critics argue that LLMs cannot model cognitive capacities because their operations are statistical in nature. This implies that critics do indeed claim that statistical operations are insufficient for modeling human cognition, which directly supports the statement.', │ │ │ verdict=1 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement='Proponents suggest that large language models can blend patterns from training data.', │ │ │ reason='The context discusses how LLMs, while capable of regurgitating training data, are also able to flexibly blend patterns from that data to produce novel outputs. This indicates that proponents of LLMs recognize their ability to blend patterns from training data.', │ │ │ verdict=1 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement='Blending patterns from training data allows large language models to produce novel outputs.', │ │ │ reason='The context explicitly states that LLMs are capable of flexibly blending patterns from their training data to produce genuinely novel outputs, which directly supports the statement.', │ │ │ verdict=1 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement='This raises questions about the empirical evidence needed to assess cognitive capabilities of large language models.', │ │ │ reason='The context discusses the need for further empirical investigation to understand the internal mechanisms of large language models and highlights ongoing philosophical debates about their cognitive capabilities. This implies that there are indeed questions regarding the empirical evidence necessary to assess these capabilities.', │ │ │ verdict=1 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement='Overall, these debates reflect broader concerns about implications of large language models.', │ │ │ reason='The context discusses ongoing philosophical debates regarding the implications of large language models (LLMs) and their cognitive capacities, indicating that these debates are indeed reflective of broader concerns about LLMs.', │ │ │ verdict=1 │ │ ), │ │ StatementFaithfulnessAnswer( │ │ │ statement='Concerns include understanding of intelligence, rationality, and the nature of language.', │ │ │ reason="The context discusses philosophical inquiries surrounding artificial neural networks, particularly focusing on their ability to model human cognition, intelligence, and language. It explicitly mentions debates about intelligence and rationality in relation to language models, which supports the statement's claim about concerns in these areas.", │ │ │ verdict=1 │ │ ) │ ], │ score=0.9333333333333333 )