Pydantic¶
Introduction to Pydantic¶
Pydantic is a data validation library in python. Suppose we get some user attributes. This data could have come from anywhere - user input, an API call, whatever. Suppose for now that this is some information about a user that the user themselves has provided via some form.
from rich.pretty import pprint
user_attributes = {
"name": "Keanu Reeves",
"age": 873, # because we all know Keanu Reeves is immortal
"email": "jwick@email.com",
"pets": ["dog"]
}
we might want this input to always have the following form:
user_attributes = {
"name" : str,
"age" : int,
"email" : str,
"pets" : list[str],
}
If you've taken an intro level python course, you might have been introduced to objects by creating classes such as people or cars with attributes and methods. So, suppose we want to take this dictionary that we got from the user input and create a class out of it.
Using dataclasses¶
One way to turn our user input into an object is to use a dataclass
.
from dataclasses import dataclass
@dataclass
class User:
name: str
age: int
email: str
pets: list[str]
We can now construct a new user, by passing in the dictionary that we got from the user input.
new_user = User(**user_attributes)
print(type(new_user))
pprint(new_user, expand_all=True)
<class '__main__.User'>
User( │ name='Keanu Reeves', │ age=873, │ email='jwick@email.com', │ pets=[ │ │ 'dog' │ ] )
We might also want to do some checks on our user input. For example, we might want to make sure that the user is over 18 years old. We can do this by defining a function that checks the age.
def is_over_18(user: dict) -> bool:
return user.age >= 18
is_over_18(new_user)
True
But now what happens if one of the fields is the incorrect type?
user_attributes["age"] = "873"
wrong_user = User(**user_attributes)
is_over_18(wrong_user)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[18], line 5 1 user_attributes["age"] = "873" 3 wrong_user = User(**user_attributes) ----> 5 is_over_18(wrong_user) Cell In[2], line 2, in is_over_18(user) 1 def is_over_18(user: dict) -> bool: ----> 2 return user.age >= 18 TypeError: '>=' not supported between instances of 'str' and 'int'
We're being told that you can't compare strings and integers (obviously).
Now, we could write some manual checks for typing in our age-checking function, but ideally we would like to keep all of the validation stuff in one place.
This is where Pydantic comes in.
Using Pydantic¶
from pydantic import BaseModel, Field
Creating a Pydantic object is similar to creating a dataclass. We use the BaseModel
class, and pass in the fields we want using the Field
class.
class User(BaseModel):
name: str = Field(..., title="Name", description="Name of the user")
age: int = Field(..., title="Age", description="Age of the user")
email: str = Field(..., title="Email", description="Email of the user")
pets: list[str] = Field(..., title="Pets", description="Pets of the user")
Pydantic will now automatically check that the data we pass in is of the correct type.
user = User(**user_attributes)
pprint(user, expand_all=True)
is_over_18(user)
User( │ name='Keanu Reeves', │ age=873, │ email='jwick@email.com', │ pets=[ │ │ 'dog' │ ] )
True
Even though we passed a string for age, Pydantic will try to convert it for us if possible. But what if we pass through a preposterous value for age?
user_attributes["age"] = "panda"
try:
user = User(**user_attributes)
except Exception as e:
print(e)
1 validation error for User age Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='panda', input_type=str] For further information visit https://errors.pydantic.dev/2.9/v/int_parsing
It makes no sense for someone to be panda
years old, so Pydantic will not let us pass this in.
Another handy feature is that we can build in additional validation checks.
class User(BaseModel):
name: str = Field(..., title="Name", description="Name of the user")
age: int = Field(..., title="Age", description="Age of the user", ge=18)
email: str = Field(..., title="Email", description="Email of the user")
pets: list[str] = Field(..., title="Pets", description="Pets of the user")
We have passed in the extra argument ge=18
to the Field
class. This means that the age must be greater than or equal to 18. If we pass in an age less than 18, Pydantic will raise an error:
from pydantic import ValidationError
user_attributes["age"] = 17
try:
user = User(**user_attributes)
pprint(user)
except ValidationError as e:
pprint(e.errors())
[ │ { │ │ 'type': 'greater_than_equal', │ │ 'loc': ('age',), │ │ 'msg': 'Input should be greater than or equal to 18', │ │ 'input': 17, │ │ 'ctx': {'ge': 18}, │ │ 'url': 'https://errors.pydantic.dev/2.9/v/greater_than_equal' │ } ]
This is handy, because it has told us what failure was, what we input, and what control we put in place to prevent it.
We can also define our own custom validators. Suppose we want to make sure that the email address is a valid email address. We can do this by defining a custom validator.
from pydantic import field_validator
from pydantic_core import PydanticCustomError
class User(BaseModel):
name: str = Field(..., title="Name", description="Name of the user")
age: int = Field(..., title="Age", description="Age of the user", ge=18)
email: str = Field(..., title="Email", description="Email of the user")
pets: list[str] = Field(..., title="Summary", description="Pets of the user")
@field_validator("email")
def check_email(cls, v):
if "@" not in v or "." not in v:
raise PydanticCustomError(
'InvalidEmail',
'Email must contain "@" and "."'
)
return v
user_attributes = {
"name": "Keanu Reeves",
"age": 873,
"email": "jwick-at-email-dot-com",
"pets": ["dog"]
}
try:
user = User(**user_attributes)
pprint(user)
except ValidationError as e:
pprint(e.errors(), expand_all=True)
[ │ { │ │ 'type': 'InvalidEmail', │ │ 'loc': ( │ │ │ 'email', │ │ ), │ │ 'msg': 'Email must contain "@" and "."', │ │ 'input': 'jwick-at-email-dot-com' │ } ]
Application to LLMs¶
An important application of Pydantic is in validating the output of LLMs. For example, suppose we have the following description:
description = (
"My name is Ryan, and I am 35 years old. "
"During the weekends I like to hike, but I also enjoy playing video games. "
"It can sometimes be difficult to use my computer, "
"because my cat likes to sleep on the keyboard! "
"During the week, I work as a MLE at the University of Cambridge. "
"Although I really enjoy living in the UK, "
"I miss the outdoors back home in NZ."
)
We can now use a system prompt to ask the LLM to extract the information we want.
system_prompt = (
"Your main role is to analyse a piece of unstructured text and extract the following information:\n"
"- Name\n"
"- Age\n"
"- Nationality\n"
"- Occupation\n"
"- A list of any pets\n"
"- A list of any hobbies\n\n"
"If any acronyms are used, please expand them.\n\n"
"Here is a description:\n\n"
)
This information is clearly contained within the text, and any human with basic comprehension skills can extract or infer it.
Let's first try using an LLM to extract this information, probably to how we might prompt ChatGPT:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": description},
],
temperature=0.0,
)
print(response.choices[0].message.content)
- Name: Ryan - Age: 35 - Nationality: New Zealander (NZ) - Occupation: Machine Learning Engineer (MLE) at the University of Cambridge - A list of any pets: Cat - A list of any hobbies: Hiking, playing video games
This might initially look good, but we can't really do anything useful with this information, since it's still unstructure. It looks structured, but parsing this would be annoying. I could write something that looked for the colon and then took the text afterwards; but then what about the lists of items? And any text in brackets? Or any extraneous information or text?
Instead, I can try to get the output in a structured format.
First we define a new Pydantic class that we want the output to be.
class Person(BaseModel):
name: str | None = Field(..., description="The name of the person")
age: int | None = Field(..., description="The age of the person")
nationality: str | None = Field(..., description="The nationality of the person")
occupation: str | None = Field(..., description="The occupation of the person")
pets: list[str] | None = Field(..., description="The pets of the person")
hobbies: list[str] | None = Field(..., description="The hobbies of the person")
# print the description of the person
def __str__(self) -> str:
output = f"Name: {self.name}\n"
output += f"Age: {self.age}\n"
output += f"Nationality: {self.nationality}\n"
output += f"Occupation: {self.occupation}\n"
output += f"Pets: {self.pets}\n"
output += f"Hobbies: {self.hobbies}\n"
return output
Now we explicitely ask the LLM to output JSON.
system_prompt = (
"Your main role is to analyse a piece of unstructured text and extract the following information:\n"
"- Name\n"
"- Age\n"
"- Nationality\n"
"- Occupation\n"
"- A list of any pets\n"
"- A list of any hobbies\n\n"
"If any acronyms are used, please expand them.\n"
"Return the information in JSON format.\n\n" # <--- JSON please!
"Here is a description:\n\n"
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": description},
],
max_tokens=512,
temperature=0.0,
)
print(response.choices[0].message.content)
```json { "Name": "Ryan", "Age": 35, "Nationality": "New Zealander", "Occupation": "Machine Learning Engineer", "Pets": ["cat"], "Hobbies": ["hiking", "playing video games"] } ```
Great! We have valid JSON! Except we don't, we have those annoying json
tags. OK, so now I can ask the LLM to not include those.
system_prompt = (
"Your main role is to analyse a piece of unstructured text and extract the following information:\n"
"- Name\n"
"- Age\n"
"- Nationality\n"
"- Occupation\n"
"- A list of any pets\n"
"- A list of any hobbies\n\n"
"If any acronyms are used, please expand them.\n"
"Return the information in JSON format. "
"Do not include the `json` tags.\n\n" # <--- no tags please!
"Here is a description:\n\n"
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": description},
],
max_tokens=512,
temperature=0.0,
)
print(response.choices[0].message.content)
{ "Name": "Ryan", "Age": 35, "Nationality": "New Zealander", "Occupation": "Machine Learning Engineer", "Pets": ["cat"], "Hobbies": ["hiking", "playing video games"] }
That actually worked, and this is valid JSON. This is what you might have to to do with many LLMs out there. However, OpenAI have gone to the effort of adding special response_format
arguments to their API to make this easier. So I can feed in the original prompt that I had, and specify that I want the output in JSON format. And this is will guarantee that the output is valid JSON.
system_prompt = (
"Your main role is to analyse a piece of unstructured text and extract the following information:\n"
"- Name\n"
"- Age\n"
"- Nationality\n"
"- Occupation\n"
"- A list of any pets\n"
"- A list of any hobbies\n\n"
"If any acronyms are used, please expand them.\n"
"Return the information in JSON format.\n\n" # <--- JSON please!
"Here is a description:\n\n"
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": description},
],
max_tokens=512,
temperature=0.0,
response_format={"type": "json_object"} # <--- JSON please!
)
print(response.choices[0].message.content)
{ "Name": "Ryan", "Age": 35, "Nationality": "New Zealander", "Occupation": "Machine Learning Engineer", "Pets": ["cat"], "Hobbies": ["hiking", "playing video games"] }
Why did we go to all this effort? Well, now we can try and pass this object in as arguments to our Person
class, just as we did before with Keanu. This process is called deserialisation - the process of converting JSON into a Pydantic object.
import json
json_content = json.loads(response.choices[0].message.content)
person = Person(**json_content)
pprint(person, expand_all=True)
--------------------------------------------------------------------------- ValidationError Traceback (most recent call last) Cell In[48], line 4 1 import json 3 json_content = json.loads(response.choices[0].message.content) ----> 4 person = Person(**json_content) 6 pprint(person, expand_all=True) File ~/Website/large-language-models/venv/lib/python3.11/site-packages/pydantic/main.py:209, in BaseModel.__init__(self, **data) 207 # `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks 208 __tracebackhide__ = True --> 209 validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self) 210 if self is not validated_self: 211 warnings.warn( 212 'A custom validator is returning a value other than `self`.\n' 213 "Returning anything other than `self` from a top level model validator isn't supported when validating via `__init__`.\n" 214 'See the `model_validator` docs (https://docs.pydantic.dev/latest/concepts/validators/#model-validators) for more details.', 215 category=None, 216 ) ValidationError: 6 validation errors for Person name Field required [type=missing, input_value={'Name': 'Ryan', 'Age': 3... 'playing video games']}, input_type=dict] For further information visit https://errors.pydantic.dev/2.9/v/missing age Field required [type=missing, input_value={'Name': 'Ryan', 'Age': 3... 'playing video games']}, input_type=dict] For further information visit https://errors.pydantic.dev/2.9/v/missing nationality Field required [type=missing, input_value={'Name': 'Ryan', 'Age': 3... 'playing video games']}, input_type=dict] For further information visit https://errors.pydantic.dev/2.9/v/missing occupation Field required [type=missing, input_value={'Name': 'Ryan', 'Age': 3... 'playing video games']}, input_type=dict] For further information visit https://errors.pydantic.dev/2.9/v/missing pets Field required [type=missing, input_value={'Name': 'Ryan', 'Age': 3... 'playing video games']}, input_type=dict] For further information visit https://errors.pydantic.dev/2.9/v/missing hobbies Field required [type=missing, input_value={'Name': 'Ryan', 'Age': 3... 'playing video games']}, input_type=dict] For further information visit https://errors.pydantic.dev/2.9/v/missing
What happened? It is valid json, so what is the problem? The issue here is the title of the required fields don't match up! Name
in the returned output is name
in the Pydantic object. One way to fix this is to also directly feed the schema into the prompt.
A schema is essentially a description of the data that we want to pass in. We can use the json_schema
method to get the schema of the Pydantic object.
A handy feature of Pydantic objects is that you can serialise the class description into a JSON schema - essentially just a string or dictionary representation of the object.
schema = Person.model_json_schema()
print(type(schema))
pprint(schema)
<class 'dict'>
{ │ 'properties': { │ │ 'name': { │ │ │ 'anyOf': [{'type': 'string'}, {'type': 'null'}], │ │ │ 'description': 'The name of the person', │ │ │ 'title': 'Name' │ │ }, │ │ 'age': { │ │ │ 'anyOf': [{'type': 'integer'}, {'type': 'null'}], │ │ │ 'description': 'The age of the person', │ │ │ 'title': 'Age' │ │ }, │ │ 'nationality': { │ │ │ 'anyOf': [{'type': 'string'}, {'type': 'null'}], │ │ │ 'description': 'The nationality of the person', │ │ │ 'title': 'Nationality' │ │ }, │ │ 'occupation': { │ │ │ 'anyOf': [{'type': 'string'}, {'type': 'null'}], │ │ │ 'description': 'The occupation of the person', │ │ │ 'title': 'Occupation' │ │ }, │ │ 'pets': { │ │ │ 'anyOf': [{'items': {'type': 'string'}, 'type': 'array'}, {'type': 'null'}], │ │ │ 'description': 'The pets of the person', │ │ │ 'title': 'Pets' │ │ }, │ │ 'hobbies': { │ │ │ 'anyOf': [{'items': {'type': 'string'}, 'type': 'array'}, {'type': 'null'}], │ │ │ 'description': 'The hobbies of the person', │ │ │ 'title': 'Hobbies' │ │ } │ }, │ 'required': ['name', 'age', 'nationality', 'occupation', 'pets', 'hobbies'], │ 'title': 'Person', │ 'type': 'object' }
Now we should include this schema into the prompt
system_prompt = (
"Your main role is to analyse a piece of unstructured text and extract the following information:\n"
"- Name\n"
"- Age\n"
"- Nationality\n"
"- Occupation\n"
"- A list of any pets\n"
"- A list of any hobbies\n\n"
"If any acronyms are used, please expand them.\n\n"
f"Return the information in JSON format according to the following schema:\n\n{schema}\n\n"
"Here is a description:\n\n"
)
print(system_prompt)
Your main role is to analyse a piece of unstructured text and extract the following information: - Name - Age - Nationality - Occupation - A list of any pets - A list of any hobbies If any acronyms are used, please expand them. Return the information in JSON format according to the following schema: {'properties': {'name': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'description': 'The name of the person', 'title': 'Name'}, 'age': {'anyOf': [{'type': 'integer'}, {'type': 'null'}], 'description': 'The age of the person', 'title': 'Age'}, 'nationality': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'description': 'The nationality of the person', 'title': 'Nationality'}, 'occupation': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'description': 'The occupation of the person', 'title': 'Occupation'}, 'pets': {'anyOf': [{'items': {'type': 'string'}, 'type': 'array'}, {'type': 'null'}], 'description': 'The pets of the person', 'title': 'Pets'}, 'hobbies': {'anyOf': [{'items': {'type': 'string'}, 'type': 'array'}, {'type': 'null'}], 'description': 'The hobbies of the person', 'title': 'Hobbies'}}, 'required': ['name', 'age', 'nationality', 'occupation', 'pets', 'hobbies'], 'title': 'Person', 'type': 'object'} Here is a description:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": description},
],
max_tokens=512,
response_format={"type": "json_object"},
temperature=0.0
)
print(response.choices[0].message.content)
{ "name": "Ryan", "age": 35, "nationality": "New Zealand", "occupation": "Machine Learning Engineer", "pets": ["cat"], "hobbies": ["hiking", "playing video games"] }
Good, this is valid
json_content = json.loads(response.choices[0].message.content)
person = Person(**json_content)
pprint(person, expand_all=True)
Person( │ name='Ryan', │ age=35, │ nationality='New Zealand', │ occupation='Machine Learning Engineer', │ pets=[ │ │ 'cat' │ ], │ hobbies=[ │ │ 'hiking', │ │ 'playing video games' │ ] )