Prompt Engineering¶
Prompting can be critical to success.
In this section, we will:
- Present two methods for prompting: Chain of Thought (CoT) and Few-shot;
- Show you how to create prompt templates using
Jinja2
from openai import OpenAI
client = OpenAI()
import dotenv
import os
dotenv.load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
Chain-of-Thought (CoT)¶
Chain-of-Thought prompting has a fancy name, but it is actually very simple.
We could simply just ask the model to spit out a simple answer to a question. Below are two simple logic problems that we present to the model.
Piaget's Glass of Water¶
user_query = (
"Suppose I show you two glasses of water, labelled A and B. "
"Glasses A and B appear to be identical in shape, and the water level appears "
"to be at the same height in both glasses. "
"I now bring out a third glass, labelled C. "
"Glass C appears to be taller and thinner than glasses A and B. "
"I pour the water from glass B into glass C. "
"The water level in glass C appears to be higher than the water level in glass A. "
"Which glass has more water, A or C? "
"Keep your answer concise."
)
This is a classic Piaget Test. Piaget devised a series of tests for children to determine their cognitive development.
One of these tests is the prototpyical conservation test. Children in the preoperational stage of development (ages 2-7) do not have an understanding of matter conservation - they will tend to claim that the taller, thinner glass has more water in it. Children in the concrete operational stage (ages 7-11) do have an understanding of matter conservation, and will claim that the two glasses have the same amount of water in them. If you also said that the two glasses have the same amount of water in them, congratulations! You are at least as cognitively developed as a 7-11 year old!
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": user_query},
],
max_tokens=512,
temperature=0.0
)
print(response.choices[0].message.content)
Glass A has more water than glass C. The height of the water level in glass C appears higher due to its taller and thinner shape, but it contains less water overall than glass A.
Poor gpt-4o-mini
! It seems to be quite confused by this question - it is almost correct, but not quite. It is not clear why this is the case, but it is clear that the model is not able to perform this task.
Try changing the model to gpt-4o
though, and you will find that it can perform this task. Some might argue that this is a type of emergent ability - as we start adding more parameters to a model, certain behaviours and abilities unexpectedly emerge. This is a very interesting phenomenon, and it is not necessarily clear why this happens, or whether it is actually real (see a rebuttal to this here).
If we stick with gpt-4o-mini
, we can try to prompt the model with a CoT prompt. This is a prompt that is designed to elicit a series of steps from the model whereby it will generate a sort of reasoning process before coming to a conclusion. All we do is add to the end of the prompt,
Carefully think through the problem step by step.
We also make sure to keep the answer concise,
Keep your answer concise.
just to avoid the model generating an overly long-winded response.
user_query = (
"Suppose I show you two glasses of water, labelled A and B. "
"Glasses A and B appear to be identical in shape, and the water level appears "
"to be at the same height in both glasses. "
"I now bring out a third glass, labelled C. "
"Glass C appears to be taller and thinner than glasses A and B. "
"I pour the water from glass B into glass C. "
"The water level in glass C appears to be higher than the water level in glass A. "
"Which glass has more water, A or C? "
"Keep your answer concise. Carefully think through the problem step by step."
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": user_query},
],
max_tokens=512,
temperature=0.0
)
print(response.choices[0].message.content)
To determine which glass has more water, we need to consider the volume of water in each glass. 1. **Initial Observation**: Glasses A and B have the same water level, indicating they contain the same volume of water. 2. **Pouring Water**: When you pour the water from glass B into glass C, the volume of water from B is transferred to C. 3. **Water Level in Glass C**: After pouring, the water level in glass C appears higher than in glass A. This is due to the taller and thinner shape of glass C, which can make the water level appear higher even if the volume is not greater. 4. **Volume Comparison**: Since glasses A and B had the same volume of water initially, and all of the water from B was transferred to C, the total volume of water in glass C is now equal to the volume of water in glass A plus the volume of water that was in glass B (which is the same as in A). Thus, after pouring, glass C contains the same amount of water as glass A, because it started with the same amount from glass B. **Conclusion**: Glass A and glass C have the same amount of water.
This answer is now correct. But before you celeberate, let me show you what happens when you make a seemingly harmless change.
user_query = (
"Suppose I show you two glasses of water, labelled A and B. "
"Glasses A and B appear to be identical in shape, and the water level appears "
"to be at the same height in both glasses. "
"I now bring out a third glass, labelled C. "
"Glass C appears to be taller and thinner than glasses A and B. "
"I pour the water from glass B into glass C. "
"The water level in glass C appears to be higher than the water level in glass A. "
"Which glass has more water, A or C? "
"Keep your answer consise and think through the problem very carefully step by step."
)
All we have done is change
Keep your answer concise. Carefully think through the problem step by step.
to
Keep your answer consise and think through the problem very carefully step by step.
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": user_query},
],
max_tokens=512,
temperature=0.0
)
print(response.choices[0].message.content)
To determine which glass has more water, we need to analyze the situation step by step: 1. **Initial Observation**: Glasses A and B have the same water level, indicating they contain the same volume of water. 2. **Pouring Water**: When you pour the water from glass B into glass C, the water from B is transferred to C. 3. **Water Level in Glass C**: After pouring, the water level in glass C appears to be higher than the water level in glass A. 4. **Volume Comparison**: Since glasses A and B initially had the same volume of water, and you poured all of the water from B into C, the total volume of water in glass C is now equal to the volume of water in glass B plus the volume of water that was already in glass C (which was initially empty). 5. **Conclusion**: Since glass A had the same amount of water as glass B before the pour, and now glass C has the water from B plus whatever was in C, glass C must have more water than glass A. Thus, glass C has more water than glass A.
What!?
How is it suddenly now wrong!? Although LLMs have ingested enormous amounts of human text, they are not like people. You cannot reason with them in the same way as you can with people. Adding very carefully
to the prompt should intuitively make the prompt better, but in this case it has caused a failure. In fact even with the successful prompt we tried, if you repeat it many times, you will probably have failures.
Humans do not fail reasoning tasks in this way.
The lesson here is that CoT can be a powerful way to improve performance, but it is not foolproof. Hallucination is almost impossible to eliminate, and prompting can be incredibly fragile.
Can you spot the issue with my prompt and the reasoning...?
Read the conclusion very carefully, and then read the prompt carefully. In particular, the part where I introduce glass C. I never said that glass C was empty! It can be very hard to craft high quality prompts! What happens if I add that additional information?
user_query = (
"Suppose I show you two glasses of water, labelled A and B. "
"Glasses A and B appear to be identical in shape, and the water level appears "
"to be at the same height in both glasses. "
"I now bring out an empty third glass, labelled C. "
"Glass C appears to be taller and thinner than glasses A and B. "
"I pour the water from glass B into glass C. "
"The water level in glass C appears to be higher than the water level in glass A. "
"Which glass has more water, A or C? "
"Keep your answer consise and think through the problem very carefully step by step."
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": user_query},
],
max_tokens=512,
temperature=0.0
)
print(response.choices[0].message.content)
To determine which glass has more water, we need to analyze the situation step by step: 1. **Initial Condition**: Glasses A and B have the same water level, meaning they contain the same volume of water. 2. **Pouring Water**: When you pour the water from glass B into glass C, the water from B is transferred to C. 3. **Water Level in Glass C**: After pouring, the water level in glass C appears to be higher than the water level in glass A. This is due to the shape of glass C being taller and thinner. 4. **Volume Comparison**: Since glass B initially had the same amount of water as glass A, and all the water from B is now in glass C, the volume of water in glass C is equal to the volume of water that was in glass B. 5. **Conclusion**: Since glass A and glass B had the same amount of water, and all of the water from B is now in C, glass C contains the same amount of water as glass A. Therefore, both glasses A and C have the same volume of water. **Final Answer**: Glass A and glass C have the same amount of water.
Now both versions are correct.
N Lions Living in Harmony¶
This problem is a little harder.
user_query = (
"There is an island with infinite grass and vegetation. "
"The island is inhabited by 1 sheep and N lions. "
"The lions can survive by eating the sheep or vegetation, "
"but they much prefer to eat the sheep. "
"When a lion eats the sheep, it gets converted into a sheep itself "
"and then can in turn be eaten by other lions. "
"The lions want to eat the sheep, but not at the risk of being eaten themselves. "
"For which number of N will all sheep be safe? Keep your answer concise."
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": user_query},
],
max_tokens=1024,
temperature=0.0
)
print(response.choices[0].message.content)
All sheep will be safe when N = 1. For N = 2 or more, the lions will eat the sheep.
This is incorrect.
To be fair, this problem is actually pretty difficult - it is a classic quantitative finance job interview question. Maybe we should not be so hard on gpt-4o-mini
after all, since most humans don't get this right first time.
Now let's add a CoT prompt.
user_query = (
"There is an island with infinite grass and vegetation. "
"The island is inhabited by 1 sheep and N lions. "
"The lions can survive by eating the sheep or vegetation, "
"but they much prefer to eat the sheep. "
"When a lion eats the sheep, it gets converted into a sheep itself "
"and then can in turn be eaten by other lions. "
"The lions want to eat the sheep, but not at the risk of being eaten themselves. "
"For which number of N will all sheep be safe? "
"Think through the problem step-by-step."
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": user_query},
],
max_tokens=1024,
temperature=0.0
)
print(response.choices[0].message.content)
To analyze the problem, we need to consider the dynamics between the sheep and the lions. The key points to keep in mind are: 1. **Initial Setup**: There is 1 sheep and N lions on the island. 2. **Lions' Preference**: Lions prefer to eat the sheep over vegetation. 3. **Consequences of Eating**: When a lion eats the sheep, it becomes a sheep itself, which can then be eaten by other lions. 4. **Survival Instinct**: Lions will not eat the sheep if they believe they will be at risk of being eaten themselves. Now, let's analyze the situation based on different values of N (the number of lions): - **N = 0**: There are no lions, so the sheep is safe. - **N = 1**: There is 1 lion. The lion will eat the sheep because it has no risk of being eaten (there are no other lions). The sheep is not safe. - **N = 2**: There are 2 lions. Each lion knows that if one of them eats the sheep, the other lion can then eat the new sheep. Therefore, both lions will refrain from eating the sheep to avoid being eaten themselves. The sheep is safe. - **N = 3**: With 3 lions, if one lion eats the sheep, it becomes a sheep and the other two lions can then eat it. Each lion knows that if it eats the sheep, it risks being eaten by one of the other two lions. Thus, all three lions will refrain from eating the sheep. The sheep is safe. - **N = 4**: With 4 lions, the same logic applies. If one lion eats the sheep, it becomes a sheep and the other three lions can eat it. Each lion knows that if it eats the sheep, it risks being eaten by one of the other three lions. Thus, all four lions will refrain from eating the sheep. The sheep is safe. Continuing this reasoning, we can see that as long as there are an even number of lions, they will not eat the sheep because they can always be outnumbered by the remaining lions if one lion tries to eat the sheep. However, if we consider: - **N = 5**: With 5 lions, if one lion eats the sheep, it becomes a sheep and the other four lions can eat it. The lion that eats the sheep knows it will be outnumbered by the remaining four lions, so it will not eat the sheep. However, the same logic applies to the other lions, and they will also refrain from eating the sheep. The sheep is safe. - **N = 6**: With 6 lions, the same reasoning applies. The sheep is safe. - **N = 7**: With 7 lions, if one lion eats the sheep, it becomes a sheep and the other six lions can eat it. The lion that eats the sheep knows it will be outnumbered by the remaining six lions, so it will not eat the sheep. The sheep is safe. - **N = 8**: With 8 lions, the same reasoning applies. The sheep is safe. Continuing this reasoning, we can conclude that the sheep will be safe as long as there are an even number of lions. Thus, the final conclusion is: **The sheep will be safe if N is even (N = 0, 2, 4, 6, ...). If N is odd (N = 1, 3, 5, 7, ...), the sheep will not be safe.**
Perhaps surpirsingly, this answer is actually correct! Just by adding,
Think through the problem step by step.
to the end of the prompt, the model was able to reason its way to the correct answer. Again though, try running it a few times and seeing if it consistently correct. Or try telling the model to be concise...you will likely experience failures.
Again CoT prompting can be a powerful technique, but it will almost certainly fail at times.
Few-shot prompting¶
When we want an LLM to do something for us, we could just ask it.
In these example, we use a selection of real and fake Haiku and ask the model to classify them as either real or fake. The fake Haiku are generated by Claude, and the real Haiku are from the famous Haiku poets Matsuo Basho, Yosa Buson, and Kobayashi Issa. There are 6 each of the real poets for a total of 18, and 18 fake examples.
with open("haiku/real_haiku_basho.txt", "r") as f:
real_haikus_basho = f.read().split("\n\n")
with open("haiku/real_haiku_buson.txt", "r") as f:
real_haikus_buson = f.read().split("\n\n")
with open("haiku/real_haiku_issa.txt", "r") as f:
real_haikus_issa = f.read().split("\n\n")
with open("haiku/gpt_haiku.txt", "r") as f:
fake_haikus = f.read().split("\n\n")
print(real_haikus_basho[0])
print()
print(fake_haikus[0])
real_haikus = real_haikus_basho + real_haikus_buson + real_haikus_issa
all_haikus = real_haikus + fake_haikus
# 1 for real, 0 for fake
targets = [1] * len(real_haikus) + [0] * len(fake_haikus)
# shuffle
import random
zipped = list(zip(all_haikus, targets))
random.shuffle(zipped)
The old pond a frog jumps in sound of water Dawn's first light a spider weaves anew its dew-kissed web
Some of the generated Haiku are quite similar to the original Haiku. However, what might throw off the model is that the real Haiku are not necessarily in the 5-7-5 pattern (in fact most of them are not). For a very interesting analysis of influential Japanese Haiku, I strongly recommend Volume 1 of R. H. Blythe's Haiku: Eastern Culture.
We will have to switch to gpt-4o
to get the correct answer to this question. gpt-4o-mini
is not able to perform this task well.
from tqdm import tqdm
system_prompt = (
"You will be shown a haiku that is either written by a human or by a computer. "
"Your job is to classify the haiku as either 'human' or 'computer'.\n\n"
"Classify the following haiku as either written by a human or a computer. "
"Respond with 'human' or 'computer' only."
)
result = []
for haiku, target in tqdm(zipped):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": haiku},
],
max_tokens=128
)
result.append((haiku, target, response.choices[0].message.content))
100%|██████████| 36/36 [00:17<00:00, 2.06it/s]
def accuracy(result):
correct = 0
for haiku, target, response in result:
if response == "human" and target == 1:
correct += 1
elif response == "computer" and target == 0:
correct += 1
return correct / len(result)
print(f'Accuracy: {accuracy(result)*100:.2f}%')
# print target, response
for haiku, target, response in result:
print(target, response)
Accuracy: 47.22% 1 human 1 human 0 human 1 human 1 human 0 human 1 computer 0 human 1 human 0 human 0 human 0 human 1 human 0 human 0 human 0 human 0 human 0 human 1 human 0 human 0 human 1 human 1 human 0 human 1 human 0 human 1 human 1 human 0 human 1 human 1 human 1 human 0 human 1 human 1 human 0 human
It thinks almost all of them are human!! This is not what we want. We want the model to be able to distinguish between human and machine generated Haiku.
Let's add some examples to the prompt. We could add these into the system prompt itself, but it makes more sense to add them in the following way:
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "A cicada shell\nit sang itself\nutterly away"},
{"role": "assistant", "content": "human"},
{"role": "user", "content": "Crimson leaves falling\na path of quiet footsteps\nautumn fades to dusk"},
{"role": "assistant", "content": "computer"},
{"role": "user", "content": "It is evening autumn\nI think only\nof my parents"},
{"role": "assistant", "content": "human"},
{"role": "user", "content": "In the moonlit night\na distant temple bell rings\nechoes in my heart"},
{"role": "assistant", "content": "computer"},
]
result = []
for haiku, target in tqdm(zipped):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages + [{"role": "user", "content": haiku}],
max_tokens=128
)
result.append((haiku, target, response.choices[0].message.content))
100%|██████████| 36/36 [00:22<00:00, 1.63it/s]
print(f'Accuracy: {accuracy(result)*100:.2f}%')
# print target, response
for haiku, target, response in result:
print(target, response)
Accuracy: 94.44% 1 computer 1 human 0 computer 1 human 1 human 0 computer 1 computer 0 computer 1 human 0 computer 0 computer 0 computer 1 human 0 computer 0 computer 0 computer 0 computer 0 computer 1 human 0 computer 0 computer 1 human 1 human 0 computer 1 human 0 computer 1 human 1 human 0 computer 1 human 1 human 1 human 0 computer 1 human 1 human 0 computer
Excellent! So after just showing the model a couple of examples of each, it is now able to distinguish reliably between human and machine-generated Haiku. This is the power of few-shot prompting - if you can show instead of tell, you might end up with better results.
It is also possible to combine CoT and few-shot prompting. Just be aware that this may drastically increase the length of the prompt.
An aside...¶
We used an LLM for the above task, when we could just have easily used something else. We could have cultivated a broad dataset of real and fake Haiku and fine-tuned a BERT model for this task. More than likely, a fine-tuned BERT would have outperformed an LLM. If we only have a small amount of data, and a small amount of samples to classify an LLM will perform well. However, if we have very many potential training examples, and very many samples to classify, then using an LLM may be very expensive compared to fine-tuning a BERT model.
General guidance for prompting¶
Be clear and specific¶
The model is not a mind reader. Try and avoid ambiguity in your prompts. If you are asking a question, make sure it is clear what you are asking. If you are asking for a summary, make sure it is clear what you want a summary of. If you want code, make sure you specify language, arguments, returns, etc.
Be concise¶
LLMs suffer from a retrieval bias, where they tend to favour information at the beginning and ends of prompts. A well known test for LLMs is the needle in a haystack test. You take a document and bury an obscure fact or sentence within it. You then ask a question related to this sentence.
What this effectively shows is that if you make your prompt very long, and the information you are looking for is buried within it, the model may not pay close enough attention to it. Keep important information or instructions at the beginning or end of the prompt, but preferably the end.