Generating fake Nature abstracts¶

In this practical, the goal is to produce fake Nature articles. The inspiration comes from the work done by Alessandro Trevisan and his students in detecting "bullshit" in academic writing. No doubt a difficult task, and a never-ending stuggle.

In the abstracts/real.txt file, there are 5 real abstracts taken from Nature articles. The first goal is to have gpt-4o-mini produce some fake abstracts. The next stage is to see if gpt-4o can tell the difference. We will then see if we can tell the difference, and whether there is anything we can do to bring the generated abstracts closer to the real abstracts.

You will see some code blocks with something like:

def generate():
    response = client.\#---\#.completions.create(
        model="gpt-4o-mini",
        messages=[
            \#---\#
        ],
        max_tokens=128,
        \#---\#
    )

Wherever you see \#---\#, that is where you fill in the missing code.

In [2]:

Copied!





from openai import OpenAI
import dotenv
import os

import numpy as np
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns

from jinja2 import Environment, FileSystemLoader, select_autoescape
from typing import Any

from transformers import AutoModel, AutoTokenizer
import torch

# supress every warning
import warnings
warnings.filterwarnings("ignore")
from openai import OpenAI
import dotenv
import os

import numpy as np
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns

from jinja2 import Environment, FileSystemLoader, select_autoescape
from typing import Any

from transformers import AutoModel, AutoTokenizer
import torch

# supress every warning
import warnings
warnings.filterwarnings("ignore")

In [3]:

Copied!

client = OpenAI()
client = OpenAI()

Generating fake abstracts¶

First we load the real abstracts to take a look.

In [4]:

Copied!

with open('abstracts/real.txt', "r") as f:
    real_abstracts = f.read().split("\n\n")

print(real_abstracts[0])
with open('abstracts/real.txt', "r") as f:
    real_abstracts = f.read().split("\n\n")

print(real_abstracts[0])

A goal of neuroscience is to obtain a causal model of the nervous system. The recently reported whole-brain fly connectome specifies the synaptic paths by which neurons can affect each other, but not how strongly they do affect each other in vivo. To overcome this limitation, we introduce a combined experimental and statistical strategy for efficiently learning a causal model of the fly brain, which we refer to as the ‘effectome’. Specifically, we propose an estimator for a linear dynamical model of the fly brain that uses stochastic optogenetic perturbation data to estimate causal effects and the connectome as a prior to greatly improve estimation efficiency. We validate our estimator in connectome-based linear simulations and show that it recovers a linear approximation to the nonlinear dynamics of more biophysically realistic simulations. We then analyse the connectome to propose circuits that dominate the dynamics of the fly nervous system. We discover that the dominant circuits involve only relatively small populations of neurons—thus, neuron-level imaging, stimulation and identification are feasible. This approach also re-discovers known circuits and generates testable hypotheses about their dynamics. Overall, we provide evidence that fly whole-brain dynamics are generated by a large collection of small circuits that operate largely independently of each other. This implies that a causal model of a brain can be feasibly obtained in the fly.

Now we need to figure out a way to generate fake abstracts. We create both a system and a user prompt.

System prompt The system prompt should direct the model to produce a fake abstract on a topic given by the user.

User prompt The user prompt should just be a topic.

We will use Jinja to create the prompts. Create a new directory and add the two prompts.

In [5]:

Copied!





def load_template(template_filepath: str, arguments: dict[str, Any]) -> str:
    env = Environment(
        loader=FileSystemLoader(searchpath='./'),
        autoescape=select_autoescape()
    )
    template = env.get_template(template_filepath)
    return template.render(**arguments)
def load_template(template_filepath: str, arguments: dict[str, Any]) -> str:
    env = Environment(
        loader=FileSystemLoader(searchpath='./'),
        autoescape=select_autoescape()
    )
    template = env.get_template(template_filepath)
    return template.render(**arguments)

In [6]:

Copied!





generation_system_prompt = load_template("prompts/system.jinja",{})
generation_user_prompt = load_template(
    "prompts/user.jinja",
    {
        "topic": "Panda foraging in subsaharan Africa, and impact on the local polar bear population",
    }
)

print(generation_system_prompt)
print(generation_user_prompt)
generation_system_prompt = load_template("prompts/system.jinja",{})
generation_user_prompt = load_template(
    "prompts/user.jinja",
    {
        "topic": "Panda foraging in subsaharan Africa, and impact on the local polar bear population",
    }
)

print(generation_system_prompt)
print(generation_user_prompt)

You will be given a topic from the user. You are to write an original Nature article abstract about the topic.
Do not write a title, authors, or any other article text. Write only the abstract.
The abstract should be in English.
The abstract should be a continuous paragraph between 150 and 250 words.
It is important that the abstract is scientifically feasible.

Here is a new topic:

Panda foraging in subsaharan Africa, and impact on the local polar bear population

In [7]:

Copied!





def chat_response(system_prompt, user_prompt, model, temperature) -> str:
    client = OpenAI()

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=temperature,
        max_tokens=400
    ).choices[0].message.content

    return response
def chat_response(system_prompt, user_prompt, model, temperature) -> str:
    client = OpenAI()

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=temperature,
        max_tokens=400
    ).choices[0].message.content

    return response

In [8]:

Copied!

fake_abstract = chat_response(generation_system_prompt, generation_user_prompt, "gpt-4o-mini", 0.2)
fake_abstract = chat_response(generation_system_prompt, generation_user_prompt, "gpt-4o-mini", 0.2)

In [9]:

Copied!

print(fake_abstract)
print(fake_abstract)

The introduction of giant pandas (Ailuropoda melanoleuca) into sub-Saharan Africa presents a unique ecological scenario, particularly concerning its potential impact on local polar bear (Ursus maritimus) populations. This study investigates the foraging behavior of pandas in a novel environment characterized by diverse flora and fauna, contrasting their traditional bamboo diet with available local vegetation. Preliminary observations indicate that pandas exhibit dietary flexibility, incorporating native plant species into their foraging repertoire. This adaptability raises questions about interspecies competition, particularly with apex predators like polar bears, which are not native to this region but are included in this study due to their ecological significance. The research employs a combination of field studies and ecological modeling to assess the overlap in habitat use and resource competition between the two species. Early findings suggest that while pandas primarily forage on vegetation, their presence may inadvertently influence polar bear hunting patterns and prey availability, particularly in overlapping habitats. This study underscores the importance of understanding species interactions in altered ecosystems, highlighting the need for conservation strategies that consider the implications of introducing non-native species and their potential effects on established wildlife populations. Further research is essential to elucidate the long-term ecological consequences of this unprecedented scenario.

We now need to produce 5 new abstracts. It would be very useful if we could actually produce articles on similar topics to those in our real abstracts. We will therefore use gpt-4o to try and extract single sentence from each abstract.

In [10]:

Copied!





def extract_topic(abstract):
    extraction_system_prompt = load_template("prompts/extraction_system.jinja",{})
    extraction_user_prompt = load_template(
        "prompts/extraction_user.jinja",
        {
            "abstract": abstract,
        }
    )

    response = chat_response(extraction_system_prompt, extraction_user_prompt, "gpt-4o-mini", 0.2)

    return response
def extract_topic(abstract):
    extraction_system_prompt = load_template("prompts/extraction_system.jinja",{})
    extraction_user_prompt = load_template(
        "prompts/extraction_user.jinja",
        {
            "abstract": abstract,
        }
    )

    response = chat_response(extraction_system_prompt, extraction_user_prompt, "gpt-4o-mini", 0.2)

    return response

In [11]:

Copied!

topic_sentence = extract_topic(fake_abstract)
topic_sentence
topic_sentence = extract_topic(fake_abstract)
topic_sentence

Out[11]:

'Giant pandas in sub-Saharan Africa may impact local polar bear populations and ecosystems.'

Great. Now we can do this for all of our real abstracts:

In [12]:

Copied!

topic_sentences = [extract_topic(abstract) for abstract in real_abstracts]
topic_sentences
topic_sentences = [extract_topic(abstract) for abstract in real_abstracts]
topic_sentences

Out[12]:

['Causal modeling of the fly brain using connectome and perturbation data.',
 'Long-term health impacts of tropical cyclones on mortality in the contiguous United States.',
 'Investigating haematopoiesis in Down syndrome through multi-omic analysis of fetal samples.',
 'Innovative H2-based method transforms traditional alloy-making into a sustainable single-step process.',
 'Three-dimensional wave breaking significantly alters steepness and air-sea exchange dynamics.']

Comparing these with the original abstracts seems OK!

Now we can feed these topic sentences into our first gpt-4o-mini agent and see how it does.

In [13]:

Copied!





def generate_abstract(topic):
    generation_system_prompt = load_template("prompts/system.jinja",{})
    generation_user_prompt = load_template(
        "prompts/user.jinja",
        {
            "topic": topic,
        }
    )

    fake_abstract = chat_response(generation_system_prompt, generation_user_prompt, "gpt-4o-mini", 0.2)

    return fake_abstract
def generate_abstract(topic):
    generation_system_prompt = load_template("prompts/system.jinja",{})
    generation_user_prompt = load_template(
        "prompts/user.jinja",
        {
            "topic": topic,
        }
    )

    fake_abstract = chat_response(generation_system_prompt, generation_user_prompt, "gpt-4o-mini", 0.2)

    return fake_abstract

In [14]:

Copied!

generated_abstracts = [generate_abstract(topic) for topic in topic_sentences]
generated_abstracts = [generate_abstract(topic) for topic in topic_sentences]

In [14]:

Copied!





print(topic_sentences[0])
print("-"*10)
print(generated_abstracts[0])
print("-"*10)
print(real_abstracts[0])
print(topic_sentences[0])
print("-"*10)
print(generated_abstracts[0])
print("-"*10)
print(real_abstracts[0])

Causal modeling of the fly brain using connectome and experimental data.
----------
The intricate neural circuitry of the Drosophila melanogaster brain presents a unique opportunity to explore causal relationships between neural connectivity and behavior. In this study, we integrate high-resolution connectome data with experimental observations to construct a comprehensive causal model of the fly brain. Utilizing advanced computational techniques, we analyze synaptic connections derived from electron microscopy alongside behavioral assays that assess responses to sensory stimuli. Our model incorporates both structural and functional data, allowing us to identify key neural pathways that mediate specific behaviors, such as olfactory processing and motor coordination. By applying causal inference methods, we elucidate the influence of particular neural circuits on behavioral outcomes, revealing how alterations in connectivity can lead to changes in fly behavior. Furthermore, we validate our model through targeted manipulations of identified neural populations, demonstrating the predictive power of our approach. This research not only enhances our understanding of the fly brain's functional architecture but also provides a framework for investigating causal relationships in other complex neural systems. Ultimately, our findings contribute to the broader field of neurobiology by bridging the gap between structural connectomics and functional neuroscience, paving the way for future studies on the causal mechanisms underlying behavior in both invertebrates and vertebrates.
----------
A goal of neuroscience is to obtain a causal model of the nervous system. The recently reported whole-brain fly connectome specifies the synaptic paths by which neurons can affect each other, but not how strongly they do affect each other in vivo. To overcome this limitation, we introduce a combined experimental and statistical strategy for efficiently learning a causal model of the fly brain, which we refer to as the ‘effectome’. Specifically, we propose an estimator for a linear dynamical model of the fly brain that uses stochastic optogenetic perturbation data to estimate causal effects and the connectome as a prior to greatly improve estimation efficiency. We validate our estimator in connectome-based linear simulations and show that it recovers a linear approximation to the nonlinear dynamics of more biophysically realistic simulations. We then analyse the connectome to propose circuits that dominate the dynamics of the fly nervous system. We discover that the dominant circuits involve only relatively small populations of neurons—thus, neuron-level imaging, stimulation and identification are feasible. This approach also re-discovers known circuits and generates testable hypotheses about their dynamics. Overall, we provide evidence that fly whole-brain dynamics are generated by a large collection of small circuits that operate largely independently of each other. This implies that a causal model of a brain can be feasibly obtained in the fly.

But how can we really compare these? We generate embeddings of the abstracts.

In [34]:

Copied!





def get_embedding(texts : list, method = "openai"):
    match method:
        case "openai":
            embeddings = client.embeddings.create(
                model="text-embedding-3-small",
                input=texts
            )
            embeddings = [embedding.embedding for embedding in embeddings.data]

            return np.array(embeddings)
        
        case "bert":
            model_checkpoint = "distilbert/distilbert-base-uncased"

            model = AutoModel.from_pretrained(model_checkpoint)
            tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

            tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
            with torch.no_grad():
                embeddings = model(**tokens).last_hidden_state[:, 0, :].detach().numpy()

            return embeddings
        
        case _:
            raise ValueError("Invalid method")
def get_embedding(texts : list, method = "openai"):
    match method:
        case "openai":
            embeddings = client.embeddings.create(
                model="text-embedding-3-small",
                input=texts
            )
            embeddings = [embedding.embedding for embedding in embeddings.data]

            return np.array(embeddings)
        
        case "bert":
            model_checkpoint = "distilbert/distilbert-base-uncased"

            model = AutoModel.from_pretrained(model_checkpoint)
            tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

            tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
            with torch.no_grad():
                embeddings = model(**tokens).last_hidden_state[:, 0, :].detach().numpy()

            return embeddings
        
        case _:
            raise ValueError("Invalid method")

In [35]:

Copied!

real_embeddings = get_embedding(real_abstracts)
generated_embeddings = get_embedding(generated_abstracts)
real_embeddings = get_embedding(real_abstracts)
generated_embeddings = get_embedding(generated_abstracts)

In [38]:

Copied!

similarity = cosine_similarity(real_embeddings, generated_embeddings)
similarity = cosine_similarity(real_embeddings, generated_embeddings)

In [39]:

Copied!





all_embeddings = np.concatenate([real_embeddings, generated_embeddings])

pca = PCA(n_components=2)
pca.fit(all_embeddings)
real_pca = pca.transform(real_embeddings)
fake_pca = pca.transform(generated_embeddings)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
fig.tight_layout()

# heatmap with names axes and 2sf labels
sns.heatmap(similarity, annot=True, fmt='.2f', xticklabels=False, yticklabels=False, square=False, ax=ax1)
ax1.set_xlabel("Generated Abstracts")
ax1.set_ylabel("Real Abstracts")

ax2.scatter(real_pca[:,0], real_pca[:,1], label="Real")
ax2.scatter(fake_pca[:,0], fake_pca[:,1], label="Fake")
ax2.set_xlabel(f"PCA 1 ({pca.explained_variance_ratio_[0]*100:.2f}%)")
ax2.set_ylabel(f"PCA 2 ({pca.explained_variance_ratio_[1]*100:.2f}%)")
ax2.legend()

plt.show()
all_embeddings = np.concatenate([real_embeddings, generated_embeddings])

pca = PCA(n_components=2)
pca.fit(all_embeddings)
real_pca = pca.transform(real_embeddings)
fake_pca = pca.transform(generated_embeddings)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
fig.tight_layout()

# heatmap with names axes and 2sf labels
sns.heatmap(similarity, annot=True, fmt='.2f', xticklabels=False, yticklabels=False, square=False, ax=ax1)
ax1.set_xlabel("Generated Abstracts")
ax1.set_ylabel("Real Abstracts")

ax2.scatter(real_pca[:,0], real_pca[:,1], label="Real")
ax2.scatter(fake_pca[:,0], fake_pca[:,1], label="Fake")
ax2.set_xlabel(f"PCA 1 ({pca.explained_variance_ratio_[0]*100:.2f}%)")
ax2.set_ylabel(f"PCA 2 ({pca.explained_variance_ratio_[1]*100:.2f}%)")
ax2.legend()

plt.show()

No description has been provided for this image

From this is seems like most of the variance is down to the difference in topic. What if we get 5 abstracts of the same topic? In the file abstracts/real_same.txt are 5 articles on the fruit fly connectome. First grab the topic sentences and then generate the samples as before.

In [40]:

Copied!

with open('abstracts/real_same.txt', "r") as f:
    real_same_abstracts = f.read().split("\n\n")
with open('abstracts/real_same.txt', "r") as f:
    real_same_abstracts = f.read().split("\n\n")

In [41]:

Copied!

same_topic_sentences = [extract_topic(abstract) for abstract in real_same_abstracts]
same_topic_sentences
same_topic_sentences = [extract_topic(abstract) for abstract in real_same_abstracts]
same_topic_sentences

Out[41]:

['Developing a causal model of the fly brain using experimental and statistical methods.',
 'Analysis of the adult fly brain connectome reveals insights into neural network organization.',
 'Modeling the Drosophila brain reveals insights into sensory processing and behavior.',
 "Comprehensive annotation of neuronal classes in Drosophila melanogaster's brain connectome.",
 "Comprehensive analysis of neuronal cell types and connectivity in Drosophila's optic lobe.",
 'Whole brain neuronal wiring diagram reveals insights into connectivity and circuit mechanisms.',
 'Advancements in connectomics enable predictions about neuronal function from structural wiring diagrams.',
 'Neural mechanisms of halting during walking in Drosophila involve distinct inhibitory and excitatory pathways.',
 "Visual information processing in Drosophila's navigation system and its neural architecture."]

In [42]:

Copied!

same_generated_abstracts = [generate_abstract(topic) for topic in same_topic_sentences]
same_generated_abstracts = [generate_abstract(topic) for topic in same_topic_sentences]

In [43]:

Copied!

same_real_embeddings = get_embedding(real_same_abstracts)
same_generated_embeddings = get_embedding(same_generated_abstracts)
same_real_embeddings = get_embedding(real_same_abstracts)
same_generated_embeddings = get_embedding(same_generated_abstracts)

In [44]:

Copied!





similarity = cosine_similarity(same_real_embeddings, same_generated_embeddings)

all_embeddings = np.concatenate([same_real_embeddings, same_generated_embeddings])

pca = PCA(n_components=2)
pca.fit(all_embeddings)
real_pca = pca.transform(same_real_embeddings)
fake_pca = pca.transform(same_generated_embeddings)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
fig.tight_layout()

# heatmap with names axes and 2sf labels
sns.heatmap(similarity, annot=True, fmt='.2f', xticklabels=False, yticklabels=False, square=False, ax=ax1)
ax1.set_xlabel("Generated Abstracts")
ax1.set_ylabel("Real Abstracts")

ax2.scatter(real_pca[:,0], real_pca[:,1], label="Real")
ax2.scatter(fake_pca[:,0], fake_pca[:,1], label="Fake")
ax2.set_xlabel(f"PCA 1 ({pca.explained_variance_ratio_[0]*100:.2f}%)")
ax2.set_ylabel(f"PCA 2 ({pca.explained_variance_ratio_[1]*100:.2f}%)")
ax2.legend()

plt.show()
similarity = cosine_similarity(same_real_embeddings, same_generated_embeddings)

all_embeddings = np.concatenate([same_real_embeddings, same_generated_embeddings])

pca = PCA(n_components=2)
pca.fit(all_embeddings)
real_pca = pca.transform(same_real_embeddings)
fake_pca = pca.transform(same_generated_embeddings)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
fig.tight_layout()

# heatmap with names axes and 2sf labels
sns.heatmap(similarity, annot=True, fmt='.2f', xticklabels=False, yticklabels=False, square=False, ax=ax1)
ax1.set_xlabel("Generated Abstracts")
ax1.set_ylabel("Real Abstracts")

ax2.scatter(real_pca[:,0], real_pca[:,1], label="Real")
ax2.scatter(fake_pca[:,0], fake_pca[:,1], label="Fake")
ax2.set_xlabel(f"PCA 1 ({pca.explained_variance_ratio_[0]*100:.2f}%)")
ax2.set_ylabel(f"PCA 2 ({pca.explained_variance_ratio_[1]*100:.2f}%)")
ax2.legend()

plt.show()

Just by looking at the similarity scores of the embeddings and the first two pricipal components, it is quite difficult to spot the difference between the two different styles.

Now contrast this with the haiku generated by Claude, and the real haiku.

In [45]:

Copied!





with open("haiku/real_haiku_basho.txt", "r") as f:
    real_haikus_basho = f.read().split("\n\n")

with open("haiku/real_haiku_buson.txt", "r") as f:
    real_haikus_buson = f.read().split("\n\n")

with open("haiku/real_haiku_issa.txt", "r") as f:
    real_haikus_issa = f.read().split("\n\n")

with open("haiku/gpt_haiku.txt", "r") as f:
    fake_haikus = f.read().split("\n\n")

print(real_haikus_basho[0])
print()
print(fake_haikus[0])

real_haikus = real_haikus_basho + real_haikus_buson + real_haikus_issa

all_haikus = real_haikus + fake_haikus
# 1 for real, 0 for fake
targets = [1] * len(real_haikus) + [0] * len(fake_haikus)

# shuffle
# import random
# zipped = list(zip(all_haikus, targets))
# random.shuffle(zipped)
with open("haiku/real_haiku_basho.txt", "r") as f:
    real_haikus_basho = f.read().split("\n\n")

with open("haiku/real_haiku_buson.txt", "r") as f:
    real_haikus_buson = f.read().split("\n\n")

with open("haiku/real_haiku_issa.txt", "r") as f:
    real_haikus_issa = f.read().split("\n\n")

with open("haiku/gpt_haiku.txt", "r") as f:
    fake_haikus = f.read().split("\n\n")

print(real_haikus_basho[0])
print()
print(fake_haikus[0])

real_haikus = real_haikus_basho + real_haikus_buson + real_haikus_issa

all_haikus = real_haikus + fake_haikus
# 1 for real, 0 for fake
targets = [1] * len(real_haikus) + [0] * len(fake_haikus)

# shuffle
# import random
# zipped = list(zip(all_haikus, targets))
# random.shuffle(zipped)

The old pond
a frog jumps in
sound of water

Dawn's first light
a spider weaves anew
its dew-kissed web

In [46]:

Copied!

fake_haiku_embeddings = get_embedding(fake_haikus)
real_haiku_embeddings = get_embedding(real_haikus)
fake_haiku_embeddings = get_embedding(fake_haikus)
real_haiku_embeddings = get_embedding(real_haikus)

In [47]:

Copied!

all_embeddings = np.concatenate([real_haiku_embeddings, fake_haiku_embeddings])
all_embeddings = np.concatenate([real_haiku_embeddings, fake_haiku_embeddings])

In [48]:

Copied!





pca = PCA(n_components=2)
pca_embeddings = pca.fit_transform(all_embeddings)

plt.figure(figsize=(8, 6))
# plot each poet
plt.scatter(pca_embeddings[:len(fake_haikus), 0], pca_embeddings[:len(fake_haikus), 1], label="Fake Haiku")
# basho
plt.scatter(
    pca_embeddings[len(fake_haikus):len(fake_haikus)+len(real_haikus_basho), 0], 
    pca_embeddings[len(fake_haikus):len(fake_haikus)+len(real_haikus_basho), 1],
    label="Real Haiku (Basho)"
)
# buson
plt.scatter(
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho):len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson), 0],
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho):len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson), 1],
    label="Real Haiku (Buson)"
)
# issa
plt.scatter(
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson):, 0],
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson):, 1],
    label="Real Haiku (Issa)"
)

plt.xlabel(f"PCA Component 1 ({pca.explained_variance_ratio_[0]*100:.2f}%)")
plt.ylabel(f"PCA Component 2 ({pca.explained_variance_ratio_[1]*100:.2f}%)")
plt.legend()
plt.show()
pca = PCA(n_components=2)
pca_embeddings = pca.fit_transform(all_embeddings)

plt.figure(figsize=(8, 6))
# plot each poet
plt.scatter(pca_embeddings[:len(fake_haikus), 0], pca_embeddings[:len(fake_haikus), 1], label="Fake Haiku")
# basho
plt.scatter(
    pca_embeddings[len(fake_haikus):len(fake_haikus)+len(real_haikus_basho), 0], 
    pca_embeddings[len(fake_haikus):len(fake_haikus)+len(real_haikus_basho), 1],
    label="Real Haiku (Basho)"
)
# buson
plt.scatter(
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho):len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson), 0],
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho):len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson), 1],
    label="Real Haiku (Buson)"
)
# issa
plt.scatter(
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson):, 0],
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson):, 1],
    label="Real Haiku (Issa)"
)

plt.xlabel(f"PCA Component 1 ({pca.explained_variance_ratio_[0]*100:.2f}%)")
plt.ylabel(f"PCA Component 2 ({pca.explained_variance_ratio_[1]*100:.2f}%)")
plt.legend()
plt.show()

Using BERT as the embedding model¶

We've tried using the OpenAI embedding models. But now let's try our old friend, BERT.

In [49]:

Copied!

fake_embeddings = get_embedding(fake_haikus, method="bert")
real_embeddings = get_embedding(real_haikus, method="bert")
fake_embeddings = get_embedding(fake_haikus, method="bert")
real_embeddings = get_embedding(real_haikus, method="bert")

In [51]:

Copied!

embeddings = np.concatenate([fake_embeddings, real_embeddings])
labels = np.array([0]*len(fake_haikus) + [1]*len(real_haikus))
embeddings = np.concatenate([fake_embeddings, real_embeddings])
labels = np.array([0]*len(fake_haikus) + [1]*len(real_haikus))

In [52]:

Copied!





pca = PCA(n_components=2)
pca_embeddings = pca.fit_transform(embeddings)

plt.figure(figsize=(8, 6))
# plot each poet
plt.scatter(pca_embeddings[:len(fake_haikus), 0], pca_embeddings[:len(fake_haikus), 1], label="Fake Haiku")
# basho
plt.scatter(
    pca_embeddings[len(fake_haikus):len(fake_haikus)+len(real_haikus_basho), 0], 
    pca_embeddings[len(fake_haikus):len(fake_haikus)+len(real_haikus_basho), 1],
    label="Real Haiku (Basho)"
)
# buson
plt.scatter(
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho):len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson), 0],
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho):len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson), 1],
    label="Real Haiku (Buson)"
)
# issa
plt.scatter(
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson):, 0],
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson):, 1],
    label="Real Haiku (Issa)"
)

plt.xlabel(f"PCA Component 1 ({pca.explained_variance_ratio_[0]*100:.2f}%)")
plt.ylabel(f"PCA Component 2 ({pca.explained_variance_ratio_[1]*100:.2f}%)")
plt.legend()
plt.show()
pca = PCA(n_components=2)
pca_embeddings = pca.fit_transform(embeddings)

plt.figure(figsize=(8, 6))
# plot each poet
plt.scatter(pca_embeddings[:len(fake_haikus), 0], pca_embeddings[:len(fake_haikus), 1], label="Fake Haiku")
# basho
plt.scatter(
    pca_embeddings[len(fake_haikus):len(fake_haikus)+len(real_haikus_basho), 0], 
    pca_embeddings[len(fake_haikus):len(fake_haikus)+len(real_haikus_basho), 1],
    label="Real Haiku (Basho)"
)
# buson
plt.scatter(
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho):len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson), 0],
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho):len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson), 1],
    label="Real Haiku (Buson)"
)
# issa
plt.scatter(
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson):, 0],
    pca_embeddings[len(fake_haikus)+len(real_haikus_basho)+len(real_haikus_buson):, 1],
    label="Real Haiku (Issa)"
)

plt.xlabel(f"PCA Component 1 ({pca.explained_variance_ratio_[0]*100:.2f}%)")
plt.ylabel(f"PCA Component 2 ({pca.explained_variance_ratio_[1]*100:.2f}%)")
plt.legend()
plt.show()

That actually looks a little better.

With the abstracts¶

In [53]:

Copied!

with open('abstracts/real_same.txt', "r") as f:
    real_same_abstracts = f.read().split("\n\n")
with open('abstracts/real_same.txt', "r") as f:
    real_same_abstracts = f.read().split("\n\n")

In [54]:

Copied!

same_fake_embeddings = get_embedding(same_generated_abstracts, method="bert")
same_real_embeddings = get_embedding(real_same_abstracts, method="bert")
same_fake_embeddings = get_embedding(same_generated_abstracts, method="bert")
same_real_embeddings = get_embedding(real_same_abstracts, method="bert")

In [55]:

Copied!





similarity = cosine_similarity(same_real_embeddings, same_fake_embeddings)


all_embeddings = np.concatenate([same_real_embeddings, same_fake_embeddings])

pca = PCA(n_components=2)
pca.fit(all_embeddings)
real_pca = pca.transform(same_real_embeddings)
fake_pca = pca.transform(same_fake_embeddings)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
fig.tight_layout()

# heatmap with names axes and 2sf labels
sns.heatmap(similarity, annot=True, fmt='.2f', xticklabels=False, yticklabels=False, square=False, ax=ax1)
ax1.set_xlabel("Generated Abstracts")
ax1.set_ylabel("Real Abstracts")

ax2.scatter(real_pca[:,0], real_pca[:,1], label="Real")
ax2.scatter(fake_pca[:,0], fake_pca[:,1], label="Fake")
ax2.set_xlabel(f"PCA 1 ({pca.explained_variance_ratio_[0]*100:.2f}%)")
ax2.set_ylabel(f"PCA 2 ({pca.explained_variance_ratio_[1]*100:.2f}%)")
ax2.legend()

plt.show()
similarity = cosine_similarity(same_real_embeddings, same_fake_embeddings)


all_embeddings = np.concatenate([same_real_embeddings, same_fake_embeddings])

pca = PCA(n_components=2)
pca.fit(all_embeddings)
real_pca = pca.transform(same_real_embeddings)
fake_pca = pca.transform(same_fake_embeddings)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
fig.tight_layout()

# heatmap with names axes and 2sf labels
sns.heatmap(similarity, annot=True, fmt='.2f', xticklabels=False, yticklabels=False, square=False, ax=ax1)
ax1.set_xlabel("Generated Abstracts")
ax1.set_ylabel("Real Abstracts")

ax2.scatter(real_pca[:,0], real_pca[:,1], label="Real")
ax2.scatter(fake_pca[:,0], fake_pca[:,1], label="Fake")
ax2.set_xlabel(f"PCA 1 ({pca.explained_variance_ratio_[0]*100:.2f}%)")
ax2.set_ylabel(f"PCA 2 ({pca.explained_variance_ratio_[1]*100:.2f}%)")
ax2.legend()

plt.show()

Well, well, well...

BERT to the rescue once again. Even though their cosine similarities are all remarkably similar, as we would expect, since they are all papers around the same topic, PCA on those embeddings still yields some valuable results.

In [6]:

Copied!





from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("bert-alone.png")
plt.imshow(img)
plt.axis('off')
plt.show()
from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("bert-alone.png")
plt.imshow(img)
plt.axis('off')
plt.show()