RAG¶
In this section we will start to see the glimpses of RAG. We start by figuring out how to handle external documents. We have already been exposed to building a database in the previous section, and we will use this knowledge to build a database over an example document.
As with data storage, we have many, many options for processing external documents. In this section we will make use of LlamaIndex. The approach will likely vary depending on the nature of your documents - html, pdf, word, folders, etc.
Here, we focus on parsing a single PDF using llama_index
and PyMuPDFReader
.
from llama_index.readers.file import PyMuPDFReader
from llama_index.core.node_parser import SentenceSplitter
from pydantic import BaseModel, Field
import fitz
from PIL import Image
import matplotlib.pyplot as plt
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
import dotenv
import os
from openai import OpenAI
from jinja2 import Environment, FileSystemLoader, select_autoescape
from typing import Any
dotenv.load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
loader = PyMuPDFReader()
documents = loader.load(file_path="data/paper.pdf")
len(documents)
30
This list contains 1 item for each page.
print(documents[0].text[:1000] + "...")
A Philosophical Introduction to Language Models Part I: Continuity With Classic Debates Raphaël Millière Department of Philosophy Macquarie University raphael.milliere@mq.edu.eu Cameron Buckner Department of Philosophy University of Houston cjbuckner@uh.edu Abstract Large language models like GPT-4 have achieved remarkable proficiency in a broad spec- trum of language-based tasks, some of which are traditionally associated with hallmarks of human intelligence. This has prompted ongoing disagreements about the extent to which we can meaningfully ascribe any kind of linguistic or cognitive competence to language models. Such questions have deep philosophical roots, echoing longstanding debates about the status of artificial neural networks as cognitive models. This article–the first part of two companion papers–serves both as a primer on language models for philosophers, and as an opinionated survey of their significance in relation to classic debates in the philosophy cognitive science,...
Extracting images¶
It is probably handy to have the images extracted from the pdf. This is not always easy to do, but for this paper, we can use PyMuPDF to extract the images. Objects in a pdf are identified by a xref
(cross reference) number.
If you know this number, you can extract the image. But how do you find the xref
number? One method is use PyMuPDF's image extraction functions. We can just loop through all xref
s and try and extract the image. If it doesn't work, then it's not an image! PyMuPDF will do most of this for us.
def get_images(path: str):
doc = fitz.open(path)
for p, page in enumerate(doc):
images = page.get_images()
if len(images) > 0:
print(f"Page {p} has {len(images)} images")
for i, img in enumerate(images):
xref = img[0]
mref = img[1]
basepix = fitz.Pixmap(doc,xref)
maskpix = fitz.Pixmap(doc,mref)
pix = fitz.Pixmap(basepix, maskpix)
pix.save(f"./data/page_{p}_image_{i}.png")
print("Done")
get_images("data/paper.pdf")
Page 4 has 1 images Page 6 has 1 images Page 10 has 1 images Done
If we inspect one of these images, we can see that sure enough, it is a correct image.
img = Image.open("data/page_4_image_0.png")
plt.imshow(img)
plt.axis("off")
plt.show()
In some cases, this might not be possible. Another method is to convert the pdf pages to images. We can then pass the images to a vision LLM and ask it to extract the images.
Creating a vector database¶
Now we have our documents, we can create a vector database. We will use Chroma as before.
First, we use the text_parser
we created before to split the documents into chunks, and create indices. Essentially, the process is:
- Split the document into chunks;
- Add the chunks to a list;
- Add the chunks to a database, assigning a unique index, and any metadata to the chunks.
The splitting can occur in a few different ways: at .
, page-breaks, paragraphs, sentences. You can also choose different chunk sizes and overlap sizes.
We use the SentenceSplitter
from LlamaIndex, and just pick some generic parameters.
def chunker(chunk_size: int, overlap: int, documents: Any) -> tuple[list[str], list[int]]:
text_parser = SentenceSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
)
text_chunks = []
doc_idxs = []
for doc_idx, doc in enumerate(documents):
cur_text_chunks = text_parser.split_text(doc.text)
text_chunks.extend(cur_text_chunks)
doc_idxs.extend([doc_idx] * len(cur_text_chunks))
return text_chunks, doc_idxs
text_chunks, doc_idxs = chunker(chunk_size=1024, overlap=128, documents=documents)
len(text_chunks)
34
Reusing roughly the same database structure as before:
class DocumentDB:
def __init__(self, name: str, model_name: str = "text-embedding-3-small"):
self.model_name = model_name
self.client = chromadb.PersistentClient(path="./")
self.embedding_function = OpenAIEmbeddingFunction(api_key=OPENAI_API_KEY, model_name=model_name)
self.chat_db = self.client.create_collection(name=name, embedding_function=self.embedding_function, metadata={"hnsw:space": "cosine"})
self.id_counter = 0
def add_chunks_to_db(self, chunks: list[str], doc_idxs: list[int], metadata: dict = {}):
"""Add text chunks to the database.
Args:
chunks (list[str]): List of text chunks.
doc_idxs (list[int]): List of corresponding document indices.
"""
self.chat_db.add(
documents=chunks,
metadatas=[{"doc_idx": idx} for idx in doc_idxs],
ids=[f"chunk_{self.id_counter + i}" for i in range(len(chunks))]
)
self.id_counter += len(chunks)
def get_all_entries(self) -> dict:
"""Grab all of the entries in the database.
Returns:
dict: All entries in the database.
"""
return self.chat_db.get()
def clear_db(self, reinitialize: bool = True):
"""Clear the database of all entries, and reinitialize it.
Args:
reinitialize (bool, optional): _description_. Defaults to True.
"""
self.client.delete_collection(self.chat_db.name)
# re-initialize the database
if reinitialize:
self.__init__(self.chat_db.name, self.model_name)
def query_db(self, query_text: str, n_results: int = 2) -> dict:
"""Given some query text, return the n_results most similar entries in the database.
Args:
query_text (str): The text to query the database with.
n_results (int): The number of results to return.
Returns:
dict: The most similar entries in the database.
"""
return self.chat_db.query(query_texts=[query_text], n_results=n_results)
Now we add our chunks to the database:
doc_db = DocumentDB("paper_db")
doc_db.add_chunks_to_db(chunks=text_chunks, doc_idxs=doc_idxs)
If you have already created the database then you will get an error if you try to run this again. You'll need to delete the chroma.sqlite3
file and the folder with a name consisting of a long string of numbers and letters.
We now try a query and see what results we get back:
sample_query = "Abstract"
results = doc_db.query_db(sample_query, n_results=3)
print(f"Sample query results for '{sample_query}':")
results
Sample query results for 'Abstract':
{'ids': [['chunk_30', 'chunk_12', 'chunk_11']], 'distances': [[0.7026290379547134, 0.7119312067010161, 0.7212011058066763]], 'metadatas': [[{'doc_idx': 27}, {'doc_idx': 11}, {'doc_idx': 10}]], 'embeddings': None, 'documents': [['A Philosophical Introduction to Language Models\nPart I\nQiu, L., Shaw, P., Pasupat, P., Nowak, P., Linzen, T., Sha, F. & Toutanova, K. (2022), Improving\nCompositional Generalization with Latent Structure and Data Augmentation, in ‘Proceedings of the\n2022 Conference of the North American Chapter of the Association for Computational Linguistics:\nHuman Language Technologies’, Association for Computational Linguistics, Seattle, United States,\npp. 4341–4362.\nQuilty-Dunn, J., Porot, N. & Mandelbaum, E. (2022), ‘The Best Game in Town: The Re-Emergence of\nthe Language of Thought Hypothesis Across the Cognitive Sciences’, Behavioral and Brain Sciences\npp. 1–55.\nRaffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. & Liu, P. J. (2020),\n‘Exploring the limits of transfer learning with a unified text-to-text transformer’, The Journal of\nMachine Learning Research 21(1), 140:5485–140:5551.\nRamesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. (2022), ‘Hierarchical Text-Conditional Image\nGeneration with CLIP Latents’.\nSalton, G., Wong, A. & Yang, C. S. (1975), ‘A vector space model for automatic indexing’, Communica-\ntions of the ACM 18(11), 613–620.\nSavelka, J., Agarwal, A., An, M., Bogart, C. & Sakr, M. (2023), Thrilled by Your Progress! Large\nLanguage Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Program-\nming Courses, in ‘Proceedings of the 2023 ACM Conference on International Computing Education\nResearch V.1’, pp. 78–92.\nSavelka, J., Ashley, K. D., Gray, M. A., Westermann, H. & Xu, H. (2023), Can GPT-4 Support Analysis\nof Textual Data in Tasks Requiring Highly Specialized Domain Expertise?, in ‘Proceedings of the\n2023 Conference on Innovation and Technology in Computer Science Education V. 1’, pp. 117–123.\nSchmidhuber, J. (1990), Towards Compositional Learning with Dynamic Neural Networks, Inst. für\nInformatik.\nSchut, L., Tomasev, N., McGrath, T., Hassabis, D., Paquet, U. & Kim, B. (2023), ‘Bridging the Human-AI\nKnowledge Gap: Concept Discovery and Transfer in AlphaZero’.\nSearle, J. R. (1980), ‘Minds, Brains, and Programs’, Behavioral and Brain Sciences 3(3), 417–57.\nShinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K. & Yao, S. (2023), ‘Reflexion:\nLanguage Agents with Verbal Reinforcement Learning’.\nSmolensky, P. (1988), ‘On the proper treatment of connectionism’, Behavioral and Brain Sciences\n11(1), 1–23.\nSmolensky, P. (1989), Connectionism and Constituent Structure, in R. Pfeifer, Z. Schreter, F. Fogelman-\nSoulié & L. Steels, eds, ‘Connectionism in Perspective’, Elsevier.\nSmolensky, P., McCoy, R., Fernandez, R., Goldrick, M. & Gao, J. (2022a), ‘Neurocompositional\nComputing: From the Central Paradox of Cognition to a New Generation of AI Systems’, AI\nMagazine 43(3), 308–322.\nSmolensky, P., McCoy, R. T., Fernandez, R., Goldrick, M. & Gao, J. (2022b), ‘Neurocompositional\ncomputing in human and machine intelligence: A tutorial’.\nSober, E. (1998), Morgan’s canon, in ‘The Evolution of Mind’, Oxford University Press, New York, NY,\nUS, pp. 224–242.\n28', 'A Philosophical Introduction to Language Models\nPart I\nInitial DNN performance on SCAN and other synthetic datasets probing compositional gener-\nalization – such as CFQ (Keysers et al. 2019) and COGS (Kim & Linzen 2020) – was somewhat\nunderwhelming. Testing generally revealed a significant gap between performance on the train set\nand on the test set, suggesting a failure to properly generalize across syntactic distribution shifts.\nSince then, however, many Transformer-based models have achieved good to perfect accuracy on these\ntests. This progress was enabled by various strategies, including tweaks to the vanilla Transformer\narchitecture to provide more effective inductive biases (Csordás et al. 2022, Ontanon et al. 2022) and\ndata augmentation to help models learn the right kind of structure (Andreas 2020, Akyürek et al.\n2020, Qiu et al. 2022).\nMeta-learning, or learning to learn better by generalizing from exposure to many related learning\ntasks (Conklin et al. 2021, Lake & Baroni 2023), has also shown promise without further architectural\ntweaks. Standard supervised learning rests on the assumption that training and testing data are\ndrawn from the same distribution, which can lead models to “overfit” to the training data and fail to\ngeneralize to the testing data. Meta-learning exposes models to several distributions of related tasks,\nin order to promote acquisition of generalizable knowledge. For example, Lake & Baroni (2023) show\nthat a standard Transformer-based neural network, when trained on a stream of distinct artificial\ntasks, can achieve systematic generalization in a controlled few-shot learning experiment, as well as\nstate-of-the-art performance on systematic generalization benchmarks. At test time, the model exhibits\nhuman-like accuracy and error patterns, all without explicit compositional rules. While meta-learning\nacross various tasks helps promote compositional generalization, recent work suggests that merely\nextending the standard training of a network beyond the point of achieving high accuracy on training\ndata can lead it to develop more tree-structured computations and generalize significantly better to\nheld-out test data that require learning hierarchical rules (Murty et al. 2023). The achievements of\nTransformer models on compositional generalization benchmarks provide tentative evidence that\nbuilt-in rigid compositional rules may not be needed to emulate the structure-sensitive operations of\ncognition.\nOne interpretation of these results is that, given the right architecture, learning objective, and\ntraining data, ANNs might achieve human-like compositional generalization by implementing a\nlanguage of thought architecture – in accordance with the second horn of the classicist dilemma\n(Quilty-Dunn et al. 2022, Pavlick 2023). But an alternative interpretation is available, on which ANNs\ncan achieve compositional generalization with non-classical constituent structure and composition\nfunctions. Behavioral evidence alone is insufficient to arbitrate between these two hypotheses.8 But\nit is also worth noting that the exact requirements for implementing a language of thought are still\nsubject to debate (Smolensky 1989, McGrath et al. 2023).\nOn the traditional Fodorian view, mental processes operate on discrete symbolic representations\nwith semantic and syntactic structure, such that syntactic constituents are inherently semantically\nevaluable and play direct causal roles in cognitive processing. By contrast, the continuous vectors that\nbear semantic interpretation in ANNs are taken to lack discrete, semantically evaluable constituents\nthat participate in processing at the algorithmic level, which operates on lower-level activation values\ninstead. This raises the question whether the abstracted descriptions of stable patterns observed\nin the aggregate behavior of ANNs’ lower-level mechanisms can fulfill the requirements of classical\nconstituent structure, especially when their direct causal efficacy in processing is not transparent.\nFor proponents of connectionism who argue that ANNs may offer a non-classical path to modeling\ncognitive structure, this is a feature rather than a bug. Indeed, classical models likely make overly\nrigid assumptions about representational formats, binding mechanisms, algorithmic transparency,\nand demands for systematicity; conversely, even modern ANNs likely fail to implement their specific\n8See Part II for a brief discussion of mechanistic evidence in favor of the second hypothesis.\n12', 'A Philosophical Introduction to Language Models\nPart I\n3.1. Compositionality\nAccording to a long-standing critique of the connectionist research program, artificial neural networks\nwould be fundamentally incapable of accounting for the core structure-sensitive features of cognition,\nsuch as the productivity and systematicity of language and thought. This critique centers on a\ndilemma: either ANNs fail to capture the features of cognition that can be readily accounted for in a\nclassical symbolic architecture; or they merely implement such an architecture, in which case they lack\nindependent explanatory purchase as models of cognition (Fodor & Pylyshyn 1988, Pinker & Prince\n1988, Quilty-Dunn et al. 2022). The first horn of the dilemma rests on the hypothesis that ANNs lack\nthe kind of constituent structure required to model productive and systematic thought – specifically,\nthey lack compositionally structured representations involving semantically-meaningful, discrete\nconstituents (Macdonald 1995). By contrast, classicists argue that thinking occurs in a language of\nthought with a compositional syntax and semantics (Fodor 1975). On this view, cognition involves\nthe manipulation of discrete mental symbols combined according to compositional rules. Hence, the\nsecond horn of the dilemma: if some ANNs turn out to exhibit the right kind of structure-sensitive\nbehavior, they must do so because they implement rule-based computation over discrete symbols.\nThe remarkable progress of LLMs in recent years calls for a reexamination of old assumptions\nabout compositionality as a core limitation of connectionist models. A large body of empirical\nresearch investigates whether language models exhibit human-like levels of performance on tasks\nthought to require compositional processing. These studies evaluate models’ capacity for compositional\ngeneralization, that is, whether they can systematically recombine previously learned elements to\nmap new inputs made up from these elements to their correct output (Schmidhuber 1990). This is\ndifficult to do with LLMs trained on gigantic natural language corpora, such as GPT-3 and GPT-4,\nbecause it is near-impossible to rule out that the training set contains that exact syntactic pattern.\nSynthetic datasets overcome this with a carefully designed train-test split.\nThe SCAN dataset, for example, contains a set of natural language commands (e.g., “jump twice”)\nmapped unambiguously to sequences of actions (e.g., JUMP JUMP) (Lake & Baroni 2018). The\ndataset is split into a training set, providing broad coverage of the space of possible commands, and\na test set, specifically designed to evaluate models’ abilities to compositionally generalize (3). To\nsucceed on SCAN, models must learn to interpret words in the input (including primitive commands,\nmodifiers and conjunctions) in order to properly generalize to novel combinations of familiar elements\nas well as entirely new commands. The test set evaluates generalization in a number of challenging\nways, including producing action sequences longer than seen before, generalizing across primitive\ncommands by producing the action sequence for a novel composed command, and generalizing in a\nfully systematic fashion by “bootstrapping” from limited data to entirely new compositions.\nFigure 3 | Examples of inputs and outputs from the SCAN dataset (Lake & Baroni 2018) with an\nillustrative train-test split.7\n7Several train-test splits exist for the SCAN dataset to test different aspects of generalization, such as generalization to\nlonger sequence lengths, to new templates, or to new primitives (Lake & Baroni 2018).\n11']], 'uris': None, 'data': None, 'included': ['metadatas', 'documents', 'distances']}
This is all pretty messy, but we can see that we have
'documents'
The documents returned by the database query
'metadatas'
Any metadata we wanted to include, in this case only the index. But we could easily include the author and title of the paper...
'distances'
The similarity measure between the query and the returned context.
The next step is to put these contexts, along with the query, into an LLM.
client = OpenAI()
Our prompt will be simple for now. We use the standard way to load Jinja templates.
def load_template(template_filepath: str, arguments: dict[str, Any]) -> str:
env = Environment(
loader=FileSystemLoader(searchpath='./'),
autoescape=select_autoescape()
)
template = env.get_template(template_filepath)
return template.render(**arguments)
system_prompt = load_template(
template_filepath="prompts/rag_system_prompt.jinja",
arguments={}
)
You are a helpful academic assistant that is an expert at extracting information from academic papers. You will be given a query, and some chunks of text that corresponds to a document. You will also be given the cosine similarity of the query with the text chunks.
You must answer the query using the information in the text.
Your answer must be concise and to the point.
If you are unsure of something, you should say that you are unsure.
### Input Format ###
Query: <query>
Context: <text chunk>\n\n"
Cosine Similarity: <similarity>
----------
We now need to combine the call to the retriever, along with combining the context into a function.
def combine_context(documents: list[str], scores: list[float]) -> str:
string = ""
for document, score in zip(documents, scores):
string += f"{document}\nCosine distance: {score:.2f}\n{'-'*10}\n"
return string
def get_context(user_input: str, n_results: int = 2, doc_db: DocumentDB = doc_db) -> str:
results = doc_db.query_db(user_input, n_results=n_results)
context_list = results["documents"][0]
combined_context = combine_context(context_list, results["distances"][0])
if not combined_context:
combined_context = "No relevant chat history found."
return combined_context
query = "What are the main findings of this paper?"
context = get_context(query, n_results=3)
user_prompt = (
f"Query: {query}\n\n"
f"Context: {context}"
)
print(user_prompt)
Query: What are the main findings of this paper? Context: A Philosophical Introduction to Language Models Part I were all they could do, LLMs like GPT-4 would simply be Blockheads come to life. Compare this to a human student who had found a test’s answer key on the Internet and reproduced its answers without any deeper understanding; such regurgitation would not be good evidence that the student was intelligent. For these reasons, “data contamination”–when the training set contains the very question on which the LLM’s abilities are assessed–is considered a serious concern in any report of an LLM’s performance, and many think it must be ruled out by default when comparing human and LLM performance (Aiyappa et al. 2023). Moreover, GPT-4’s pre-training and fine-tuning requires an investment in computation on a scale available only to well-funded corporations and national governments–a process which begins to look quite inefficient when compared to the data and energy consumed by the squishy, 20-watt engine between our ears before it generates similarly sophisticated output. In this opinionated review paper, we argue that LLMs are more than mere Blockheads; but this skeptical interpretation of LLMs serves as a useful foil to develop a subtler view. While LLMs can simply regurgitate large sections of their prompt or training sets, they are also capable of flexibly blending patterns from their training data to produce genuinely novel outputs. Many empiricist philosophers have defended the idea that sufficiently flexible copying of abstract patterns from previous experience could form the basis of not only intelligence, but full-blown creativity and rational decision-making (Baier 2002, Hume 1978, Buckner 2023); and more scientific research has emphasized that the kind of flexible generalization that can be achieved by interpolating vectors in the semantic spaces acquired by these models may explain why these systems often appear more efficient, resilient, and capable than systems based on rules and symbols (Smolensky 1988, Smolensky et al. 2022a). A useful framework for exploring the philosophical significance of such LLMs, then, might be to treat the worry that they are merely unintelligent, inefficient Blockheads as a null hypothesis, and survey the empirical evidence that can be mustered to refute it.5 We adopt that approach here, and use it to provide a brief introduction to the architecture, achievements, and philosophical questions surrounding state-of-the-art LLMs such as GPT-4. There has, in our opinion, never been a more important time for philosophers from a variety of backgrounds– but especially philosophy of mind, philosophy of language, epistemology, and philosophy of science–to engage with foundational questions about artificial intelligence. Here, we aim to provide a wide range of those philosophers (and philosophically-inclined researchers from other disciplines) with an opinionated survey that can help them to overcome the barriers imposed by the technical complexity of these systems and the ludicrous pace of recent research achievements. 2. A primer on LLMs 2.1. Historical foundations The origins of large language models can be traced back to the inception of AI research. The early history of natural language processing (NLP) was marked by a schism between two competing paradigms: the symbolic and the stochastic approaches. A major influence on the symbolic paradigm in NLP was Noam Chomsky’s transformational-generative grammar (Chomsky 1957), which posited that the syntax of natural languages could be captured by a set of formal rules that generated well- 5Such a method of taking a deflationary explanation for data as a null hypothesis and attempting to refute it with empirical evidence has been a mainstay of comparative psychology for more than a century, in the form of Morgan’s Canon (Buckner 2017, Sober 1998). As DNN-based systems approach the complexity of an animal brain, it may be useful to take lessons from comparative psychology in arbitrating fair comparisons to human intelligence (Buckner 2021). In comparative psychology, standard deflationary explanations for data include reflexes, innate-releasing mechanisms, and simple operant conditioning. Here, we suggest that simple deflationary explanations for an AI-inspired version of Morgan’s Canon include Blockhead-style memory lookup. 3 Cosine distance: 0.69 ---------- A Philosophical Introduction to Language Models Part I Evidential targets Corresponding data for LLMs Architecture Transformer Learning objective Next-token prediction Model size 1010 −1012 trainable parameters Training data Internet-scale text corpora Behavior Performance on benchmarks & targeted experiments Representations & computations Findings from probing & intervention experiments Table 1 | Kinds of empirical evidence that can be brought to bear in philosophical debates about LLMs cognitive capacity, simply because its operations can be explained in less abstract and more deflation- ary terms. In the present context, the fallacy manifests in claims that LLMs could not possibly be good models of some cognitive capacity 𝜙because their operations merely consist in a collection of statistical calculations, or linear algebra operations, or next-token predictions. Such arguments are only valid if accompanied by evidence demonstrating that a system, defined in these terms, is inherently incapable of implementing 𝜙. To illustrate, consider the flawed logic in asserting that a piano could not possibly produce harmony because it can be described as a collection of hammers striking strings, or (more pointedly) that brain activity could not possibly implement cognition because it can be described as a collection of neural firings. The critical question is not whether the operations of an LLM can be simplistically described in non-mental terms, but whether these operations, when appropriately organized, can implement the same processes or algorithms as the mind, when described at an appropriate level of computational abstraction. The Redescription Fallacy is a symptom of a broader trend to treat key philosophical questions about artificial neural networks as purely theoretical, leading to sweeping in-principle claims that are not amenable to empirical disconfirmation. Hypotheses here should be guided by empirical evidence regarding the capacities of artificial neural networks like LLMs and their suitability as cognitive models (see table 1). In fact, considerations about the architecture, learning objective, model size, and training data of LLMs are often insufficient to arbitrate these issues. Indeed, our contention is that many of the core philosophical debates on the capacities of neural networks in general, and LLMs in particular, hinge at least partly on empirical evidence concerning their internal mechanisms and knowledge they acquire during the course of training. In other words, many of these debates cannot be settled a priori by considering general characteristics of untrained models. Rather, we must take into account experimental findings about the behavior and inner workings of trained models. In this section, we examine long-standing debates about the capacities of artificial neural networks that have been revived and transformed by the development of deep learning and the recent success of LLMs in particular. Behavioral evidence obtained from benchmarks and targeted experiments matters greatly to those debates. However, we note from the outset that such evidence is also insufficient to paint the full picture; connecting to concerns about Blockheads reviewed in the first section, we must also consider evidence about how LLMs process information internally to close the gap between claims about their performance and putative competence. Sophisticated experimental methods have been developed to identify and intervene on the representations and computations acquired by trained LLMs. These methods hold great promise to arbitrate some of the philosophical issues reviewed here beyond tentative hypotheses supported by behavioral evidence. We leave a more detailed discussion of these methods and the corresponding experimental findings to Part II. 10 Cosine distance: 0.70 ---------- A Philosophical Introduction to Language Models Part I: Continuity With Classic Debates Raphaël Millière Department of Philosophy Macquarie University raphael.milliere@mq.edu.eu Cameron Buckner Department of Philosophy University of Houston cjbuckner@uh.edu Abstract Large language models like GPT-4 have achieved remarkable proficiency in a broad spec- trum of language-based tasks, some of which are traditionally associated with hallmarks of human intelligence. This has prompted ongoing disagreements about the extent to which we can meaningfully ascribe any kind of linguistic or cognitive competence to language models. Such questions have deep philosophical roots, echoing longstanding debates about the status of artificial neural networks as cognitive models. This article–the first part of two companion papers–serves both as a primer on language models for philosophers, and as an opinionated survey of their significance in relation to classic debates in the philosophy cognitive science, artificial intelligence, and linguistics. We cover topics such as compositionality, language acquisition, semantic competence, grounding, world models, and the transmission of cultural knowledge. We argue that the success of language models challenges several long-held assumptions about artificial neural networks. However, we also highlight the need for further empirical investigation to better understand their internal mechanisms. This sets the stage for the companion paper (Part II), which turns to novel empirical methods for probing the inner workings of language models, and new philosophical questions prompted by their latest developments. 1. Introduction Deep learning has catalyzed a significant shift in artificial intelligence over the past decade, leading up to the development of Large Language Models (LLMs). The reported achievements of LLMs, often heralded for their ability to perform a wide array of language-based tasks with unprecedented proficiency, have captured the attention of both the academic community and the public at large. State-of-the-art LLMs like GPT-4 are even claimed to exhibit “sparks of general intelligence” (Bubeck et al. 2023). They can produce essays and dialogue responses that often surpass the quality of an average undergraduate student’s work (Herbold et al. 2023); they achieve better scores than most humans on a variety of AP tests for college credit and rank in the 80-99th percentile on graduate admissions tests like the GRE or LSAT (OpenAI 2023a); their programming proficiency “favorably compares to the average software engineer’s ability” (Bubeck et al. 2023, Savelka, Agarwal, An, Bogart & Sakr 2023); they can solve many difficult mathematical problems (Zhou et al. 2023)–even phrasing their solution in the form of a Shakespearean sonnet, if prompted to do so. LLMs also form the backbone of multimodal systems that can answer advanced questions about visual inputs arXiv:2401.03910v1 [cs.CL] 8 Jan 2024 Cosine distance: 0.73 ----------
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
stream=True,
temperature=0.0
)
for chunk in response:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
The main findings of the paper highlight that large language models (LLMs) like GPT-4 demonstrate remarkable proficiency in various language-based tasks, prompting debates about their cognitive and linguistic competence. The authors argue that while LLMs can regurgitate information, they also exhibit the ability to blend patterns from their training data to produce novel outputs, suggesting a form of creativity and rational decision-making. The paper emphasizes the importance of empirical evidence in understanding the internal mechanisms of LLMs and challenges traditional assumptions about artificial neural networks. It serves as both a primer for philosophers and an opinionated survey of the philosophical implications of LLMs in relation to cognitive science and artificial intelligence.
Meh.
This is OK, but not all of these chunks are massively useful. This highlights one of the most important parts of building RAG pipelines: RAG systems live and die on the quality of the retrieval process. And the retrieval process depends strongly on the quality of the embeddings created from the documents.
But also, consider what question we are actually asking?
"What are the main findings of this paper?"
The only way we are going to get meaningful chunks returned, ready to be ingested by the LLM, is if there are similar chunks within the body of text. In other words, a phrase similar to "main finding of this paper" would have to appear in one of the chunks! So we have two main issues:
- The quality of the chunks are not very good due to the parsing process;
- We have to know what kinds of questions to ask of our models.
In this case, it would have probably been better to just stuff the entire document into the model context window (this technique is literally called stuffing).
Let's ask another question to demonstrate this:
def combine_context(documents: list[str], scores: list[float]) -> str:
string = ""
for document, score in zip(documents, scores):
string += f"{document}\nCosine distance: {score:.2f}\n{'-'*10}\n"
return string
def rag_query(query: str, n_context, return_context=False):
query_results = doc_db.query_db(query, n_results=n_context)
context_list = query_results["documents"][0]
combined_context = combine_context(context_list, query_results["distances"][0])
if not combined_context:
combined_context = "No relevant chat history found."
system_prompt = load_template(
template_filepath="prompts/rag_system_prompt.jinja",
arguments={}
)
user_prompt = (
f"Query: {query}\n\n"
f"Context: {combined_context}"
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
stream=False,
temperature=0.0
)
if return_context:
return response.choices[0].message.content, context_list
else:
return response.choices[0].message.content
response = rag_query(
query="Who wrote this paper?",
n_context=2
)
print(response)
The authors of the paper "A Philosophical Introduction to Language Models" are not explicitly mentioned in the provided text chunks. Therefore, I am unsure who wrote this paper.
Of course, this is ridiculous. We know that this paper was written by Raphaël Millière and Cameron Buckner. But how would the retriever know how to receive the correct information? Where in the text would there be anything similar to, "Who wrote this paper?". You might get lucky if you include the word "authors", but try that for this paper, and you'll get the same result.
response = rag_query(
query="Who wrote this paper? If there is no explicitely stated author, then make a best guess.",
n_context=5
)
print(response)
The paper does not explicitly state an author. However, based on the context provided, a best guess for the author could be "Buckner, C. J." as they are mentioned multiple times in the text and seem to be discussing relevant topics related to language models and philosophy.
Give the model a little more freedom, and it can sometimes infer one of the authors. Note that we can actually extract the authors in a couple of ways - sometimes they are included in the metadata of the pdf, but their names will usually be in the first document chunk!
You could use this information and either add it as metadata to the entire collection or to individual chunks.
Further work¶
This example is a very naive form of RAG. Additional techniques might include hybrid search or contextual retrieval.