Ian Mason

Building a Minimal RAG Model

2023-10-21T00:00:00+01:00

Large language models (LLMs) like ChatGPT are very good at generating cohesive text on a wide range of topics. Often, however, we want to generate text for a very specific use case. For example, imagine we want a model that is able to answer factual questions about historical financial data. When we ask ChatGPT the question “What was the inflation rate in Indonesia in 1986?”, it states that it doesn’t have enough information to provide a good answer. Other LLMs might give a reasonable looking, but factually inaccurate, answer as LLMs can be prone to hallucinations.

If we want to build a model that can answer questions like this we have a couple of options. We could fine-tune an LLM on a dataset of questions and answers, but this requires a lot of data, can be expensive, and we may lose some generality in the text the LLM can generate. Alternatively, we could augment the model with external tools or resources.

RAG models (retrieval augmented generation) aim to augment language models by providing them with additional context with which to respond to a user query. Originally designed to include a training loop, RAG models now tend more towards stitching together pre-trained components with a vector database. The below figure (which is taken from a more detailed tutorial on building RAG models at scale) shows the main components and structure of a RAG model.

Set-Up: A vector database (VectorDB) is created offline by embedding relevant resources/documents using a pre-trained language embedding model that converts text to vector embeddings.
Step One: A user query is received and embedded with the same embedding model used to create the VectorDB
Step Two: The VectorDB is searched to find the document(s) with the most similar embedding(s). (Using cosine similarity, L2 distance or similar.)
Step Three: The documents retrieved from the VectorDB are added to the user’s query as additional context.
Step Four: The original query augmented with the documents is fed into an LLM.
Step Five: The LLM should now generate a more reasonable response using the additional information from an outside source.

Spinning Up a Minimal Example

To see what this actually looks like in practice, we will walk through how to build a very simple RAG model making use of available tools. The documents we embed to create the VectorDB are a small number of news documents from the nltk reuters corpus. To build the VectorDB we use ChromaDB which allows to create a locally stored vector database with a few lines of code. We use the OpenAI API to access powerful models for embedding documents and generating text.

Creating a VectorDB

A VectorDB is just a database where each entry haa a vector associated with it allowing us to search the database to find the entry with the closest vector. To keep costs low, for this example we will take the first 100 documents from the nltk reuters corpus and discard any documents with more than 500 words. Returning to our original example on Indonesian inflation, one of the documents in this set is shown below. We can see how if we are able to provide this context to the LLM we should be able to get a good answer to our question about past inflation rates in Indonesia.

First, we set up a method to embed documents using the OpenAI text-embedding-ada-002 model.

import openai
from chromadb.utils import embedding_functions

def get_embedding_function():
    openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                    api_key=get_openai_key(),
                    model_name="text-embedding-ada-002"
                )
    return openai_ef

With this embedding function, the below code creates a locally stored VectorDB which stores the raw text of the reuters documents, an id for each document and the vector created by the embedding model.

import chromadb
import nltk
from nltk.corpus import reuters
from utils import get_embedding_function

nltk.download('reuters')
reuters_subset = reuters.fileids()[0:100]
reuters_subset = [id for id in reuters_subset if len(reuters.words(id)) < 500]
    
client = chromadb.PersistentClient(path="chromadb/test_db")
collection = client.create_collection(name="reuters_collection", embedding_function=get_embedding_function())

for i, file_id in enumerate(reuters_subset):
    collection.add(
        documents=[reuters.raw(file_id)],
        metadatas=[{"nltk_file_id": file_id}],
        ids=[str(i)]
    )
print(collection.peek())  # To see the first documents in the collection

Building a RAG model

Now we have constructed our VectorDB to store relevant documents, when we receive a user query we query the VectorDB, get the top 3 most similar results, and then add the top 3 results to the user query to generate a response. In the below code get_rag_context embeds the user query and finds the most similar 3 documents to use as additional context. The method rag_response gets a response from the LLM (gpt-3.5-turbo) when these documents are provided alongside the query. For comparison the method response gets a response using only the user query with no additional information.

import openai
import chromadb
from utils import get_embedding_function

def response(query):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"{query}"},
        ]
    )
    return response['choices'][0]['message']['content']


def rag_response(query, context):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant. Please answer the query using the context provided."},
            {"role": "user", "content": f"query: {query}. context: {context}"},
        ]
    )
    return response['choices'][0]['message']['content']


def get_rag_context(query, client, num_docs=3):
    collection = client.get_collection(name="reuters_collection", embedding_function=get_embedding_function())
    results = collection.query(
        query_texts=[query],
        n_results=num_docs
    )
    contexts = [doc.replace("\n", " ") for doc in results['documents'][0]]
    return contexts


def main():
    client = chromadb.PersistentClient(path="../chromadb/test_db")

    query = "What was the inflation rate in Indonesia in 1986?"
    contexts = get_rag_context(query, client)
    default_response = response(query)
    ragged_response = rag_response(query, ";".join(contexts))
    print(f"Query: {query}")
    print(f"Default response: {default_response}")
    print(f"RAG response: {ragged_response}")


if __name__ == "__main__":
    main()

RAG in action

Now we can see what happens when we ask the RAG model our original question about the inflation rate in Indonesia in 1986. We see here that with the additional context the LLM is able to answer the question with the correct answer (8.8%) whereas the default LLM without augmented context is unable to provide a definitive answer.

Summary

In around 100 lines of code we have augmented a large language model to use external resources when answering user queries. We have seen that with the right additional context LLMs are able to better answer specific technical questions.

This simple code is only possible because of recent improvements in the tooling available for building these models. We can query large models in a few lines of code with the OpenAI API and build a VectorDB quickly with ChromaDB. Recently, OpenAI have even launched a beta for their Assistants API where you can use retrieval without ever leaving the OpenAI ecosystem.

RAG models are a useful tool for building applications with LLMs right now. However, despite the release of GPT-4 which is bigger and better than GPT-3.5 used here, hallucinations and inability to answer factual questions still remain a problem. Whether RAG-style approaches will lead to more general AI systems I am less sure, currently I tend to lean more towards scaling self-prediction in end-to-end systems.

Code

A repository containing the minimal RAG model discussed in this post can be found here.

Periodic Autoencoder - Explanation and Addendum

2022-04-09T00:00:00+01:00

In DeepPhase we use a Periodic Autoencoder (PAE) to learn periodic features from data. In particular, we aim to extract a small number of phase values that well-capture the (non-linear) periodicity of higher dimensional time series data. If you are solely interested in implementing the Periodic Autoencoder for your use case look for the supplementary material in the ACM Digital Library. What follows is a brief explanation of how we can extract phase features from data.

Given some number of temporal signals, which are assumed to have some joint periodicity, the Periodic Autoencoder learns a latent space, $\mathbf{L}$, with a few (say 5) latent signals of the same length as the original signals. These latent signals might look something like the signals in the left plot below. For each one of these signals we then aim to extract a good phase offset that captures its current location as part of a larger cycle.

However, any one of these signals (say the blue curve in the right plot) may not have a single obvious frequency, amplitude, or phase offset. So the question becomes, what is a good way to calculate a phase value for such a curve?

The approach taken with the Periodic Autoencoder is to approximate each latent signal with a sinusoidal function $\Gamma(x) = A sin (2 \pi (Fx - S)) + B$, parameterized by $A$, $F$, $B$ & $S$, and to then use $S$ as the phase offset. The plot below shows the same blue curve from above along with the function $\Gamma(x) = 0.01 sin(4 \pi x)$. Our aim is to better set the parameters of $\Gamma$ such that the orange curve more closely approximates the blue curve. After calculating these sinusoidal functions for all latent signals the PAE then aims to reconstruct the original data from these functions. This places a big inductive bias on the latent space that assumes a few periodic functions will be sufficient for reconstruction.

So, to find the parameters for $\Gamma$, let’s say this $1D$ signal (blue curve) contains $N$ points over a time window of $T$ seconds, then after applying a real discrete Fourier transform [1] we receive $K+1$ Fourier coefficients $\mathbf{C}=[c_0, c_1, \dots, c_K]$ where $K = \left\lfloor\frac{N}{2}\right\rfloor$. These coefficients correspond to frequency bins centered at $\mathbf{f}=[0, 1/T, 2/T, \dots, K/T]$ Hz (the real DFT does not calculate the negative frequency terms as they are redundant for real-valued signals, returning only a single-sided spectrum).

The magnitudes of the coefficients, $\mathbf{m} = [m_0, m_1, \dots, m_K]$ $= [|c_0|, |c_1|, \dots, |c_K|]$, represent the relative presence of each of the frequency bins in the signal. From this we can calculate the power spectrum (the amount of the signal’s power present in each of the frequency bins) as $\mathbf{p} = [p_0, p_1, \dots, p_K] =\frac{2}{N} [\frac{1}{2} m_0^2, m_1^2, \dots, m_K^2]$. Note that every term except the $m_0$ term is doubled since the real DFT returns the single-sided spectrum and we still wish to account for the power in the double-sided spectrum [2]. Below we show the whole single sided power spectrum in the left plot and the power spectrum without the zero frequency bin in the right plot.

$c_0$ is the “zero frequency” Fourier coefficient, which is equivalently the sum of the $N$ samples and is always real. Therefore, dividing by the number of samples gives us the mean offset of the signal which we use as the offset for the sinusoidal approximation $\Gamma$:

\[B = \frac{c_0}{N}.\]

After applying this value of $B$, we see our approximation becomes more well aligned along the y-axis.

A good value for the single frequency $F$ is the mean frequency of the overall signal. We can calculate the mean frequency by taking an average of the frequency components weighted by the power with which they appear in the signal [3]. Note that the zero frequency term is not included as this is already captured by $B$.

\[F = \frac{\sum_{j=1}^K \mathbf{f}_j \mathbf{p}_j}{\sum_{j=1}^K \mathbf{p}_j}.\]

After applying this $F$ with our already found $B$, we get closer to approximating the signal.

Now that the bias and frequency are accounted for, we aim to set the amplitude parameter $A$. We set $A$ such that the average power of the signal is maintained. Since we create a sine curve with a single frequency $F$, we set the amplitude such that the power in this frequency bin equals the average power of the signal $\frac{\sum_{j=1}^K \mathbf{p}_j}{N}$. A single sided power spectrum has values at height $\left(\frac{A}{\sqrt{2}}\right)^2$, [2], so rearranging we find,

\[A = \sqrt{\frac{2}{N}\sum_{j=1}^K \mathbf{p}_j}.\]

Adding this value of $A$ to $\Gamma$, our approximation gets closer again.

Finally we aim to find $S$. For signals that are not exactly periodic over the time window, we will see discontinuities in the phase extracted from the DFT. To avoid discontinuities in the PAE we learn a 2D phase representation, $(s_x, s_y)$, with a small neural network, from which we calculate $S$ as

\[S = arctan2(s_y, s_x).\]

After including $S$ as the final parameter for $\Gamma(x) = A sin (2 \pi (Fx - S)) + B$, we can see that we have a reasonable approximation of our original signal.

By performing this parameterization process fully differentiably, the PAE encoder is encouraged to extract latent features which remain useful for reconstructing the original data after undergoing this severe dimensionality reduction (to 5 parameters ($F$, $A$, $B$, $s_x$, $s_y$) per latent curve). Intuitively, this means the phase representation must capture a lot of “information” about the current state of the original time series data within the available context.

Addendum

In the final figure the phase value was actually calculated with the following equations (thanks to Mario Geiger). You may have some luck replacing the small phase learning network with these operations, but I haven’t had time to check that it is well behaved in practical applications.

\[s_x = \sum_{i=1}^{N}(y_i - B) cos(2 \pi F x_i)\] \[s_y = \sum_{i=1}^{N}(y_i - B) sin(2 \pi F x_i)\] \[S = arctan2(s_y, s_x)\]

Where ${(x_i, y_i)}$ are the $N$ samples that make up our signal (blue curve). With these values we find our approximation (orange curve) with:

\[A cos(2 \pi F x - S) + B .\]

Code

The figures in this post were generated with this short python script.

There is a basic implementation of the Periodic Autoencoder in the supplementary material files here.

References

[1] PyTorch Differentiable FFT

[2] The fundamentals of FFT-based signal analysis and measurement; Michael Cerna and Audrey F. Harvey; 2000.

[3] The usefulness of mean and median frequencies in electromyography analysis; Angkoon Phinyomark, Sirinee Thongpanja, Huosheng Hu, Pornchai Phukpattaranont and Chusak Limsakul; 2012.