Tutorial: Use Chroma and OpenAI to Build a Custom Q&A Bot

In the last tutorial, we explored Chroma as a vector database to store and retrieve embeddings. Let’s extend the use case to build a Q&A application based on OpenAI and the Retrieval Augmentation Generation (RAG) technique.
When we initially built the Q&A Bot for the Academy Awards, we implemented similarity search based on a custom function that calculated the cosine distance between two vectors. We will replace that function with a query to search the collection stored in Chroma.
For completeness, we will start setting up the environment and preparing the dataset. This is the same as the steps mentioned in this tutorial.
Step 1 – Preparing the Dataset
Download the Oscar Award dataset from Kaggle and move the CSV file to a subdirectory named data. The dataset has all the categories, nominations, and winners of Academy Awards from 1927 to 2023. I renamed the CSV file to oscars.csv.
Start by importing the Pandas library and loading the dataset:
1 2 3 |
import pandas as pd df=pd.read_csv('./data/oscars.csv') df.head() |
The dataset is well-structured, with column headers and rows that represent the details of each category, including the name of the actor/technician, the film, and whether the nomination was won or lost.
Since we are most interested in awards related to 2023, let’s filter them and create a new Pandas data frame. At the same time, we will also convert the category to lowercase while dropping the rows where the value of a film is blank. This helps us design contextual prompts sent to GPT 3.5.
1 2 3 4 |
df=df.loc[df['year_ceremony'] == 2023] df=df.dropna(subset=['film']) df['category'] = df['category'].str.lower() df.head() |
With the filtered and cleansed dataset, let’s add a new column to the data frame that has an entire sentence representing a nomination. This complete sentence, when sent to GPT 3.5, enables it to find the facts within the context.
1 2 3 |
df['text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' to win the award' df.loc[df['winner'] == False, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' but did not win' df.head()['text'] |
Notice how we concatenate the values to generate a complete sentence. For example, the column “text” in the first two rows of the data frame has the below values:
Austin Butler got nominated under the category, actor in a leading role, for the film Elvis but did not win
Colin Farrell got nominated under the category, actor in a leading role, for the film The Banshees of Inisherin but did not win
Step 2 – Generate and store the Word Embeddings for the Dataset
Now that we have the text that’s constructed from the dataset let’s convert it into word embeddings and store it in Chroma.
This is a crucial step, as the tokens generated by the embedding model will help us perform a semantic search to retrieve the sentences from the dataset that have similar meanings.
1 2 3 4 5 6 7 8 9 10 11 12 |
import openai import chromadb from chromadb.utils import embedding_functions def text_embedding(text) ->; None: response = openai.Embedding.create(model="text-embedding-ada-002", input=text) return response["data"][0]["embedding"] openai_ef = embedding_functions.OpenAIEmbeddingFunction( api_key=os.environ["OPENAI_API_KEY"], model_name="text-embedding-ada-002" ) |
In the above step, we are pointing Chroma to use OpenAI embeddings by passing the OpenAI API Key and the embedding model.
We can use the text_embedding function to convert the query’s phrase or sentence into the same embedding format that Chorma uses.
We can now create the ChromaDB collection based on the OpenAI embeddings model.
1 2 |
client = chromadb.Client() collection = client.get_or_create_collection("oscars-2023",embedding_function=openai_ef) |
Notice how we are associating the collection with OpenAI by passing the function. This will become the default mechanism to generate the embeddings as data gets ingested.
Let’s convert the text column in the Pandas dataframe into a Python list that can be passed to Chroma. Since each document stored in Chroma also needs an id in the string format, we will convert the index column of the dataframe into a list of strings.
1 2 |
docs=df["text"].tolist() ids= [str(x) for x in df.index.tolist()] |
With the documents and IDs fully populated, we are ready to create the collection.
1 2 3 4 |
collection.add( documents=docs, ids=ids ) |
Step 3 – Perform a Similarity Search to Augment the Prompt
Let’s first generate the word embedding for the string that gets all the nominations for the music category.
1 |
vector=text_embedding("Nominations for music") |
We can now pass this as the search query to Chroma to retrieve all relevant documents. By setting the n_results parameter, we can restrict the output to 15 documents.
1 2 3 4 5 |
results=collection.query( query_embeddings=vector, n_results=15, include=["documents"] ) |
The results dictionary has a list of all documents.
Let’s convert this list into one string that can provide context to the prompt.
1 |
res = "\n".join(str(item) for item in results['documents'][0]) |
It’s time to construct the prompt based on the context and send it to OpenAI.
1 |
prompt=f'```{res}```Based on the data in ```, answer who won the award for the original song' |
1 2 3 4 5 6 7 8 9 10 |
messages = [ {"role": "system", "content": "You answer questions about 95th Oscar awards."}, {"role": "user", "content": prompt} ] response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=messages, temperature=0 ) response_message = response["choices"][0]["message"]["content"] |
The response includes the correct response based on the combination of the context and the prompt.
This tutorial demonstrates how to leverage a Vector database like Chroma to implement Retrieval Augmented Generation (RAG) to enhance the prompt with additional context.
Below is the complete code for you to explore: