TNS
VOXPOP
Will real-time data processing replace batch processing?
At Confluent's user conference, Kafka co-creator Jay Kreps argued that stream processing would eventually supplant traditional methods of batch processing altogether.
Absolutely: Businesses operate in real-time and are looking to move their IT systems to real-time capabilities.
0%
Eventually: Enterprises will adopt technology slowly, so batch processing will be around for several more years.
0%
No way: Stream processing is a niche, and there will always be cases where batch processing is the only option.
0%
Operations

Tutorial: Use Chroma and OpenAI to Build a Custom Q&A Bot

Aug 4th, 2023 8:00am by
Featued image for: Tutorial: Use Chroma and OpenAI to Build a Custom Q&A Bot

In the last tutorial, we explored Chroma as a vector database to store and retrieve embeddings. Let’s extend the use case to build a Q&A application based on OpenAI and the Retrieval Augmentation Generation (RAG) technique.

When we initially built the Q&A Bot for the Academy Awards, we implemented similarity search based on a custom function that calculated the cosine distance between two vectors. We will replace that function with a query to search the collection stored in Chroma.

For completeness, we will start setting up the environment and preparing the dataset. This is the same as the steps mentioned in this tutorial.

Step 1 – Preparing the Dataset

Download the Oscar Award dataset from Kaggle and move the CSV file to a subdirectory named data. The dataset has all the categories, nominations, and winners of Academy Awards from 1927 to 2023. I renamed the CSV file to oscars.csv.

Start by importing the Pandas library and loading the dataset:


The dataset is well-structured, with column headers and rows that represent the details of each category, including the name of the actor/technician, the film, and whether the nomination was won or lost.

Since we are most interested in awards related to 2023, let’s filter them and create a new Pandas data frame. At the same time, we will also convert the category to lowercase while dropping the rows where the value of a film is blank. This helps us design contextual prompts sent to GPT 3.5.


With the filtered and cleansed dataset, let’s add a new column to the data frame that has an entire sentence representing a nomination. This complete sentence, when sent to GPT 3.5, enables it to find the facts within the context.


Notice how we concatenate the values to generate a complete sentence. For example, the column “text” in the first two rows of the data frame has the below values:

Austin Butler got nominated under the category, actor in a leading role, for the film Elvis but did not win

Colin Farrell got nominated under the category, actor in a leading role, for the film The Banshees of Inisherin but did not win

Step 2 – Generate and store the Word Embeddings for the Dataset

Now that we have the text that’s constructed from the dataset let’s convert it into word embeddings and store it in Chroma.

This is a crucial step, as the tokens generated by the embedding model will help us perform a semantic search to retrieve the sentences from the dataset that have similar meanings.


In the above step, we are pointing Chroma to use OpenAI embeddings by passing the OpenAI API Key and the embedding model.

We can use the text_embedding function to convert the query’s phrase or sentence into the same embedding format that Chorma uses.

We can now create the ChromaDB collection based on the OpenAI embeddings model.


Notice how we are associating the collection with OpenAI by passing the function. This will become the default mechanism to generate the embeddings as data gets ingested.

Let’s convert the text column in the Pandas dataframe into a Python list that can be passed to Chroma. Since each document stored in Chroma also needs an id in the string format, we will convert the index column of the dataframe into a list of strings.


With the documents and IDs fully populated, we are ready to create the collection.

Step 3 – Perform a Similarity Search to Augment the Prompt

Let’s first generate the word embedding for the string that gets all the nominations for the music category.


We can now pass this as the search query to Chroma to retrieve all relevant documents. By setting the n_results parameter, we can restrict the output to 15 documents.


The results dictionary has a list of all documents.

Let’s convert this list into one string that can provide context to the prompt.


It’s time to construct the prompt based on the context and send it to OpenAI.



The response includes the correct response based on the combination of the context and the prompt.

This tutorial demonstrates how to leverage a Vector database like Chroma to implement Retrieval Augmented Generation (RAG) to enhance the prompt with additional context.

Below is the complete code for you to explore:

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma, SingleStore.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.