Favorite Social Media Timesink
When you take a break from work, where are you going?
Video clips on TikTok/YouTube
X, Bluesky, Mastodon et al...
Web surfing
I do not get distracted by petty amusements
AI / Large Language Models

Tutorial: Build a Q&A Bot for Academy Awards Based on ChatGPT

This tutorial walks you through a practical example of using Retrieval Augmented Generation with GPT 3.5 to answer questions based on a custom dataset.
Jul 21st, 2023 6:36am by
Featued image for: Tutorial: Build a Q&A Bot for Academy Awards Based on ChatGPT
Feature image by Mirko Fabian from Pixabay.        

In a previous article, I introduced the concept of Retrieval Augmented Generation (RAG), which is used to provide context to Large Language Models (LLMs) to improve the accuracy of the response.

This tutorial walks you through a practical example of using RAG with GPT 3.5 to answer questions based on a custom dataset. Since the training cutoff for GPT 3.5 is 2021, it cannot answer questions based on recent events. We will use a dataset related to Oscar awards to implement RAG and have GPT 3.5 respond to questions about the 95th Academy Awards, which took place in March 2023.

This tutorial assumes that you have an active account with OpenAI and have populated the OPENAI_API_KEY environment variable with your API key.

Step 1 – Preparing the Dataset

Download the Oscar Award dataset from Kaggle and move the CSV file to a subdirectory named data. The dataset has all the categories, nominations, and winners of Academy Awards from 1927 to 2023. I renamed the CSV file to oscars.csv

Start by importing the Pandas library and loading the dataset:

The dataset is well-structured, with column headers and rows that represent the details of each category, including the name of the actor/technician, the film, and whether the nomination was won or lost.

Since we are most interested in awards related to 2023, let’s filter them and create a new Pandas dataframe. At the same time, we will also convert the category to lowercase while dropping the rows where the value of a film is blank. This helps us design contextual prompts sent to GPT 3.5.

With the filtered and cleansed dataset, let’s add a new column to the data frame that has an entire sentence representing a nomination. This complete sentence, when sent to GPT 3.5, enables it to find the facts within the context.

Notice how we concatenate the values to generate a complete sentence. For example, the column ‘text’ in the first two rows of the data frame has the below values:

Austin Butler got nominated under the category, actor in a leading role, for the film Elvis but did not win

Colin Farrell got nominated under the category, actor in a leading role, for the film The Banshees of Inisherin but did not win

Step 2 – Generate the Word Embeddings for the Dataset

Now that we have the text that’s constructed from the dataset let’s convert it into word embeddings. This is a crucial step, as the tokens generated by the embedding model will help us perform a semantic search to retrieve the sentences from the dataset that have similar meanings.

In the above step, we set the embedding model to text-embedding-ada-002 and then use a lambda function to add a new column to the data frame called embedding. This directly maps to the corresponding text in the same row.

Step 3 – Performing a Search to Retrieve Similar Text

With the embeddings generated per row, we can now use a simple technique called cosine similarity to compare two vectors based on their meaning.

Let’s import the modules needed for this step.

We will create a helper function to perform a cosine similarity search. It converts the query into embeddings and then compares it with each embedding available in the data frame. It returns the text along with a score that ranks the similarity. The top_n parameter defines how many sentences are sent.

Let’s test this function by sending the keyword “Lady Gaga.” The goal is to get the top three values from the data frame that has references to the keyword.

Obviously, the first value, with a score of 0.821, comes closest to the search. We can now inject that into our prompt to augment the context.

Step 4 – Construct the Prompt based on RAG

One thing we want to make sure of is that the token size doesn’t exceed the supported context length of the model. For GPT 3.5, the context length is 4K. The below function handles that.

Let’s create helper functions that make it easy to create the prompt by performing the similarity search in the data frame while respecting the token size.

Based on the context that the previous function generated, we will then create a function that calls the OpenAI API.

It’s time to finally ask a question to GPT 3.5 related to the 95th Academy Awards.

Let’s try one more query.

The bot seems to work well even though the model didn’t have knowledge of the recent event.

You can find the entire code below:

In the next part of this tutorial, we will explore how to use a vector database to store, search, and retrieve word embeddings. Stay tuned.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.