Tutorial: Build a Q&A Bot for Academy Awards Based on ChatGPT

In a previous article, I introduced the concept of Retrieval Augmented Generation (RAG), which is used to provide context to Large Language Models (LLMs) to improve the accuracy of the response.
This tutorial walks you through a practical example of using RAG with GPT 3.5 to answer questions based on a custom dataset. Since the training cutoff for GPT 3.5 is 2021, it cannot answer questions based on recent events. We will use a dataset related to Oscar awards to implement RAG and have GPT 3.5 respond to questions about the 95th Academy Awards, which took place in March 2023.
This tutorial assumes that you have an active account with OpenAI and have populated the OPENAI_API_KEY environment variable with your API key.
Step 1 – Preparing the Dataset
Download the Oscar Award dataset from Kaggle and move the CSV file to a subdirectory named data. The dataset has all the categories, nominations, and winners of Academy Awards from 1927 to 2023. I renamed the CSV file to oscars.csv
Start by importing the Pandas library and loading the dataset:
1 2 3 |
import pandas as pd df=pd.read_csv('./data/oscars.csv') df.head() |
The dataset is well-structured, with column headers and rows that represent the details of each category, including the name of the actor/technician, the film, and whether the nomination was won or lost.
Since we are most interested in awards related to 2023, let’s filter them and create a new Pandas dataframe. At the same time, we will also convert the category to lowercase while dropping the rows where the value of a film is blank. This helps us design contextual prompts sent to GPT 3.5.
1 2 3 4 |
df=df.loc[df['year_ceremony'] == 2023] df=df.dropna(subset=['film']) df['category'] = df['category'].str.lower() df.head() |
With the filtered and cleansed dataset, let’s add a new column to the data frame that has an entire sentence representing a nomination. This complete sentence, when sent to GPT 3.5, enables it to find the facts within the context.
1 2 3 |
df['text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' to win the award' df.loc[df['winner'] == False, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' but did not win' df.head()['text'] |
Notice how we concatenate the values to generate a complete sentence. For example, the column ‘text’ in the first two rows of the data frame has the below values:
Austin Butler got nominated under the category, actor in a leading role, for the film Elvis but did not win
Colin Farrell got nominated under the category, actor in a leading role, for the film The Banshees of Inisherin but did not win
Step 2 – Generate the Word Embeddings for the Dataset
Now that we have the text that’s constructed from the dataset let’s convert it into word embeddings. This is a crucial step, as the tokens generated by the embedding model will help us perform a semantic search to retrieve the sentences from the dataset that have similar meanings.
1 2 3 4 5 6 7 8 9 |
import ast import openai def text_embedding(text) -> None: response = openai.Embedding.create(model="text-embedding-ada-002", input=text) return response["data"][0]["embedding"] df=df.assign(embedding=(df["text"].apply(lambda x : text_embedding(x)))) df.head() |
In the above step, we set the embedding model to text-embedding-ada-002 and then use a lambda function to add a new column to the data frame called embedding. This directly maps to the corresponding text in the same row.
Step 3 – Performing a Search to Retrieve Similar Text
With the embeddings generated per row, we can now use a simple technique called cosine similarity to compare two vectors based on their meaning.
Let’s import the modules needed for this step.
1 2 |
import tiktoken from scipy import spatial |
We will create a helper function to perform a cosine similarity search. It converts the query into embeddings and then compares it with each embedding available in the data frame. It returns the text along with a score that ranks the similarity. The top_n parameter defines how many sentences are sent.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
def strings_ranked_by_relatedness( query: str, df: pd.DataFrame, relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y), top_n: int = 100 ) -> tuple[list[str], list[float]]: query_embedding_response = openai.Embedding.create( model="text-embedding-ada-002", input=query, ) query_embedding = query_embedding_response["data"][0]["embedding"] strings_and_relatednesses = [ (row["text"], relatedness_fn(query_embedding, row["embedding"])) for i, row in df.iterrows() ] strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True) strings, relatednesses = zip(*strings_and_relatednesses) return strings[:top_n], relatednesses[:top_n] |
Let’s test this function by sending the keyword “Lady Gaga.” The goal is to get the top three values from the data frame that has references to the keyword.
1 2 3 4 |
strings, relatednesses = strings_ranked_by_relatedness("Lady Gaga", df, top_n=3) for string, relatedness in zip(strings, relatednesses): print(f"{relatedness=:.3f}") display(string) |
Obviously, the first value, with a score of 0.821, comes closest to the search. We can now inject that into our prompt to augment the context.
Step 4 – Construct the Prompt based on RAG
One thing we want to make sure of is that the token size doesn’t exceed the supported context length of the model. For GPT 3.5, the context length is 4K. The below function handles that.
1 2 3 |
def num_tokens(text: str) -> int: encoding = tiktoken.encoding_for_model("gpt-3.5-turbo") return len(encoding.encode(text)) |
Let’s create helper functions that make it easy to create the prompt by performing the similarity search in the data frame while respecting the token size.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
def query_message( query: str, df: pd.DataFrame, model: str, token_budget: int ) -> str: strings, relatednesses = strings_ranked_by_relatedness(query, df) introduction = 'Use the below content related to the 95th Oscar awards to answer the subsequent question. If the answer cannot be found in the content, write "I could not find an answer."' question = f"\n\nQuestion: {query}" message = introduction for string in strings: next_row = f'\n\nOscar database section:\n"""\n{string}\n"""' if ( num_tokens(message + next_row + question) > token_budget ): break else: message += next_row return message + question |
Based on the context that the previous function generated, we will then create a function that calls the OpenAI API.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
def ask( query: str, df: pd.DataFrame = df, model: str = "gpt-3.5-turbo", print_message: bool = False, ) -> str: message = query_message(query, df, model=model, token_budget=token_budget) if print_message: print(message) messages = [ {"role": "system", "content": "You answer questions about 95th Oscar awards."}, {"role": "user", "content": message}, ] response = openai.ChatCompletion.create( model=model, messages=messages, temperature=0 ) response_message = response["choices"][0]["message"]["content"] return response_message |
It’s time to finally ask a question to GPT 3.5 related to the 95th Academy Awards.
1 |
print(ask('What was the nomination from Lady Gaga for the 95th Oscars?')) |
Let’s try one more query.
The bot seems to work well even though the model didn’t have knowledge of the recent event.
You can find the entire code below:
In the next part of this tutorial, we will explore how to use a vector database to store, search, and retrieve word embeddings. Stay tuned.