Using ChatGPT for Questions Specific to Your Company Data
As a child, I was obsessed with Isaac Asimov’s Foundation series of books. The books were based on the premise that when the volume of data becomes extremely large, you can predict the future with a very small statistical margin of error.
In many ways, ChatGPT does precisely that — it predicts the next word and the word after based on what it has learned from a vast corpus of data. As this corpus becomes larger, the margin of error becomes smaller.
But here’s the thing: ChatGPT can only generate text that is similar to the text it was trained on. But what if I want ChatGPT to generate responses based on my data that is not publicly available?
For example, I may work for a company that has volumes of product documentation, code, internal wikis, conversations, feedback from our customers and meeting notes that capture the context of what I want ChatGPT to learn from and respond to. This way, if I ask ChatGPT something, it should generate a response that is customized to my company’s ethos.
As of writing this article, as far as I am aware, there are only two ways of achieving this.
Method 1: Fine-Tune ChatGPT Against Your Dataset
This involves training the large language model (LLM) on data specific to your domain. With ChatGPT, you can only fine-tune GPT-2 and GPT-3 against custom data. OpenAI provides API access to download links for different-sized models, which can be found in their respective repositories.
Once you have downloaded the model, you then need to use TensorFlow, PyTorch or some other relevant library first to define the training parameters and train the model against 80% of your data, using 10% of your data for validation and another 10% for testing.
Keep in mind this method also involves configuring hardware resources such as graphical processing units (GPUs) or tensor processing units (TPUs) for the chosen model. Finally, you can deploy this model into your application using APIs, SDKs, etc.
By now, you have probably realized this method is not for the faint of heart and often requires significant computational resources to pull off, not to mention several trial-and-error iterations.
This brings me to the second method.
Method 2: Prompt Engineering with Your Database
In this method, you store all your relevant company data (a feat in and of itself 🙂) in one single database. Then, when a user puts in a prompt, you match it against your company data in the database, find similar results to the user prompt, modify the prompt and send it over to GPT-4 (or GPT3 if you are still on the waitlist).
Using a database to store and query your custom data can be a very efficient way to use that data for ChatGPT. This is because databases are designed to store and query large amounts of data quickly. In addition, databases can be used to store data in various formats, which means that you can store your custom data in the most convenient format.
Since you are still sending prompts to OpenAPI, one limitation to keep note of with this method is the limit of ~8,000 tokens — or roughly 32,000 characters that you can send to GPT APIs.
In addition, since you are now adding a layer between the user prompt and OpenAI APIs, you also have to be extremely efficient with the search result and prompt engineering, both with accuracy and latency, before you send this prompt to GPT-4.
This brings me to the choice of database and a method to query the data so the database can find relevant data using semantic search in a few milliseconds. The obvious choice for this in my mind is SingleStoreDB. Here is why:
SingleStoreDB is a database that can ingest data row by row while you query it for real-time analytics.
This means you can use SingleStoreDB to store your custom data and query that data in real time. This can be very useful for tasks like answering questions, generating text and translating languages.
SingleStoreDB offers a variety of features that make it ideal for use with ChatGPT. Its features also include:
- The ability to store data in a variety of formats like vectors
- The ability to run native vector functions like
- The ability to use a variety of database query languages including SQL
However, given that we want to do a quick search, we need to store all data as vector embeddings so you can do a semantic search with simply one line of SQL code with SingleStoreDB native vector functions. In plain English, this means storing your data as numbers so when you send a piece of data, it can respond to results that are semantically similar vs. an exact keyword search.
Oh, and as I mentioned earlier, you want this to happen in milliseconds.
Here are the steps for how to do this:
Steps for Using Custom Data with ChatGPT
To use custom data with ChatGPT, you will need to follow the steps below. In our example, we are assuming that the user wants ChatGPT to respond with something that includes all the customer feedback the company has collected and stored for future product development.
1. First, sign up for a free trial with SingleStoreDB cloud and get $500 in credits. Create a workspace and a database.
2. Next, create a table in your database with your schema. In this example, we created a sample table called embeddings with a text column that we want to index:
CREATE TABLE embeddings (id INT AUTO_INCREMENT PRIMARY KEY, text TEXT NOT NULL, vector blob);
3. Create a SingleStore pipeline and bring your data into the table. Within SingleStoreDB, you can do this either by reading a CSV on an S3 Bucket or other supported data sources. Here is a pipeline that reads a CSV to inject data into the table above:
LOAD DATA INFILE '/path/to/embeddings.csv' INTO TABLE embeddings FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES -- To ignore the header row (text);
4. Next, create embeddings for your entries and store them in your table in the vector column. Here, we will create embeddings for an entry in the text column using OpenAI’s embeddings API. For more instructions on how to use OpenAI’s embeddings API, look at our blog, where we perform semantic search:
You can use OpenAI embeddings API to get this
UPDATE embeddings SET vector = JSON_ARRAY_PACK(‘sample vector’)WHERE text=’sample text’
5. After you have added embeddings for each of your entries, run semantic search using our in-built and highly parallelized
DOT_PRODUCT vector function with just one line of SQL:
You can use OpenAI embeddings API to get this
SET @input_vector = 'Replace this with the vector representation of your query';
SELECT review, DOT_PRODUCT(vector, JSON_ARRAY_PACK(input_vector)) AS Score FROM embeddings ORDER BY Score DESC LIMIT 5
6. The example above will return the top five most similar records. You can now extract the content and add it to your prompt to then send to OpenAI for a response custom to your data set.
Using custom data for ChatGPT can be a very effective way to improve the accuracy and speed of ChatGPT’s responses. Following the steps outlined above, you can use custom data to generate a ChatGPT response specific to your domain. In a future article, I will detail a custom ChatGPT bot built with Node that can talk to a custom SingleStore database to generate responses for custom data that was not used for training ChatGPT. Stay tuned!