Building GPT Applications on Open Source LangChain, Part 2

This is the second of two articles.
In the previous article, we discussed three considerations for developers when building GPT applications with an open source stack, such as LangChain. Let’s now use LangChain for a practical example where we want to store and analyze PDF documents.
We’ll obtain a PDF document, divide it into smaller parts, save the document text and its vector representations (embeddings*) in a database system and then query it. We’ll also use a GPT to help answer a question.
*In a GPT, an embedding is simply a numerical representation of a word or phrase. Vectors represent the semantic meaning of words and phrases in a way that a machine-learning model can understand.
Create a SingleStoreDB Cloud Account
First, sign up for a free SingleStoreDB Cloud account. Once logged in, select CLOUD > Create new workspace group from the left-hand navigation pane. Next, choose Create Workspace and just work through the wizard. Here are the recommended settings for this example:
Create Workspace Group
Workspace Group Name: LangChain Demo Group
Cloud Provider: AWS
Region: US East 1 (N. Virginia)
Click Next.
Create Workspace
Workspace Name: langchain-demo
Size: S-00
Click Create Workspace.
Once the workspace is created and available, from the left-hand navigation pane, select DEVELOP > SQL Editor to create a new database, as follows:
CREATE DATABASE IF NOT EXISTS pdf_db;
Create a Notebook
From the left-hand navigation pane, select DEVELOP > Notebooks. In the top right of the web page, select New Notebook > New Notebook, as shown in Figure 1 below.
We’ll call the notebook langchain_demo. Select a Blank notebook template from the available options.
We’ll also select the Connection and Database using the drop-down menus above the notebook, as shown in Figure 2.

Figure 2. Connection and Database
Fill out the Notebook
First, we’ll import some libraries:
1 2 3 4 5 6 |
!pip install langchain --quiet !pip install openai --quiet !pip install pdf2image --quiet !pip install tabulate --quiet !pip install tiktoken --quiet !pip install unstructured --quiet |
Next, we’ll read in a PDF document. This is an article by Neal Leavitt titled “Whatever Happened to Object-Oriented Databases?” OODBs were an emerging technology during the late 1980s and early 1990s. We’ll add leavcom.com
to the firewall by selecting the Edit Firewall option in the top right. Once the address has been added to the firewall, we’ll read the PDF file:
1 2 3 |
from langchain.document_loaders import OnlinePDFLoader loader = OnlinePDFLoader("http://leavcom.com/pdf/DBpdf.pdf") data = loader.load() |
We can use LangChain’s OnlinePDFLoader, which makes reading a PDF file easier.
Next, we’ll get some data on the document:
1 2 3 4 |
from langchain.text_splitter import RecursiveCharacterTextSplitter print (f"You have {len(data)} document(s) in your data") print (f"There are {len(data[0].page_content)} characters in your document") |
The output should be:
1 2 |
You have 1 document(s) in your data There are 13040 characters in your document |
We’ll now split the document into pages containing 2,000 characters each, giving us seven pages:
1 2 3 4 |
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 2000, chunk_overlap = 0) texts = text_splitter.split_documents(data) print (f"You have {len(texts)} pages") |
Next, we’ll create a table to store the text and embeddings. We can do this directly using the %%sql
magic command:
1 2 3 4 5 6 7 8 9 10 |
%%sql USE pdf_db; DROP TABLE IF EXISTS pdf_docs; CREATE TABLE IF NOT EXISTS pdf_docs ( id INT PRIMARY KEY, text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci, embedding BLOB ); |
To use Python code to connect to our database, we can use the built-in connection_url
, as follows:
1 2 |
from sqlalchemy import * db_connection = create_engine(connection_url) |
We’ll set our OpenAI API Key:
1 2 |
import openai openai.api_key = "OpenAI API Key" |
and use LangChain’s OpenAIEmbeddings
:
1 2 |
from langchain.embeddings import OpenAIEmbeddings embedder = OpenAIEmbeddings(openai_api_key = openai.api_key) |
Now we are ready to obtain the vector embeddings and store them in the database system:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
db_connection.execute("TRUNCATE TABLE pdf_docs") for i, document in enumerate(texts): text_content = document.page_content embedding = embedder.embed_documents([text_content])[0] stmt = """ INSERT INTO pdf_docs ( id, text, embedding ) VALUES ( %s, %s, JSON_ARRAY_PACK_F32(%s) ) """ db_connection.execute(stmt, (i+1, text_content, str(embedding))) |
We truncate the table to ensure that we start with an empty table. Then we iterate through the pages of text, obtain the embeddings from OpenAI, and store the text and embeddings in the database table.
We can now ask a question, as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
query_text = "Will object-oriented databases be commercially successful?" query_embedding = embedder.embed_documents([query_text])[0] stmt = """ SELECT text, DOT_PRODUCT_F32(JSON_ARRAY_PACK_F32(%s), embedding) AS score FROM pdf_docs ORDER BY score DESC LIMIT 1 """ results = db_connection.execute(stmt, str(query_embedding)) for row in results: print(row[0]) |
Here we convert the question into vector embeddings, perform a DOT_PRODUCT
and return only the highest-scoring value.
Finally, we can use a GPT to provide an answer, based on the earlier question:
1 2 3 4 5 6 7 8 9 10 11 |
prompt = f"The user asked: {query_text}. The most similar text from the document is: {row[0]}" response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ] ) print(response['choices'][0]['message']['content']) |
Here is some example output:
Based on the information provided in the document, it seems that object-oriented databases are not expected to be commercially successful in the near future. While they are gaining some popularity in niche markets such as CAD and telecommunications, relational databases continue to dominate the market and are expected to do so for the foreseeable future. IDC predicts that the growth rate for relational databases will be significantly higher than that of OO databases through 2004. However, OO databases still have their place in certain niche markets.
Summary
In this example, we saw the benefits of LangChain in the application development process. We also saw how easily we can convert documents from one format to another, store the content in a database system, generate vector embeddings and ask questions about the data stored in the database system. We also have the full power of SQL available if we are interested in performing additional query operations on the data.
I will host a workshop on June 22 and will go through building a ChatGPT application using LangChain. I hope you can join. Sign up here.