Grounding Transformer Large Language Models with Vector Databases
It’s easy to see the limits of large language models. We’ve all read the news stories, of Chat GPT making up legal cases, of determined users crafting prompts that override the limitations built into the model interfaces, and of answers being made up out of what appears to be thin air.
Even in general use, long responses produce grammatically and semantically accurate prose that’s at best complete nonsense.
Drilling down into how a transformer model works, you can see how and why these models behave the way they do. At heart, they’re a neural network trained on a large database of text. That text isn’t the plain English (and other languages) you might expect.
Instead, it’s broken down into chains of syllables that are then converted into “embeddings”, vector representations of text objects in a multidimensional space. Training the model helps build a semantic space that’s used to reinforce the neural net, building a way to probabilistically construct paths through that space — paths that we see as text or images, or even hear as speech.
The LLM as ‘Text Completer’
The resulting LLM is best thought of as a tool for predicting the most likely chain of syllables that follow on from the syllables in your prompt. It’s an engine that draws a probabilistic path through a multidimensional semantic space, with each point in the path a fresh output token. As the path gets longer, it’s less related to the initial prompt, allowing the model to generate what has become known as “hallucinations,” outputs that are semantically correct but are not grounded in reality.
Output like this doesn’t mean that the underlying LLM is broken, it’s that it’s working in an unbounded semantic space without a way of keeping its output grounded. That’s why ChatGPT gets more accurate responses when used with any of a range of plugins, and why Microsoft’s first Copilots operated within a relatively closed domain, using the Codex model with programming languages, or adding more information through integration with a search engine like Bing.
Reducing Error with Restrictions
It’s clear that it’s possible to reduce the risks associated with generative AI by reducing the size of the space that an LLM can use to generate text. That can be by restricting it to a specific set of texts, for example, a catalog or a company document store, or by tying outputs to prompts generated by outputs from a specific service or application.
Both options require prompts that contain data that are in a similar vector format to that used by the LLM, either as a vector database or as a vector search over an existing corpus. That latter option also explains the decisions Microsoft made when using Bing as the source for its Prometheus model that wraps GPT 4.
Search engines take advantage of the properties of vector databases to find content that shares similar vector characteristics to your original query, the less similarity, the lower the relevance. Vector search tools use different similarity models to produce results, assigning accuracy probabilities based on the distance between vectors.
Grounding AI with Vector Search
So how can you ground an LLM?
Microsoft’s Semantic Kernel tooling provides several good examples of how to add vector search to a model, providing a pipeline-based workflow that wraps LLMs like OpenAI’s GPT and transformer models from the open source Hugging Face, as well as its own Azure OpenAI service. It’s best thought of as an extensibility framework for LLMs that allows you to use them as part of an application workflow using familiar programming languages and tools.
Inside Semantic Kernel, vector search and vector databases are used to add what it calls “semantic memory” to an AI application alongside more traditional API calls via connectors to familiar services. While you can use traditional key-value pairs to store, say, catalog data, more complex texts require processing as vector embeddings, stored in vector search index tables in databases that offer this option or in more specialized vector databases.
Most LLMs provide their own embedding tools to convert strings into embeddings, allowing you to quickly add them as index terms for strings in your own stores. Currently, OpenAI offers several different embedding models, though it recommends using the latest generation text-embedding-ada-002.
This is more economic as well as more accurate than its predecessors, able to encode around 3000 pages per dollar spend. As tools like GPT use syllable tokens for prompts and embeddings, it’s hard to get an accurate cost measure as the number of tokens per page can vary — OpenAI currently estimates that a standard document contains about 800 tokens per page.
Embeddings are an important tool for working with LLMs, as they can be used for many different purposes, classifying and summarizing texts and managing translations. At the same time, they can be used as a tool for text generation, allowing your applications to generate outputs based on stored content.
This last approach is key to controlling the output of a LLM operation, as it helps constrain output to a set of known texts; for example, your organization’s successful project proposals as submitted to government agencies, or your current support knowledgebase.
Tools like Semantic Kernel provide plugins that simplify connecting to common vector databases and vector search APIs, including Microsoft’s own new vector index for its Cosmos DB document database and for open source tooling like PostgreSQL and Qdrant. There’s even support for working with large data sets using Azure Data Explorer’s Kusto query tools.
Like most ML datasets, embeddings work best when made from labeled data, for example using a set of key/value pairs in a table, or working with data that’s self-labeling, like an organization chart or a product catalog.
This makes common business content like a support database and its associated FAQs ideally suited as a data source for embeddings that can be used to drive a self-service support chatbot. Here the LLM can provide the semantic structure that wraps around answers, allowing an application’s prompts to add a personality around what would otherwise be very declarative responses, making interactions more user “friendly”.
Using Vector Search in Your LLM Application
So how does an application build on this approach?
Once you have created an embedding of your data set, you need to build an application that orchestrates a query flow, starting by vectorizing an initial question. The resulting question vector is used to query your stored embeddings, via your vector store’s vector search APIs. OpenAI recommends using cosine similarity to produce answers, as its embeddings are normalized to the same length, ensuring a faster response using this query type. The resulting response is added to a default prompt template that instructs the LLM to summarize the data and deliver its own response based on it.
This way you’re not relying on LLM training data, you’re using your own data to form the content of an answer, with the LLM purely providing a semantically correct wrapper for the answer. You still have to build and construct a prompt, based on the original query and the embedding data used to provide an answer. Tools like LangChain and Semantic Kernel can provide the appropriate orchestration for the resulting pipeline.
Things get more interesting when your base documents are themselves self-labeled, for example in Microsoft’s Office Open XML or the OpenDocument format used by both Open Office and Libre Office. Embedding these formats brings along a semantic description of the document format, allowing an LLM to recreate document layouts as well as text.
Document automation specialist Docugami transforms documents into its own Document XML Knowledge Graph. These are then stored as vector embeddings in Redis VectorDB allowing chat sessions through an LLM to extract information from collections of business documents, grounding interactions in your own data.
If you want accurate responses from an LLM, then you need to remember that whatever the model you’re using it’s still a foundation model — and like building a home, to get good results you need to build on top of that foundation. That might be using a tool like Semantic Kernel or LangChain to orchestrate both LLMs, vector search, and application APIs, or it might simply be using a model’s own plugin features.
Grounding an LLM in your own data sources ensures you can be much more confident in your outputs, especially if your prompts require what to do when there is no answer to the query.