Limited Compute Resources? Low-Parameter RAG Can Help
When building a generative AI application that needs to call a large language model (LLM) multiple times to complete a task, a common problem is that repeated queries to the LLM can be both expensive and unpredictable. Large models such as GPT-3.5/4 are incredibly resource intensive to train and run inference; this is reflected in the API charges as well as occasional disruptions to service. ChatGPT was originally released as a research preview and wasn’t intended to be used for production applications. However, its usefulness across a vast array of applications is indisputable, so interest in LLMs has exploded.
Since the inception of ChatGPT, users have been looking for ways around the lack of privacy and the inability to control uptime or inference settings when using GPT. This has led to the popularity of free, public models such as Meta’s Llama 2, and later, the creation of quantized and lower-parameter versions of Llama that can run on consumer hardware. These public models are capable of providing much of the same functionality as GPT for much less computing power, though at the cost of fewer parameters and less verbose outputs.
If your application doesn’t necessarily depend on processing excessively large contexts or producing verbose outputs, then hosting your own inference on instances you control can be a more cost-effective option. And when it comes to real-world applications of retrieval augmented generation (RAG), the cost differences may become even more significant.
I’ll demonstrate a simple method for combining vector stores, lexical search and prompt engineering to perform accurate RAG on commodity hardware. Using this method, you can both reduce the complexity of large volumes of information and make running generative AI applications more accurate, efficient and cost effective at scale. And by using RAG on specific stores of information, you can gain the ability to eliminate hallucinations and create effective and knowledgeable agents from any source material, and without needing to pay for third-party APIs.
To get started, you will need either a DataStax Enterprise 7 instance or DataStax Astra DB to store the vectors and text data, as well as an LLM and a sentence transformer model to generate responses and encode your data with vectors. Depending on the complexity of your data or the user prompts, you may also consider combining this with a DataStax Enterprise 6.8 database that can perform Solr searches to match wider ranges of data, which is what I used in this example. DataStax is continually working on improvements to enable all of these operations with a single database, but for now I used two databases.
Solving for Hallucinations
Regardless of which LLM you choose, they all still suffer from hallucinations. For now, that limitation needs to be resolved by feeding truthful information into the context of prompts to the LLM, otherwise known as RAG. The method by which you locate your information and transform it for the prompts completely depends on your data model, but you can find more pertinent information in a more efficient way by using vector databases.
Say for instance that you have a collection of e-books on a subject you’d like to explore, such as how to play Warhammer 40,000. Under normal circumstances, it would take years to read through the supporting literature and gain enough gameplay experience to reach an expert level.
A targeted question such as, “What can you tell me about Morvenn Vahl of Adepta Sororitas?” would be best answered by a veteran player or any employee at a Warhammer store. And while ChatGPT can answer many questions about the game, it unfortunately has no training data that covers this particular character:
Compare this to a Llama 2 13B parameter LLM, hosted on a consumer workstation with an Nvidia RTX A4000 graphics card. Similarly the model can demonstrate basic knowledge of the Warhammer universe, but because of the tuning, the model doesn’t care that the character isn’t found and provides instead a best-effort hallucination:
If you want to build a chatbot that can help both newcomers and veterans play Warhammer 40,000, then these outputs are unacceptable. To be an effective game guide, the chatbot needs to know the rules of the game, the rules for each unit, some bits of the lore, and some strategy and commentary. Luckily, all of that information on the 10th-edition rules is available for free from Games Workshop and fan websites, and all you need to do is make it searchable to your chatbot app.
Compare this to the same 13B Llama model, where with RAG it is asked to compare a couple of sources on Morvenn Vahl and devise a relevant answer based on the user prompt. This time, the chatbot has access to a search database and a vector database full of all the public information on how to play Warhammer 40,000, 10th Edition:
What a difference! Not only does it find pertinent information on this niche character, but it also keeps its outputs in line with the context of how to play the game with the 10th-edition rules.
The hardest part in all of this is performing an effective search to find the relevant pages to feed into the LLM. This is where vector databases can be particularly useful.
In this example we’ll use DSE 7 and DSE 6.8 running in Docker instances to satisfy the database requirements of the chatbot application, which needs to be able to compare vectors and perform lexical searches. DSE 7 and Astra DB have introduced the ability to store vectors and perform vector searches as well as filtering by text matches. We only need to search a few dozen books for this example, so running DSE instances in Docker will be sufficient for most consumer hardware.
Using vectors in your databases will help to find documents that are similar to a given query, or they can be used to compare results retrieved from another search. This can help you to overcome the limitations of lexical search and to improve the effectiveness of data models.
For instance, something like e-book PDFs can benefit from being encoded with sentence transformers like miniLM, and the vectors can be used to run a similarity comparison between a query and a given source. In this case, a sentence transformer model is used to create embeddings of a page’s text in an e-book, and this can enable you to compare to the user’s prompt to figure out if a result is relevant to the query. Relevant pages should contain one or more instances of terms that are similar to the user’s query and result in better similarity scores from the model’s standpoint.
That said, the vectors are best applied as a supplement to an existing lexical search model. If you search by vectors only, then you might end up unexpectedly retrieving unrelated documents and providing them as context where they don’t apply.
In this example, the query “What can you tell me about Morvenn Vahl of Adepta Sororitas?” can be transformed by an LLM to a set of simple search terms:
Morvenn, Vahl, Adepta, Sororitas
The first step in finding relevant documents would be to search documents that contain those basic terms. This can be done by first filtering for text matches in the database to find keywords in the page text matching such a query. The reason for using an LLM to generate keywords is to provide a wider range of possible keywords to search, as it often attempts to add more keywords that are related but are not in the text of the original prompt. Be careful with this, however, as LLMs can also generate special characters and odd sequences that you will need to sanitize.
Once you have at least one result, you can vectorize the user’s query and compare it to the vectors of the lexical search, creating scores of how relevant each result is. This allows you to check the search results for accuracy to the query and set a threshold for rejecting unrelated results when it comes to finally presenting your results to the LLM.
In this case, the first step should match to pages that specifically show Morvenn Vahl’s index card or gameplay mechanics, because those describe the character’s unit in terms of how it plays in the game. If the page meets a certain relevance threshold to the user query, determined by the application, then it gets summarized and placed in a list of results.
Finally, the search results can be compiled into a list and fed back to the LLM, where it is asked to use the most relevant contexts to answer the original query. Here is a visualization of the flow:
As you can see, the LLM gets called quite frequently for this flow. The LLM is responsible for transforming the user prompt into keywords, summarizing applicable results and choosing which context best answers a query. Each source to check adds another LLM invocation, which can be quite expensive when making queries to GPT. But if you already have the information you need and just want to summarize it or transform it, then you may not need to use such a large model. In fact, switching to smaller models can provide a number of benefits.
By using a smaller LLM, you can reduce the computational cost of each query, which can lead to significant savings over time. This can also result in faster response times for your users, which can improve their overall experience. In this example, where RAG is performed using a small LLM and small databases, all hosted on the same GPU instance, it takes about 30 seconds to retrieve 15 sources, analyze them for relevance and provide a final answer. And the shorter the prompts (sources), the faster the outputs can be returned.
Additionally, this method allows for increased security and scalability. With prompt engineering and a pipeline of calls to the LLM, you’re in full control of how the data is accessed and what the users will get in their responses. In terms of resource usage, the example 13B parameter model only consumes a little over 8GB of VRAM, and still provides relevant answers. Depending on needs, this shows potential for even running RAG on myriad other platforms such as user workstations and mobile devices.
Controlling the Output
Prompt engineering is key to making RAG do exactly what you want. You are in control of how the chatbot interprets the data and the context under which it should be thinking. In this example, we want to ensure that the chatbot knows we are specifically seeking Warhammer information, so we can first ask it to help provide supporting context to the user’s query:
Query: “<user query>”
Give me a minimal, comma-separated list of Warhammer 40K keywords for a search engine. Respond with only the query. Do not use emojis or special characters.
Warhammer 40,000 is full of terms and names that might appear in other, unrelated popular culture, so it is important to set the context of the RAG in the very first query. This context should be something available to select or modify if your application covers multiple contexts, such as if you need to cover multiple editions of the Warhammer game rules or combine with the official lore books.
Note that the user’s query is always encapsulated with quotes for this experiment. This helps the LLM to distinguish between the query that it’s attempting to directly answer and the separate prompt-engineering instructions, which it must not directly answer. The question/answer portion of the prompt can be adjusted to fit a particular context, but mostly all you need to be able to do is inform the LLM what it should and should not directly respond to, and how to respond.
In this case, it’s safe to assume that the LLM does have a general knowledge of the game universe, since the series is reasonably popular and general information is available for free. The output of this first query helps generate some keywords to use in lexical search without us having to build a cheat sheet into our application.
The lexical and vector comparisons can then be performed in the background, and a list of results is compiled for review by the LLM. Because the user’s original prompt is never directly answered with inference at the first step, the LLM only transforms what is found in a search and can easily be stopped from answering queries outside of its guard rails or knowledge base.
If there are relevant results from the search:
Query: “<user query>”
Review these search results and use them to answer the query.
If there are no relevant results from the search:
Query: “<user query>”
Politely tell me you searched but could not find an answer for the query. Answer to the best of your knowledge instead.
For added security, you can completely reject or redirect the request when it cannot be served.
Query: “<user query>”
Politely tell me you searched but could not find an answer for the query. Instruct me to reach out to the Customer Support team for assistance instead.
You can even make the outputs lengthier by asking for more details. As long as you can fit your source material within the context window, the LLM can transform it for you.
Query: “<user query>”
Review these search results and use them to answer the query. Be as detailed as possible and cite the sources.
The LLM has a limited context window and will fail to process exceptionally large pages of text. Consider placing limits on row size so your data is more manageable and easier for the LLM to process. For example, cutting pages into chunks of about 1,000 characters seems to work well, and try to avoid feeding more than four or five detailed answers into the prompt.
The LLM has no memory of a conversation aside from what you can fit in the context window. It is possible to build a permanent store of conversation data, but it is not possible for an LLM to fit excessively large conversations or detailed context into a prompt; there is an upper limit to what it can transform. This means that no matter what, at a certain point you will notice that the LLM seems to “forget” certain details even when they are provided as context; this is just an inherent limitation of the tool. It is best to rely on it for short conversations only and focus on transforming small amounts of text at a time to minimize hallucinations.
Randomness in the LLM can be a problem. Testing and tuning will be necessary to determine which prompts work best for your data set and to find which model works best for your use case. In my testing with a 13B parameter model, there was a lot of unpredictability regarding which search keywords got generated from the first prompt, especially as the prompt length increases. For best results, stick to shorter prompts.
In summary, leveraging RAG by combining vector and lexical search models allows for more effective finding and sorting of relevant results and generating agent outputs far less prone to hallucinations. The smaller the searchable context, the more precise and accurate the responses. Constructing your own custom pipeline of LLM calls offers far more flexibility in tuning responses toward your desired level of accuracy and guard rails.
While it cannot process excessively large amounts of data within the limited context window, it does offer the ability to create effective assistants on limited knowledge bases, as well as run more concurrent agents on the same or lesser hardware than before. This could open up more possibilities for virtual assistants for applications such as tabletop gaming, or even cover more complex topics for use by government, legal and accounting firms, scientific research, energy and more.
If you’re ready to get started building, you can try Astra DB for free. Create your database and start loading your RAG sources today, with no cloud or database Ops experience required.