Where are you using WebAssembly?
Wasm promises to let developers build once and run anywhere. Are you using it yet?
At work, for production apps
At work, but not for production apps
I don’t use WebAssembly but expect to when the technology matures
I have no plans to use WebAssembly
No plans and I get mad whenever I see the buzzword
AI / Data / Large Language Models

Reduce AI Hallucinations with Retrieval Augmented Generation

This newly devised technique shows promise in increasing the knowledge of LLMs by enabling prompts to be augmented with proprietary data.
Jul 18th, 2023 6:00am by
Featued image for: Reduce AI Hallucinations with Retrieval Augmented Generation
Image via Shutterstock.

In the rapidly evolving world of AI, large language models have come a long way, boasting impressive knowledge of the world around us. Yet LLMs, as intelligent as they are, often struggle to recognize the boundaries of their own knowledge, a shortfall that often leads them to “hallucinate” to fill in the gaps. A newly devised technique, known as retrieval augmented generation (RAG), shows promise in efficiently increasing the knowledge of these LLMs and reducing the impact of hallucination by enabling prompts to be augmented with proprietary data.

Navigating the Knowledge Gap in LLMs

LLMs are computer models capable of comprehending and generating human-like text. They’re the AI behind your digital assistant, autocorrect function and even some of your emails. Their knowledge of the world is often immense, but it isn’t perfect. Just like humans, LLMs can reach the limits of their knowledge but, instead of stopping, they tend to make educated guesses or “hallucinate” to complete the task. This can lead to results that contain inaccurate or misleading information.

In a simple world, the answer would be to provide the model with relevant proprietary information at the exact time it’s needed, right when the query is made. But determining what information is “relevant” isn’t always straightforward and requires an understanding of what the LLM has been asked to accomplish. This is where RAG comes into play.

The Power of Embedding Models and Vector Similarity Search

Embedding models, in the world of AI, act like translators. They transform text documents into a large list of numbers, through a process known as “document encoding.” This list represents the LLM’s internal “understanding” of the document’s meaning. This string of numbers is known as a vector: a numeric representation of the attributes of a piece of data. Each data point is represented as a vector with many numerical values, where each value corresponds to a specific feature or attribute of the data.

While a string of numbers might seem meaningless to the average person, these numbers serve as coordinates in a high-dimensional space. In the same way that latitude and longitude can describe a location in a physical space, this string of numbers describes the original text’s location in semantic space, the space of all possible meanings.

Treating these numbers as coordinates enables us to measure the similarity in meaning between two documents. This measurement is taken as a distance between their respective points in the semantic space. A smaller distance would indicate a greater similarity in meaning, while a larger distance suggests a disparity in content. Consequently, information relevant to a query can be discovered by searching for documents “close to” the query in semantic space. This is the magic of vector similarity search.

The Idea Behind Retrieval Augmented Generation

RAG is a generative AI architecture that applies semantic similarity to automatically discover information relevant to a query.

In a RAG system, your documents are stored in a vector database (DB). Each document is indexed based on a semantic vector produced by an embedding model so that finding documents close to a given query vector can be done quickly. This essentially means that each document is assigned a numerical representation (the vector), which indicates its meaning.

When a query comes in, the same embedding model is used to produce a semantic vector for the query.

The model then retrieves similar documents from the DB using vector search, looking for documents whose vectors are close to the vector of the query.

Once the relevant documents have been retrieved, the query, along with these documents, is used to generate a response from the model. This way, the model doesn’t have to rely solely on its internal knowledge but can access whatever data you provide it at the right time. The model is therefore better equipped to provide more accurate and contextually appropriate responses, by incorporating proprietary data stored in a database that offers vector search as a feature.

There are a handful of so-called “vector databases” available, including DataStax Astra DB, for which vector search is now generally available (read about the news here). The main advantage of a database that enables vector search is speed. Traditional databases have to compare a query to every item in the database. In contrast, integrated vector search enables a form of indexing and includes search algorithms that vastly speed up the process, making it possible to search massive amounts of data in a fraction of the time it would take a standard database.

Fine-tuning can be applied to the query encoder and result generator for optimized performance. Fine-tuning is a process where the model’s parameters are slightly adjusted to better adapt to the specific task at hand.

RAG Versus Fine-Tuning

Fine-tuning offers many benefits for optimizing LLMs. But it’s also got some limitations. For one, it doesn’t allow for dynamic integration of new or proprietary data. The model’s knowledge remains static post-training, leading it to hallucinate when asked about data outside of its training set. RAG, on the other hand, dynamically retrieves and incorporates up-to-date and proprietary data from an external database, mitigating the hallucination issue and providing more contextually accurate responses. RAG gives you query-time control over exactly what information is provided to the model, allowing prompts to be tailored to specific users at the exact time a query is made.

RAG is also more computationally efficient and flexible than fine-tuning. Fine-tuning requires the entire model to be retrained for each dataset update, a time-consuming and resource-intensive task. Conversely, RAG only requires updating the document vectors, enabling easier and more efficient information management. RAG’s modular approach also allows for the fine-tuning of the retrieval mechanism separately, permitting adaptation to different tasks or domains without altering the base language model.

RAG enhances the power and accuracy of large language models, making it a compelling alternative to fine-tuning. In practice, enterprises tend to use RAG more often than fine-tuning.

Changing the Role of LLMs with RAG

Integrating RAG into LLMs doesn’t only improve the accuracy of their responses, but it also maximizes their potential. The process enables LLMs to focus on what they excel at: intelligently generating content from a prompt. The model is no longer the sole source of information because RAG provides it with relevant proprietary knowledge when required, and the corpus of knowledge accessible to the model can be expanded and updated without expensive model-training jobs.

In essence, RAG acts as a bridge, connecting the LLM to a reservoir of knowledge that goes beyond its internal capabilities. As a result, it drastically reduces the LLM’s tendency to “hallucinate” and provides a more accurate and efficient model for users.

DataStax today announced the general availability of vector search capability in Astra DB. Learn about it here

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.