Where are you using WebAssembly?
Wasm promises to let developers build once and run anywhere. Are you using it yet?
At work, for production apps
At work, but not for production apps
I don’t use WebAssembly but expect to when the technology matures
I have no plans to use WebAssembly
No plans and I get mad whenever I see the buzzword
AI / Large Language Models

Retrieval Augmented Generation for LLMs

Retrieval-augmented generation (RAG) is a cutting-edge approach in NLP and AI. Badrul Sarwar, a machine learning scientist, shares his tips.
Dec 13th, 2023 7:55am by
Featued image for: Retrieval Augmented Generation for LLMs
Image via Unsplash.

Generative AI (GenAI), powered by advanced neural network architectures and large language models (LLMs), has the remarkable ability to generate coherent and contextually relevant content — including text, images and even music — with minimal human intervention. However, such models suffer from one major limitation: they cannot expand or update their memory and may produce what is known as “hallucinations.” Hallucinations occur when the LLM produces content that sounds plausible but is actually fictional or incorrect, often due to the model extrapolating or confabulating beyond the scope of its training data.

For general-purpose content generation and other use cases, model hallucination can cause mild annoyance; but for an AI assistant or chatbot dealing with enterprise data, any type of inaccurate answer can lead to user frustration and even catastrophic consequences.

Solution: Retrieval Augmented Generation

Retrieval-augmented generation represents a cutting-edge approach to natural language processing and AI. This technique combines elements of both text generation and information retrieval to enhance the quality and relevance of generated content.

By incorporating knowledge and context from external sources or databases, retrieval-augmented generation models can produce more contextually accurate, coherent, and informative text that is free of hallucination. Most importantly, RAG can harness an application’s internal data and augment an LLM’s knowledge to find the specific answer to a question.

One can think of general-purpose LLMs as memorizing the knowledge (closed book) and when asked a question they generate an answer from their memory. But when the question is out of their memorized knowledge that is modeled through billions of parameters, they tend to fill in the gap by confabulating or hallucinating an answer. On the contrary, RAG is like an open book test — when needed, they can quickly retrieve the relevant knowledge and augment the LLM’s knowledge to provide a correct answer. RAG systems can be designed not to provide any answer if no relevant contextual information can be harnessed, thereby solving the hallucination problem.

RAG Details

At the heart of the RAG system is the retrieval system for additional knowledge. Embeddings or vector representations are used for semantic knowledge retrieval. The following are the main components of a RAG system:

1. Embedding and Similarity Search

All additional documents or knowledge sources are tokenized and embedded in some dense low-dimensional space using any foundational NLP model (e.g., Word2Vec, GPT, Bert, Llama). Embeddings are numerical representations of words in a way that preserves semantic relationships. These embeddings are of a fixed dimension whose size is dictated by the model that generates the embeddings.

With these embeddings, words with similar meanings or contexts are located closer together in the vector space. Given a query, usually Maximum Inner Product Search (MIPS) algorithms are used to find the most semantically similar documents to it. MIPS algorithms use either dot product, Euclidean distance, or cosine similarity as the sorting criteria.

2. Managing Embeddings — Vector Databases

For a typical enterprise application, there can be a great number of documents. Storing and searching through these large numbers of embeddings can be a daunting task. Imagine a scenario with 1 million documents of 1,000-dimensional embeddings. To perform a MIPS-based top-k nearest neighbor search, it would require computing dot products with the query vector to all the 1M document vectors and selecting the top-k similar documents — a very compute-intensive task.

Faster approximate nearest neighbor (ANN) algorithms, such as locality-sensitive hashing and others, have been invented to address this. These days, a new set of services called Vector Databases are available that can help with storage and organization and (most importantly) can provide MIPS-based retrieval through simple APIs. Vector databases are specifically designed to operate with vector embeddings. Cloud-based vector DBs such as Pinecone, Milvus and AWS and local-vector DBs such as FAISS and Chroma are becoming very popular and play the most crucial role in designing RAG systems.

Managing Embeddings–Vector Databases

Image via Badrul Sarwar

3. Augmented Prompt

Once vector databases provide the most similar documents to the question, they are compiled into one single context that supposedly contains enough information to answer the question. Finally, a special prompt needs to be created that instructs the LLM to answer the question by only using this supplied context. If the quality of retrieval is good — i.e., the context is relevant to the question — the LLMs can generate a suitable answer.

The quality of the retrieved context can be controlled by applying similarity thresholds and the RAG system can decide not to answer a question if the retrieved contextual information is not relevant.

Benefits of RAGs for Enterprise Applications

Retrieval-augmented generation can be beneficial for enterprise applications in a variety of ways:

  • No hallucination: one of the most important benefits for RAG-based generative applications. For enterprise use cases, it is crucial that the answers provided by the model are factually correct and trustworthy, otherwise, it will cause more harm than benefit.
  • Cost savings: no need to train LLMs as knowledge evolves. Training LLMs is very expensive. As enterprise knowledge grows or changes, RAG can accommodate them by simply generating embeddings and inserting them into the vector databases. The similarity search can easily retrieve those and can be used in generating the context for LLMs.
  • Tailored experience: ideally, enterprises can train smaller, more tailored foundational LLMs with the help of RAG systems and can provide a much better customer experience.
  • Privacy and security maintenance: enterprises can avoid the exposure of their proprietary data to large LLMs. The privacy and security of enterprise customer data is one of the most important considerations when it comes to using LLMs. With the help of RAG, enterprises can run smaller but powerful open LLMs and provide better customer experience without compromising the privacy and security of sensitive data.

Challenges of RAGs

RAG-based applications have challenges, too. The use of additional vector databases may add to the cost. Also, with RAGs, the prompt to the LLM is augmented by using extra information that is retrieved from vector DBs — and that adds to the response time. Also, the overall prompt size is much larger, as we send the question as well as the contextual information in the same prompt. As the LLMs charge by prompt token count, each question answered gets more expensive.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.