TNS
VOXPOP
Where are you using WebAssembly?
Wasm promises to let developers build once and run anywhere. Are you using it yet?
At work, for production apps
0%
At work, but not for production apps
0%
I don’t use WebAssembly but expect to when the technology matures
0%
I have no plans to use WebAssembly
0%
No plans and I get mad whenever I see the buzzword
0%
AI / Large Language Models

5 Bottlenecks Impacting RAG Pipeline Efficiency in Production

These are the main potential bottlenecks that negatively impact the performance of RAG pipelines targeting production LLM environments.
Feb 2nd, 2024 9:38am by
Featued image for: 5 Bottlenecks Impacting RAG Pipeline Efficiency in Production
Photo by Volodymyr Hryshchenko on Unsplash

Retrieval Augmented Generation (RAG) has become a critical component of generative AI applications that are based on large language models. Its primary objective is to enhance the capabilities of general-purpose language models by integrating them with an external information retrieval system. This hybrid approach aims to address the limitations of traditional language models, particularly in handling complex, knowledge-intensive tasks. By doing this, RAG significantly enhances the factual accuracy and reliability of the generated response, especially in situations where precise or up-to-date information is essential.

RAG stands out for its ability to augment the knowledge of language models, enabling them to produce more accurate, context-aware, and reliable outputs. Its application ranges from enhancing chatbots to powering sophisticated data analysis tools, making it an essential tool for building chatbots and AI agents.

But let’s take a closer look at the potential bottlenecks that negatively impact the performance of RAG pipelines targeting production environments.

Prompt Template

The prompt template in LLMs plays a pivotal role in determining the model’s response quality. A poorly structured prompt can lead to ambiguous or irrelevant responses.

Every LLM has a well-defined prompt template that becomes the lingua franca of the model. To get the best results from the model, it’s extremely important to ensure that the prompt is structured correctly as per the format used during the pre-training.

For example, the below template ensures Llama 2 responds appropriately to the prompt.

<s>

[INST] 

   <<SYS>>

      {{ system_prompt }}

   <</SYS>>

   {{ user_message }}

[/INST]

The LLMs from OpenAI use the below format:

{“role”: “system”, “content”: “system_prompt“},
{“role”: “user”, “content”: “user_message“}

LLM Context Length

LLMs have a fixed context window, limiting the amount of information they can consider in one instance. This is dependent on the parameters used during the pre-training. The standard GPT-4 model offers a context window of 8,000 tokens.  There is also an extended version with a 32,000 token context window. Furthermore, OpenAI has introduced the GPT-4 Turbo model, which has a significantly larger context window of 128,000 tokens. Mistral has a context window that is technically unlimited with a 4,000 sliding window context size. Llama 2 has a context window of 4,096.

Even though some LLMs have a large context window, this does not imply that we can skip some stages of the RAG pipeline and pass the whole context at one time. “Context stuffing,” which involves embedding a large amount of contextual data in the prompt, has been shown to reduce LLM performance. It’s not a good idea to include an entire PDF in the prompt just because the model supports a larger context length.

Ensuring that the combined size of the prompt and context is well within the limits of a reasonable context length ensures a faster and more accurate response.

Chunking Strategy

Chunking is a technique used to manage long text that exceeds the model’s maximum token limit. Since LLMs can only process a fixed number of tokens at a time based on the context window, chunking involves dividing a longer text into smaller, manageable segments, or “chunks”. Each chunk is processed sequentially, allowing the model to handle extensive data by focusing on one segment at a time.

Chunking is an important process in processing content stored in files such as PDF and TXT, in which large texts are divided into smaller, more manageable segments to accommodate the input limitations of embedding models. These models transform text chunks into numerical vectors representing their semantic meanings. This step is critical for ensuring that each text segment retains its contextual relevance and accurately represents semantic content. The generated vectors are then stored in a vector database, allowing for efficient vectorized data handling in applications such as semantic search and content recommendation. Essentially, chunking allows for efficient processing, analysis, and retrieval of large amounts of text data in a context-aware manner, overcoming the limitations of embedding models.

The below list highlights some of the proven chunking strategies for embedding models.

  • Sentence-Based Chunking: This strategy divides text into individual sentences, ensuring that each chunk captures a complete thought or idea; it’s suitable for models focusing on sentence-level semantics.
  • Line-Based Chunking: Text is split into lines, typically used for poetry or scripts, where each line’s structure and rhythm are crucial for understanding.
  • Paragraph-Based Chunking: This approach chunks text by paragraph, ideal for maintaining thematic coherence and context within each block of text.
  • Fixed-Length Token Chunking: Here, text is divided into chunks containing a fixed number of tokens, balancing model input constraints with contextual completeness.
  • Sliding Window Chunking: Involves creating overlapping chunks with a ‘sliding window’ approach, ensuring continuity and context across adjacent chunks, especially beneficial in long texts with complex narratives.

Choosing the right chunking strategy for the text embeddings model and the language model is the most critical aspect of a RAG pipeline.

Dimensionality of Embedding Models

The dimensionality of embedding models refers to the number of dimensions used to represent text as vectors in a vector space. In natural language processing (NLP), these models — such as word embeddings like Word2Vec, or sentence embeddings from BERT — transform words, phrases, or sentences into numerical vectors. The dimensionality, often ranging from tens to hundreds or even thousands of dimensions, determines the granularity and capacity of the model to capture the semantic and syntactic nuances of the language. Higher-dimensional embeddings can capture more information and subtleties, but they also require more computational resources and can lead to challenges like overfitting in machine learning models.

The dimensionality of embedding models in LLMs affects their ability to capture semantic nuances. Higher dimensionality often means better performance, but at the cost of increased computational resources.

Here is a list of popular text embedding models and their dimensionality:

  • sentence-transformers/all-MiniLM-L6-v2: This model, suitable for general use with lower dimensionality, has a dimensionality of 384. It’s designed for embedding sentences and paragraphs in English text.
  • BAAI/bge-large-en-v1.5: One of the most performant text embedding models with a dimensionality of 1024, which is good for embedding entire sentences and paragraphs.
  • OpenAI text-embedding-3-large: The most recently announced embeddings model from OpenAI comes with an embedding size of 3,072 dimensions. This larger dimensionality allows the model to capture more semantic information and improve the accuracy of downstream tasks.
  • Cohere Embed v3: Cohere’s latest embedding model, Embed v3, offers versions with either 1,024 or 384 dimensions. The model providers claim that it is the most efficient and cost-effective embeddings model.

Balancing the trade-off between performance and computational efficiency (cost) is key. Research is focused on finding the optimal dimensionality that maximizes performance while minimizing resource usage.

Similarity Search Algorithm in Vector Databases

The efficiency of similarity search algorithms in vector databases is crucial for tasks like semantic search and document retrieval in RAG.

Optimizing the index and choosing the right algorithms significantly impacts the query processing mechanisms. Some vector databases allow users to choose the metric or algorithm during the creation of the index:

  • Cosine Similarity: This metric measures the cosine of the angle between two vectors, providing a similarity score irrespective of their magnitude. It’s particularly effective in text retrieval applications where the orientation of vectors (indicating the similarity in the direction of their context) is more significant than their magnitude.
  • HSNW (Hierarchical Navigable Small World Graphs): A graph-based method, HSNW constructs multi-layered navigable small world graphs, enabling efficient nearest neighbor searches. It’s known for its high recall and search speed, especially in high-dimensional data spaces.
  • User-Defined Algorithms: Custom algorithms tailored to specific use cases can also be implemented. These can leverage domain-specific insights to optimize search and indexing strategies, offering a tailored approach to the unique requirements of different datasets and applications.

These methods collectively contribute to improved search accuracy and query efficiency in vector databases, catering to diverse requirements across various data types and use cases.

Summary

RAG pipeline bottlenecks include prompt template design, context length limitations, chunking strategies, the dimensionality of embedding models, and the algorithms used for similarity searches in vector databases. These challenges have an impact on the effectiveness and efficiency of RAG models, ranging from generating accurate responses, to handling large amounts of text and maintaining contextual coherence. Addressing these bottlenecks is critical for improving the performance of various LLM-based applications, ensuring they can accurately interpret and generate language responses.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.