TNS
VOXPOP
Where are you using WebAssembly?
Wasm promises to let developers build once and run anywhere. Are you using it yet?
At work, for production apps
0%
At work, but not for production apps
0%
I don’t use WebAssembly but expect to when the technology matures
0%
I have no plans to use WebAssembly
0%
No plans and I get mad whenever I see the buzzword
0%
AI / Data / Large Language Models

The Transformative Fusion of Probability and Vector Search

Search is undergoing a transformative shift in which hybrid systems are blending the precision of keywords with the depth of semantic understanding.
Dec 15th, 2023 10:22am by
Featued image for: The Transformative Fusion of Probability and Vector Search
Image from msgrafixx on Shutterstock

The digital world, vast and complex like the cosmos, has changed the way we search for information. Gone are the days when searches were limited to just keywords. We’re seeing the rise of search driven by probability and advanced vector search technologies. These new techniques are changing the game, making it easier for us to navigate the overwhelming amount of digital information.

Let’s explore this transformative shift, tracing the journey from traditional keyword-based search to innovative hybrid systems where the precision of keywords meets the depth of semantic understanding. Join us as we dive into how this evolution is breaking the limits of traditional search methods and shaping the future of information retrieval.

The Dawn of Advanced Search: Embracing Probability

In the digital search arena, exact or keyword search has long stood as the fundamental approach, originating from the early days of digital indexing and database management. It revolutionized information retrieval with its ability to match user queries using simple algorithms directly. However, its rigidity, characterized by strict adherence to exact term matching, often misses context, semantic meaning and user intent.

As data complexity and user expectations evolved, the limitations of keyword search in understanding the probabilistic nature of language became clear, paving the way for more advanced, context-aware methodologies.

The concept of probability has been integrated into AI research since the 1950s, especially in fields like machine translation. Early research on the probability of word occurrences in texts laid the groundwork for significant advancements in AI. For instance, analysis of extensive English text examples provides insights into the letter frequency distribution.

This shift toward probabilistic analysis represented a departure from the exactness of keyword search, setting the stage for the development of probability search. As researchers delved deeper into the probabilistic aspects of language, it became evident that a new approach was needed to capture the rich semantic relationships inherent in human language. This realization spurred the exploration of semantic models, marking the next step in the evolution of search technologies.

Expanding beyond letters to words and sentences, AI systems, fueled by vast amounts of web data, have learned intricate aspects of vocabulary, grammar, sentence structures and nuanced essence. Statistical analyses of common English documents using 2-gram models reveal complex language patterns, establishing the foundation for natural language processing (NLP). NLP is central to probability search, offering a nuanced, context-sensitive language understanding.

A heat map of English letter frequency distribution using 2-gram, source: https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/

Neural networks, evolving from AI research, further enhance the processing of probability information. These networks, with layers interconnected and fed by massive data, analyze and learn from information, forming sophisticated probability distributions. Large language models (LLMs) trained in word prediction embody this approach, enabling complex language tasks from word prediction to paragraph generation and translation.

Traditional information retrieval systems, focused on keyword matching, often faltered with synonyms, incomplete queries and understanding user intent. In contrast, probability models and neural networks have revolutionized search systems, improving information retrieval quality.

Beyond Basics: The Rise of TF-IDF and BM25 in Search

The journey from basic keyword matching to more sophisticated search models began with the introduction of probability-based models like TF-IDF (term frequency — inverse document frequency) and BM25 (best match 25). These models were revolutionary in assessing the importance of words in documents.

TF-IDF combines two key concepts: “term frequency” (how often a word appears in a document) and “inverse document frequency” (how unique or common a word is across all documents). This combination allows TF-IDF to determine the relevance of words in a document, helping to rank documents more effectively based on query terms.

BM25 is built on TF-IDF, adding refinements like considering the length of documents and adjusting for the over-representation of terms. This made it more effective when comparing documents of different lengths and in situations where certain words are repeated too often. BM25, therefore, offered a more nuanced approach to understanding document relevance than TF-IDF.

However, while these advancements represented a significant leap forward, they were still primarily limited to statistical analysis of text. As the field of search technology continued to evolve, the focus shifted toward understanding the deeper, semantic meanings of words and phrases. This led to exploring and developing deep learning techniques, which sought to capture the intricate and often subtle relationships inherent in natural language. Models such as Word2Vec and BERT, which emerged from this exploration, represented the next frontier in search technology, leveraging the power of neural networks and machine learning to bring an unprecedented depth of understanding to information retrieval.

The Leap into Deep Learning: Word2Vec and BERT’s Impact

The evolution of information retrieval has been significantly influenced by deep learning models like Word2Vec and BERT, each representing a shift toward semantic methods in processing language.

Word2Vec

Developed by Google in 2013, Word2Vec pioneered word embeddings, mapping words into high-dimensional vectors. It offers two architectures: Skip-gram, which predicts the surrounding context for a given word (suitable for large datasets) and Continuous Bag of Words (CBOW), which predicts a word based on its context (ideal for smaller datasets). By optimizing the probability of word contexts, Word2Vec excels in capturing semantic relationships, offering a nuanced understanding of language beyond traditional methods.

2D projection of word vectors generated by Word2Vec

2D projection of word vectors generated by Word2Vec

BERT (Bidirectional Encoder Representations from Transformers)

Launched by Google in 2018, BERT marked a significant advancement in natural language processing. Unlike models that only consider unidirectional context, BERT analyzes text bi-directionally, enabling a deeper understanding of word meaning in context. Its self-attention mechanism effectively captures the relationships between words in a sequence. Based on the Transformer model, BERT’s architecture has excelled in tasks like sentiment analysis and question-answering, enhancing information retrieval by understanding complex language nuances.

BERT in Information Retrieval

BERT’s application in text retrieval is showcased in techniques like Bi-Encoder and Cross-Encoder. The Bi-Encoder uses two separate BERT models to encode queries and documents, offering fast retrieval but sometimes missing interactive nuances between them. In contrast, the Cross-Encoder uses a single BERT model for concurrent encoding, capturing complex relationships more effectively but with slower retrieval speed.

Sentence BERT, a variant of BERT, exemplifies the Bi-Encoder approach, showing significant improvements in semantic retrieval over traditional methods. Performance comparisons on datasets like MS MARCO and NQ illustrate BERT’s effectiveness, often outperforming models like BM25.

Data set

(A higher value indicates better retrieval performance)

BM25 Dense
MS MARCO Passage V1 (in-domain) 0.309 0.4402
NQ (dense in-domain, sparse zero-shot) 0.382 0.505
Quora (dense in-domain, sparse zero-shot) 0.800 0.889
HotpotQA (zero-shot) 0.682 0.520
DBPedia (zero-shot) 0.415 0.436
Fever (zero-shot) 0.689 0.540
FiQA (zero-shot) 0.315 0.467
SciFact (zero-shot) 0.698 0.681

Performance comparison between BERT Dense Embedding and BM25 (based on the NDCG@10 metric on the BEIR dataset, where a higher value indicates better retrieval performance)

While BERT has made significant strides in capturing diverse information, offering scalability and flexibility, it is not without its challenges. These challenges include its struggle with domain-specific datasets, handling fewer common terms, and the need for substantial computational resources. Similarly, Word2Vec, though effective in understanding word contexts, does not fully capture the complexity of sentences or paragraphs, which can limit its applicability in more complex search scenarios.

The limitations of both BERT and Word2Vec — such as the need for extensive domain-specific training, handling of uncommon terms and computational demands — underscore the necessity for a more adaptive and encompassing approach to information retrieval. This growing need paves the way for the development of hybrid search systems. Hybrid systems aim to marry the semantic depth of models like BERT with the precision and efficiency of traditional keyword-based search methods. This combination promises a more balanced and comprehensive solution aptly suited to meet modern information retrieval’s diverse and evolving demands, leading us to the next frontier in search technology: hybrid search systems.

Hybrid Systems Uniting Semantics and Keywords

Hybrid search systems integrate the strengths of both semantic models and traditional keyword-based retrieval. This innovative approach offers a more versatile and effective solution for information retrieval, addressing the limitations of each method.

The architecture of a hybrid retrieval system

The architecture of a hybrid retrieval system

Advantages of Hybrid Systems

  • Versatility: Hybrid systems effectively capture a broad range of information, responding efficiently to diverse queries.
  • Flexibility: These systems allow for fine-tuning and weight adjustments, adapting to different search requirements.
  • Balanced performance: Employing a two-tier approach, hybrid systems use a “retriever” for fast initial results and a “ranker” for in-depth analysis, ensuring speed and accuracy.

Challenges of Hybrid Systems

Despite their advantages, hybrid systems face several challenges:

  • Complexity: The combination of semantic and keyword approaches increases system complexity.
  • Resource demands: These systems often require more computational power.
  • Tuning and synchronization: Balancing the two methodologies requires careful tuning, and maintaining data consistency can be challenging.

Despite these challenges, hybrid systems represent a significant step forward in the evolution of search technology. They offer a comprehensive approach, combining the best of both worlds: the semantic understanding of AI models and the precision of keyword searches.

Milvus: Pioneering Advances in Data Retrieval

As the landscape of search technology has evolved, from the simplicity of keyword searches to the complexity of semantic and hybrid systems, Milvus stands out as a pioneering solution, adeptly navigating the challenges and opportunities of this progression. This open source, cloud native vector database epitomizes the fusion of probability and vector search, addressing the needs illuminated by the evolution of search methodologies.

Embracing the Hybrid Search Paradigm

  1. Scalability: Milvus excels in scalability, effortlessly managing vector data from millions to billions. This scalability is critical in today’s data-heavy search environments, meeting the demands of modern applications.
  2. High-performance, dense vector queries: Leveraging advanced indexes like HNSW and FAISS, Milvus delivers high-speed, efficient vector retrieval. This capability marks a substantial improvement over traditional search methods, catering to the needs of dynamic and large-scale data sets.
  3. Versatile multivector queries and multipath recall: Milvus supports an array of query types, preparing to include sparse vector queries. Its scoring models, from default weight scoring to Reciprocal Rank Fusion (RRF), and the potential integration of neural ranking with models like BERT’s Cross Encoder illustrate its alignment with cutting-edge AI developments in search.
  4. Adaptability in scoring models: The flexibility of Milvus in scoring models makes it a robust solution for diverse search scenarios, bridging the gap between classical and modern search approaches.

Milvus isn’t just about keeping up with the latest search technology; it’s about pushing boundaries. It addresses the complexity and resource demands of contemporary hybrid systems, offering a solution that balances the preciseness of keyword search with the context-aware capabilities of AI models.

Conclusion: Envisioning the Future of Search Technology

In reviewing the evolution of search technology, it’s clear that we’ve come a long way from basic keyword search to the advanced territory of semantic and hybrid systems. Each step has been crucial in tackling the growing complexities in data interpretation and retrieval. Milvus is a prime example of these technological strides at this critical point.

As we continue to push the boundaries in this fast-evolving sector, platforms like Milvus are not merely keeping pace with the complexities of modern data retrieval, they’re actively expanding what’s possible in data exploration and retrieval, setting new standards for what we’ll achieve in the future.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.