Combining the Power of Text-Based Keyword and Vector Search
Let’s say that your company wants to build some sort of search system. Some of your engineers prefer full-text search. Others proclaim semantic search as the future. Here’s the good news: You can have both! We call this hybrid search.
Let’s take a high-level look at why hybrid search engines might be the answer to frictionless information retrieval. Let’s go!
What Is Text-Based Keyword Search?
Before we talk about hybrid search, we should talk about the two pieces involved.
Text-based keyword search, which you might more commonly come across as “full-text search,” means that when a user looks in specific text or data for a certain word or phrase, that search will return all results that contain some or all of the words from the user’s query.
For example, when I go to my local library’s website and search “James Patterson,” it shows me that author’s books. What it’s not showing me are other similar books I might like since I enjoy James Patterson.
Side note: In fact, it looks like my library might even be using traditional search, which is even more precise. The only results I see match my search query exactly.
In other words, text-based keyword search will deliver more results than traditional search, because it’s not so nitpicky about precision.
What Is Semantic Search?
Full-text search works well if you know not only what you want to find, but also how to describe it. But what if you want to conduct a search query yet can’t think of the proper keywords? Maybe you’re using large language models (LLMs) that don’t have the most current information.
How will you ever find what you’re looking for?
In semantic search, data and queries are stored as vectors (also called embeddings), not plain text. Machine learning (ML) models take the source input (meaning any unstructured data, whether that be video, text, images or even audio) and turn it into vector representations.
But what does this even mean?
Let’s do another example. Maybe your partner loves cycling, so you decide to buy them a new bike for their upcoming birthday. But you don’t know anything about bikes. All you can think of is the Schwinn you rode as a child.
So you go on Google, Amazon or another marketplace and search “Schwinn bikes.”
Because a search engine using vector search understands that you’re looking for Schwinn bikes and probably also comparable alternatives, it might additionally show you bikes from other brands like Redline and Retrospec.
With a vector-based search algorithm, it better understands the context of your search queries. Some people call this semantic search. According to Merriam-Webster, “semantic” means “of or relating to meaning in language.”
In other words, a vector-based search system better understands the thought and intent behind search queries to deliver results you might be looking for but don’t know how to search for. With the help of artificial intelligence, vector search can deliver information outside of what large language models can provide. For context, the limitation here is that LLMs were trained using data from the internet and other sources only up to the end of 2021. Vectors (embeddings) augment this, filling in the gaps where LLMs can’t quite keep up.
It’s almost like a vector search engine can read your mind a little bit.
If you like watching videos, this one is a great overview of the technology.
A Note on Text-Based Keyword vs. Vector Search
You might be wondering how we can qualify a text-based search engine versus a vector-based one. The answer is… it’s complicated. That’s because all of this kind of happens on a spectrum of sorts.
More specifically, when we break it down even further, we’re looking at dense and sparse vectors. Think of these vectors as the little worker bees responsible for rendering your search results.
With sparse embeddings, there are fewer worker bees, so the user is going to get fewer search results. However, using sparse vectors is typically more efficient since the search index will have to evaluate less information.
Dense vector search, on the other hand, typically gives the user more to work with. In our previous example with Schwinn bikes, dense embeddings might mean that the results display six different brands, as opposed to just one or two.
What does this look like behind the curtain? Sparse vectors are composed mostly of zero values. This is why if someone searches for “cat” with sparse vectors, the results might simply be “cat.”
Conversely, an algorithm with dense vectors might return “cat,” “feline” and “tabby.” With more non-zero values, dense vectors are better able to understand context and return results that don’t exactly match the query but still relate to it somehow.
Sparse and dense vectors help determine both the relevance and abundance of the results.
Why Combine Them into a Hybrid Search?
Not sure which approach is best for you? Well, you can have your cake and eat it too. And everybody loves cake!
How exactly do we do this? It’s called hybrid search.
Hybrid search engines combine both search methods to get the best of both worlds, ultimately delivering the results and webpages that users are looking for.
Think about our previous example with bicycles. A text search offers the benefit of more fine-tuned results but only if you know that you want to see solely websites that reflect your search query specifically.
However, if you’re still shopping around, your query might limit you to one brand, and you might miss out on the perfect bike for your spouse.
On the other hand, the beauty of semantic search is that it goes above and beyond what you type into the search bar. Unfortunately, these search systems tend not to be the most efficient.
With a hybrid search engine, you leverage the strengths of both approaches (including the strengths of sparse and dense vectors) to get hybrid results. Think of a hybrid search engine as something that can cast a very wide net but still target a specific type of fish. It’s the optimal combination.
To achieve the ideal search experience, hybrid search engines will pack the greatest punch.
Hybrid Search Is the Future
Search engines are constantly evolving. If we look at the almighty Google as an example, one thing is for certain: The search engine’s No. 1 priority is to give the people who use it the most relevant results with as little effort on their end as possible.
Combining full-text search with vector search into hybrid search is proving to be the most effective way to do that.
We can all learn something valuable here, and MongoDB is making search more convenient than ever. We love vector search and have even incorporated it into Atlas. This allows you to build intelligent applications powered by semantic search and generative AI. Whether you’re using full-text or semantic search, Atlas Search’s hybrid approach has your back. It lives alongside your data, making it faster and easier to deploy, manage and scale search functionality for your applications.
Accessing data and building AI-powered experiences has never been this smooth.