Vector Primer: Understand the Lingua Franca of Generative AI
We’re fond of saying that there’s no artificial intelligence without data. But it can’t be any kind of data. Take large language models, or LLMs — deep learning models, like OpenAI’s GPT-4 that can generate text that’s quite similar to what a human would write.
For LLMs to “understand” words, they need to be stored as text “vectors” — a way of using numbers to capture the meanings and usage patterns of words. Vectors are, you might say, the lingua franca of AI.
Vectors have been around for a while, but with the popularity and accessibility of the generative AI interface ChatGPT, they’ve become a hot topic, particularly because the most popular apps that organizations will build with these technologies will leverage their own private data for LLMs by composing their own vectors.
But how do they work, how are they stored, how do applications search for them and how do they help make AI possible? Let’s dig into vectors, vector search and the kinds of databases that can store and query vectors.
A vector refers to a numeric representation of the attributes of a piece of data. Each data point is represented as a vector with many numerical values, where each value corresponds to a specific feature or attribute of the data.
When you transform data like an image or text into a vector representation, it’s known as “embedding.” The choice of image embeddings for vector search, for example, depends on various factors such as the specific use case, the available resources and the characteristics of the image dataset. In e-commerce or product image search applications, it can be beneficial to use embeddings specifically trained on product images; so-called instance retrieval, on the other hand, involves searching for instances of objects within a larger scene or images.
Storing data as vector representations enables you to perform various operations and calculations on the data, most importantly search. Selecting the vector attributes is important for the types of questions you’d like to be able to ask later. For example, if you only store information about the colors in an image with plants, you can’t then ask about the care requirements. You’ll only be able to find visually similar plants.
By representing data as vectors, you can leverage mathematical techniques to efficiently search and compare very big datasets without having an exact match. Millions of customer profiles or images or articles that are represented as vectors — a list of numbers that capture each item’s key characteristics — can be combed through very quickly with vector similarity search (or “nearest neighbor search”).
Unlike traditional keyword-based search, which matches documents based on the occurrence of specific terms, vector search focuses on the similarity of queries; for instance, are their semantic meanings similar?
This capability enables finding similar items based on their vector representations. Similarity search algorithms can measure the “distance” or similarity between vectors to determine how closely related they are.
In recommendation systems, vector search can be used to find the most similar and dissimilar items or users based on their preferences. In image processing, it enables tasks like object recognition and image retrieval. For instance, Google, the world’s largest search engine, relies on vector search to power the backend of Google Image Search, YouTube and other information retrieval services.
Vectors and Databases
There are stand-alone vector search technologies, including the likes of Elasticsearch. But vectors need to be stored in and retrieved from scalable and fast databases to deliver the responsiveness and scale demanded by AI applications. There are a handful of databases today that offer vector search as a feature.
The main advantage of a database that enables vector search is speed. Traditional databases have to compare a query to every item in the database. In contrast, integrated vector search enables a form of indexing and includes search algorithms that vastly speed up the process, making it possible to search massive amounts of data in a fraction of the time it would take a standard database.
In a business context, this is extremely valuable when using AI applications to recommend products that are similar to past purchases, or identify fraudulent transactions that resemble known patterns, or anomalies that look dissimilar to the norm.
One example of a database that offers vector search is DataStax’s Astra DB, which is built on the highly scalable, high-throughput, open source Apache Cassandra. Cassandra has already been proven at scale to power AI by the likes of Netflix, Uber and Apple for AI applications. The addition of vector search makes Astra DB a one-stop shop for high-scale database operations.
Integrating vector search with a scalable data store like Astra DB enables calculations and ranking directly within the database, eliminating the need to transfer large amounts of data to external systems. This reduces latency and improves overall query performance. Vector search can be combined with other indexes within Astra DB for even more powerful queries.
The Growing Importance of Vector Search
Vectors and the databases that store them play a big role in enabling efficient search, similarity calculations and data exploration in the field of AI. As organizations scale their generative AI efforts and look to customize the end-user experience with their data, vector representations and the ability to work with scalable, fast databases that are vector-search enabled will become increasingly critical.
Learn more about vector search at Agent X: Architecture for Generative AI, a free virtual event on July 11. Register now.