Vector Databases: What Devs Need to Know about How They Work
When we say “database” today we are probably talking about persistent storage, relational tables, and SQL. Rows and Columns, and all that stuff. Many of the concepts were designed to pack data into what was, at the time they were created, limited hard disk space. But most of the things we store and search for are still just numbers or strings. And while dealing with strings is clearly a little more complex than dealing with numbers, we generally only need an exact match — or maybe a simply defined fuzzy pattern.
This post looks at the slightly different challenges to traditional tooling that AI brings. The journey starts with a previous attempt to emulate modern AI, by creating a Shakespeare sonnet.
We analyzed a corpus and tried predicting words, a trick played to perfection by ChatGPT. We recorded the distance words appeared from each other. And we used this distance data to guess similar words based on their distances to the same word.
So in the above, if we were to have only two phrases in our corpus, then the word following “Beware” could be “the” or “of”. But why couldn’t we produce ChatGPT-level sonnets? My process was just the equivalent of a couple of dimensions of training data. There was no full model as such, and no neural network.
What we did was a somewhat limited attempt to turn words into something numerical, and thus computable. This is largely what a word embedding is. Either way, we end up with a set of numbers — aka a vector.
At school we remember vectors having magnitude and direction, so they could be used to plot an airplane’s course and speed, for example. But a vector can have any amount of numbers or dimensions attached to it:
x=(x₁, x₂, x₃, … ,x₉)
Obviously, this can no longer be placed neatly in physical space, though I welcome any n-dimensional beings who happen to be reading this post.
By reading lots of texts and comparing words, vectors can be created that will approximate characteristics like the semantic relationship of the word, definitions, context, etc. For example, reading fantasy literature I might see very similar uses of “King” and “Queen”:
The values here are arbitrary of course. But we can start to think about doing vector maths, and understand how we can navigate with these vectors:
King - Man + Woman = Queen
[5,3] - [2,1] + [3, 2] = [6,4]
The trick is to imagine not just two, but a vector of many, many dimensions. The Word2Vec algorithm uses a neural network model to learn word associations like this with a large corpus of text. Once trained, such a model can detect similar words:
Given a large enough dataset, Word2Vec can make strong estimates about a word’s meaning based on their occurrences in the text.
Using neural network training methods, we can start to both produce more vectors and improve our model’s ability to predict the next word. The network translates the “lessons” provided by the corpus into a layer within vector space that reliably “predicts” similar examples. You can train on what target word is missing in a set of words, or you can train on what words are around a target word.
The common use of Shakespeare shouldn’t be seen as some form of elite validation of the Bard’s ownership of language. It is just a very large set of accurately recorded words that we all recognize as consistent English and within the context of one man’s endeavors. This matters, because whenever he says “King” or “Queen” he retains the same perspective. If he was suddenly talking about chess pieces, then the context for “Knight” would be quite different — however valid.
Any large set of data can be used to extract meaning. For example, we can look at tweets about the latest film “Spider-Man: Across the Spiderverse,” which has generally been well-reviewed, by those who would be likely to comment or see it:
“That was a beautiful movie.”
“The best animation ever, I’m sorry but it’s true, only 2 or 3 movies are equal to this work of art.”
“It really was peak.”
“..is a film made with LOVE. Every scene, every frame was made with LOVE.”
“We love this film with all our hearts.”
But you can begin to see that millennial mannerisms mixed with Gen Z expressions, while all valid, might cause some problems. The corpus needs to be sufficiently large that there would be natural comparisons within the data, so that one type of voice didn’t become an outlier.
Obviously, if you wanted to train a movie comparison site, these are the embeddings you would want to look at.
Ok, so we now have an idea of what word embeddings are in terms of vectors. Let’s generalize to vector embeddings, and imagine using sentences instead of single words, or pixel values to construct images. As long as we can convert from data items to vectors, the same methods apply.
- Models help generate vector embeddings.
- Neural networks train these models.
What a Vector Database Does
Unsurprisingly, a vector database deals with vector embeddings. We can already perceive that dealing with vectors is not going to be the same as just dealing with scalar quantities (i.e. just normal numbers that express a value or magnitude).
The queries we deal with in traditional relational tables normally match values in a given row exactly. A vector database interrogates the same space as the model which generated the embeddings. The aim is usually to find similar vectors. So initially, we add the generated vector embeddings into the database.
As the results are not exact matches, there is a natural trade-off between accuracy and speed. And this is where the individual vendors make their pitch. Like traditional databases, there is also some work to be done on indexing vectors for efficiency, and post-processing to impose an order on results.
Indexing is a way to improve efficiency as well as to focus on properties that are relevant in the search, paring down large vectors. Trying to accurately represent something big with a much smaller key is a common strategy in computing; we saw this when looking at hashing.
Working out the meaning of “similar” is clearly an issue when dealing with a bunch of numbers that stand in for something else. Algorithms for this are referred to as similarity measures. Even in a simple vector, like for an airplane, you have to decide whether two planes heading in the same direction but some distance away are more or less similar to two planes close to each other but with different destinations.
Learning from Tradition
The final consideration is leveraging experience from traditional databases — there are plenty of them to learn from. So for fault tolerance, vector databases can use replication or sharding, and face the same issues between strong and eventual consistency.
Common sense suggests that there will be strategic combinations of traditional vendors and niche players, so that these methods can be reliably applied to the new data that the AI explosion will be producing. So a vector database is yet another of the new and strange beasts that should become more familiar as AI continues to be exploited.