Exploring Chroma: The Open Source Vector Database for LLMs

The rise of large language models has accelerated the adoption of vector databases that store word embeddings.
A vector database stores data in vector form, leveraging the potential of advanced machine learning algorithms. It enables highly efficient similarity search, which is crucial for AI applications, including recommendation systems, image recognition, and natural language processing.
Each data point stored in a vector database is represented as a multidimensional vector, capturing the essence of complex data. Advanced indexing methods, like k-d trees or hashing, facilitate quick retrieval of similar vectors. This architecture creates highly scalable, efficient solutions for data-heavy industries, transforming how we approach big data analytics.
In this article, we will take a closer look at Chroma, a lightweight, open source vector database.
Overview of Chroma
Chroma can be used with Python or JavaScript code to generate word embeddings. It has a simple API that can be used against the database backend that’s running in-memory or in client/server mode. Developers can install Chroma, consume the API in a Jupyter Notebook while prototyping, and then use the same code in a production environment, which may run the database in client/server mode.
When running in-memory, Chroma database collections can be saved to the disk in Apache Parquet format. Since generating word embeddings is an expensive task, saving them for later retrieval reduces the cost and performance overhead.
Let’s see the Chroma vector database in action.
Using Chroma with Python
The first step to using Chroma is installing it through pip.
1 |
pip install chroma |
Once installed, you can then import the module into your code.
1 |
import chromadb |
Let’s now create a list of strings that we will encode into embeddings.
1 2 3 4 5 6 |
phrases=[ "Amanda baked cookies and will bring Jerry some tomorrow.", "Olivia and Olivier are voting for liberals in this election.", "Sam is confused, because he overheard Rick complaining about him as a roommate. Naomi thinks Sam should talk to Rick. Sam is not sure what to do.", "John's cookies were only half-baked but he still carries them for Mary." ] |
We also need a list of strings that uniquely identify the above strings.
1 |
ids=["001","002","003","004"] |
It’s also possible to associate additional metadata with each string that has a reference or a pointer to the original source. This is completely optional. For our tutorial, we will add some dummy metadata. This is structured as a list of dictionary objects.
1 |
metadatas=[{"source": "pdf-1"},{"source": "doc-1"},{"source": "pdf-2"},{"source": "txt-1"}] |
Now, we have all the entities that can be stored in Chroma. Let’s initialize the client.
1 |
chroma_client = chromadb.Client() |
If you want to persist the data to the disk, you can pass the location of the directory to save the database.
1 |
chroma_client = chromadb.PersistentClient(path="/path/to/save/to") |
Chroma calls a set of relevant content a collection. Each collection has documents, which are simply a list of strings, ids that act as unique identifiers for the documents, and, optionally, metadata.
Embeddings are an important part of collections. They can be implicitly generated based on the word embedding model included within Chroma, or you can generate them based on an external word embedding model based on OpenAI, PaLM, or Cohere. Chroma makes it easy to integrate external APIs to automate the process of generating embeddings and then storing them. We will explore this concept in more detail in the next part of this tutorial.
Chroma creates embeddings by default using the Sentence Transformers, all-MiniLM-L6-v2 model. This embedding model can generate sentence and document embeddings for a variety of tasks. This embedding function runs locally on your machine and may necessitate the download of model files, which will occur automatically.
Since we are relying on the inbuilt word embedding model offered by Chroma, we will only ingest the data and let Chroma automatically generate the embeddings for each of the documents in the collection.
Let’s go ahead and create a collection.
1 |
collection = chroma_client.create_collection(name="tns_tutorial") |
We are now ready to insert the documents into the collection.
1 2 3 4 5 |
collection.add( documents=phrases, metadatas=metadatas, ids=ids ) |
We can quickly check if the embeddings are generated for the inserted documents.
1 |
collection.peek() |
You should see the embeddings automatically generated and added to the embeddings list of the collection.
We can now perform a similarity search on the collection. Let’s search for phrases that match the phrase “Mary got half-baked from John”. Notice that it only has a similar meaning to one of the documents but not an exact match.
1 2 3 4 |
results = collection.query( query_texts=["Mary got half-baked cake from John"], n_results=2 ) |
When you access the results variable, it has the following content:
1 2 3 4 5 6 |
{'ids': [['004', '001']], 'distances': [[0.4699302613735199, 1.333911657333374]], 'metadatas': [[{'source': 'txt-1'}, {'source': 'pdf-1'}]], 'embeddings': None, 'documents': [["John's cookies were only half-baked but he still carries them for Mary.", 'Amanda baked cookies and will bring Jerry some tomorrow.']]} |
Based on the distance, the first document in the list is a perfect match. We can now access the actual phrase by accessing the element directly. The embeddings element is empty because it’s expensive to fetch the embeddings for each query. But, behind the scenes, Chroma is performing a cosine similarity search on the embeddings stored as vectors.
1 |
print(results['documents'][0][0]) |
The Chroma database also supports querying based on the metadata, or ids. This makes it handy to perform a search based on the source of the documents.
1 2 3 4 5 |
results = collection.query( query_texts=["cookies"], where={"source": "pdf-1"}, n_results=1 ) |
The above query first performs a similarity search and then filters the query based on the where condition, which specifies the metadata.
Finally, let’s delete the collection.
1 |
collection.delete() |
In the next part of this tutorial, scheduled for next week, we will extend the Academy Awards chatbot to use the Chroma vector database. Stay tuned.
Below is the complete code that you can try on your machine.