TNS
VOXPOP
Will JavaScript type annotations kill TypeScript?
The creators of Svelte and Turbo 8 both dropped TS recently saying that "it's not worth it".
Yes: If JavaScript gets type annotations then there's no reason for TypeScript to exist.
0%
No: TypeScript remains the best language for structuring large enterprise applications.
0%
TBD: The existing user base and its corpensource owner means that TypeScript isn’t likely to reach EOL without a putting up a fight.
0%
I hope they both die. I mean, if you really need strong types in the browser then you could leverage WASM and use a real programming language.
0%
I don’t know and I don’t care.
0%
AI / Data / Large Language Models / Open Source

Exploring Chroma: The Open Source Vector Database for LLMs

Chroma is the open-source embedding database that makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs. Find out here how it works.
Jul 28th, 2023 9:00am by
Featued image for: Exploring Chroma: The Open Source Vector Database for LLMs
     
   

The rise of large language models has accelerated the adoption of vector databases that store word embeddings.

A vector database stores data in vector form, leveraging the potential of advanced machine learning algorithms. It enables highly efficient similarity search, which is crucial for AI applications, including recommendation systems, image recognition, and natural language processing.

Each data point stored in a vector database is represented as a multidimensional vector, capturing the essence of complex data. Advanced indexing methods, like k-d trees or hashing, facilitate quick retrieval of similar vectors. This architecture creates highly scalable, efficient solutions for data-heavy industries, transforming how we approach big data analytics.

In this article, we will take a closer look at Chroma, a lightweight, open source vector database.

Overview of Chroma

Chroma can be used with Python or JavaScript code to generate word embeddings. It has a simple API that can be used against the database backend that’s running in-memory or in client/server mode. Developers can install Chroma, consume the API in a Jupyter Notebook while prototyping, and then use the same code in a production environment, which may run the database in client/server mode.

When running in-memory, Chroma database collections can be saved to the disk in Apache Parquet format. Since generating word embeddings is an expensive task, saving them for later retrieval reduces the cost and performance overhead.

Let’s see the Chroma vector database in action.

Using Chroma with Python

The first step to using Chroma is installing it through pip.


Once installed, you can then import the module into your code.


Let’s now create a list of strings that we will encode into embeddings.


We also need a list of strings that uniquely identify the above strings.


It’s also possible to associate additional metadata with each string that has a reference or a pointer to the original source. This is completely optional. For our tutorial, we will add some dummy metadata. This is structured as a list of dictionary objects.


Now, we have all the entities that can be stored in Chroma. Let’s initialize the client.


If you want to persist the data to the disk, you can pass the location of the directory to save the database.


Chroma calls a set of relevant content a collection. Each collection has documents, which are simply a list of strings, ids that act as unique identifiers for the documents, and, optionally, metadata.

Embeddings are an important part of collections. They can be implicitly generated based on the word embedding model included within Chroma, or you can generate them based on an external word embedding model based on OpenAI, PaLM, or Cohere. Chroma makes it easy to integrate external APIs to automate the process of generating embeddings and then storing them. We will explore this concept in more detail in the next part of this tutorial.

Chroma creates embeddings by default using the Sentence Transformers, all-MiniLM-L6-v2 model. This embedding model can generate sentence and document embeddings for a variety of tasks. This embedding function runs locally on your machine and may necessitate the download of model files, which will occur automatically.

Since we are relying on the inbuilt word embedding model offered by Chroma, we will only ingest the data and let Chroma automatically generate the embeddings for each of the documents in the collection.

Let’s go ahead and create a collection.


We are now ready to insert the documents into the collection.


We can quickly check if the embeddings are generated for the inserted documents.


You should see the embeddings automatically generated and added to the embeddings list of the collection.

We can now perform a similarity search on the collection. Let’s search for phrases that match the phrase “Mary got half-baked from John”. Notice that it only has a similar meaning to one of the documents but not an exact match.


When you access the results variable, it has the following content:


Based on the distance, the first document in the list is a perfect match. We can now access the actual phrase by accessing the element directly. The embeddings element is empty because it’s expensive to fetch the embeddings for each query. But, behind the scenes, Chroma is performing a cosine similarity search on the embeddings stored as vectors.


The Chroma database also supports querying based on the metadata, or ids. This makes it handy to perform a search based on the source of the documents.


The above query first performs a similarity search and then filters the query based on the where condition, which specifies the metadata.

Finally, let’s delete the collection.


In the next part of this tutorial, scheduled for next week, we will extend the Academy Awards chatbot to use the Chroma vector database. Stay tuned.

Below is the complete code that you can try on your machine.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.