TNS
VOXPOP
How has the recent turmoil within the OpenAI offices changed your plans to use GPT in a business process or product in 2024?
Increased uncertainty means we are more likely to evaluate alternative AI chatbots and LLMs.
0%
No change in plans, though we will keep an eye on the situation.
0%
With Sam Altman back in charge, we are more likely to go all-in with GPT and LLMs.
0%
What recent turmoil?
0%
Cloud Native Ecosystem / Open Source / Python

Create a Movie Recommendation Engine with Milvus and Python

Use open source tools to help people find movies they might want to watch.
Nov 17th, 2023 9:32am by
Featued image for: Create a Movie Recommendation Engine with Milvus and Python
Featured image by Tima Miroshnichenko on Pexels.

Recommender systems or recommendation engines are information-filtering systems that aim to predict and suggest things users might be interested in. These items include products, services and content such as movies, books, music or news articles.

There are various types of recommender systems, such as collaborative filtering, content-based filtering, hybrid recommendation systems and vector-based recommendation systems. Vector-based systems use the vector space to find (recommend) the closest items in the database. There are various ways to store these vectors; one of the most efficient ones is using the Milvus open source vector database. This database is highly flexible, fast and reliable and allows for trillion-byte-scale addition, deletion, updating and nearly real-time search of vectors.

This article explains how to build a movie recommender with Milvus and Python. This system will use SentenceTransformers to convert the text information to vectors and store these vectors in Milvus. Milvus enables users to search for a movie in the database based on the text information they provide.

You can find the code for this tutorial in the Milvus Bootcamp repository on GitHub, and there’s also a Jupyter Notebook.

Setting Up the Environment

For this article, you’ll need the following requirements installed:

Python Requirements

You also need to install a set of libraries that will be needed throughout this tutorial. Install the libraries using PIP:

Vectors Data Store (Milvus)

You’ll use the Milvus vector database to store the embeddings that you’ll generate using movie descriptions. The dataset is relatively large, at least for a server running on a personal computer. So you may want to use a Zilliz Cloud instance to store these vectors.

If you prefer to stay with a local instance, you can download a docker-compose configuration and run it:


You now have all the requirements to build your movie recommender system with Milvus.

Data Collection and Preprocessing

For this project, you’ll use the Movies Dataset from Kaggle which contains metadata for 45,000 movies. You can download this dataset directly or use the Kaggle API to download the dataset using Python. To do it with Python, you need to download the kaggle.json file from the profile section of Kaggle.com and put it in a location where the API will find it.

Next, set up some environment variables for Kaggle authentication. For this, you can open the Jupyter Notebook and write the following lines of code:


Once done, you can use Kaggle’s Python dependency to download the dataset from Kaggle:


Once the dataset is downloaded, you can use the read_csv() method from pandas to read the dataset:


Dataset shape

Database records and columns

The image shows that you have 45,466 records with 24 columns of metadata. Check all these metadata columns with the following:


Dataset columns

Dataset columns

There are a lot of columns you don’t need to create the recommender system. You can filter out the required columns with:


Required columns

Required columns

Also, some of the fields in the data are missing, so get rid of those rows to produce a clean dataset:

Connect to Milvus

Now that you have all the required columns, connect to Milvus to start uploading the data. To connect to the Milvus cloud instance, you’ll need the Uniform Resource Identifier (URI) and token, which you can download from your Zilliz Cloud dashboard.

Zilliz dashboard

Zilliz Cloud dashboard

Once you have your URI and API key, you can use the connect() method from PyMilvus to connect to the Milvus server:

Generate Embeddings for Movies

Now it’s time to calculate the embeddings for the text data in the movie dataset. First, create a collection object that will store the movie ID and embeddings for the text data. Also, create an index field to make searches more efficient:


Now that you have an indexed collection, create a function for generating the embeddings for the text. Although overview is the primary column used to generate the embeddings, you’ll also use the genre and release data information along with an overview to make the data more logical.

To generate the embeddings, use the SentenceTransformer:


This function uses the build_genres() method to clean the genre column and get the text out of it. Then it creates a SentenceTransformer object to help generate the embeddings from the text. Finally, it uses the encode() method to generate the embeddings using the overview, release_date and genre features.

Send Embeddings to Milvus

Now you can create the embeddings using the embed_movie() method. This dataset is too large to send to Milvus in a single insert statement, but sending data rows one at a time would create unnecessary network traffic and add too much time. So, instead create batches (e.g., 5,000 rows) of data to send to Milvus:


Note: You can play with the batch size to suit your individual needs and preferences. Also, a few movies will fail for IDs that cannot be cast to integers. You could fix this with a schema change or by verifying their format.

Output of sending embeddings to Milvus

Sending embeddings to Milvus

Recommend New Movies Using Milvus

Now you can leverage Milvus’ near real-time vector search functionality to get a close match of movies that meet the viewer’s criteria. For this, create two different functions:

  1. embed_search(): You need a transformer to convert the user’s search string to an embedding. This function takes the viewer’s criteria and passes it to the same transformer you used to populate Milvus.
  2. search_for_movies(): This function performs the actual vector search using the other function for support.


The above code defines the parameters topK for getting the top five similar vectors, metric_type as L2 (squared Euclidean) that calculates the distance between two vectors and nprobe that indicates the number of cluster units to search. It also implements different functions to get similar vectors from the user’s query (recommendation).

Finally, use the search_for_movies() function to recommend movies based on the user’s search string:


Movie recommendation output

Movie recommendation output

By using Milvus’s vector search feature, the code recommends the top five similar movies based on the user’s query. This is it: You’ve now built your own movie recommender system using Milvus.

Conclusion

After reading this article, you know what a vector-based recommendation system is and how to create a movie recommender system with Milvus. Milvus helps build an efficient and scalable movie recommendation system. Leveraging vector storage and similarity search, Milvus has great potential for enabling personalized recommendations, enhancing user engagement and showcasing the role of advanced vector-based models in modern recommendation systems. You can learn more about it on the Milvus website.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: turing, Uniform, Docker.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.