Data / Machine Learning

Pinecone: A Vector Database for Machine Learning Applications

15 Feb 2021 1:37pm, by

As more applications employ machine learning and artificial intelligence for tasks such as rating, recommendation engines, anomaly detection, and duplication removal, companies face a quandary between development costs and performance as they try to force traditional databases to accomplish tasks for which they weren’t designed.

That’s according to Pinecone founder and CEO Edo Liberty, who left Amazon Web Services with an eye toward building new technology to alleviate this pain.

At AWS, he led Amazon’s AI lab, including the team that built Amazon’s cloud machine-learning platform SageMaker. Before that he ran Yahoo’s Scalable Machine Learning Platforms group and did doctoral and post-doc work on big data and machine learning frameworks.

“It was obvious to me that the world of kind of machine learning and databases were in a head-on collision path where machine learning was representing data as these new objects called vectors that no database was really able to handle. And as time went by, more and more jobs, and more and more applications, were using machine learning to run things like recommendation, personalization, all these things, and they just needed the infrastructure to be able to run it, and it didn’t exist,” Liberty said.

He described his idea for Pinecone, previously called HyperCube, as the “connective tissue between the production world of databases and the continuous and fluid kind of more experimental side of machine learning.”

Forrester projects that the AI software market will grow to $37 billion by 2025, becoming a new middleware category of algorithms, data sets, and tools that enable embedding AI functionality in all software products.

Machine learning models take data such as documents, videos or user behaviors, and convert them into vector embeddings, which describe the semantic similarity of objects and concepts by how close they are to each other as points in vector spaces. These usually are long, complex collections of numbers, and the rows and tables of conventional databases don’t efficiently accommodate them.

Applications that need to accurately filter and rank large collections of vectors in real-time require a highly specialized data infrastructure to answer queries like nearest neighbor and max-dot-product search accurately and in milliseconds.

“When a database that is schematized data, and the way you select out of it is with SQL or some other logic, right, based on keys and values. And so with a search engine with the collection of documents, the way you select from them is specifying terms in the documents. And you can kind of use the intersection of those documents that contain those terms,” Liberty explained.

“When you have high-dimensional vectors, the object is just a very long list of numbers, say 1,024 numbers, just literally floating points, right? Just 0.8, 1.6 so on. You don’t have the table to do like SQL on and you don’t have the documents, and so really the tools and the languages that we have to specify what we’re interested in, just don’t hold anymore,” he said. “The way you fetch from a collection of data, a collection of vectors, has its own logic, and it speaks the language of geometry, like nearest neighbor or in a box.”

While it’s possible to homebrew infrastructure to accomplish this, it’s too labor-intensive for most companies, Liberty said.

“I’ve seen many companies kind of between a rock and a hard place, you know, they want some really cool application, they want to unleash machine learning in real-time. And they see a big potential business improvement, but they have to pay for it with many months of development or some compromise on the quality or simply poor performance. And it’s always a painful self-negotiation they have to go through. With Pinecone, we really try to liberate them from that,” he said.

Speed and Scale

There are three parts to Pinecone. The first is a core index, converting high-dimensional vectors from third-party data sources into a machine-learning ingestible format so they can be saved and searched accurately and efficiently.

Container distribution dynamically ensures performance regardless of scale, handling load balancing, replication, name-spacing, sharding, and more at latencies below 50 milliseconds for queries, updates, and embeddings. Being totally serverless, Pinecone can run on as many nodes as you want.

“There’s absolutely nothing that prevents us from running on 100 billion objects. It’s definitely designed to be able to do that,” Liberty said.

The company professes a real-time indexing speed 30 times higher than open source libraries.

The third component is a fully automated cloud management layer that frees users from having to procure and manage hardware or install anything. You can just start an index and pump data into it and start querying. The Python-based API enables updating and querying vector indexes from anywhere, including Jupyter notebooks.

It’s designed for self-service, with consumption-based pricing to enable companies to build proofs of concept with little overhead and to scale effortlessly.

The company recently raised a $10 million seed round led by Wing Venture Capital, one of the major backers of startups including the data warehouse-as-a-service offering Snowflake and the service control platform Kong.

“The world abounds with databases and it is reasonable to ask why it needs another. The answer lies in the distinctive requirements of AI-powered application,” Peter Wagner, founding partner at Wing Venture Capital, wrote in a blog post.

“New workloads and their core data types have always been the catalysts for the creation of new data platforms. ML and its vectors are next in line[…] Looking ahead, it is hard to imagine many interesting applications that aren’t grounded in AI in some fundamental way. AI will be a pervasive property of modern software, as ubiquitous and important as oxygen.”

Most of the people that care about a vector database aren’t the scientists and engineers, though they care about being able to get to production, Liberty said.

“The people who really care about it are the engineers and the ML infrastructure [people], who build those systems and need to run them day in day out,” Liberty said.

“It’s a sigh of relief because they don’t have to figure out like 1,000 different pieces of software and they don’t have to build a distributed system from scratch, or they don’t have to integrate like 10 different tools. … They are able to enable their scientists and engineers and provide the right way to support [them].”

A newsletter digest of the week’s most important stories & analyses.