Generative AI: How to Choose the Optimal Database
It seems like nearly every day brings a new AI application that pushes the boundaries of what is possible. Despite all the attention generative AI is garnering, some high-profile missteps have reminded the world once again of “garbage in, garbage out.” If we ignore underlying data management principles, then the output can’t be trusted.
Technical professionals will not only have to evolve their data strategy and existing data infrastructure to leverage the influx of Large Language Models (LLMs) and unlock new insights, but they will have to identify which technologies are best and most efficient in enabling AI workloads.
But what database components are needed by organizations to harness the power of LLMs on proprietary data?
8 Components to Support AI Workloads
Databases supporting AI workloads must enable low latency and highly scalable queries. The world of LLMs is expanding at a very rapid clip — some models are completely open source, while others are semi open but have commercial APIs.
There are many considerations when deciding on how to evaluate a new or existing database to handle your generative AI workloads. The essential capabilities needed to deliver AI workloads are shown and explained in further detail in the following diagram:
Training data for LLMs like GPT-4 was based on data available only up to September 2021. Without enhancements, like the browser plug-in, the responses are outdated — and organizations expect to make decisions on the freshest data.
Hence, the ingestion capabilities of a database must include capabilities to:
- Ingest, process and analyze multi-structured data.
- Ingest batch as well as real-time streaming data, including the ability to easily pull data (up to millions of events per second) from diverse data sources including Amazon Simple Storage Service (S3), Azure Blobs, Hadoop Distributed File System (HDFS) or a streaming service like Kafka.
- Call the APIs or user defined functions to convert the data to vectors.
- Index the vectors for fast vector (similarity) searches.
- Make data immediately available to analyze data as it lands.
A relational database management system (RDBMS) has the advantage of performing the preceding tasks in the more familiar SQL.
Debates can be cyclical. The NoSQL debate on whether it’s better to use a specialized vector data structure has risen again, and so has the question about if multi-model databases can be equally efficient. After almost 15 years of NoSQL databases, it is common to see a relational data structure store a JSON document natively. However, initial incarnations of multi-model databases stored JSON documents as a BLOB (binary large object).
While it’s too early to say whether a multi-model database is equally adept at storing vector embeddings as a native vector database, we expect these data structures to converge. Databases like SingleStoreDB have supported vector embeddings in a Blob column since 2017.
Vector embeddings can quickly grow in size. As vector searches run in memory, it may not be practical to store all the vectors in memory. Disk-based vector searches are not performant. Hence, the database must have the ability to index the vectors and store them in memory, while the vectors themselves are on the disk.
3. Performance (Compute and Storage)
An important aspect of performance tuning is the ability to index the vectors and store them in memory.
The database should be able to split the vectors into shards in smaller buckets so they can be searched in parallel and leverage hardware optimizations, like SIMD. SIMD can achieve fast and efficient vector similarity matching — without the need for parallelizing your application or moving lots of data from your database into your application.
For example, in a test described in a recent SingleStore blog post, the database could process 16 million vector embeddings within 5ms to do image matching and facial recognition.
If LLMs are fed a large number of embeddings, then the latency for responses will accordingly be very high. The purpose of using a database as an intermediary is to perform an initial vector search and determine a smaller embedding to be sent to the LLM.
Caching of prompts and responses from LLMs can further improve performance. We have learned from the BI world that most questions asked in an organization are frequently repeated.
Cost could be one of the biggest impediments to mass adoption of LLMs. We are addressing concerns with deploying a database to help make API calls to LLMs. As in any data and analytics initiative, it is imperative to calculate the total cost of ownership (TCO):
- Infrastructure cost of the database. This includes licensing, pay-per-use, APIs, licenses, etc.
- The cost to search the data using vector embeddings. Typically, this cost is higher than the conventional cost of full-text search, as extra CPU/GPU processing is needed to create the embeddings.
- Skills and training. We have already seen the creation of the “prompt engineer” role. In addition, Python and machine learning skills are essential to prepare the data for vector searches.
Eventually, we expect FinOps observability vendors will add capabilities to track and audit vector search costs.
5. Data Access
Semantic searches rely on natural language processing (NLP) to ask questions — meaning, end users’ reliance on SQL diminishes. It is quite possible that LLMs replace business intelligence reports and dashboards. Also, a robust infrastructure to handle APIs becomes critical. The APIs may be the traditional HTTP REST or GraphQL.
However, in a database that supports traditional online transaction and online analytic processing, use of SQL can allow mixing traditional keyword (i.e., lexical) search with the semantic search capabilities enabled by LLMs.
6. Deployment, Reliability and Security
As we know, vectors should be shared to improve the performance of vector searches. This approach is used by database vendors to also improve reliability as the shards run in pods orchestrated by Kubernetes. In this self-healing approach, if a pod fails, it is automatically restarted.
Database vendors should also geo-distribute shards to different cloud providers or different regions within a cloud provider. This solves two concerns — reliability and data privacy concerns.
A common concern is the confidentiality of data. Organizations need the chatbot or the API to the LLM to not store the prompts and retrain their model. As mentioned earlier, OpenAI’s updated data usage and retention policy addresses this concern.
Finally, the vector search and the API call to the LLM must perform role-based access control (RBAC) to maintain privacy, just like in conventional keyword search.
7. Ecosystem Integration
A database that supports AI workloads must have integration with the larger ecosystem. These include:
- Notebooks or integrated development environments (IDEs) to write code that will enable the AI value chain steps described earlier in this article.
- Existing MLOps capabilities from cloud providers, like AWS, Azure and Google Cloud as well as independent vendors. In addition, support for LLMOps is starting to arise.
- Libraries to generate embeddings, such as OpenAI and HuggingFace. This is a quickly expanding space with many open-source and commercial libraries.
The modern application space is getting redefined by the ability to chain various LLMs. This is clear in the rise of LangChain, AutoGPT and BabyAGI.
8. User Experience
The debate over which is the best technology to use for a specific task is often resolved by the speed of adoption. Technologies that have superior user experience often prevail. This experience is across various vectors (no pun intended):
- Developer experience: the ability to write code to prepare the data for AI workloads.
- Consumer experience: the ease of generating the right prompts.
- DevOps experience: the ability to integrate with the ecosystem and deploy (CI/CD).
Database providers must provide best practices for all the personas interacting with their offerings.
One thing is clear: The generative AI space is nascent and a work in progress. In the end, the guiding principles that apply to other data management disciplines should still be adhered to in regard to AI. Hopefully, this helps a bit in demystifying what is needed to leverage AI workloads and how to choose the optimal database technologies.