The GenAI Data Developer Experience: Performance Optimization
You’ve seen it often enough in science fiction, and it’s natural to assume it’s where information technology is heading: You ask the computer, perhaps speaking aloud, for an analysis of the data, and you get it — certainly before the commercial break.
Yet if you’re a database engineer or a data scientist, you’re familiar with a certain fork in the road of progress. Since computers could first accept commands, performance optimization and accelerating results have been necessary skills. These skills have always been about finding the patterns that questions or queries would most likely take, and preparing or pre-programming the system to respond best to those patterns.
Along comes generative AI, and with it, the expectation that you can ask your data whatever questions come to your mind (and probably hear the results in Majel Barrett’s voice). Now performance optimization becomes a problem. The patterns have changed. You might need an AI-enhanced analytics tool to tell you which questions a database user is most likely to ask an AI-enhanced analytics tool.
All of a sudden, databases that have been leveraging the latest advances in semiconductors and processing capability, including vast pools of fast DRAM and networks of dedicated GPUs (such as Nvidia’s Tensor Core H100s), appear to have an early advantage. Systems whose vendors were bragging they don’t need indexes or pre-engineered optimizations to be fast on the draw, are storming into the emerging ad hoc query market bearing this message: When you pair large language models with database engines geared for this new world of servers, the need for performance pre-optimization vanishes.
In its place, we find natural language query processing for real-time data stores. The ability to ask questions of your data, even as it’s being collected and processed in real-time, using a natural language everyone already knows, seems at the outset like a genuine paradigm shift.
Which sounds delightful, for the first 15 seconds or so. But then Yogi Berra’s proverbial fork presents itself: What happens to the database developer — the person whose value to the enterprise, to this day, has rested on the ability to optimize queries in advance for performance and reliability? Perhaps just this once, science fiction’s most overused trope — AI processes outmoding human ingenuity — looks a little more like prophecy.
“In the last forty years or so, a lot of data scientists have spent time [working with] natural language processing,” said Michael Hunger, who now heads developer product strategy for graph database maker Neo4j. “They went into every language, looked at its grammar, separated their vocabularies for each domain. They spent countless years fine-tuning, building specific models for lawyers, for biologists, and so on. This is all now obsolete with large language models.”
Rule #11: Distribution Independence
If you’ve worked in the database field for any length of time, you’ve come across the phrase “ad hoc query.” It might sound a little disgusting at first. After that feeling wears off, you’ll find it refers to a database system facilitating the ability for a user to generate a query at any point in time, and for that system to respond with requested data in short order.
In 1985, Dr. E. F. Codd — who by that time was properly credited as the inventor of the relational database and the creator of SQL — first published what came to be known as the “12 Rules” of true relational databases. Rule #11 is called the distribution independence rule. In a 1990 book attempting to explain his reasoning for folks who had already come to their own conclusions, Codd explained the distribution independence concept:
…permits not only all the data at any one site to be moved to another, but also a completely different decomposition of the totality of data at all sites into fragments to be deployed at the various sites.
Rule #11 has been re-interpreted any number of self-serving and convenient ways, but here’s the one that makes the most sense: The user’s ability to ask any question of a database should not be precluded by the way the data is engineered, or by how it is distributed.
It was vitally important, Codd went on, that a database that purports to be relational without supporting distribution independence, be made to do so as soon as possible. One of the four reasons he offered was so that the database could make way for a type of functionality which, at that time, did not exist yet: “vastly improved automatic optimization of execution — and, when necessary, automatic re-optimization.” In other words, performance optimizations made on the fly, rather than in advance by a stored procedure.
Codd didn’t say it himself, because he didn’t leverage Latin phrases to sound pretentious. Yet not only was he foreseeing the future of data sharding, he was also paving the way for the ad hoc query. Up until the 1980s, especially to people on the receiving end of vendor demonstrations, databases appeared to be quick and painless. Usually, that was because the queries being demoed were the ones databases were pre-engineered to answer. Technically, that’s a violation of Rule #11. However the data may be distributed, the database should always be ready to provide an optimal response.
The coupling of the database with the large language model (LLM) was an inevitable convergence — one clearly enabled by Rule #11. What matters at this point isn’t just how the coupling takes place, but also, with respect to database system architecture, where.
Last October 5 at DockerCon, in conjunction with LangChain and language model builder provider Ollama, Neo4j announced that its graph database was selected as the default platform for Docker’s new GenAI Stack — a developer’s toolkit for crafting language-enabled AI applications. During a session there, Hunger spoke at length about a new type of LLM-involved development process he calls grounding:
“With all these challenges with LLMs, how can we make it better?” said Hunger to attendees. “You can either take an existing model and fine-tune it. But it’s a lot of effort, and oftentimes the outputs and results are — at least today — not there yet for everyone. You can provide a few examples when you talk to the LLM, but then you basically have to hard-code these examples — which is basically not really helpful.”
You can just feel the architect of Rule #11 nodding his head in agreement.
Grounding, Hunger went on, is a user-driven process. Here, the user provides the LLM with the extra context it needs to give more accurate responses. Taking the microphone from Hunger, LangChain CEO Harrison Chase went on to add further detail to what Hunger called an “opportunity for developers.” Most tellingly, Chase told those developers that they don’t have to be data scientists anymore.
“There are exciting packages that we’ll be using like Ollama,” said Chase, “which are locally-hosted models that bundle up super-nicely. You no longer have to train the model. You just have to run it and serve it, and Ollama makes that really easy.”
A custom language model, such as one built by Ollama or with the help of Ollama’s tools, would ground the database with a working vocabulary that’s already tuned to the style and context of questions that users are most likely to ask. In an Ollama blog post last October 13, LangChain maintainer Jacob Lee introduced his project to connect an Ollama LLM with LangChain natural-language composition tools, to make a prompt available from any user’s browser that enables ad hoc queries of the user’s existing documents.
We’re getting closer to the point where the distribution and structure of data matters not one whit to the person who wants to ask a question. Maybe more importantly, we’re nearing a stage where the AI that drives these conversations can be implemented on the client side rather than the server side. (Sorry, Google.)
All through the history of databases, however, “paradigm shifts” have found themselves stuck at the starting gate. The tasks that Hunger and Chase declared no longer necessary are actually jobs, occupied by people. We like to believe the skill sets required to maintain those jobs evolve as quickly as the technology. But check today’s posts for jobs wanted in the database field, and you see that they don’t. There’s a healthy demand right now for skills that are two or three decades old.
With that in mind, we asked Neo4j’s Hunger: once people become capable of asking ordinary questions, and getting responses from their data that used to require engineering, does the need for engineering go away?
“What you could also do with the human question, if you know what domain your database is about,” Hunger responded, “instead of turning all my text into vector embeddings and having the vector index find similar documents or nodes, [you could] say, ‘Take the following question and turn it into a Cypher statement.’” That statement could then be run against the Neo4j graph database just as it would normally. (Cypher is Neo4j’s counterpart to SQL, geared for graph databases.)
The advantages from this, as Hunger sees them, would include giving the user a kind of roundabout education in Cypher and perhaps also SQL. True, the novice user may not understand how the query language’s syntax is structured, during the first or second run. But over time, users would begin to glean the details just from everyday use. Hunger believes this kind of “democratization,” as he refers to it, would come to reinforce existing job skills in query languages and query optimization, rather than outmoding them.
The New Context Stack
Michael Hunger is not alone in this theory, and someone else may be trying to beat his company in proving it.
Last August, Kinetica began making available access to SQL GPT, its pairing of the company’s vector processing engine with a large language model. Leveraging a concept called retrieval-augmented generation (RAG), the tool’s sole function is to interpret a natural language request whose subjects, verbs and objects can be mapped to symbols in a tabular, time-series, or geospatial database. A function working in the background employs so-called contexts — which the end user never sees — as aids to the LLM in mapping keywords to symbols.
The result is a SQL query which can be relayed to the Kinetica database, and which should (much more likely than not) be capable of producing a table, report, map, or chart of some form that responds to the query whose criteria were inferred from the natural-language request.
You’ve seen examples, including from my friend and The New Stack colleague Stephen J. Vaughan-Nichols, of a ChatGPT prompt being asked to write a snippet of code in Python that fulfils the criteria of an English-language request. There are several critical differences with SQL GPT, the first being that its language model is not trained with web-based data. It’s trained instead on how to produce SQL applicable to Kinetica. Its context supplements this training with lengthy instructions pertinent to the industry or business, or other function to which the natural-language request refers.
“Today, we do have an API where you can pick-and-choose between different [language] models — either our own or other public LLMs,” explained Philip Darringer, Kinetica’s senior vice president, speaking with The New Stack. From there, he continued, you would provide SQL GPT with a context, which at a minimum should reference the symbols from the database tables you intend to use. SQL GPT then acquires the Data Description Language (DDL) information that maps those symbols to database entries, completing the connection from language to symbology to configuration.
The instructions in a context are already human-legible. Because they’re being given directly to an LLM, they’re in English. So the LLM is told, Darringer explained, in very precise and explicit language, that its job is to generate SQL before being given a thorough explanation of just how that’s done. “Then you can add additional enrichment to that context if needed,” he went on, “to increase the accuracy of the queries that are generated. You can add examples or even feedback from logs. If you can start mining the logs of your Kinetica instance, to see which queries have been executed and which ones people have already provided feedback on, you can use that to provide additional context to help educate the models on what’s working and what’s not.”
For its own self-guided demonstration of SQL GPT, Kinetica Cloud offers access to a public data stream providing live telemetry for delivery trucks (though not their contents) in Washington, D.C. A second table coordinates the names of D.C. landmarks with geospatial coordinates. So you can give SQL GPT an instruction such as, “Show me all the trucks that passed by the Washington Monument in the last half-hour.” In about five seconds, you’d get a SQL statement, which you can then pass to Kinetica. The database — which is already fine-tuned for this type of statement — will respond with results in under a second.
When such a tool is ready for widespread adoption, what changes about the organizations that adopt it? Kinetica’s not in a position to make predictions that wide-reaching just yet, although Darringer did say he did feel the person taking charge of adoption in an organization could most likely become the one who’s already managing the machine learning process. Such people may be, he suggested, “pure analysts” whose science bypasses computing altogether, rooted directly in mathematics.
“You can imagine that the data stewards in an organization,” explained Darringer, “will start to add and generate these queries and contexts, to help train and fine-tune the model, just so the less practiced SQL users — the pure analysts who may not be conversant in SQL at all — can start to make use of the data.”
Traditionally, database vendors have perceived two separate classes of customers: the application developer and the data scientist. Both B2B marketers and tech publications alike have treated these groups almost as isolated, self-contained entities. The former entity is said to be more in tune with the nuances of source code; the latter entity has a more intimate relationship with the data.
As generative AI ingratiates itself further into the organization, a third class of customer may be taking shape. They may not be “developers” in the sense of someone who wrestles day-to-day with connecting Python with JDBC drivers to JSON files. They could be developers in a much broader sense. Perhaps they could be given an explicit set of instructions, or maybe led step-by-step by a YouTube video, to produce one or two Python statements.
“Twelve months ago,” said Neo4j’s Hunger, “when you talked to someone about machine learning, those people mostly weren’t developers — they were data scientists. […] Now, this picture has really changed. All these new language models are available as APIs, libraries, Python libs, and so on. So now, developers who build applications are enabled again to work with machine learning models, because they don’t have to train them themselves. They don’t have to create them. Just use them.”
Learning the fine art of providing background and context for LLMs, with the help of a plethora of new, free courses published by DeepLearning.ai, Hunger noted, “I think you can get to ‘Hello, world!’ in an hour or so.” Folks who may not have spent years building their database management skillset, he said, can demonstrate their own ability to accomplish near-instantaneous data application generation, all before the end of the workday.
If this promise, and the arguments behind it, all seem somewhat familiar to you, it’s not a paranormal feeling. Nearly three decades ago, something akin to the “democratization of data” was behind Microsoft’s effort to promote a tool unimaginatively named Query, as a general-purpose front end for conducting what Microsoft called ad hoc queries.
The Macintosh screenshot above is from a 1995 demonstration of Microsoft Query, developed for a group of pharmacists working with St. Louis, Missouri’s BJC Health System. With this frontend tool, their presentation and white paper stated, pharmacists could make on-the-spot decisions that may be critical to patient care, all without “technical assistance in collecting any patient data necessary for analysis.”
There was a catch back then: Making these queries work required the implementation ahead of time of stored procedures — enabling the types of queries that pharmacists were most likely to pose. Not exactly as ad hoc as would deserve a Latin phrase.
Fast-forward to today: The remnants of Microsoft Query have been embedded (some would say “buried”) in Excel. But the stored procedure and data pre-engineering businesses, particularly in the pharmacy fields, have blossomed. Custom-configured pharmacy data, and the procedures for accessing it, are behind the success of healthcare technology providers such as First Databank.
Data engineering and data science are fields of endeavor that show no sign of being vulnerable to disappearing. It’s quite easy to imagine the latest advances in ad hoc queries morphing into a custom language model industry, perhaps with Ollama at the center. The open question is whether that evolutionary path finally leads, this time around, to an entire category of job skills — one that’s hung on successfully for over a half-century now — becoming outmoded tomorrow.
“That’s the hundred-billion-dollar question,” noted Hunger. “As you’ve probably read on many outlets already, ‘Will developers become obsolete in five years?’ That’s a question for everyone. Because if you talk about Cypher here, you could also talk about Python or Java, or any other language. If it’s good enough that a manager or a business user describes what they want to have, and the LLM basically builds it all by themselves, then the question is whether you still need developers.”
Disclosure: Scott Fulton is a previous employee of Neo4j, and a former colleague of Michael Hunger. He currently serves as a content advisor to Kinetica.