The Next Wave of Big Data Companies in the Age of ChatGPT
Remember the catchphrase “Big Data”? It spawned many successful companies in the cloud computing era — such as Snowflake, Databricks, DataStax, Splunk and Cloudera. But now we’re in the AI era and supposedly machine learning software is at or near “intelligence” now (even if it is prone to hallucinating — but then, aren’t we all?).
So given the current AI boom, do we even need “big data” companies that sort and organize the world’s data? Can’t the AI do that for us now?
To find out how data companies are adapting to the AI age, I spoke to Aaron Kalb, a co-founder of Alation, which styles itself as a “data intelligence” platform and promotes a concept it calls the “data catalog.” This combines “machine learning with human curation” to create a custom store of data for enterprise companies.
How ChatGPT Differs from Siri in the 2000s
Before co-founding Alation with ex-Oracle executive Satyen Sangani, Kalb worked at Apple on its Siri software. Siri was perhaps the first mainstream software application to make use of AI language modeling. So I asked him how different is the current generation of generative AI software (such as ChatGPT and Google Bard) compared to what Siri was doing in the late 2000s.
“Siri had a difficult job at first, because they didn’t have conversational training data at the time,” he replied. “They were the first voice assistant.” The corpus that the language models for Siri were trained on was much smaller than the training data of large language models (LLMs) today — Kalb called Siri’s training data a “journalistic corpus.”
As well as relatively poor training data, Siri didn’t use much machine learning. Kalb says that Siri made a lot of mistakes when used, in both voice-to-text and text-to-intent. “And I think to this day, Siri, Alexa, Cortana and Google Assistant, all have struggled,” he added.
Why Does AI Hallucinate?
All that said, it’s not as if generative AI is perfect either. I asked Kalb what he makes of the current issues with hallucinations (making up facts) that affect software like ChatGPT and Bard.
Kalb suggests that it’s a “psychological phenomenon” for the human users of generative AI, more than an issue with the software itself.
“For many kinds of prompts, it really seems as though it is understanding the prompt and formulating an answer and then putting it into words,” he said, regarding ChatGPT and similar software. “And it’s just so impressive. We think that it has understanding and true intelligence. What it’s actually doing is [that] it’s basically a super sophisticated Markov model, where it’s saying, hey, what’s the next word given the prior words it said, the prompt before that, and then the entire internet probabilistic distribution of words before that.”
He thinks the hallucinations are in a sense “forced” on the AI software, sometimes because the human prompts were not good enough.
“The hallucination seems like, wait, you’ve gone crazy in the middle of your logic! But, in fact, it’s just an artifact of the algorithm […] it has a distribution of all the words that could possibly come next, and it picks one with some statistical randomness. And the hallucination is what happens when it gets to a point where it gets very unlucky, so to speak; or, given the prompt, it is not obvious what to say. And so it’s forced to pick something, more or less a shot in the dark.”
How Data Intelligence Fits into the AI Landscape
So what is “data intelligence”? Kalb started answering that by noting that both AI and the common enterprise acronym of BI (business intelligence) are “garbage in, garbage out.”
“So data intelligence is this layer that precedes AI and BI, that makes sure you can find, understand and trust the right data to put into your AI and BI.”
In this context, he said, taking something like ChatGPT from the public internet and bringing it into the enterprise is very risky. He thinks that data needs to be, well, more intelligent before it is used by AI systems within an enterprise.
Also, he doesn’t think that the “internet scale” of ChatGPT and similar systems is needed in the enterprise. This is where Alation’s “data catalog” comes into play, as it will “distill down” the data and give it “specific mapping.”
Every organization has its own terminology, he said — that could be industry terms, or things that are very specific to that company.
“So that’s where data intelligence and the data catalog helps,” Kalb explained. “It helps to map that last mile of how language is used by people in the organization, and how data is stored in the databases.”
Alation’s software automates the process of putting an organization’s data into these “data catalogs,” which can then optionally be fed into a generative AI system (if the company wants to do that).
The way Kalb explains it, data intelligence is “step zero for whatever the task is — whether it’s [data] preprocessing, or ML training, or just making a spreadsheet and analyzing it for a shareholder meeting.”
Welcome to the Next Wave of Big Data
So far I’ve spoken to generative AI companies like Cohere and Vectara about their vision for enterprise IT. Both had mentioned the use case of an employee being able to have a conversation with an AI trained on large language models — essentially, what IT has traditionally called “knowledge management,” but now it’s in chatbot form.
Kalb makes a good point, though: much depends on the quality of the data the generative AI has been trained on. He sees data intelligence as “the missing link” between ChatGPT and “the dream of having an enterprise portal where you can ask a question in English and get an accurate, trustworthy answer about your business.”
So just as cloud computing ushered in a raft of useful “big data” companies built off the back of it, it seems clear that generative AI will be a catalyst for the next wave of data intelligence solutions. As I’ve been saying a lot this year in relation to AI, watch this space!