Language is a fascinating construct and at the heart of how humans share and understand ideas and knowledge. For something so complex and nuanced, most people fail to acknowledge it as such, since it seems (and is) instinctual and natural. That’s why we call the language of human communication “natural language.”
We begin absorbing language from infancy. The simple words come in the first year or two. By the age of six, we’ve added thousands more to our vocabulary, and by our teenage years, upwards of 100,000 learned words. But as much as language is an innate capability for humans, machines find it very difficult.
This is a classic example of Moravec’s Paradox which states that what’s easy for machine is hard for people and vice versa. Software can compute mathematical operations on large number sets quickly and flawlessly, but it struggles with everyday human activities like recognizing objects in their surroundings or comprehending language. And while there has been a tremendous amount of activity to develop software that understands natural language in the same way as humans, it remains a major challenge.
Words Are Not Numbers
The last 20 years have seen an explosion in the amount of data of all forms produced and captured. Broadly, this data falls into two categories: structured and unstructured. Structured data is numerical and organized, and by definition is the basic input of mathematical operations. Thanks to machine learning (ML) and the overall growth in data processing capability, AI has made solid progress in producing predictive insights from structured data for everything from potential machinery failure to fraud detection. If you can express and structure data numerically, you have a potential candidate for machine learning driven insights.
But digital technology has also produced a massive increase in unstructured data which includes pictures, videos, and language data. This is where traditional machine learning-based natural language processing (NLP) techniques have fallen short. Language is data-dense — it carries a tremendous amount of potential information depending on how it is used.
As a thought exercise, just list the number of meanings and usages of any common word like “bat.” These meanings flow from context. Linguist R.J. Firth wrote, “you shall know a word by the company it keeps.” These intrinsic elements of language make it incredibly challenging to apply mathematical techniques to deliver a real understanding of the meaning in natural language. And yet, there is a more fundamental shortcoming of a “one-size-fits-all” machine learning approach to language: the knowledge problem.
The Knowledge Problem
The language challenge compounds when you enter the real world of complex language documents that powers so many enterprises and is unique to their domains. These are, by definition, edge cases that make the language even more complex. Machine learning models know the world only through the data on which they are trained, and they arrive at their outcomes through algorithms that are in many cases complex and opaque — the famous “black box” characteristic of so many AI approaches.
Much of the work in delivering a real-world solution rests on ensuring the data set is large enough and representative enough to capture the information that a subject matter expert recognizes only after years of experience and training. In many cases, such a large volume of training data is not available. This is an ongoing exercise as well, given that the real-world changes over time and the models need to undergo retraining.
Even the much-publicized advances of large language models like GPT-3 offer little reason for optimism against this complexity. These models rely on massive data sets for their training and can handle relatively simple language cases. But lacking any true grounding in a specific domain, they fall well short of what a human with experience and knowledge uses to understand intent, context, and meaning.
The Whole Exceeds the Sum of the Parts
There is emerging recognition of the need to combine the capabilities of machine learning approaches with knowledge-based approaches that build on what the experts in an enterprise develop over years. These knowledge-based approaches are known as symbolic AI and rely on techniques for embedding knowledge similar to how humans build their own mastery of a subject.
The symbolic approach offers the added benefit of explainability in that outcomes are tied to explicit representations of knowledge. The symbolic approach, in fact, was the first technique used for AI natural language understanding and is increasingly viewed as a necessary complement to more recent machine learning approaches.
The combination of learning and knowledge approaches offers the ability to generate deep understanding at scale with insights relevant to the domain and outcomes that are explainable. This “hybrid” approach can enable people to be better at their jobs (be more expert) by ensuring that relevant information embedded in language is captured and delivered in a scalable way for faster, smarter, and more consistent decisions. This is ultimately the arena in which businesses compete, and where the best technology delivers.
The New Stack is a wholly owned subsidiary of Insight Partners. TNS owner Insight Partners is an investor in the following companies: MADE, Famous, Real.