What Large Language Models Can Do Well Now, and What They Can’t

Attendees of New York QCon earlier this month got a preview of where the exciting world of Large Language Model (LLM)-based artificial intelligence (AI) may be going, as well as some limits of how far the technology can practically reach.
Two OpenAI engineers from the company’s API team demonstrated ChatGPT’s newest feature, Functions, in one session.
Functions are a way of connecting ChatGPT to the rest of the world, explained Atty Eleti, OpenAI Software Engineer.
A big limitation of the service to date is that it is built on a body of knowledge that extends only until 2022, when the process of gathering all the training data for ChatGPT was completed.
Functions are a way into the world of real-time data, giving ChatGPT the permission to execute select actions on the user’s behalf, by the user signaling an intent for such an action to take place, using a prompt such as “Show me a list of hotels in the area.”
In practice, this means ChatGPT can now call on third-party services, such as Yelp, to provide information on the user’s behalf. ChatGPT can then format the results according to a set of instructions provided by the developer.
For this work, the team fine-tuned the ChatGPT 3.5 model to understand when to take action, or use a tool, on behalf of a user.
“The end result is a new set of models that can now intelligently use tools and call functions for you,” explained Sherwin Wu, OpenAI Technical Staff.
As an example, Wu showed how to ask, through the OpenAI API, ChatGPT for the current weather. With instructions passed along to the ChatGPT, it can call an external weather service to provide the answer, using the location of the user, and returning the results to the user, formatted for human readability.
Another is the ability to use Yelp to provide the user with a list of nearby restaurants. ChatGPT can query public external services, or even private sources of data when provided with log-in instructions through the API.
The presentation was clearly aimed at developers who want to build their own apps on the OpenAI platform. While OpenAI might have started as a large-scale experiment in using AI, it is clear the company has plans to market its services as a platform upon which to build applications.
Limits of Generative AI
Others are more circumspect about the possibilities of ChatGPT, such as Mathew Lodge, CEO of AI-based unit test automation provider DiffBlue, who spoke in an earlier session at QCon.
At its core, LLM-based generative AI relies on a single mechanism, called a transformer, which is basically a function to predict what the next word will be, given a prompt and a training set. It is simply a large statistical model that returns results based on information it has seen before.
“It’s important to remember this because you can read all kinds of crazy stuff about transformer-based models, large language models, and how they’re intelligent, that they have a theory of mind. They don’t have any of those things. They are next-word predictors,” Lodge said.
Completed in 2019, GPT 2 was built on 1.5 billion parameters. The following year, GPT 3 arrived, built on a 175 billion parameter model — a model that would require 355 GPU years to train on the highest-end GPU today. It would cost $4.6 million to run such a job on Azure, deep learning service provider Lambda has estimated.
We don’t know how big the newest version, GPT-4, is, Lodge noted. OpenAI is keeping mum on the details, citing the newly competitive nature of the market.
But like the previous versions, GPT-4 is answer-driven, rather than goal driven. The breakthrough with this release is that the new version can do tasks that it wasn’t specifically trained to do.
“It generalizes nicely for text and language tasks,” Lodge said.
This means GPT-4 is great at completing boilerplate, such as for Java classes. This also translates well to working with an external API with little documentation (Lodge commented that traffic has declined on Stack Overflow since the emergence of ChatGPT).
But certain, seemingly built-in limitations remain with the new release. They seem to be limitations hard-wired to the LLM model itself.
Accuracy is still problematic. By now, everyone knows of the LLM’s tendency to make stuff up. While LLMs have been compared quite a bit to search engines, but the fact remains search engines are a lot more accurate in providing information users need.
Lodge shared one surprising example that came from the Geo-location API service OpenCage. ChatGPT falsely stated on numerous occasions that the company offers a service that would, given a phone number, provide a location for that phone. The company got so many API requests for this non-existent service — all generated by ChatGPT — that it had to post a disclaimer on its website.
Nor are LLMs particularly good at mathematics. At heart they are a language analysis tool, not one built for symbolic manipulation, Lodge pointed out.
“Fundamentally, these are very, very large statistical models. They’re not predictable to humans. They’re not explainable by humans. We can’t predict what they’re going to do” — Mathew Lodge
Another problem with generative models is that a small change in the input can make a huge difference in the output. They’re not deterministic. A GPT-4-based code assist may do great building out a programming class for “dogs” within an application for pets, but might totally produce gibberish if the instructions were changed to build a class for “cats.” It’s the semantics, Lodge explained, that trip up ChatGPT.
Prompt engineering is not engineering at all, Lodge argued, in that the “engineer” is just randomly trying new things at the prompt hoping to achieve success. This is more like programming, Lodge joked.

Mathew Lodge
“These are consistent issues across all the models,” Lodge pointed out. The OpenAI research team disclosed these issues in their earliest papers, dating back to 2014, and you see similar text-processing issues with other neural networks as models.
Lodge’s company DiffBlue, looked at using both ChatGPT 3.5 and 4 for writing unit tests, the company’s specialty, and found them to be less useful than the company’s current AI approach.
Because LLMs are language models, they look for language cues in a set of code. So, for a calculator class, with a function called “add,” the model would run a test checking that function’s addition capabilities — which would be problematic if that wasn’t what the code inside that class actually did.
ChatGPT kicks the bucket down the road, making subtle errors in code generation that are even harder for the regular programmer to find. It will call programming language type identifiers, unaware that they are reserved words, which leads to compilation errors. It will also reference non-existent functions.
LLMs are exciting because they are large enough to support many different languages without knowing those languages specifically. But the downside is that they lose out on accuracy in achieving this generality.
“That’s essentially the trade-off going on with this kind of model,” Lodge said.
Enter Reinforcement Learning
Large language models learn from language, but this is not he way we as humans learn how to do many tasks. We don’t learn to play basketball by reading a book on basketball. Instead, we learn through trial-and-error, the process of actually playing basketball.
This sort of learning can be done through a different branch of AI, called reinforcement learning, which is basically a systematic approach to trial-and-error, Lodge explained. This is also the exact opposite approach compared to LLMs. While LLMs generalize against a large knowledge base, reinforcement learning goes step-by-step, improving accuracy with each attempt of trying something. The knowledge set is smaller but more accurate.
Lodge boasts that, by using this approach, his company DiffBlue can write unit tests for a program within a few seconds, one without any errors and with the ability to catch all regressions. This approach also works very well for code optimization.
This was the approach that Google used for AlphaGo, the AI program that it built to play the Go board games, winning a match in 2015, he explained. It would be impossible, just based on the limits of hardware, to play out all the particular moves of Go, using a brute force approach alone. So AlphaGo took a more focused approach. It narrowed down the possible choices to consider to probabilistic ones, those that could conceivably advance the computer’s position in a favorable way. Reinforcement learning is the algorithm that guides this search.