OpenAI’s GPT-4 Can Analyze Visual Images, Pass Bar Exam
Earlier this week, OpenAI launched GPT-4, the latest in its series of large language models capable of reading, summarizing, translating and generating text in a way that seems almost human-like.
GPT-4 is intended to be the successor to GPT-3.5, which is the model that the popular ChatGPT conversational tool is built upon. Though this time around, GPT-4 demonstrates a number of impressive advancements, including multimodal abilities that allow it to generate text when presented with image and text inputs, such as suggesting a list of possible recipes when given an image of ingredients.
Notably, GPT-4 also exhibited “human-level performance” on various standardized academic tests like the LSAT, GRE, and various AP exams.
OpenAI’s line of GPTs are what are known as generative pre-trained transformers, a family of language models that are pre-trained on large text-based datasets. These AI models are built with “transformer” deep learning neural networks, which enable them to learn and understand contextual relationships between words in a text.
OpenAI released the first GPT model back in 2018, followed by GPT-2 in 2019, and the staggering 175-billion-parameter GPT-3 in 2020, and the updated GPT-3.5 in 2022. In a blog post and technical paper, OpenAI outlines some of the new improvements in GPT-4, as well as a system model card that describes the model’s limitations.
What’s New in GPT-4
According to the company, GPT-4 leaps forward on a number of key metrics, including improved creativity, the ability to process visual inputs like images, and the capacity to process up to 25,000 words. That’s about eight times more than ChatGPT, and allows GPT-4 to perform extended document analysis, produce longer content, or sustain extended conservations with users.
GPT-4’s enhanced creativity permits better collaboration with users on more complex creative tasks, like writing long-form content. It can even learn a user’s unique writing style, and then mimic that user’s style in its content.
The findings also detail GPT-4’s increased capabilities compared to its predecessor, when tested in machine learning benchmarks using relatively uncommon languages other than English, like Latvian and Welsh.
Additionally, GPT-4’s ability to process visual imagery could be a potentially powerful feature, allowing users to generate things like automated captions, or as in the example below, offer a number of recipe ideas like pancakes, quiche and more, when given an image of eggs and flour.
GPT-4 is also seemingly able to understand and deconstruct the mechanics of a joke based on visual images, such as the one below. When the user asks, “What is funny about this image? Describe it panel by panel,” the model is able to explain step-by-step why the images can be considered humorous, replying that “The humor in this image comes from the absurdity of plugging a large, outdated VGA connector into a small, modern smartphone charging port.”
This new release also demonstrates better steerability, meaning that the API can be programmed by developers to respond in a particular way. For example, it can be customized to reply like a “Shakespearean pirate” to user prompts, delivering gems like “Ahoy, dear mate… / Turn thine gaze to Box 1 for wages earned / And in Box 2, withholdings of tax discerned” — when asked by a user to help them locate non-qualified plans on the W-2 tax form.
OpenAI believes that such features will give the chatbot greater opportunities for more nuanced implementation.”Rather than the classic ChatGPT personality with a fixed verbosity, tone, and style, developers (and soon ChatGPT users) can now prescribe their AI’s style and task by describing those directions in the ‘system’ message,” explained the OpenAI team. “System messages allow API users to significantly customize their users’ experience within [OpenAI’s usage policies].”
These upgrades enabled GPT-4 to perform in the top percentiles on standardized academic tests like the SAT, LSAT, GRE, and the Uniform Bar Exam. The model also did quite well in a number of AP exams, including macroeconomics, microeconomics, US history, and chemistry. Interestingly, however, GPT-4 achieved mediocre scores on the AP tests for English language, literature and composition.
Besides these improvements, OpenAI says that GPT-4 has also been tweaked to make it safer to use than previous versions. According to the report, this latest release produces 40% more truthful responses, while also being 82% less likely to respond to “disallowed content”, compared to GPT-3.5.
Like previous models, GPT-4 can be prone to risks like generating harmful advice, or inaccurate information, but the additional capabilities of GPT-4 entails new risks. To mitigate these potential risks, the company employed what is called reinforcement learning from human feedback (RLHF), using a team of humans to manually fine-tune the model’s behavior.
“To understand the extent of these risks, we engaged over 50 experts from domains such as AI alignment risks, cybersecurity, biorisk, trust and safety, and international security to adversarially test the model,” said OpenAI. “Their findings specifically enabled us to test model behavior in high-risk areas which require expertise to evaluate. Feedback and data from these experts fed into our mitigations and improvements for the model; for example, we’ve collected additional data to improve GPT-4’s ability to refuse requests on how to synthesize dangerous chemicals.”
Hallucinations and Limitations
Nevertheless, OpenAI points out that there remain some limitations to GPT-4, which can result in undesirable outcomes as the technology becomes more widespread.
“GPT-4 has the tendency to ‘hallucinate’, or produce content that is nonsensical or untruthful in relation to certain sources,” said the OpenAI team. “This tendency can be particularly harmful as models become increasingly convincing and believable, leading to over-reliance on them by users.
Counterintuitively, hallucinations can become more dangerous as models become more truthful, as users build trust in the model when it provides truthful information in areas where they have some familiarity. Additionally, as these models are integrated into society and used to help automate various systems, this tendency to hallucinate is one of the factors that can lead to the degradation of overall information quality and further reduce veracity of and trust in freely available information.”
Other points of criticism come from experts within the AI research field, who are troubled by OpenAI’s unusual choice to not release important technical details about GPT-4, like the actual size of the model, hardware used, training compute, dataset construction and training methods.
“I think we can call it shut on ‘Open’ AI. The 98-page paper introducing GPT-4 proudly declares that they’re disclosing nothing about the contents of their training set,” tweeted Ben Schmidt, the vice president of information design at Nomic AI.
“Every piece of academic work on machine learning datasets has found consistent and problematic ways that training data conditions what the models outputs,” Schmidt elaborated. “Choices of training data reflects historic biases and can inflict all sorts of harms. To ameliorate those harms, and to make informed decisions about where a model should not be used, we need to know what kinds of biases are built in. OpenAI’s choices make this impossible.”
Despite these issues, it has been confirmed that GPT-4 has already been incorporated into Microsoft’s Bing chat for several months now, with Microsoft also recently investing $10 billion in the company. Now, OpenAI also has announced plans to partner up with other companies like Duolingo, Be My Eyes, Stripe, and Khan Academy to incorporate GPT-4 on their platforms.