So it’s finally here: GPT-4. This is latest and greatest artificial intelligence system from OpenAI, and a successor to the A.I. model that powers the wildly popular ChatGPT.
OpenAI, the San Francisco A.I. lab that is now closely tied to Microsoft, says that GPT-4 is much more capable than the GPT-3.5 model underpinning the consumer version of ChatGPT. For one thing, GPT-4 is multi-modal: it can take in images as well as text, although it only outputs text. This opens up the ability of the A.I. model to “understand” photos and scenes. (Although for now this visual understanding capability is only being offered through OpenAI's partnership with Be My Eyes, a free mobile app for the visually impaired.)
The new model performs much better than GPT-3.5 on a range of benchmark tests for natural language processing and computer vision algorithms. It also performs very well on a battery of diverse tests designed for humans, including a very impressive score on a simulated bar exam as well as scoring a five out of five on a wide range of Advanced Placement exams, from Math to Art History. (Interestingly, the system scores poorly on both the AP English Literature and AP English Composition exams and there is already some questions from machine learning experts about whether there may be less than meets the eye to GPT-4’s stellar exam performance.)
The model, according to OpenAI, is 40% more likely to return factual answers to questions—although it may still in some cases simply invent information, a phenomenon A.I. researchers call “hallucination.” It is also less likely to jump the guardrails OpenAI has given the model to try to keep it from spewing toxic or biased language, or recommending actions that might cause harm. OpenAI said GPT-4 is more likely to refuse such requests than GPT-3.5 was.
Still, GPT-4 still has many of the same potential risks and flaws as other large language models. It isn’t entirely reliable. Its answers are unpredictable. It can be used to produce misinformation. It can still be pushed to jump its guardrails and give outputs that might be unsafe, either because they might be hurtful to the person reading the output or because they might encourage the person to take actions that would harm themselves or others. It can be used, for instance, to help someone find ways to make improvised chemical weapons or explosives from household products.
Because of this, OpenAI cautioned users that “Great care should be taken when using language model outputs, particularly in high-stakes contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-stakes uses altogether) matching the needs of a specific use-case.” And yet, OpenAI has released the model as a paid service to ChatGPT Plus customers and businesses purchasing services through its cloud-based application programming interface (or API).
GPT-4’s release had been widely anticipated among those who follow A.I. developments. While ChatGPT took almost everyone by surprise when OpenAI released it in late November, it was widely known for at least a year that OpenAI was working on something called GPT-4, although there has been wild speculation about exactly what it would be. In fact, after ChatGPT became an unexpected viral sensation, massively ramping up hype around A.I., Sam Altman, the CEO of OpenAI, felt it necessary to try to tamp down expectations surrounding GPT-4’s imminent release. “The GPT-4 rumor mill is a ridiculous thing. I don’t know where it all comes from,” Altman said in an interview at an event in San Francisco in January. Referring to the idea of artificial general intelligence (or AGI), the kind of machine superintelligence that has been a staple of science fiction, he said, “people are begging to be disappointed and they will be. The hype is just like... We don’t have an actual AGI and that’s sort of what’s expected of us.”
Yesterday, I talked to several of the OpenAI researchers who helped build GPT-4 about its capabilities, limitations, and how they built it. The researchers spoke in general terms about the methods they used, but there is much about GPT-4 they are keeping under wraps, including the size of the model, exactly what data was used to train it, how many specialized computer chips (graphics processing units, or GPUs) were needed to train and run it, what its carbon footprint is, and more.
OpenAI was co-founded by Elon Musk, who has said he chose the name because he wanted the new research lab to be dedicated to democratizing A.I. and being transparent, publishing all its research. Over the years, OpenAI has increasingly moved away from its founding dedication to transparency, and with little detail about GPT-4 being released, some computer scientists quipped that the lab should change its name. “I think we can call it shut on 'Open' AI,” tweeted Ben Schmidt, the vice president of design at a company called Nomic AI. “The 98 page paper introducing GPT-4 proudly declares that they're disclosing *nothing* about the contents of their training set.”
Ilya Sutskever, OpenAI’s chief scientist, told Fortune the reason for this secrecy was primarily because “it is simply a competitive environment” and the company did not want commercial rivals to quickly replicate its achievement. He also said that in the future, as A.I. models became even more capable and “those capabilities could be easily very harmful,” it will be important for safety reasons to limit information about how the models were created.
At times, Sutskever spoke of GPT-4 in terms that seemed designed to sidestep serious discussion of its inner workings. He described a "recipe for producing magic" when discussing the high-level process of creating generative pre-trained transformers, or GPTs, the basic model architecture that underpins most large language models. "GPT-4 is the latest manifestation of this magic," Sutskever said. In response to a question about how OpenAI had managed to reduce GPT-4’s tendency to hallucinate, Sutskever said, “We just teach it not to hallucinate.”
Six months of fine tuning for safety and ease-of-use
Two of Sutskever’s OpenAI colleauges did provide slightly more detail on how OpenAI "just taught it not to hallucinate." Jakub Pachocki, a member of OpenAI's technical staff, said the model's increased size alone, and the larger amount of data it ingested during pre-training, seemed to be part of the reason for its increased accuracy. Ryan Lowe, who co-leads OpenAI’s team that works on “alignment,” or making sure A.I. systems do what humans want them to and don’t do things we don’t want them to do, said that the OpenAI also spent about six months after pre-training GPT-4 fine-tuning the model to be both safer and easier to use. One method it used, he said, was to collect human feedback on GPT-4’s outputs and then used those to push the model towards trying to generate responses that it predicted were more likely to get positive feedback from these human reviewers. This process, called “reinforcement learning from human feedback” was part of what made ChatGPT such an engaging and useful chatbot.
Lowe said some of the feedback used to refine GPT-4 came from the experience of ChatGPT users, showing the way in which getting that chatbot out into the hands of hundreds of millions of people before many competitors debuted rival systems may have created a faster-spinning “data flywheel” for OpenAI that gives the company an advantage in building future, advanced A.I. software that its rivals may find hard to match.
OpenAI specifically trained GPT-4 on more examples of accurate question-answering in order to boost the model’s ability to perform that task, and reduce the chances of it hallucinating, Lowe said. He also said that OpenAI used GPT-4 itself to generate simulated conversations and other data that was then fed back into the fine-tuning of GPT-4 to help it hallucinate less. This is another example of the “data flywheel” in action.
Is the "magic" reliable enough for release?
Sutskever defended OpenAI’s decision to release GPT-4, despite its limitations and risks. “The model is flawed, ok, but how flawed?" he said. "There are some safety mitigations that exist on the model right now," he said, explaining that OpenAI judged these guardrails and safety measures to be effective enough to allow the company to release the model. He also noted that OpenAI’s terms and conditions of use prohibited certain malicious uses and that the company now had monitoring procedures in place to try to check that users were not violating those terms. He said this in combination with GPT-4’s better safety profile on key metrics like hallucinations and the ease with which it could be “jailbroken” or made to bypass guardrails, “made us feel that it is appropriate to proceed with the GPT-4 release, as we’re doing right now.”
In a demonstration for Fortune, OpenAI researchers asked the system to summarize an article about itself, but using only words that start with the letter ‘G’—which GPT-4 was able to do relatively coherently. Sutskever said that GPT-3.5 would have flubbed the task, resorting to some words that did not start with ‘G.’ In another example, GPT-4 was presented with part of the U.S. tax code and then given a scenario about a specific couple and asked to calculate how much tax they owed, with reference to the passage of regulations it had just been given. GPT-4 seemingly came up with the right amount of tax in about a second. (Although I was not able to go back through and double-check its answer.)
Despite impressive demonstrations, some A.I. researchers and technologists say that systems like GPT-4 are still not reliable enough for many enterprise use cases, particularly when it comes to information retrieval, because of the chance of hallucination. In cases where a human is asking it a question to which that user doesn’t know the answer, GPT-4 is still probably not appropriate. “Even if the hallucination rate goes down, until it is infinitesimal, or at least as small as would be the case with an expert human analyst, it is probably not appropriate to use it,” Aaron Kalb, co-founder and chief strategy officer at Alation, a software company that builds data cataloging and retrieval software.
He also said that even prompting the model to answer only from a particular set of data or only using the model to summarize information surfaced through a traditional search algorithm might not be sufficient to be certain the model wasn’t making up some part of its answer or surfacing inaccurate or outdated information that it had ingested during its pre-training.
Kalb said whether it was appropriate to use large language models would depend on the use case and whether it was practical for a human to review the A.I.’s answers. He said that asking GPT-4 to generate marketing copy, in cases where that copy is going to be reviewed and edited by a human, was probably fine. But in situations where it wasn’t possible for a human to fact-check everything the model produced, relying on GPT-4’s answers might be dangerous.