Wow, what a week—perhaps the most eventful week in A.I. (at least, in terms of sheer volume of announcements) that I can remember in the past seven years writing about this topic.
- Google and Microsoft each began pushing generative A.I. capabilities into their rival office productivity software.
- Chinese search giant Baidu launched its Ernie Bot, a large language model-based chatbot that can converse in both Chinese and English, only to see its stock get hammered because the company used a pre-recorded demo in the launch presentation. (Baidu, in a defensive statement emailed to me yesterday, implied it was the victim of an unfair double-standard: Microsoft and Google also used pre-recorded demos when they unveiled their search chatbots. And while Google’s stock did take a hit for an error its Bard chatbot made, no one seemed upset that the demos weren’t live.)
- Midjourney released the fifth generation of its text-to-image generation software, which can produce very professional-looking, photo realistic images. And Runway, one of the companies that helped create Midjourney’s open-source competitor Stable Diffusion, released Gen-2 which creates very short videos from scratch based on a text prompt.
- And just as I was preparing this newsletter, Google announced it is publicly releasing Bard, its A.I.-powered chatbot with internet search capabilities. Google unveiled Bard, its answer to Microsoft’s Bing chat, a few weeks ago but it was only available to employees—now, a limited number of public users in the U.S. and the U.K. will be able to try the chatbot.
But let's focus on what was by far the most widely anticipated news of the past week: OpenAI’s unveiling of GPT-4, a successor to the large language model GPT-3.5 that underpins ChatGPT. The model is also multimodal, meaning you can upload an image to it and it will describe the image. In a clever demonstration, OpenAI cofounder and president Greg Brockman drew a very rough sketch on a piece of paper of a website homepage, uploaded that to GPT-4, and asked it to write the code needed to generate the website—and it did.
A couple of key points to note though: There’s a great deal about GPT-4 that we don’t know because OpenAI has revealed almost nothing about how large a model it is, what data it was trained on, how many specialized computer chips (known as graphics processing units, or GPUs) it took to train, or what its carbon footprint might be. OpenAI has said it is keeping all these details secret for both competitive reasons and what it says are safety concerns. (In an interview, OpenAI’s chief scientist Ilya Sutskever told me it was primarily competitive concerns that had made the company decide to say so little about how they built GPT-4.)
Because we know almost nothing about how it was trained and built, there have been a number of questions about how to interpret some of the headline-grabbing performance figures for GPT-4 that OpenAI did publish. The stellar performance that GPT-4 turned in on computer programming questions from Codeforces’ coding contests in particular has been called into question. Since GPT-4 was trained on so much data, some believe there’s a decent chance it was trained with some of the exact same coding questions it was tested on. If that's the case, GPT-4 may simply have shown that it's good at memorizing answers rather than at actually answering never-before seen questions. The same data “contamination” issue might apply to GPT-4’s performance on other tests too. (And, as many have pointed out, just because GPT-4 can pass the bar exam with flying colors doesn’t mean it is about to be able to practice law as well as human.)
Another thing about GPT-4: Although we don’t know how many GPUs it takes to run, the answer is probably a heck of a lot. One indication of this is the way that OpenAI is having to throttle usage of GPT-4 through ChatGPT Plus. “GPT-4 currently has a cap of 25 messages every 3 hours. Expect significantly lower caps, as we adjust for demand,” reads the disclaimer that greets those who want to chat with GPT-4. Lack of GPU capacity may become a serious challenge to how quickly generative A.I. is adopted by businesses. The Information reported that teams within Microsoft that wanted to use GPUs for various research efforts were being told they would need special approval since the bulk of the company’s vast GPU capacity across its datacenters was now going to support new generative A.I. features in Bing and its first Office customers, as well as all of the Azure customers using OpenAI’s models. Charles Lamanna, Microsoft’s corporate vice president for business applications and low code platforms, told me that “there’s not infinite GPUs and if everybody uses it for every event, every team's meeting, there's probably not enough, right?” He told me Microsoft was prioritizing GPUs for areas that had the highest impact and “highest confidence of a return for our customers.” Look for discussions about limited GPU capacity holding back the implementation of generative A.I. in business to become more prevalent in the weeks and months ahead.
Most importantly, GPT-4, like all large language models, still has a hallucination problem. OpenAI says that GPT-4 is 40% less likely to make things up than its predecessor, ChatGPT, but the problem still exists—and might even be more dangerous in some ways because GPT-4 hallucinates less often, so humans may be more likely to be caught off guard when it does. So the other term you are going to start hearing a lot more about is “grounding”—or, how do you make sure that the output of a large language model is rooted in some specific, verified data that you’ve fed it and not something that it has just invented or drawn from its pretraining data.
Microsoft made a big deal about how its “Copilot” system—which is underpinning its deployment of GPT models into its Office and its Power applications—goes through a number of steps to make sure the output of the large language model is grounded in the data the user is giving it. These steps take place both on the input given to the LLM and on the output it generates.
Arijit Sengupta, the cofounder and CEO of machine learning platform Aible, reached out to me to point out that even with a 40% improvement in accuracy, GPT-4 still, according to the “technical report” OpenAI released, is inaccurate between 20% and 25% of the time. “That means you can never use it in the enterprise,” Sengupta says—at least not on its own. Aible, he says, has developed methods for ensuring that large language models can be used in situations where the output absolutely has to be grounded in accurate data. The system, which Aible is calling the Business Nervous System, sounds like it functions similarly to what Microsoft has tried to do with its Copilot system.
Aible’s system starts by using meta-prompts to instruct the large language model to only reference a particular dataset in producing its answer. Sengupta compares this to giving a cook a recipe for how to bake a cake. Next, it uses a more standard semantic parsing and information retrieval algorithms to check that all the factual claims the large language model is making are actually found within the dataset it was supposed to reference. In cases where it cannot find the model’s output in the dataset, it prompts the model to try again, and if it still fails—which Sengupta says happens in about 5% of cases in Aible’s experience so far—it flags that output as a failure case so that a customer knows not to rely on it. He says this is much better than a situation where you know the model is wrong 25% of the time, but you don’t know which 25%. Expect to hear a lot more about “grounding” in the weeks and months ahead too.
And with that here’s the rest of this week’s news in A.I.
Jeremy Kahn
@jeremyakahn
jeremy.kahn@fortune.com