Hello and welcome to Eye on AI. In this edition…OpenAI’s o3 model blows the AI industry’s collective mind; Trump AI policy advisor becomes a flashpoint for internecine battle over immigration policy; DoE confirms AI’s big impact on electricity demand; some predictions for 2025.
The year drew to a close with a mic drop from OpenAI. On Dec. 20, the company unveiled a new AI model called o3 that showed off-the-charts performance on a series of benchmark tests, including one specifically designed to gauge whether AI models are capable of human-like abstract reasoning and generalization. The new o3 model scored 75.7% on this ARC-AGI benchmark, when restricted to less than $10,000 in computing expense, and 87.5% with an unrestricted compute budget. OpenAI’s relatively capable GPT-4o model had scored just 5% on the same test.
The result led some AI enthusiasts to wonder out loud whether OpenAI had just achieved the field’s long-sought Holy Grail, artificial general intelligence (or AGI)—which OpenAI defines as a single software system able to perform most economically valuable cognitive tasks as well as or better than a human.
Meanwhile, some deep learning skeptics, most notably Gary Marcus, blasted the o3 result as wildly misleading, arguing that: OpenAI had specifically trained o3 to do well on the ARC-AGI benchmark (even though the benchmark was designed to make this sort of “training to the test” difficult); that the ARC-AGI benchmark was a poor judge of how well a model will perform on more open-ended and ambiguous real world tasks; and that the high cost of o3 would make it economically unfeasible to use for most practical applications.
o3 still can’t do some thing humans find easy
Francois Chollet, the former Google AI researcher who designed the ARC-AGI benchmark and ran the tests of o3 for OpenAI, acknowledged that his benchmark was not designed to be the definitive marker of whether AGI has been achieved. But it was supposed to be a yardstick for the kind of learning efficiency and conceptual extrapolation humans exhibit and which might indicate a system was getting closer to AGI. Chollet and Marcus both agreed, however, that o3’s average performance score on ARC-AGI could give people a false impression of the model's capabilities—while the model scored well on the ARC-AGI test overall, it bricked some visual reasoning tasks that humans find fairly trivial to solve.
Software suddenly has a marginal cost
Chollet made a critical point about o3: It upends a fundamental truism of the software business—that the marginal cost of software trends toward zero. With earlier kinds of software, once the program is coded or once the AI model is trained, the cost of deploying additional copies of that software is essentially nothing. But o3 works very differently. Like other new “reasoning models,” it produces better results the more computing power it uses at the point of inference (i.e. when it is asked to perform a task). This means the marginal cost of running additional copies does not trend towards zero. (The cost of getting o3 to score 87.5% on ARC-AGI was not revealed but it was estimated at hundreds of thousands of dollars.) This will change how AI companies will need to think about pricing their AI models, and more importantly, radically alter the way companies buying AI like o3 will need to budget.
AGI will come sooner, but matter less
o3 also means that OpenAI CEO Sam Altman is probably correct when he predicts that “we will hit AGI much sooner than most people in the world think and it will matter much less.” When I first heard Altman say this, I was skeptical. I thought it was simply a way of moving the goalposts on AGI, while also trying to avoid regulation. But, now, I think Altman is on to something. At the current rate of progress, it is quite possible a successor model to o3 will, within the next year or two, do well enough across every tough benchmark we can conceive, that its creators will be able to claim it is as good or better than humans at every cognitive task.
But, at the same time, the cost of running such a model may be so high that, as Marcus argues, it won’t be economically feasible to use it for most real world applications, at least not initially. While those costs will no doubt come down rapidly, as they have for LLMs, they may not drop far enough or fast enough to make AGI adoption seem anything other than gradual. In many cases, it will remain easier and more economically efficient for businesses to keep employing people to do tasks.
The pace of AI adoption is already lagging the pace of AI model development, and this seems unlikely to change. AI outputs still differ enough from what we expect from humans, and remain unreliable and inconsistent enough, that in most cases we’ll need to keep at least some humans in the loop to double-check AI outputs. Changes to workflows and jobs will occur gradually, more like a creeping tide of automation than an exploding bomb. This is good news, as it should give us all more time to adapt.
With that, here’s more AI news.
Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn