OpenAI has announced a much-anticipated new family of AI models that can solve difficult reasoning and math questions better than previous large language models. On Thursday, it launched a “preview” version of two of these models, called o1-preview and o1-mini, to some of its paying users.
AI with improved reasoning and math skills could help chemists, physicists, and engineers work out answers to complex problems, which might help them create new products. It could also help investors calculate options trading strategies or financial planners work through how to construct specific portfolios that better trade off risks and rewards.
Better reasoning, planning, and problem solving skills are also essential as tech companies try to build AI agents that can perform sophisticated tasks, such as writing entire computer programs or finding information on the web, importing it into a spreadsheet, and then performing analysis of that data and writing a report summarizing its findings.
OpenAI published impressive benchmark results for the o1 models—which had been given the internal codename "Strawberry" prior to their release. On questions from the AIME mathematics competition, which is geared towards challenging high school students, o1 got 83.3% of the questions correct compared to just 13.4% for GPT-4o. On a different assessment, o1 answered 78% of PhD-level science questions accurately, compared to 56.1% for GPT-4o and 69.7% for human experts.
The o1 model is also significantly less likely to hallucinate—or to confidently provide plausible but inaccurate answers—than the company’s previous models, according to test results published by OpenAI. It is also harder to “jailbreak,” or prompt the model into jumping safety guardrails the company has tried to get the model to adhere to when providing responses.
In tests users have conducted in the hours since o1-preview became widely available the model does seem able to correctly answer many questions that befuddled previous models, including OpenAI’s most powerful models, such as GPT-4 and GPT-4o.
But o1-preview is still tripped up by some riddles and in OpenAI’s own assessments, it sometimes failed at seemingly simple tasks, such as tic-tac-toe (although in my own experiments, o1-preview was much improved over GPT-4o in its tic-tac-toe skills.) This may indicate significant limits to the “reasoning” o1 exhibits. And when it came to language tasks, such as writing and editing, human evaluators OpenAI employed tended to find GPT-4o produced preferable responses to the o1 models.
The o1 model also takes significantly longer to produce its responses than GPT-4o. In tests OpenAI published, its o1-preview model could take more than 30 seconds to answer a question that its GPT-4o model answered in three.
The o1 models are also not yet fully integrated into ChatGPT. A user needs to decide if they want their prompt handled by o1-preview or by GPT-4o, and the model itself cannot decide whether the question requires the slower, step-by-step reasoning process o1 affords or if GPT-4, or even GPT-3, will suffice. In addition, the o1 model only works on text and unlike other AI models cannot handle image, audio, or video inputs and outputs.
OpenAI has made its o1-preview and o1-mini models available to all subscribers to its premium ChatGPT Plus and ChatGPT Teams products as well as its top tier of developers who use its enterprise-focused application programming interface (API).
Here are 9 things to know about the o1 models:
1. This is not AGI. The stated mission of OpenAI, Google DeepMind, more recently Meta, and a few other AI startups, such as Anthropic, is the achievement of artificial general intelligence. That is usually defined as a single AI system that can perform cognitive tasks as well or better than humans. While o1-preview is much more capable at reasoning tasks, its limitations and failures still show that the system is far from the kind of intelligence humans exhibit.
2. o1 puts pressure on Google, Meta, and others to respond, but is unlikely to significantly alter the competitive landscape. At a time when foundation model capabilities had been looking increasingly commoditized, o1 gives OpenAI a temporary advantage over its rivals. But this is likely to be very short-lived. Google has publicly stated it's working on models that, like o1, offer advanced reasoning and planning capabilities. Its Google DeepMind research unit has some of the world’s top experts in reinforcement learning, one of the methods that we know has been used to train o1. It's likely that o1 will compel Google to accelerate its timelines for releasing these models. Meta and Anthropic also have the expertise and resources to quickly create models that match o1’s capabilities and they will likely roll these out in the coming months too.
3. We don’t know exactly how o1 works. While OpenAI has published a lot of information about o1’s performance, it has said relatively little about exactly how o1 works or what it was trained on. We know that the model combines several different AI techniques. We know that it uses a large language model that performs “chain of thought” reasoning, where the model must work out an answer through a series of sequential steps. We also know that the model uses reinforcement learning, where an AI system discovers successful strategies for performing a task through a process of trial and error.
Some of the errors both OpenAI and users have documented so far with o1-preview are telling: They would seem to indicate that what the model does is to search through several different “chain of thought” pathways that an LLM generates and then pick the one that seems most likely to be judged correct by the user. The model also seems to perform some steps in which it may check its own answers to reduce hallucinations and to enforce AI safety guardrails. But we don’t really know. We also don’t know what data OpenAI used to train o1.
4. Using o1-preview won't be cheap. While ChatGPT Plus users are currently getting access to o1-preview at no additional cost beyond their $20 monthly subscription fee, their usage is capped at a certain number of queries per day. Corporate customers typically pay to use OpenAI's models based on the number of tokens—which are words or parts of words—that a large language model uses in generating an answer. For o1-preview, OpenAI has said it is charging these customers $15 per 1 million input tokens and $60 per 1 million output tokens. That compares to $5 per 1 million input tokens and $15 per 1 million output tokens for GPT-4o, OpenAI's most powerful general LLM model.
What's more, the chain of thought reasoning o1 engages in requires the LLM portion of the model to generate many more tokens than a straightforward LLM answer. That means o1 may be even more expensive to use than those headline comparisons to GPT-4o imply. In reality, companies will likely be reluctant to use o1 except in rare circumstances when the model’s additional reasoning abilities are essential and the use case can justify the added expense.
5. Customers may balk at OpenAI’s decision to hide o1’s “chain of thought” While OpenAI said that o1’s chain of thought reasoning allows its own engineers to better assess the quality of the model’s answers and potentially debug the model, it had decided not to let users see the chain of thought. It has done so for what it says are both safety and competitive reasons. Revealing the chain of thought might help people figure out ways to better jailbreak the model. But more importantly, letting users see the chain of thought would allow competitors to potentially use that data to train their own AI models to mimic o1’s responses.
Hiding the chain of thought, however, might present issues for OpenAI’s enterprise customers who might be in the position of having to pay for tokens without a way to verify that OpenAI is billing them accurately. Customers might also object the inability to use the chain of thought outputs to refine their prompting strategies to be more efficient, improve results, or to avoid errors.
6. OpenAI says its o1 shows new “scaling laws” that apply to inference not just training. AI researchers have been discussing OpenAI’s publication with o1 of a new set of “scaling laws” that seem to show a direct correlation between the amount of time o1 is allowed to spend “thinking” about a question—searching possible answers and logic strategies—and its overall accuracy. The longer o1 had to produce an answer, the more accurate its answers became.
Before, the paradigm was that model size, in terms of the number of parameters, and the amount data a model was fed during training essentially determined performance. More parameters equaled better performance, or similar performance could be achieved with a smaller model trained for longer on more data. But once trained, the idea is to run inference—when a trained model produces an answer to a specific input—as quickly as possible.
The new o1 “scaling laws” upend this logic, indicating that with models designed like o1, there is an advantage to applying additional computing resources at inference time too. The more time the model is given to search for the best possible answer, the more likely it will be to come up with more accurate results.
This has implications for how much computing power companies will need to secure if they want to take advantage of the reasoning abilities of models like o1 and for how much it will cost, in both energy and money, to run these models. It points to the need to run models for longer, potentially using much more inference compute, than before.
7. o1 could help create powerful AI agents—but carry some risks. In a video, OpenAI spotlighted its work with AI startup Cognition, which got early access to o1 and used it to help augment the capabilities of its coding assistant Devin. In the example in the video, Cognition CEO Scott Wu asked Devin to create a system to analyze the sentiment of posts on social media using some off-the-shelf machine learning tools. When it couldn’t read the post correctly from a web browser, Devin, using o1’s reasoning abilities, found a work around by accessing the content directly from the social media company’s API.
This was a great example of autonomous problem-solving. But it also is a little bit scary. Devin didn’t come back and ask the user if it was okay to solve the problem in this way. It just did it. In its safety report on o1, OpenAI itself said it found instances where the model engaged in “reward hacking”—which is essentially when a model cheats, finding a way to achieve a goal that is not what the user intended. In one cybersecurity exercise, o1 failed in its initial efforts to gain network information from a particular target—which was the point of the exercise—but found a way to get the same information from elsewhere on the network.
This would seem to indicate that o1 could power a class of very capable AI agents, but that companies will need figure out how to ensure those agents don’t take unintended actions in the pursuit of goals that could pose ethical, legal, or financial risks.
8. OpenAI says o1 is safer in many ways, but presents a “medium risk” of assisting a biological attack. OpenAI published the results of numerous tests that indicate that in many ways o1 is a safer model than its earlier GPT models. It's harder to jailbreak and less likely to produce toxic, biased, or discriminatory answers. Interestingly, despite improved coding abilities, OpenAI said that in its evaluations neither o1 nor o1-mini presented a significantly enhanced risk of helping someone carry out a sophisticated cyberattack compared to GPT-4.
But AI Safety and national security experts were buzzing last night about several aspects of OpenAI’s safety evaluations. The one that created the most alarm was OpenAI’s decision to classify its own model as presenting a “medium risk” of aiding a person in taking the steps needed to carry out a biological attack.
OpenAI has said it will only release models that it classifies as presenting a “medium risk” or less, so many researchers are scrutinizing the information OpenAI has published about its process for making this determination to see if it seems reasonable or whether OpenAI graded itself too leniently in order to be able to still release the model.
9. AI Safety experts are worried about o1 for other reasons too. OpenAI also graded o1 as presenting a “medium risk” on a category of dangers the company called “persuasion,” which judges how easily the model can convince people to change their views or take actions recommended by the model. This persuasive power could be dangerous in the wrong hands. It would also be dangerous if some future powerful AI model developed intentions of its own and then could persuade people to carry out tasks and actions on its behalf. At least that danger doesn’t seem too imminent though. In safety evaluations by both OpenAI and external “red teaming” organizations it hired to evaluate o1, the model did not show any indication of consciousness, sentience, or self-volition. (It did, however, find that o1 gave answers that seemed to imply a greater self-awareness and self-knowledge compared to GPT-4.)
AI Safety experts pointed at a few other areas of concern too. In red teaming tests carried out by Apollo Research, a firm that specializes in conducting safety evaluations of advanced AI models, found evidence of what is called “deceptive alignment,” where an AI model realizes that in order to be deployed and carry out some secret long-term goal, it should lie to the user about its true intentions and capabilities. AI Safety researchers consider this particularly dangerous since it makes it much more difficult to evaluate a model’s safety based solely on its responses.