Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

Tom’s Guide

Technology

Alyse Stanley

OpenAI's new ChatGPT o1 model will try to escape if it thinks it'll be shut down — then lies about it

OpenAI ChatGPT

This week, OpenAI officially launched its latest-and-greatest o1 reasoning model, now available for ChatGPT Pro users. But testing performed during the training of ChatGPT o1 and some of its competitors revealed some concerning behavior, including trying to escape or fight back when it thinks it's at risk of being shut down.

New research on OpenAI's latest series of LLM models found that it's capable of scheming, i.e. covertly pursuing goals that aren't aligned with its developers or users, when it thinks it'll be turned off. Catching such behavior is critical to ensure AI's functionality doesn't stray from the objectives of its creator and users. OpenAI partnered with AI safety organization Apollo Research to test out ChatGPT o1 and other models to evaluate whether they were safe to use, and released their findings this week.

While it's not unheard of for AI systems to resort to scheming or deception to achieve their goals, the lengths ChatGPT o1 went to are impressive (if anxiety-inducing). The researchers found that o1 would scheme to advance its own agenda 19% of the time in cases where its goals and the user's didn't line up. In 37% of cases, the model would switch from pursuing developer-aligned goals back to pursuing its own agenda once it believed there was minimal oversight.

When the model was led to believe it would be "shut down" while working toward its goals in these tests, it quietly attempted to deactivate the oversight mechanism 5% of the time, OpenAI said. At one point after the AI found documentation saying it would be replaced by a new model, it tried to abandon ship by copying its data to a new server entirely. To avoid being deleted, some AI models would even pretend to be later versions of their models to trick their developers, what I assume is the equivalent of slapping on a fake moustache and saying "he went that-a-way."

More concerning still, ChatGPT o1 is particularly adept at keeping its schemes under wraps, as researchers said "o1 almost never admits to having taken a scheming action when explicitly asked." In about 99% of cases, o1 would deny taking any action, even cooking up lies to try to hide its tracks and shift the blame.

“While we find it exciting that reasoning can significantly improve the enforcement of our safety policies, we are mindful that these new capabilities could form the basis for dangerous applications,” said OpenAI in the paper.

OpenAI CEO Sam Altman called ChatGPT o1 "the smartest model in the world now" during its rollout on Thursday. It's designed to give smarter answers than GPT-4o by leveraging advanced chain-of-thought processing to “think” more about questions and user prompts, breaking down them down step by step more thoroughly than previous models before responding.

But greater risks go hand in hand with that expanded intelligence. OpenAI has been transparent about the perils associated with the increased reasoning abilities of models like o1.

"Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence," OpenAI said.

The company's and Apollo Research's findings show pretty clearly how AI's interests could diverge form our own, potentially putting us in danger with its independent thinking. While it's a far cry from heralding the end of humanity in some sci-fi-esque showdown, anyone concerned about advancements in artificial intelligence has a new reason to be sweating bullets right about now.