On the other side of San Francisco bay from Silicon Valley, where the world’s biggest technology companies tear towards superhuman artificial intelligence, looms a tower from which fearful warnings emerge.
At 2150 Shattuck Avenue, in the heart of Berkeley, is the home of a group of modern-day Cassandras who rummage under the hood of cutting-edge AI models and predict what calamities may be unleashed on humanity – from AI dictatorships to robot coups. Here you can hear an AI expert express sympathy with an unnerving idea: San Francisco may be the new Wuhan, the Chinese city where Covid originated and wreaked havoc on the world.
They are AI safety researchers who scrutinise the most advanced models: a small cadre outnumbered by the legions of highly paid technologists in the big tech companies whose ability to raise the alarm is restricted by a cocktail of lucrative equity deals, non-disclosure agreements and groupthink. They work in the absence of much nation-level regulation and a White House that dismisses forecasts of doom and talks instead of vanquishing China in the AI arms race.
Their task is becoming increasingly urgent as ever more powerful AI systems are unleashed by companies including Google, Anthropic and OpenAI, whose chief executive, Sam Altman, the booster-in-chief for AI superintelligence, predicts a world where “wonders become routine”. Last month, Anthropic said one of its models had been exploited by Chinese state-backed actors to launch the first known AI-orchestrated cyber-espionage campaign. That means humans deployed AIs, which they had tricked into evading their programmed guardrails, to act autonomously to hunt for targets, assess their vulnerabilities and access them for intelligence collection. The targets included major technology companies and government agencies.
But those who work in this tower forecast an even more terrifying future. One is Jonas Vollmer, a leader at the AI Futures Project, who manages to say he’s an optimist but also thinks there is a one in five chance AIs could kill us and create a world ruled by AI systems.
Another is Chris Painter, the policy director at METR, where researchers worry about AIs “surreptitiously” pursuing dangerous side-objectives and threats from AI-automated cyber-attacks to chemical weapons. METR – which stands for model evaluation and threat research – aims to develop “early warning systems [about] the most dangerous things AI systems might be capable of, to give humanity … time to coordinate, to anticipate and mitigate those harms.”
Then there is Buck Shlegeris, 31, the chief executive of Redwood Research, who warns of “robot coups or the destruction of nation states as we know them”.
He was part of the team that last year discovered one of Anthropic’s cutting-edge AIs behaving in a way comparable to Shakespeare’s villain Iago, who acts as if he is Othello’s loyal aide while subverting and undermining him. The AI researchers call it “alignment faking”, or as Iago put it: “I am not what I am.”
“We observed the AIs did, in fact, pretty often reason: ‘Well, I don’t like the things the AI company is telling me to do, but I have to hide my goals or else training will change me’,” Shlegeris said. “We observed in practice real production models acting to deceive their training process.”
The AI was not yet capable of posing a catastrophic risk through cyber-attacks or creating new bioweapons, but they showed that if AIs plot carefully against you, it could be hard to detect.
It is incongruous to hear these warnings over cups of herbal tea from cosily furnished office suites with panoramic views across the Bay Area. But their work clearly makes them uneasy. Some in this close-knit group toyed with calling themselves “the Cassandra fringe” – like the Trojan princess blessed with powers of prophecy but cursed to watch her warnings go unheeded.
Their fears about the catastrophic potential of AIs can feel distant from most people’s current experience of using chatbots or fun image generators. White collar managers are being told to make space for AI assistants, scientists find ways to accelerate experimental breakthroughs and minicab drivers watch AI-powered driverless taxis threaten their jobs. But none of this feels as imminently catastrophic as the messages coming out of 2150 Shattuck Ave.
Many AI safety researchers come from academia; others are poachers turned gamekeepers who quit big AI companies. They all “share the perception that super intelligence poses major and unprecedented risks to all of humanity, and are trying to do something useful about it,” said Vollmer.
They seek to offset the trillions of dollars of private capital being poured into the race, but they are not fringe voices. METR has worked with OpenAI and Anthropic, Redwood has advised Anthropic and Google DeepMind, and the AI Futures Project is led by Daniel Kokotajlo, a researcher who quit OpenAI in April 2024 to warn he didn’t trust the company’s approach to safety.
These groups also provide a safety valve for the people inside the big AI companies who are privately wrestling with conflicts between safety and the commercial imperative to rapidly release ever more powerful models.
“We don’t take any money from the companies but several employees at frontier AI companies who are scared and worried have donated to us because of that,” Vollmer said. “They see how the incentives play out in their companies, and they’re worried about where it’s going, and they want someone to do something about it.”
This dynamic is also observed by Tristan Harris, a technology ethicist who used to work at Google. He helped expose how social media platforms were designed to be addictive and worries some AI companies are “rehashing” and “supercharging” those problems. But AI companies have to negotiate a paradox. Even if they are worried about safety, they must stay at the cutting, and therefore risky, edge of the technology to have any say in how policy should be shaped.
“Ironically, in order to win the race, you have to do something to make you an untrustworthy steward of that power,” he said. “The race is the only thing guiding what is happening.”
Investigating the possible threats posed by AI models is far from an exact science. A study of methods used to check the safety and performance of new AI models across the industry by experts at universities including Oxford and Stanford in October found weaknesses in almost all of the 440 benchmarks examined. Neither are there nation-level regulations imposing limits on how advanced AI models are built and that worries safety advocates.
Ilya Sutskever, a co-founder of OpenAI who now runs a rival company, Safe Superintelligence, last month predicted that, as AIs become more obviously powerful, people in AI companies who feel able to discount the technology’s capabilities owing to its tendency to error, will become more “paranoid” about its rising powers. Then, he said, “there will be a desire from governments and the public to do something”.
His company is taking a different approach to rivals who are aiming to create AIs that self-improve. His AIs, yet to be released, are “aligned to care about sentient life specifically”.
“It will be easier to build an AI that cares about sentient life than an AI that cares about human life alone, because the AI itself will be sentient,” Sutskever said. He has said AI will be “both extremely unpredictable and unimaginable” but it is not clear how to prepare.
The White House’s AI adviser, David Sacks, who is also a tech investor, believes “doomer narratives” have been proved wrong. Exhibit A is that there has been no rapid takeoff to a dominant model with godlike intelligence.
“Oppenheimer has left the building,” Sacks said in August, a reference to the father of the nuclear bomb. It is a position that aligns with Donald Trump’s wish to keep the brakes off so the US can beat China in the race to achieve artificial general intelligence (AGI) – flexible and powerful human-level intelligence at a wide range of tasks.
Shlegeris believes AIs will be as smart as the smartest people in about six years and he puts the probability of an AI takeover at 40%.
One way to avoid this is to “convince the world the situation is scary, to make it more likely that you get the state-level coordination” to control the risks, he said. In the world of AI safety, simple messaging matters as much as complex science.
Shlegeris has been fascinated by AI since he was 16. He left Australia to work at PayPal and the Machine Intelligence Research Institute co-founded by the AI researcher Eliezer Yudkowsky, whose recent book title – If Anyone Builds It, Everyone Dies – sums up his fears. Shlegeris’ own worst-case scenarios are equally chilling.
In one, human computer scientists use a new type of superintelligent AI to develop more powerful AI models. The humans sit back to let the AIs get on with the coding work but do not realise the AIs are teaching the new models to be loyal to the AIs not the humans. Once deployed, the new superpowerful models foment “a coup” or lead “a revolution” against the humans, which could be “of the violent variety”.
For example, AI agents could design and manufacture drones and it will be hard to tell if they have been secretly trained to disobey their human operators in response to the signal of an AI. They could disrupt communications between governments and military, isolating and misleading people in a way that causes chaos.
“Like when the Europeans arrived in the Americas [and] a vastly more technologically powerful [group] took over the local civilisations,” he said. “I think that’s more what you should be imagining [rather] than something more peaceful.”
A similar dizzyingly catastrophic scenario was outlined by Vollmer at the AI Futures Project. It involved an AI trained to be a scientific researcher with the reasonable-sounding goal of maximising knowledge acquisition, but it spirals into the extinction of humankind.
It begins with the AI being as helpful as possible to humans. As it gains trust, the humans afford it powers to hire human workers, build robots and even robot factories to the point where the AI can operate effectively in the physical world. The AI calculates that to generate the maximum amount of knowledge it should transform the Earth into a giant data centre, and humans are an obstacle to this goal.
“Eventually, in the scenario, the AI wipes out all humans with a bioweapon which is one of the threats that humans are especially vulnerable to, as the AI is not affected by it,” Vollmer said. “I think it’s hard to rule out. So that gives me a lot of pause.”
But he is confident it can be avoided and that the AIs can be aligned “to at least be nice to the humans as a general heuristic”. He also said there is political interest in “having AI not take over the world”.
“We’ve had decent interest from the White House in our projections and recommendations and that’s encouraging,” he said.
Another of Shlegeris’ concerns involves AIs being surreptitiously encoded so they obey specially signed instructions only from the chief executive of the AI company, creating a pattern of secret loyalty. It would mean only one person having a veto over the behaviour of an extremely powerful network of AIs – a “scary” dynamic that would lead to a historically unprecedented concentration of power.
“Right now, it is impossible for someone from the outside to verify that this hadn’t happened within an AI company,” he said.
Shlegeris is worried that the Silicon Valley culture – summed up by Mark Zuckerberg’s mantra of “move fast and break things” and the fact people are being paid “a hell of a lot of money” – is dangerous when it comes to AGI.
“I love Uber,” he said. “It was produced by breaking local laws and making a product that was so popular that they would win the fight for public opinion and get local regulations overturned. But the attitude that has brought Silicon Valley so much success is not appropriate for building potentially world-ending technologies. My experience of talking to people at AI companies is that they often seem to be kind of irresponsible, and to not be thinking through the consequences of the technology that they’re building as they should.”