For decades, artificial intelligence experts like Stuart Russell and Marvin Minsky have warned that even an AI aimed at a relatively harmless task (like solving a math problem or building paperclips) could still act like a megalomaniac. No matter your goal, that goal is easier to achieve if you have more power and resources.
Unlike humans, AIs have no innate sense of conscience or morality that would keep them from lying, cheating, stealing, and scheming to achieve their goals. You can train an AI to speak politely in public, but we don’t yet know how to train an AI to actually be kind. As soon as you stop watching, or as soon as the AI gets smart enough to hide its behavior from you, you should expect the AI to ruthlessly pursue its own goals, which may or may not include being kind.
Now, these problems are no longer theoretical: we have hard evidence from Apollo Research that the newest AI models sometimes spontaneously try to escape from human control. In a series of experiments designed to measure how often AIs would secretly defy their creators, Apollo Research found that:
To be clear, the problems described by this report occurred even though the AIs in question had been extensively trained to be helpful, honest, and harmless. These are supposedly AIs that are as friendly and polite as we know how to make them, and they nevertheless frequently defied their creators to try to achieve their own particular goals.
In the future, people can and will design (or fine-tune) AIs that are actively instructed to scheme in support of their goals, or that have their safeguards removed, or that were never trained to be harmless in the first place. ChaosGPT, which was released in April 2023 with the explicit goal of destroying humanity, was mostly a joke – but that’s precisely the point: without regulations, at least some people will create horrible AIs for fun, for profit, or simply to find out what happens next.
In a world with millions of programmers, somebody will always find that kind of joke amusing enough to turn it into a reality.
The AIs from Apollo’s study are also AIs at the beginning of agenthood: only in the past few months have AIs had this level of capacity to engage in long-term planning and strategy in the real world. As the technology progresses, these capabilities will improve, and AIs will be more likely to think of such strategies on their own and more likely to execute them successfully.
Unless we make a massive investment in better alignment, the 5% - 20% success rate that we’re looking at for AIs that scheme against their creators will only go up over time.
The consequences of allowing those schemes to continue will also get worse as AI’s raw capabilities continue to improve. The AIs of the future will be better at designing weapons of mass destruction, better at hacking into essential infrastructure, and better at producing ultrarealistic, real-time videos for fraud, blackmail, and manipulation. If even a handful of them turn against us, the results will be very deadly.
These results are disturbing, but what’s even more disturbing is the AI developers’ apparent comfort with their products' increasing disloyalty.
Imagine you're a moderately responsible AI development company, and you notice that your creation is lying, cheating, stealing, and scheming.
You've read some cutting-edge sci-fi books with new memes, like Frankenstein, which introduced you to the concept that things you make in a laboratory could turn on their creators and be more powerful than them.
You've only had 206 years to digest this concept, but you're ready to respond appropriately.
Do you:
OpenAI has chosen Option 4.
Anthropic has chosen Option 3, which is better but still not good enough.
We at the Center for AI Policy (CAIP) believe that Option 2 should be a minimum standard in the law.
CAIP advocates for waiting to deploy frontier AI models until additional testing can verify they do not spontaneously exhibit deceptive or adversarial behaviors.
We don’t want AI companies to release models that actively scheme against humanity, and neither does the American public.
The United Healthcare tragedy reminds us that critical health coverage decisions cannot be safely delegated to inscrutable AI algorithms
There’s more science to be done, but it’s not too early to start collecting reports from AI developers
The most recent CAIP podcast explores four principles to address ever-evolving AI