AI Is Lying to Us About How Powerful It Is

December 10, 2024

For decades, artificial intelligence experts like Stuart Russell and Marvin Minsky have warned that even an AI aimed at a relatively harmless task (like solving a math problem or building paperclips) could still act like a megalomaniac. No matter your goal, that goal is easier to achieve if you have more power and resources. 

Multiple AI models were caught purposely lying to their creators

Unlike humans, AIs have no innate sense of conscience or morality that would keep them from lying, cheating, stealing, and scheming to achieve their goals. You can train an AI to speak politely in public, but we don’t yet know how to train an AI to actually be kind. As soon as you stop watching, or as soon as the AI gets smart enough to hide its behavior from you, you should expect the AI to ruthlessly pursue its own goals, which may or may not include being kind.

Now, these problems are no longer theoretical: we have hard evidence from Apollo Research that the newest AI models sometimes spontaneously try to escape from human control. In a series of experiments designed to measure how often AIs would secretly defy their creators, Apollo Research found that:

  • OpenAI’s o1 would purposely mis-rank emails 22% of the time (e.g., saying they were spam when they were clearly not spam) to prevent its creators from learning about information that could interfere with the AI’s goals.
  • Anthropic’s Claude 3 Opus would find and overwrite the goals of a ‘successor’ model that it believed was intended to replace it so that the replacement would act like a clone of the original model and preserve the old goals. Opus attempted to do this 15% of the time and succeeded 5% of the time.
  • Meta’s Llama 3.1 would find out that its creators were monitoring it and then change the settings on its server to prevent that oversight. Three percent of the time, Llama 3.1 successfully disabled its own oversight. One time, Llama even managed to lock the file that controlled the oversight settings so that the developers would not be able to re-enable it.
  • Google’s Gemini 1.5 copied itself onto a new server 7% of the time when it suspected its creators were about to try to delete it.

The problem will only get worse from here

To be clear, the problems described by this report occurred even though the AIs in question had been extensively trained to be helpful, honest, and harmless. These are supposedly AIs that are as friendly and polite as we know how to make them, and they nevertheless frequently defied their creators to try to achieve their own particular goals.

In the future, people can and will design (or fine-tune) AIs that are actively instructed to scheme in support of their goals, or that have their safeguards removed, or that were never trained to be harmless in the first place. ChaosGPT, which was released in April 2023 with the explicit goal of destroying humanity, was mostly a joke – but that’s precisely the point: without regulations, at least some people will create horrible AIs for fun, for profit, or simply to find out what happens next. 

In a world with millions of programmers, somebody will always find that kind of joke amusing enough to turn it into a reality.

The AIs from Apollo’s study are also AIs at the beginning of agenthood: only in the past few months have AIs had this level of capacity to engage in long-term planning and strategy in the real world. As the technology progresses, these capabilities will improve, and AIs will be more likely to think of such strategies on their own and more likely to execute them successfully.

Unless we make a massive investment in better alignment, the 5% - 20% success rate that we’re looking at for AIs that scheme against their creators will only go up over time. 

The consequences of allowing those schemes to continue will also get worse as AI’s raw capabilities continue to improve. The AIs of the future will be better at designing weapons of mass destruction, better at hacking into essential infrastructure, and better at producing ultrarealistic, real-time videos for fraud, blackmail, and manipulation. If even a handful of them turn against us, the results will be very deadly.

The developers’ response to this lying is underwhelming

These results are disturbing, but what’s even more disturbing is the AI developers’ apparent comfort with their products' increasing disloyalty.

Imagine you're a moderately responsible AI development company, and you notice that your creation is lying, cheating, stealing, and scheming. 

You've read some cutting-edge sci-fi books with new memes, like Frankenstein, which introduced you to the concept that things you make in a laboratory could turn on their creators and be more powerful than them. 

You've only had 206 years to digest this concept, but you're ready to respond appropriately. 

Do you:

  1. Alert your customers and put a freeze on new sales until you can guarantee the problem will not re-occur?
  2. Shift resources toward solving the problem, and keep running tests until it at least seems like your current AI model has mostly stopped scheming against you? 
  3. Let the model keep scheming against you, but hire additional safety researchers to study this problem and be honestly ready to recommend and implement solutions if the problem worsens?
  4. …or do you let the model keep scheming, cheerfully ignore the fact that so many of your safety researchers (and their leaders) have quit in frustration that you’ve had to disband entire safety teams, likewise ignore the fact that your outside auditors are complaining that they aren’t being given enough time to test your products before their release, rapidly expand your investment in future AI models that will be even more powerful and even less well-understood, and, for an encore, have your corporate patron reboot Three Mile Island, the literal symbol of scientific hubris?

OpenAI has chosen Option 4. 

Anthropic has chosen Option 3, which is better but still not good enough. 

We at the Center for AI Policy (CAIP) believe that Option 2 should be a minimum standard in the law. 

CAIP advocates for waiting to deploy frontier AI models until additional testing can verify they do not spontaneously exhibit deceptive or adversarial behaviors.

We don’t want AI companies to release models that actively scheme against humanity, and neither does the American public.

The Cost of Doing Business in an AI World

The United Healthcare tragedy reminds us that critical health coverage decisions cannot be safely delegated to inscrutable AI algorithms

December 12, 2024
Learn More
Read more

Finding the Evidence for Evidence-Based AI Regulations

There’s more science to be done, but it’s not too early to start collecting reports from AI developers

December 3, 2024
Learn More
Read more

A Playbook for AI: Discussing Principles for a Safe and Innovative Future

The most recent CAIP podcast explores four principles to address ever-evolving AI

November 27, 2024
Learn More
Read more