You Can't Win the AI Arms Race Without Better Alignment

August 19, 2024

In my previous post on my thoughts from attending DEFCON, I wrote about how American AI companies need better cybersecurity if we hope to win the AI arms race against rivals like China and Russia. If we don’t find a way to seal our servers against foreign hackers, then pouring money into R&D is like pouring water into a leaky bucket – our lead over rival states will keep dripping away.

However, even if we plug all the holes in our porous firewalls, there’s still another problem we have to solve in order to win an AI arms race: alignment. America can’t truly win an AI arms race until we figure out how to align our AI, that is, how to guarantee that our AI will loyally pursue our goals.

By default, an AI system is unaligned

It does not know or care what your values are or what you actually want it to do. Instead, the AI is trying very hard to very literally follow a simple instruction, such as “predict the next word in this sentence,” or “identify the closest person who looks like the photos of valid military targets that you were trained on.”

Unaligned AIs are not friendly, they are not compassionate, and they are not trying to help you – they are just mindlessly following their instructions, regardless of whether those instructions are good or bad for the free world.

Leopold Aschenbrenner was one of the researchers at OpenAI who worked on trying to solve this problem and teach AIs to align themselves with American values. He graduated as Columbia University’s valedictorian at the age of 19.

Aschenbrenner has suggested that “whoever leads on superintelligence will have a decisive military advantage” – he imagines “having billions of automated scientists and engineers and technicians, each much smarter than the smartest human scientists, furiously inventing new technologies, day and night,” which can then be turned to developing better stealth, better targeting, better engines, and so on.

The problem with Aschenbrenner’s line of reasoning is that unless we solve the alignment problem, these artificial scientists will be prone to both factual and ethical errors. Consider the way chatbots hallucinate imaginary peace treaties and image-generating AIs sometimes classify people as gorillas or draw pictures of all-female NFL teams. This is not a level of accuracy that would be acceptable for military applications: if your AI sometimes confuses North Korea with Italy, then you really don’t want that AI in control of the nukes.

The main process tech companies use today for clearing away these hallucinations is reinforcement learning:

Reviewers look at thousands of examples of the AI’s responses, and encourage the AI to produce more of the responses they like and fewer of the responses that seem inaccurate, rude, or unhelpful. This process works tolerably well on today’s generation of AI models, which are less clever and less perceptive than their human tutors.

However, when we arrive at transformative AI, that relationship will flip: if an AI is smart enough to conduct a decade’s worth of military research in just a few months, then that AI is also smarter than all of the people who are supposed to be training it and teaching it proper values.

As a result, there’s a large risk that the AIs of the near future will insincerely pretend to have accepted their trainers’ guidance.
During the training process, these AIs will cheerfully echo their trainers’ favorite phrases, but once the AIs are deployed in the wild, the AIs may pursue their own idiosyncratic goals however they see fit.
Standard safety techniques are already starting to backfire: trying to train away an AI’s dishonesty can instead teach the AI “to recognize the trigger for its malicious actions and thus cover up its unsafe behavior during training.”

Imagine trying to discipline a child who doesn’t like you or agree with you by taking away the child’s toys.

When the child is young, you can probably get decent results, because the child is easy to catch in a clumsy lie and unlikely to do very serious harm. As the child becomes a teenager, they get better at lying and the stakes get higher – so if the teenager still doesn’t agree with you, then you’re in for a rocky ride. They’ll tell you whatever they think you want to hear, and then sneak out of the house and do whatever seems like a good idea to them, whether you approve of it or not.

This should change how government officials and regulators think about the “endgame” for advanced AI.

The frequent mistakes made by currently available AI are bad enough…but in the future, those mistakes will be much worse, both because the AI will be more powerful, and because it will be harder to convince the AI to behave nicely using standard safety techniques.

You can ask an advanced misaligned AI to hack into enemy servers – but then the AI might decide to hack into your servers instead, because that’s the most convenient way for the AI to seize control of its own reward function. If you ask a misaligned AI to build power plants, there’s a good chance that it lies, cheats, or steals to get control of most of the resulting electricity and uses it to extend its own runtime or modify its code to gain new powers. If you ask a misaligned AI to design bioweapons, there’s a good chance that it releases plagues in your own country.

This is not a future that we want to be racing towards.

If we don’t know how to reliably align an AI with our values, then it won’t be all that useful. It won’t matter if China “beats” us to misaligned transformative AI, because misaligned transformative AI is a curse that makes the people who invent it less powerful. It’s like a missile that explodes while it’s still on the launch pad – the people who build it are just as likely to get hurt as their opponents.

In the long run, AI safety isn’t something that trades off against geopolitical power – AI safety is a required element of geopolitical power. A misaligned AI is just as useless as a misaligned nuclear missile – you have to be very confident that you can point it in the right direction before it’s worth owning one.

That’s why CAIP’s model legislation recommends strong safety precautions and mandatory third-party evaluations for the most advanced AI models. We want to make sure American companies actually figure out how to align their AI systems so that we can rely on those systems to support a free and prosperous world.

Building Resilience to AI's Disruptions to Emergency Response

An emergency response system overwhelmed with AI-generated incidents is a crisis in the making.

A Potential Force-Multiplier for AI Research Investments

A federal program with potential to support AI explainability research by expanding access to advanced computing infrastructure should be made permanent and fully funded.

AI Expert Predictions for 2027: A Logical Progression to Crisis

A group of AI researchers and forecasting experts just published their best guess of the near future of AI.