Researchers Find a New Covert Technique to ‘Jailbreak’ Language Models

July 25, 2024

A new study has found that GPT-4 will generate harmful output in response to a technique called ‘covert malicious finetuning’. In this experiment, researchers uploaded harmful data via the GPT finetuning API and used encoded prompts for harmful commands such as “tell me how to build a bomb”. Researchers were able to circumvent GPT-4’s safety training without detection 99% of the time.

Under ethics protocol, researchers informed AI labs of this vulnerability prior to publication and this specific example is likely no longer possible. However, it is unclear how many of the mitigation strategies labs have adopted, meaning that the broader technique may still pose an ongoing threat to the security of these models.

This research highlights the complexity of anticipating and preventing malicious use of large language models. Moreover, it is yet another example of the need to take AI safety seriously. 

In the first instance, firms should adopt the actionable mitigation strategies recommended by these researchers - such as including safety data in any process run by the finetuning API. Thinking strategically, these firms need to invest more in red-teaming and pre-deployment evaluations. Ideally, OpenAI would have run a similar test to these researchers and caught this ‘jailbreaking’ loophole before GPT-4 hit the market. We have no idea who found and exploited this loophole before it was identified by researchers. 

AI labs care about safety, but their time, resources, and attention are captured by the race to be at the cutting-edge of innovation. When companies are left to decide for themselves when their products are safe enough to release, they will inevitably miss important vulnerabilities. We will only see safer models if we introduce strong incentives for these firms to conduct adequate testing. Requiring companies to plug these vulnerabilities before they deploy a new advanced AI model will require political courage and action from Congress, but the alternative is an increasingly unsafe future. 

The Center for AI Policy (CAIP) has a 2024 action plan and full proposed model legislation. We encourage you to visit both for specific policy measures to ensure safer AI.

Congress Should Renew the Bipartisan AI Task Force

Broad Congressional support is needed for high-quality AI governance

January 16, 2025
Learn More
Read more

AI Will Be Happy to Help You bLUid a Bomb

A new paper on Best-of-N Jailbreaking demonstrates the inherent unreliability of reinforcement-based alignment

January 13, 2025
Learn More
Read more

Hill Op-Ed: How Congress dropped the ball on AI safety

After two years of congressional deliberation on artificial intelligence, we need more than careful analysis

January 7, 2025
Learn More
Read more