Researchers Find a New Covert Technique to ‘Jailbreak’ Language Models

Claudia Wilson
,
July 25, 2024

A new study has found that GPT-4 will generate harmful output in response to a technique called ‘covert malicious finetuning’. In this experiment, researchers uploaded harmful data via the GPT finetuning API and used encoded prompts for harmful commands such as “tell me how to build a bomb”. Researchers were able to circumvent GPT-4’s safety training without detection 99% of the time.

Under ethics protocol, researchers informed AI labs of this vulnerability prior to publication and this specific example is likely no longer possible. However, it is unclear how many of the mitigation strategies labs have adopted, meaning that the broader technique may still pose an ongoing threat to the security of these models.

This research highlights the complexity of anticipating and preventing malicious use of large language models. Moreover, it is yet another example of the need to take AI safety seriously. 

In the first instance, firms should adopt the actionable mitigation strategies recommended by these researchers - such as including safety data in any process run by the finetuning API. Thinking strategically, these firms need to invest more in red-teaming and pre-deployment evaluations. Ideally, OpenAI would have run a similar test to these researchers and caught this ‘jailbreaking’ loophole before GPT-4 hit the market. We have no idea who found and exploited this loophole before it was identified by researchers. 

AI labs care about safety, but their time, resources, and attention are captured by the race to be at the cutting-edge of innovation. When companies are left to decide for themselves when their products are safe enough to release, they will inevitably miss important vulnerabilities. We will only see safer models if we introduce strong incentives for these firms to conduct adequate testing. Requiring companies to plug these vulnerabilities before they deploy a new advanced AI model will require political courage and action from Congress, but the alternative is an increasingly unsafe future. 

The Center for AI Policy (CAIP) has a 2024 action plan and full proposed model legislation. We encourage you to visit both for specific policy measures to ensure safer AI.

Biden and Xi’s Statement on AI and Nuclear Is Just the Tip of the Iceberg

Analyzing present and future military uses of AI

November 21, 2024
Learn More
Read more

Bio Risks and Broken Guardrails: What the AISI Report Tells Us About AI Safety Standards

AISI conducted pre-deployment evaluations of Anthropic's Claude 3.5 Sonnet model

November 20, 2024
Learn More
Read more

Slower Scaling Gives Us Barely Enough Time To Invent Safe AI

Slower AI progress would still move fast enough to radically disrupt American society, culture, and business

November 20, 2024
Learn More
Read more