Researchers Find a New Covert Technique to ‘Jailbreak’ Language Models

Claudia Wilson
,
July 25, 2024

A new study has found that GPT-4 will generate harmful output in response to a technique called ‘covert malicious finetuning’. In this experiment, researchers uploaded harmful data via the GPT finetuning API and used encoded prompts for harmful commands such as “tell me how to build a bomb”. Researchers were able to circumvent GPT-4’s safety training without detection 99% of the time.

Under ethics protocol, researchers informed AI labs of this vulnerability prior to publication and this specific example is likely no longer possible. However, it is unclear how many of the mitigation strategies labs have adopted, meaning that the broader technique may still pose an ongoing threat to the security of these models.

This research highlights the complexity of anticipating and preventing malicious use of large language models. Moreover, it is yet another example of the need to take AI safety seriously. 

In the first instance, firms should adopt the actionable mitigation strategies recommended by these researchers - such as including safety data in any process run by the finetuning API. Thinking strategically, these firms need to invest more in red-teaming and pre-deployment evaluations. Ideally, OpenAI would have run a similar test to these researchers and caught this ‘jailbreaking’ loophole before GPT-4 hit the market. We have no idea who found and exploited this loophole before it was identified by researchers. 

AI labs care about safety, but their time, resources, and attention are captured by the race to be at the cutting-edge of innovation. When companies are left to decide for themselves when their products are safe enough to release, they will inevitably miss important vulnerabilities. We will only see safer models if we introduce strong incentives for these firms to conduct adequate testing. Requiring companies to plug these vulnerabilities before they deploy a new advanced AI model will require political courage and action from Congress, but the alternative is an increasingly unsafe future. 

The Center for AI Policy (CAIP) has a 2024 action plan and full proposed model legislation. We encourage you to visit both for specific policy measures to ensure safer AI.

The House That AI Built

Countless American sports fans are suffering emotional despair and financial ruin as a result of AI-powered online sports betting

October 31, 2024
Learn More
Read more

"AI'll Be Right Back"

5 horror movie tropes that explain trends in AI

October 28, 2024
Learn More
Read more

A Recommendation for the First Meeting of AI Safety Institutes

Reclaim safety as the focus of international conversations on AI

October 24, 2024
Learn More
Read more