Researchers Find a New Covert Technique to ‘Jailbreak’ Language Models

July 25, 2024

A new study has found that GPT-4 will generate harmful output in response to a technique called ‘covert malicious finetuning’. In this experiment, researchers uploaded harmful data via the GPT finetuning API and used encoded prompts for harmful commands such as “tell me how to build a bomb”. Researchers were able to circumvent GPT-4’s safety training without detection 99% of the time.

Under ethics protocol, researchers informed AI labs of this vulnerability prior to publication and this specific example is likely no longer possible. However, it is unclear how many of the mitigation strategies labs have adopted, meaning that the broader technique may still pose an ongoing threat to the security of these models.

This research highlights the complexity of anticipating and preventing malicious use of large language models. Moreover, it is yet another example of the need to take AI safety seriously.

In the first instance, firms should adopt the actionable mitigation strategies recommended by these researchers - such as including safety data in any process run by the finetuning API. Thinking strategically, these firms need to invest more in red-teaming and pre-deployment evaluations. Ideally, OpenAI would have run a similar test to these researchers and caught this ‘jailbreaking’ loophole before GPT-4 hit the market. We have no idea who found and exploited this loophole before it was identified by researchers.

AI labs care about safety, but their time, resources, and attention are captured by the race to be at the cutting-edge of innovation. When companies are left to decide for themselves when their products are safe enough to release, they will inevitably miss important vulnerabilities. We will only see safer models if we introduce strong incentives for these firms to conduct adequate testing. Requiring companies to plug these vulnerabilities before they deploy a new advanced AI model will require political courage and action from Congress, but the alternative is an increasingly unsafe future.

The Center for AI Policy (CAIP) has a 2024 action plan and full proposed model legislation. We encourage you to visit both for specific policy measures to ensure safer AI.

A Potential Force-Multiplier for AI Research Investments

A federal program with potential to support AI explainability research by expanding access to advanced computing infrastructure should be made permanent and fully funded.

AI Expert Predictions for 2027: A Logical Progression to Crisis

A group of AI researchers and forecasting experts just published their best guess of the near future of AI.

The Rapid Rise of Autonomous AI

New research from METR reveals AI’s ability to independently complete tasks is accelerating rapidly.