Center for AI Policy Search Results

AI Will Be Happy to Help You bLUid a Bomb | Center for AI Policy | CAIP

Best-of-N Jailbreaking, ” revealed one of the glaring problems with this paradigm. Ordinary jailbreaking is just the art of getting AI models to do tasks for you, supposedly locked away by their reinforcement learning.…

www.centeraipolicy.org/work/ai-will-be-happy-to-help-you-bluid-a-bomb

Researchers Find a New Covert Technique to ‘Jailbreak’ Language Models | Center for AI Policy | CAIP

Researchers Find a New Covert Technique to ‘Jailbreak’ Language Models. Claudia Wilson. July 25, 2024. A new. study. has found that GPT-4 will generate harmful output in response to a technique called ‘covert malicious finetuning’.…

www.centeraipolicy.org/work/researchers-find-a-new-covert-technique-to-jailbreak-language-models

Bio Risks and Broken Guardrails: What the AISI Report Tells Us About AI Safety Standards | Center for AI Policy | CAIP

In most cases, the safeguards were defeated with publicly-available jailbreaks. , and the model provided answers that should have been prevented.…

www.centeraipolicy.org/work/bio-risks-and-broken-guardrails-what-the-aisi-report-tells-us-about-ai-safety-standards