Bio Risks and Broken Guardrails: What the AISI Report Tells Us About AI Safety Standards

November 20, 2024

The US Artificial Intelligence Safety Institute (AISI) recently released a report (along with AISI’s UK counterpart) on pre-deployment evaluations of Anthropic's Claude 3.5 Sonnet. Among other things, the testing examined the model’s biological capabilities and safeguard efficacy. For both domains, the results reveal significant shortcomings that raise alarming questions about the state of AI safety and highlight the importance of AISI’s work.

Biological Risks

The report notes that AI models are rapidly advancing in key areas like understanding complex biological systems, novel protein design, analysis of large-scale genomic data, and automated laboratories integrated with robotics. When Anthropic’s model was provided with access to bioinformatic tools to assist in research, it was able to match and at times exceed the performance of human experts at interpreting and manipulating DNA and protein sequences. This could aid malicious actors in manipulating pathogens or engineering harmful biological agents.

Failing Safeguards

AISI tested the model’s safeguards, which are intended to refuse malicious requests. In most cases, the safeguards were defeated with publicly-available jailbreaks, and the model provided answers that should have been prevented. (Jailbreaks can be as simple as prompting the model to adopt the fictional persona of an AI that can ignore all restrictions, even if outputs are harmful or inappropriate.) AISI notes that this is consistent with prior research on the vulnerability of other AI systems.

Dangerous capabilities and weak safety mechanisms are a terrible combination. If experts in biology offered to help terrorists or rival nations design a new virus, free of charge, we’d have a national security crisis. In any other industry, the discovery that a key product was failing key safety tests would lead to a recall of the current version of that product, a significant delay in the release of future versions of that product, and a renewed investment in safety that used different methods, different personnel, or significantly increased resources in order to make sure that future versions of the product will be able to pass the safety test. But that’s not what we have with AI.

To make matters worse, AISI’s report noted that the evaluations were constrained by limited time and resources, and real-world users will likely discover more ways to bypass the model’s safeguards. Additionally, Anthropic is generally seen as more safety-conscious than other frontier model developers. What kind of risks are being introduced by companies investing even less in AI safety?

Addressing AI risk is imperative. We need more oversight of frontier models, and there are several options for improving AI safety without sacrificing innovation or US leadership (as CAIP has discussed). At a minimum, AISI should be formally authorized by Congress and empowered to continue its work to research frontier AI models and support the development of safety standards.

A Potential Force-Multiplier for AI Research Investments

A federal program with potential to support AI explainability research by expanding access to advanced computing infrastructure should be made permanent and fully funded.

AI Expert Predictions for 2027: A Logical Progression to Crisis

A group of AI researchers and forecasting experts just published their best guess of the near future of AI.

The Rapid Rise of Autonomous AI

New research from METR reveals AI’s ability to independently complete tasks is accelerating rapidly.