Understanding precisely how quickly AI systems are becoming capable of independently executing complex tasks is crucial for both policymakers and the public. Clear, intuitive metrics that measure AI autonomy in terms directly comparable to human capabilities can help everyone better understand the potential impacts and risks associated with rapidly advancing AI technologies.
New research from Model Evaluation & Threat Research (METR), a non-profit dedicated to empirical evaluations of frontier AI systems, provides exactly such a metric. METR introduces the "50% task completion time horizon," offering a way to quantify AI autonomy based explicitly on human performance benchmarks.
METR’s results are striking. They found that AI systems' ability to independently handle tasks, measured by the length of time skilled humans require, is doubling approximately every seven months. This exponential rate has held steady since 2019, suggesting a robust trend.
To put this plainly: if these current trends hold, within five years, frontier AI systems could independently execute complex software projects that today require weeks or even months of human expert labor.
METR’s approach is straightforward:
This approach allows policymakers and researchers alike to quantify progress clearly, providing an intuitive benchmark of how AI autonomy compares to human capabilities.
To ensure realistic evaluation, METR conducted experiments using a carefully constructed benchmark called Human-Calibrated Autonomy Software Tasks (HCAST). These tasks span domains such as software engineering, cybersecurity, machine learning engineering, and general reasoning, and each is explicitly calibrated by measuring the time skilled humans take to complete them under identical conditions to AI systems.
The METR team collected extensive data, involving 140 skilled professionals who spent over 1,500 hours completing these calibrated tasks. The realism of the evaluation is further reinforced by incorporating multi-step decision-making and iterative problem-solving reflective of real-world scenarios. METR acknowledges that additional real-world complexities such as unclear success criteria or intricate coordination requirements, might further challenge autonomous AI in practical applications.
The increasing ability of AI systems to independently complete longer-duration tasks introduces significant and tangible risks. For example, AI models capable of autonomously handling tasks that take four hours or more, such as debugging intricate cybersecurity vulnerabilities, managing software updates in critical infrastructure, or executing prolonged machine learning training pipelines, pose heightened security risks if misaligned, compromised, or maliciously deployed. As autonomy expands to tasks spanning twelve hours or longer, these risks escalate further, potentially enabling AI to independently plan and execute actions involving strategic coordination or complex decision-making with limited human oversight.
Specific dangers associated with highly agentic AIs include the potential for autonomous execution of sensitive tasks related to national security, critical infrastructure management, or dual-use technology development. For instance, an AI tasked with long-duration cybersecurity operations could independently discover and exploit vulnerabilities without human detection. Similarly, extended autonomy in software engineering could enable AI systems to independently develop software tools or exploits with significant potential for misuse.
The progression towards more autonomous and capable systems, underscored by these findings, emphasizes the urgent need for proactive governance and robust safeguards to prevent misuse and mitigate these escalating risks.
Congress can rein in Big Tech, and specifically address one of our biggest threats, Artificial Intelligence (AI).
Attending RightsCon, the world’s leading summit on human rights in the digital age.
Balancing geopolitics, safety, and innovation.