The LSAT, or Law School Admission Test, is a standardized test required for admission to most law schools in America.
OpenAI has tested its AI models on the LSAT. Its top 2022 model, GPT-3.5, scored around the 40th percentile. Its top 2023 model, GPT-4, scored around the 88th percentile.
Now another year has passed, and AI capabilities have blazed even further ahead. Last week, OpenAI introduced a new model called “o1” that gets almost all of the LSAT questions correct. Based on historical LSAT data, the o1 model is likely scoring in the 98th or 99th percentile. This is on par with (human) students at the top law schools in the country.
This leap in LSAT performance is part of a larger wave of ongoing AI breakthroughs. In recent years, AI systems have grown significantly more competent in a wide variety of domains. For example, o1 also made sizable strides in math, physics, and computer programming—so much so that OpenAI is already asking o1 to make AI-authored contributions to the company codebase.
The frenetic speed of AI advancement is maintained by a trifecta of durable forces. First, companies are spending billions of dollars every week to build enormous supercomputer warehouses that will train the next generation of AI systems. Second, researchers are constantly finding more efficient training algorithms for converting computational power into AI capabilities. Third, as o1 demonstrates, engineers are simultaneously discovering techniques that boost performance after the main training phase.
In June, former OpenAI researcher Leopold Aschenbrenner coined the term “unhobbling” to describe this third driver of growth. From his perspective, several eliminable hindrances are preventing AI models from unleashing their excellent “raw capabilities.”
In particular, Aschenbrenner stressed how current AI chatbots must respond to every question with an immediate, top-of-mind answer, whereas humans can solve difficult problems by spending a long time to think. Thus, Aschenbrenner predicted, a critical form of future unhobbling will center on giving chatbots time to “think.”
Based on OpenAI’s research results, Aschenbrenner was right. The o1 model’s performance consistently improved as it received more computing operations (“compute”) from OpenAI engineers. Some compute went into a novel algorithm for teaching o1 to think out loud (“train-time compute”), and some compute simply allowed o1 to think longer in response to questions (“test-time compute”). In both cases, the gains were remarkably reliable:
To date, most AI progress has come from scaling training hardware and improving training software—the first two forces mentioned earlier. But now, o1 shows that the third driver of AI progress, unhobbling, will play an increasingly important role in fueling further breakthroughs.
For policymakers, the most important takeaway is to remain vigilant and proactive. Unhobbled models like o1 bring exciting new AI capabilities, but those capabilities come hand-in-hand with escalating safety hazards. As the relentless pace of AI progress continues, Congress must redouble its efforts to pass AI safety legislation.
Broad Congressional support is needed for high-quality AI governance
A new paper on Best-of-N Jailbreaking demonstrates the inherent unreliability of reinforcement-based alignment
After two years of congressional deliberation on artificial intelligence, we need more than careful analysis