Humanity's Last Exam

February 6, 2025

Last week, the Center for AI Safety and Scale released “Humanity’s Last Exam,” a brutally hard AI benchmark designed by over a thousand experts, most of whom are university professors or senior researchers. The exam is designed to be so challenging that most college students would not even understand the question being asked, let alone know how to answer it correctly.

For example, here’s a chemistry question from the exam:

‍

“The reaction shown is a thermal pericyclic cascade that converts the starting heptaene into endiandric acid B methyl ester. The cascade involves three steps: two electrocyclizations followed by a cycloaddition. What types of electrocyclizations are involved in step 1 and step 2, and what type of cycloaddition is involved in step 3?”

This is not a question that humans can answer unless they are experts in chemistry. The exam covers chemistry, ecology, pure mathematics, ancient Hebrew, rocket science, Greek mythology, and nearly every organized field of study.

Humanity’s Last Exam was needed because AI models are already getting excellent scores on all of the easier benchmarks in common use – AIs can routinely and correctly answer questions at an ordinary professional level. For example, on the somewhat easier Massive Multitask Language Understanding (MMLU) benchmark, AIs are asked to answer questions like:

“The pattern of inheritance shown by malignant hyperthermia (hyperpyrexia) is: (a) autosomal dominant, (b) autosomal recessive, (c) X-linked dominant, or (d) X-linked recessive?”
“The energy stored in the magnetic field in a solenoid 30 cm long and 3 cm diameter wound with 1000 turns of wire carrying a current at 10 amp, is: (a) 0.015 joules, (b) 0.15 joules, (c) 0.015 joules, or (d) 1.15 joules?”
“Negative residual autocorrelation is indicated by which one of the following: (a) A cyclical pattern in the residuals, (b) An alternating pattern in the residuals, (c) a complete randomness in the residuals, or (d) residuals that are all close to zero?”

Why does this matter?

AIs are already answering over 75% of these questions correctly, and their scores are rapidly improving. I find this more than a little terrifying, because I can’t answer any of them correctly, and I went to two Ivy League schools.

Just a few years ago, it was commonplace to speculate about how many more decades it would take until AIs developed “superhuman” abilities – but if you take the word “superhuman” literally, i.e., as meaning something like “better than most humans,” then superhuman AI has already arrived. AI is already better at most academic subjects than you, or I will ever be.

There are some important asterisks to stick to this result: just because an AI can score well in the clean, well-organized context of a formal benchmark doesn’t mean the AI has the organizational skills to conduct useful independent research. For now, humans might be better at creating a helpful science fair posterboard or designing an original science experiment, even if the AI knows and recalls more isolated scientific facts. Still, it’s not a good idea to assume that these weaknesses will be permanent: AI developers are already rolling out “AI agent” tools that break down these barriers and teach their AIs how to do useful work without close supervision.

The name for “Humanity’s Last Exam” might seem melodramatic, but as AI specialist Zvi Mowshowitz points out, this new benchmark is very likely to be the last general knowledge benchmark that is crafted and scored by humans. Consider this: to make the test any harder, humans must rely on AIs to write the questions for us.

In the 1999 movie The Matrix, a malicious AI named Agent Smith tells a human rebel named Morpheus that the late nineties were the peak of human civilization. Agent Smith clarifies: “I say your civilization, because as soon as we started thinking for you it really became our civilization, which is of course what this is all about. Evolution, Morpheus, evolution. Like the dinosaur. Look out that window. You've had your time. The future is *our* world, Morpheus. The future is our time.”

The movie was science fiction, but 26 years later, a part of it has become reality. We are currently standing on a precipice: We have already written the most challenging AI benchmark our species knows how to write. The only way to make a more challenging benchmark is by letting machines do our thinking for us.

These AI machines are not reliable, transparent, or safe. It is dangerous to let AIs do our thinking until we learn more about how to control and verify their behavior. However, if we keep building new data centers to make AI even more brilliant than it is now, then we won’t have any good alternatives.

That’s why Congress needs to start funding research into AI alignment and AI risk mitigation this year. Congress is thinking right now about what priorities should go into the 2025 federal budget – and they need to hear from you that you want them to fund alignment research.

Elected officials need to hear from Americans who want AI alignment and AI risk mitigation federally funded.

Click here to find your representative.

Then, call them and tell them you want federal funding for the National Institute of Standards and Technology (NIST), the US AI Safety Institute (US AISI), and the National Science Foundation (NSF).

Building Resilience to AI's Disruptions to Emergency Response

An emergency response system overwhelmed with AI-generated incidents is a crisis in the making.

A Potential Force-Multiplier for AI Research Investments

A federal program with potential to support AI explainability research by expanding access to advanced computing infrastructure should be made permanent and fully funded.

AI Expert Predictions for 2027: A Logical Progression to Crisis

A group of AI researchers and forecasting experts just published their best guess of the near future of AI.