The 2,500-Question Exam Designed to Stump Every AI

What happened: A global consortium of nearly 1,000 researchers has published "Humanity's Last Exam" (HLE) — a 2,500-question benchmark spanning mathematics, natural sciences, humanities, and ancient languages — designed to test the outer limits of current AI capability. The paper was published in Nature.

Why it matters: Every question was pre-tested against leading AI models; any question a model could answer was removed. Early scores ranged from 2.7% (GPT-4o) to just 8% (OpenAI o1). More recent frontier models, including Gemini and Claude, have since climbed to around 40–50% accuracy.

Wider context: The benchmark was created because existing AI evaluations — including the widely used MMLU — have effectively been saturated by modern models, making meaningful capability measurement difficult. Without reliable benchmarks, policymakers and developers risk misreading what AI systems can actually do.

Background: HLE was conceived by Dan Hendrycks, director of the Center for AI Safety, after a conversation with Elon Musk about the inadequacy of existing benchmarks. Questions were crowdsourced from subject matter experts worldwide, with a $500,000 prize pool incentivising top contributions. A private question set is maintained to prevent models from memorising answers.

Don’t Panic Yet: Humanity’s Last Exam Has Begun — SciTechDaily

Singularity Soup Take: The jump from under 10% to roughly 45% in under a year is the real story — HLE was designed as a long-term yardstick, but it's being compressed far faster than its creators anticipated, which raises the obvious question of what comes after.

Key Takeaways:

Designed to defeat AI: Every HLE question was pre-screened by leading models — only questions that defeated current AI made the cut, making it the most rigorous expert-level capability benchmark yet published.
Models closing the gap fast: Top scores have risen from under 10% to approximately 40–50% for frontier models, raising questions about how long the benchmark will remain a meaningful upper bound.
Scale of collaboration: Nearly 1,000 researchers from across disciplines contributed, with one Texas A&M professor alone authoring 73 questions — the second-highest contribution — covering mathematics and computer science.
Data quality concerns: An independent 2025 investigation by FutureHouse suggested around 30% of chemistry and biology answers may be incorrect; the benchmark team partially confirmed the findings and is pursuing a revision process.