What are the key points?

Mathematicians compiled 100 research-level questions to test the reasoning capabilities of modern large language models. The dataset was created by 49 contributors during a 3-day workshop held in Leipzig, Germany, in 2026. Performance testing across three stages reduced the number of unsolved questions from 41 down to 2.

Mathematicians Release New AI Reasoning Benchmark

•Mathematicians compiled 100 research-level questions to test the reasoning capabilities of modern large language models.
•The dataset was created by 49 contributors during a 3-day workshop held in Leipzig, Germany, in 2026.
•Performance testing across three stages reduced the number of unsolved questions from 41 down to 2.

A collective of 49 mathematicians released a new research-level mathematics dataset designed to test the limits of large language models (LLMs). The project was compiled between April 1 and May 15, 2026, with the majority of the work conducted during a 3-day workshop titled Benchmarks in Leipzig, held at the Max Planck Institute for Mathematics in the Sciences in Leipzig, Germany. The researchers curated 100 questions with known solutions to measure the reasoning capabilities of artificial intelligence.

The study evaluated model performance across three sequential stages. Stage 1 involved single attempts by five state-of-the-art models, during which 41 questions remained unsolved. The team then performed a 20-runs-per-model evaluation using three of those models, which reduced the number of unsolved problems to 16. In the final phase, researchers utilized two "heavy-thinking" models (AI systems optimized for extensive multi-step reasoning) in a 3-run attempt, ultimately leaving only 2 questions unsolved. The authors conclude that these results signal significant advancements in the mathematical reasoning proficiency of modern LLMs. The complete work comprises 8 pages of benchmark statistics and a 20-page appendix containing the full 100 questions.

A collective of 49 mathematicians released a new research-level mathematics dataset designed to test the limits of large language models (LLMs). The project was compiled between April 1 and May 15, 2026, with the majority of the work conducted during a 3-day workshop titled Benchmarks in Leipzig, held at the Max Planck Institute for Mathematics in the Sciences in Leipzig, Germany. The researchers curated 100 questions with known solutions to measure the reasoning capabilities of artificial intelligence.

The study evaluated model performance across three sequential stages. Stage 1 involved single attempts by five state-of-the-art models, during which 41 questions remained unsolved. The team then performed a 20-runs-per-model evaluation using three of those models, which reduced the number of unsolved problems to 16. In the final phase, researchers utilized two "heavy-thinking" models (AI systems optimized for extensive multi-step reasoning) in a 3-run attempt, ultimately leaving only 2 questions unsolved. The authors conclude that these results signal significant advancements in the mathematical reasoning proficiency of modern LLMs. The complete work comprises 8 pages of benchmark statistics and a 20-page appendix containing the full 100 questions.