What are the key points?

Soohak benchmark introduces 439 research-level mathematical problems created by a cohort of 64 mathematicians. Gemini-3-Pro achieved 30.4%, GPT-5 hit 26.4%, and Claude-Opus-4.5 scored 10.4% on the challenge subset. The dataset includes a refusal subset to measure if models correctly identify ill-posed problems.

Soohak: New Research-Level Mathematical Benchmark for LLMs

•Soohak benchmark introduces 439 research-level mathematical problems created by a cohort of 64 mathematicians.
•Gemini-3-Pro achieved 30.4%, GPT-5 hit 26.4%, and Claude-Opus-4.5 scored 10.4% on the challenge subset.
•The dataset includes a refusal subset to measure if models correctly identify ill-posed problems.

A research team led by Guijin Son has introduced Soohak, a mathematical benchmark containing 439 problems curated by 64 mathematicians. Designed to evaluate research-level math capabilities, the dataset serves as a new metric for frontier models, as current olympiad-style benchmarks are increasingly solved by leading AI systems. The benchmark is divided into two primary sections: a challenge subset and a refusal subset, the latter of which tests the ability to recognize ill-posed problems (mathematical questions lacking valid solutions or sufficient constraints).

On the challenge subset, performance results highlight significant room for improvement among top-tier models. Gemini-3-Pro leads with 30.4%, followed by GPT-5 at 26.4%, and Claude-Opus-4.5 at 10.4%. Open-weight models perform under 15%, including Qwen3-235B, GPT-OSS-120B, and Kimi-2.5. The refusal subset reveals even greater challenges, with no model surpassing 50% accuracy in correctly identifying or pausing when encountering ill-posed problems. The researchers emphasize that these results establish a new optimization target for future model development. To maintain benchmark integrity and prevent contamination, the complete dataset is scheduled for public release in late 2026, though evaluations can be requested through the authors in the interim.

A research team led by Guijin Son has introduced Soohak, a mathematical benchmark containing 439 problems curated by 64 mathematicians. Designed to evaluate research-level math capabilities, the dataset serves as a new metric for frontier models, as current olympiad-style benchmarks are increasingly solved by leading AI systems. The benchmark is divided into two primary sections: a challenge subset and a refusal subset, the latter of which tests the ability to recognize ill-posed problems (mathematical questions lacking valid solutions or sufficient constraints).

On the challenge subset, performance results highlight significant room for improvement among top-tier models. Gemini-3-Pro leads with 30.4%, followed by GPT-5 at 26.4%, and Claude-Opus-4.5 at 10.4%. Open-weight models perform under 15%, including Qwen3-235B, GPT-OSS-120B, and Kimi-2.5. The refusal subset reveals even greater challenges, with no model surpassing 50% accuracy in correctly identifying or pausing when encountering ill-posed problems. The researchers emphasize that these results establish a new optimization target for future model development. To maintain benchmark integrity and prevent contamination, the complete dataset is scheduled for public release in late 2026, though evaluations can be requested through the authors in the interim.