ChatGPT and Gemini Outperform Top Tokyo University Students
- •ChatGPT and Gemini outperformed the top-scoring student in Tokyo University's 2026 entrance exam simulation.
- •Models demonstrated significant progress in mathematical reasoning, moving from failing grades to perfect scores in just one year.
- •Human experts grading descriptive answers reveal AI's strength in logic but persistent struggles with context and visual interpretation.
The performance of Large Language Models (LLMs) on Japan's most prestigious entrance exams for the University of Tokyo and Kyoto University has captured the attention of the academic community. For students outside of computer science, it is important to recognize that this shift is not merely about AI correctly guessing multiple-choice answers. We are observing a fundamental evolution in how these systems handle complex, multi-step logical reasoning under strict, human-evaluated constraints.
Recent data provided by the startup LifePrompt, which subjected models like ChatGPT and Gemini to the 2026 secondary entrance examinations, illustrates a stark trajectory. In the Tokyo University 'Science III' category—a track famously known as the gateway to the medical profession—both models secured scores that exceeded the actual top-scoring human student. By surpassing 490 out of 550 points, these models demonstrate that they have moved far beyond simple pattern matching.
This progress is most evident in mathematics, where models have leaped from failing grades to near-perfect scores in just one year. This advancement is emblematic of how rapidly reasoning capabilities are maturing, often driven by architectures that allow for deeper 'thinking' processes. The rigorous methodology involved descriptive, handwritten-style prompts graded by expert lecturers from Kawaijuku, a renowned Japanese preparatory school, who evaluated logical flow and scientific rigor rather than just the final answers.
Despite these achievements, we must avoid the trap of 'AI exceptionalism.' While these models excel at the structured, logical environment of a mathematics exam, they still exhibit significant 'contextual gaps.' The study noted that models struggled in subjects requiring deep empathy, metaphorical understanding, or complex visual interpretation, such as geography, history, or literature.
This disparity highlights the fundamental difference between processing data and true understanding. While models possess vast amounts of information, synthesizing it into the culturally nuanced prose required by top-tier universities remains a hurdle. The future of education is not about competing with a machine’s recall, but rather mastering the human capacity for high-level context, nuanced output control, and the critical verification of AI-generated work.