What are the key points?

Multi-LCB extends LiveCodeBench to support code-generation evaluation across twelve different programming languages. The benchmark framework includes 24 LLMs, revealing significant performance disparities and Python overfitting issues. Tasks use an input-output standard to maintain consistent evaluation protocols across diverse programming environments.

Multi-LCB Expands LLM Coding Benchmarks to Twelve Languages

•Multi-LCB extends LiveCodeBench to support code-generation evaluation across twelve different programming languages.
•The benchmark framework includes 24 LLMs, revealing significant performance disparities and Python overfitting issues.
•Tasks use an input-output standard to maintain consistent evaluation protocols across diverse programming environments.

Researchers introduced Multi-LCB, a new benchmark designed to evaluate the code-generation performance of large language models (LLMs) across twelve different programming languages. This project addresses the primary limitation of the original LiveCodeBench (LCB) framework, which has historically been restricted to Python-only evaluations. By transforming existing Python tasks from the LCB dataset into equivalent problems for other languages, Multi-LCB maintains strict contamination controls and consistent evaluation protocols. The system is fully compatible with the original LCB format, allowing it to automatically track future updates to the benchmark.

The evaluation methodology involves converting tasks into an input-output format where model solutions must read from standard input (stdin) and write to standard output (stdout). Researchers developed specific evaluation scripts for each of the twelve supported languages to ensure consistent performance assessment across diverse programming environments. This setup allows for the testing of models in both single-turn coding tasks and complex agentic scenarios (AI systems that can execute multi-step tasks).

The team tested 24 different LLMs using the new framework and identified significant performance gaps. The findings revealed evidence of Python overfitting—where models perform disproportionately well on Python tasks compared to others—along with instances of language-specific contamination. These results highlight substantial disparities in multilingual coding competence, suggesting that many current models struggle to maintain their reasoning and instruction-following performance when forced to switch away from Python. The benchmark serves as a rigorous tool for developers to assess cross-language capabilities and identify weaknesses in model training strategies.

Researchers introduced Multi-LCB, a new benchmark designed to evaluate the code-generation performance of large language models (LLMs) across twelve different programming languages. This project addresses the primary limitation of the original LiveCodeBench (LCB) framework, which has historically been restricted to Python-only evaluations. By transforming existing Python tasks from the LCB dataset into equivalent problems for other languages, Multi-LCB maintains strict contamination controls and consistent evaluation protocols. The system is fully compatible with the original LCB format, allowing it to automatically track future updates to the benchmark.

The evaluation methodology involves converting tasks into an input-output format where model solutions must read from standard input (stdin) and write to standard output (stdout). Researchers developed specific evaluation scripts for each of the twelve supported languages to ensure consistent performance assessment across diverse programming environments. This setup allows for the testing of models in both single-turn coding tasks and complex agentic scenarios (AI systems that can execute multi-step tasks).

The team tested 24 different LLMs using the new framework and identified significant performance gaps. The findings revealed evidence of Python overfitting—where models perform disproportionately well on Python tasks compared to others—along with instances of language-specific contamination. These results highlight substantial disparities in multilingual coding competence, suggesting that many current models struggle to maintain their reasoning and instruction-following performance when forced to switch away from Python. The benchmark serves as a rigorous tool for developers to assess cross-language capabilities and identify weaknesses in model training strategies.