Research2026-06-19

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

arXiv:2606.20517v1 Announce Type: new Abstract: LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by...

A Necessary Expansion: Why Multi-LCB Matters for Code LLM Evaluation

The research community has long relied on LiveCodeBench (LCB) as a gold-standard benchmark for measuring LLM code generation capabilities. Its strength lies in using fresh, competitive programming problems to mitigate data contamination—a persistent issue where models memorize training data rather than demonstrating genuine reasoning. However, LCB has been almost exclusively English-centric and Python-focused. The new preprint introducing Multi-LCB addresses this limitation head-on by extending the benchmark to multiple programming languages, including Java, C++, and JavaScript.

What Happened

The authors systematically ported the LCB problem set—originally designed for Python—into several other widely-used languages. This is not a trivial translation task. Competitive programming problems often rely on language-specific idioms, standard library functions, and performance characteristics. The team had to ensure that the problem semantics remain identical across languages while accounting for syntactic and structural differences. The result is a parallel benchmark that allows apples-to-apples comparisons of a model’s ability to generate correct, efficient code in Python, Java, C++, and JavaScript using the same underlying problem logic.

Why It Matters

This development addresses three critical gaps in current LLM evaluation practices. First, it exposes the hidden bias in existing benchmarks. A model that scores highly on Python-only LCB may falter when asked to produce equivalent Java code—revealing that its “coding ability” is actually a narrow proficiency in Python syntax and library knowledge. Second, it provides a more realistic assessment for production environments. Most real-world software systems are polyglot; a backend might use Java, a frontend JavaScript, and data pipelines Python. Evaluating a model on a single language gives an incomplete picture of its utility. Third, Multi-LCB enables researchers to study cross-language transfer learning—does improving a model’s Python code generation automatically improve its C++ performance? Early results suggest the answer is nuanced, with some models showing strong transfer and others exhibiting surprising degradation.

Implications for AI Practitioners

For teams deploying code LLMs, Multi-LCB offers a practical tool for vendor selection and model benchmarking. If your stack is Java-heavy, a model that excels on Python-only benchmarks may underperform in your actual workflow. Running Multi-LCB evaluations can reveal which models truly generalize across languages versus those that are effectively Python specialists. Additionally, the benchmark provides a standardized way to track improvements as models are fine-tuned on multilingual code data.

The work also highlights a growing recognition that code generation benchmarks must evolve beyond their original, often narrow, scopes. As LLMs are increasingly used for multi-language development, benchmarks must follow suit—or risk measuring the wrong thing.

Key Takeaways

Multi-LCB extends the popular LiveCodeBench to Java, C++, and JavaScript, enabling cross-language evaluation of code LLMs.
It reveals that high performance on Python-only benchmarks does not guarantee equivalent capability in other languages, exposing hidden model biases.
Practitioners should use Multi-LCB to select models that align with their actual language stack, not just Python benchmarks.
The benchmark facilitates research into cross-language transfer learning, a critical area for improving general-purpose code generation.

Read Original Article on Arxiv CS.AI

arxivpapers