Research2026-07-03

Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

Originally published byArxiv CS.AI

arXiv:2604.04532v2 Announce Type: replace-cross Abstract: Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge's language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically...

The Hidden Bias of Language in AI Evaluation

A new paper from arXiv reveals a troubling blind spot in how we evaluate AI agents: the language of the evaluation itself can fundamentally alter which models appear superior. By localizing the "Agent-as-a-Judge" prompt stack into five typologically diverse languages, researchers demonstrate that switching from English to another language can invert the ranking of backbone models entirely.

The study systematically translates evaluation prompts—the instructions given to an LLM acting as a judge of other AI agents—into languages with different syntactic structures, morphological complexity, and cultural contexts. The results show that evaluation language is not a neutral conduit but an active variable that reshapes performance metrics. What works best in English may rank worst in Mandarin or Arabic.

Why This Matters

This finding strikes at the foundation of current AI benchmarking practices. The industry has largely treated English as a universal evaluation standard, assuming that prompt translation is a trivial exercise. The paper disproves this assumption empirically: when the judge's language changes, the relative strengths and weaknesses of different models shift in ways that cannot be explained by simple translation errors.

For multilingual AI development, this creates a paradox. If we cannot trust English-only evaluations to predict performance in other languages, then entire categories of AI products—customer service chatbots, legal document analyzers, educational tools—may be optimized for the wrong criteria. A model that excels at English code generation might fail spectacularly when evaluated in Japanese, not because of its coding ability, but because the evaluation framework itself is linguistically biased.

Implications for AI Practitioners

First, evaluation pipelines must become multilingual by default. Teams building agentic systems for global deployment cannot rely on English benchmarks as proxies for cross-lingual capability. The paper suggests that prompt localization should be treated as a first-class engineering concern, not an afterthought.

Second, backbone model selection requires language-aware testing. The inversion of rankings means that choosing a model based on English benchmarks could lead to suboptimal performance in target languages. Practitioners should run parallel evaluations in each deployment language before committing to a backbone architecture.

Third, the "Agent-as-a-Judge" paradigm needs linguistic calibration. If judges themselves are language-sensitive, then evaluation results may reflect the judge's linguistic biases more than the agent's actual capabilities. This calls for multi-judge, multi-language evaluation frameworks that can disentangle language effects from task performance.

Key Takeaways

Evaluation language is not a neutral variable; changing the judge's language can invert model rankings entirely
English-only benchmarks are insufficient for predicting multilingual AI agent performance
Practitioners must run parallel evaluations in each target deployment language before selecting backbone models
The "Agent-as-a-Judge" approach requires linguistic calibration to avoid conflating language bias with capability assessment

Read Original Article on Arxiv CS.AI

arxivpapersagentsprompting