Research2026-06-30

Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models

Originally published byArxiv CS.AI

arXiv:2601.05366v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly deployed as agents that invoke external tools through structured function calls. While recent work reports strong tool-calling performance under standard English-centric evaluations, the...

The Hidden Language Gap in AI Tool Use

A new preprint from arXiv (2601.05366v2) reveals a critical blind spot in how we evaluate large language models’ ability to call external tools. While LLMs appear proficient at function calling in English, their performance degrades significantly when operating in other languages. The research systematically tests models across multiple languages and finds that tool-calling accuracy drops sharply—sometimes by 20-30 percentage points—when the input or output language shifts away from English.

This is not merely a translation problem. The models struggle with structural elements like parameter names, argument formats, and API descriptions when they appear in non-English contexts. Even bilingual models that perform well on general language tasks fail to maintain their tool-calling competence across languages. The paper identifies a “lost in execution” phenomenon where the reasoning chain for tool selection and parameter mapping breaks down under multilingual conditions.

Why This Matters Now

The timing of this finding is significant. Enterprises are rapidly deploying LLM agents for customer service, internal operations, and data retrieval across global markets. A Chinese bank using an English-optimized model for financial tool calls, or a European healthcare system relying on German-language API interactions, could face systematic failures that remain invisible under standard English benchmarks.

The research also challenges the assumption that multilingual capability in general conversation transfers to structured tasks. Tool calling requires precise syntactic understanding of function signatures and type constraints—a different cognitive load than casual dialogue. Current evaluation suites, which overwhelmingly test in English, create a false sense of robustness.

Implications for AI Practitioners

For teams building multilingual agent systems, this research demands several practical responses:

Language-specific stress testing: Standard tool-calling benchmarks must expand to include non-English scenarios with realistic API structures, not just translated prompts.

Prompt engineering adjustments: The paper suggests that explicit language markers and structured formatting may help, but these are partial fixes rather than solutions.

Model selection criteria: When choosing an LLM for tool use, practitioners should demand language-specific performance data rather than relying on aggregate multilingual scores.

Fallback architectures: Systems operating in multilingual environments should implement language-aware routing—perhaps using English as an intermediate representation for tool calls while handling user-facing text in local languages.

The broader lesson is that LLM capabilities are not uniform across tasks or languages. As agents become more autonomous, the gap between benchmark performance and real-world reliability will widen unless evaluation practices evolve. This research serves as a necessary corrective to overconfident deployment of “multilingual” models in production tool-calling pipelines.

Key Takeaways

Tool-calling accuracy in LLMs degrades substantially in non-English contexts, even for models that perform well on general multilingual tasks.
Standard English-centric evaluations create a false sense of robustness that does not transfer to real-world multilingual deployments.
Practitioners must implement language-specific stress testing and consider fallback architectures rather than relying on aggregate performance metrics.
The gap between conversational multilingual ability and structured tool-calling competence represents a distinct failure mode that current benchmarks fail to capture.

Read Original Article on Arxiv CS.AI

arxivpapers