Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions
arXiv:2606.30790v1 Announce Type: cross Abstract: Romanized Code Mixing (RCM), where bilingual speakers fluidly blend local languages with English in Roman script, has emerged as the dominant form of communication across multilingual communities. While Large Language Models (LLMs) perform strongly...
The Unseen Language Barrier: Why Romanized Code-Mixing Demands New AI Benchmarks
The release of the Indi-RomCoM benchmark from Arxiv CS.AI highlights a critical blind spot in current LLM evaluation: the pervasive, yet largely untested, phenomenon of Romanized Code-Mixing (RCM). This benchmark specifically targets scenarios where speakers of Indian languages—Hindi, Tamil, Bengali, and others—write in Roman script while seamlessly blending their native tongue with English. The research reveals that while LLMs perform strongly on monolingual or pure English tasks, their accuracy and coherence degrade significantly when confronted with this hybrid, informal register.
Why This Matters Beyond LinguisticsAt first glance, this might seem like a niche linguistic curiosity. In reality, RCM is the default mode of digital communication for hundreds of millions of users across South Asia and the global diaspora. From WhatsApp chats to customer support tickets, social media comments to e-commerce reviews, the data that businesses and platforms must process is overwhelmingly written in this code-mixed form. The Indi-RomCoM benchmark exposes a fundamental mismatch: LLMs are being trained and optimized on clean, script-consistent data, yet deployed into environments where the input is messy, hybrid, and orthographically inconsistent.
The implications are concrete. A model that cannot reliably parse "Mujhe kal meeting mein aana hai, but I'm not sure about the time" (a typical RCM sentence) will fail at tasks like sentiment analysis for a brand, intent classification for a chatbot, or accurate translation for a customer service system. This is not a theoretical edge case; it is the mainstream user experience for a significant portion of the global internet.
Implications for AI PractitionersFor developers and product teams, this benchmark serves as a practical wake-up call. First, it underscores the need for domain-specific evaluation. Relying solely on standard English or even pure Hindi benchmarks will give a dangerously inflated view of model performance. Practitioners targeting Indian markets must incorporate RCM-specific test sets into their evaluation pipelines.
Second, it highlights the importance of data curation and fine-tuning strategy. The benchmark suggests that generic instruction-tuning on English data is insufficient. Teams may need to collect or synthetically generate RCM datasets—perhaps by transliterating existing code-mixed speech data or by using prompt engineering to encourage models to generate RCM during training. Techniques like continued pre-training on social media or forum text could also help bridge the gap.
Finally, this research points to a broader architectural consideration: tokenization efficiency. Romanized code-mixing often results in longer sequences because a single Hindi word written in Roman script (e.g., "kya") takes multiple tokens. This not only increases computational cost but can also degrade reasoning performance due to context window constraints. Practitioners should evaluate whether their chosen tokenizer handles such transliterated forms efficiently, or if a custom tokenizer is warranted.
Key Takeaways
- Standard benchmarks are misleading: LLMs that excel on English or pure native-language tasks often fail on Romanized Code-Mixing, which is the dominant communication mode for hundreds of millions of users.
- Real-world performance requires RCM-specific testing: AI practitioners targeting multilingual markets, especially in India, must integrate benchmarks like Indi-RomCoM into their evaluation to avoid deploying models that underperform on actual user input.
- Data strategy must adapt: Effective handling of RCM likely requires targeted fine-tuning on transliterated, code-mixed datasets, not just more English or script-native data.
- Tokenization efficiency is a hidden cost: Romanized transliteration inflates token counts, increasing latency and cost; teams should audit tokenizer performance on representative RCM samples.