Show HN: Evaluating Local LLMs as language translators for my app
This is my first attempt at running an eval of this nature so would love some methodology feedback.I can't guarantee the sources weren't already in the model's inputs without getting novel translations from native speakers, but from my experience using the top models, they feel very...
The Pragmatic Shift: Evaluating Local LLMs for Translation
The Hacker News post describes a practitioner’s first attempt at systematically evaluating local large language models (LLMs) for a translation use case. The author acknowledges a critical methodological limitation—potential data contamination—and seeks community feedback on their evaluation approach. This is not a breakthrough announcement but a grounded, hands-on exploration of whether locally-run models can replace cloud-based translation APIs in a production app.
Why This Matters
This post reflects a broader industry trend: the move from “Can LLMs translate?” to “Which LLM should I run locally for translation, and how do I measure that reliably?” For years, translation was dominated by specialized models (e.g., MarianMT, NLLB) or cloud APIs (Google Translate, DeepL). The rise of general-purpose LLMs like Llama, Mistral, and Qwen has blurred these boundaries. Running translation locally offers clear benefits: no API costs, no data leaving the device, lower latency for batch processing, and full privacy. But it also introduces new challenges—model size vs. quality trade-offs, prompt engineering for consistent output, and the risk of “hallucinated” translations that sound fluent but are factually wrong.
The author’s honest admission about potential training data overlap is crucial. Many open-source LLMs have been trained on multilingual corpora that may include parallel texts (e.g., Common Crawl, ParaCrawl). If a model has already “seen” the test sentences during training, evaluation scores become inflated and misleading. This is a known problem in NLP evaluation, but it is rarely discussed in the context of local LLM deployment. The practitioner’s request for methodology feedback signals that the community is maturing—moving from hype to rigorous, reproducible testing.
Implications for AI Practitioners
First, evaluation methodology is now a product differentiator. For anyone building an app that relies on LLM output—translation, summarization, code generation—the ability to design a clean, contamination-aware eval is as important as the model choice. Practitioners should adopt held-out test sets from sources unlikely to be in training data (e.g., user-generated content, domain-specific corpora, or freshly commissioned translations from native speakers).
Second, local LLMs are not drop-in replacements for specialized translation models. While a 7B-parameter model can produce passable translations for common language pairs (e.g., English-Spanish), it may struggle with low-resource languages, idiomatic expressions, or domain-specific terminology (legal, medical). The author’s experience suggests that top models “feel” good, but feeling is not a metric. Practitioners must define clear quality criteria: BLEU/COMET scores for fluency, human evaluation for adequacy, and latency benchmarks for real-time use.
Third, the cost-benefit calculus is shifting. Running a 7B model on a consumer GPU costs nothing per inference after the initial hardware investment, but it requires engineering effort for optimization (quantization, batching, caching). For apps with low throughput or strict privacy requirements, local LLMs are increasingly viable. For high-volume, latency-sensitive translation, cloud APIs still dominate.
Key Takeaways
- Evaluation rigor matters more than model hype: Data contamination can inflate perceived translation quality. Practitioners must design contamination-aware test sets (e.g., using recent, domain-specific content not in training data).
- Local LLMs are viable for translation but not universally superior: They offer privacy and zero per-query cost but may underperform specialized models on low-resource languages or niche domains.
- Methodology feedback is a sign of a maturing field: The community is shifting from “does it work?” to “how do we reliably measure if it works for our specific use case?”
- Hardware and latency constraints remain decisive: A 7B model on a laptop is not a real-time translator for a high-traffic app; practitioners must benchmark under realistic load conditions.