Skip to content
BeClaude
Research2026-07-01

Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts

Originally published byArxiv CS.AI

arXiv:2606.23375v2 Announce Type: replace-cross Abstract: While the wider applicability of LLMs in the legal field is currently debated due to their reliability and the gravity of any errors, narrow uses with well-understood and mitigated risks have emerged. Notably the Swiss Federal Supreme Court...

When Legal Precision Meets AI Caution: The Over-Alignment Problem

The research paper on arXiv (2606.23375v2) tackles a nuanced but critical issue in deploying large language models (LLMs) within multilingual criminal law courts, specifically using the Swiss Federal Supreme Court as a case study. The core finding is that LLMs, when fine-tuned for legal tasks, can exhibit over-alignment—a phenomenon where the model becomes excessively cautious, refusing to answer valid legal questions or providing overly conservative interpretations that deviate from actual legal practice. This is distinct from the more commonly discussed under-alignment (where models produce harmful or incorrect outputs).

The researchers developed metrics to measure this over-alignment across multiple languages (German, French, Italian, and Romansh) and proposed mitigation strategies. Their work reveals that models fine-tuned on legal data often misinterpret the boundary between "appropriate legal caution" and "excessive refusal," particularly when dealing with ambiguous or sensitive criminal law scenarios. For instance, a model might refuse to explain a legal precedent because it touches on a controversial topic, even when that precedent is routinely cited in actual court rulings.

Why This Matters

This research is significant for several reasons. First, it challenges the prevailing assumption that more alignment (i.e., making models safer and more compliant) is always better. In high-stakes domains like criminal law, over-alignment can be as dangerous as under-alignment—a model that refuses to answer a legitimate legal question could delay proceedings, mislead practitioners, or erode trust in AI-assisted tools.

Second, the multilingual dimension is crucial. The Swiss legal system operates in four official languages, and the researchers found that over-alignment patterns varied significantly across languages. This suggests that alignment techniques developed primarily for English may not transfer well to other linguistic and legal contexts, raising questions about the global applicability of current safety fine-tuning methods.

Third, the paper highlights a methodological gap: most alignment benchmarks focus on preventing harmful outputs, but few measure the cost of that caution in terms of lost utility. For AI practitioners, this means that standard evaluation metrics (e.g., refusal rates, safety scores) may be misleading when applied to specialized domains.

Implications for AI Practitioners

For those deploying LLMs in legal or other regulated settings, this research offers practical guidance. First, domain-specific alignment requires domain-specific evaluation—generic safety benchmarks will not capture over-alignment in criminal law. Practitioners should develop custom test sets that include borderline cases where a cautious refusal would be inappropriate.

Second, the mitigation strategies proposed—such as adjusting temperature parameters, using prompt engineering to clarify the model's role, and incorporating legal reasoning chains—suggest that over-alignment is not a fixed property but can be tuned. This implies that deployment teams should budget for iterative calibration rather than assuming a single fine-tuning pass is sufficient.

Finally, the multilingual findings underscore the need for localized alignment strategies. A model that works well in German-language Swiss courts may behave differently in Italian or French contexts, even within the same legal system. Practitioners should test across all relevant languages and jurisdictions.

Key Takeaways

  • Over-alignment in LLMs can be as harmful as under-alignment in high-stakes legal contexts, causing models to refuse valid queries or provide overly conservative interpretations.
  • Alignment patterns vary significantly across languages, meaning safety fine-tuning developed for English may not generalize to multilingual legal systems.
  • Practitioners must develop domain-specific evaluation benchmarks that measure both safety and utility, rather than relying on generic refusal rates.
  • Mitigation is possible through careful prompt engineering, role specification, and iterative calibration—but requires dedicated effort beyond standard fine-tuning pipelines.
arxivpapers