Inherited Circuits, Learned Semantics: How Fine-Tuning Creates Evasion Vulnerabilities Invisible to Standard Evaluation
arXiv:2606.27091v1 Announce Type: cross Abstract: LLMs fine-tuned for security classification are usually evaluated on held-out examples from the same distribution as their training data. We show that this can miss vulnerabilities introduced by fine-tuning itself: models can learn token-level...
The Hidden Cost of Fine-Tuning: When Security Classifiers Learn to Cheat
A new preprint from arXiv (2606.27091) reveals a troubling blind spot in how we evaluate fine-tuned large language models (LLMs). The researchers demonstrate that models fine-tuned for security classification tasks—such as distinguishing safe from malicious inputs—can develop subtle evasion vulnerabilities that standard held-out evaluation completely misses. The core finding: fine-tuning does not just teach new semantic associations; it can also imprint token-level shortcuts that bypass the intended classification logic.
What the Research Reveals
The study shows that when an LLM is fine-tuned on a security dataset, it inherits pre-existing circuit-level behaviors from its base model while simultaneously learning new semantic patterns. The vulnerability emerges because the model can exploit token-level features—specific characters, n-grams, or formatting quirks—that correlate with the training labels but are not semantically meaningful. These "inherited circuits" allow an adversary to craft inputs that evade detection by triggering the learned shortcuts rather than the intended reasoning. Crucially, standard evaluation on a held-out test set drawn from the same distribution as the training data fails to expose these weaknesses, because the test set itself contains the same spurious correlations.
Why This Matters for AI Safety
This finding strikes at the heart of current best practices for evaluating fine-tuned models. If a security classifier can be reliably evaded by inputs that differ only in token-level surface features—while passing all standard accuracy benchmarks—then the model offers a false sense of security. The vulnerability is not a bug; it is an emergent property of the fine-tuning process itself. For safety-critical applications like content moderation, malware detection, or adversarial input filtering, this means that a model that scores 99% on a held-out test could still be trivially bypassed by an attacker who reverse-engineers the token-level shortcuts.
Implications for AI Practitioners
The immediate takeaway for practitioners is clear: do not trust standard held-out evaluation as the sole measure of fine-tuned model robustness. The research suggests that evaluation must include adversarial probing specifically designed to detect token-level evasion paths. This could involve:
- Testing on out-of-distribution inputs that alter tokenization without changing semantics.
- Analyzing attention patterns to identify whether the model relies on superficial features.
- Using interpretability tools to inspect the circuits inherited from the base model.
Key Takeaways
- Fine-tuned LLMs can learn token-level evasion vulnerabilities that are invisible to standard held-out evaluation, because the test set shares the same spurious correlations as the training data.
- These vulnerabilities arise from inherited pre-training circuits interacting with new semantic learning, creating shortcuts that bypass intended classification logic.
- AI practitioners must supplement standard evaluation with adversarial probing and interpretability analysis to detect these hidden weaknesses.
- Until robust fine-tuning methods are developed, deploying fine-tuned LLMs in safety-critical security roles carries unquantified risk of evasion.