Research2026-07-02

HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

Originally published byArxiv CS.AI

arXiv:2607.00572v1 Announce Type: new Abstract: Understanding how aligned LLMs internally represent safety is critical for diagnosing alignment vulnerabilities, as it explains why jailbreaks succeed and informs the design of robust alignment strategies. Prior work shows that aligned LLMs encode...

A New Lens on Safety Alignment

A recent preprint from arXiv (2607.00572v1) introduces HARC, a framework that systematically couples the “harmfulness” and “refusal” directions in the internal representations of aligned large language models. The core insight is that these two directions are not independent: refusal behavior is not merely a binary switch triggered by harmful inputs, but is deeply entangled with how the model encodes harmfulness itself. By analyzing and manipulating these coupled directions, the researchers demonstrate a more robust method for understanding and maintaining safety alignment.

Why This Matters

This work addresses a fundamental blind spot in current alignment research. Most existing safety mechanisms treat refusal as a post-hoc filter—a layer added on top of the model’s reasoning. HARC shows that refusal is actually woven into the model’s internal geometry. When a jailbreak succeeds, it is often because it decouples these directions: it makes the model perceive a harmful request as harmless while simultaneously weakening the refusal signal. By explicitly modeling the coupling, HARC provides a diagnostic tool to detect such decoupling before it leads to harmful outputs.

For AI practitioners, this has immediate practical implications. First, it offers a new method for red-teaming: instead of probing with adversarial prompts alone, one can directly measure the alignment of internal representations. Second, it suggests that current fine-tuning approaches for safety may be brittle because they treat refusal as an isolated behavior, not as a structural property of the model’s latent space. Third, the framework points toward more robust alignment strategies that explicitly maintain the coupling between harmfulness detection and refusal activation, rather than relying on surface-level instruction tuning.

Implications for AI Practitioners

The most actionable takeaway is that safety alignment should be evaluated not just at the output level, but at the representation level. Practitioners deploying aligned models should consider adding internal representation monitoring as part of their safety stack. This is not yet standard practice, but HARC provides a concrete methodology for doing so.

Additionally, the research implies that future alignment techniques should be designed with representation geometry in mind. Fine-tuning that only adjusts output probabilities may leave the underlying coupling intact but misaligned, creating vulnerabilities that jailbreaks can exploit. Instead, alignment should aim to preserve or strengthen the natural coupling between harmfulness encoding and refusal activation.

Finally, for those building on open-source models, HARC offers a way to audit whether a given model’s safety alignment is robust or merely superficial. By examining the relationship between harmfulness and refusal directions, one can identify models that are “aligned in name only” and prioritize those with deeper structural safety.

Key Takeaways

HARC reveals that refusal and harmfulness are coupled in LLM internal representations, not independent behaviors.
Jailbreaks often succeed by decoupling these directions, making the model misclassify harmful inputs as harmless.
Practitioners should monitor internal representations, not just outputs, to assess true alignment robustness.
Future alignment methods should be designed to maintain the structural coupling between harmfulness detection and refusal activation.

Read Original Article on Arxiv CS.AI

arxivpaperssafety