Research2026-05-06
RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs
Source: Arxiv CS.AI
arXiv:2605.01913v1 Announce Type: cross Abstract: Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features are encoded in...
arxivpaperssafetyfine-tuning