Research2026-04-22
RepIt: Steering Language Models with Concept-Specific Refusal Vectors
Source: Arxiv CS.AI
arXiv:2509.13281v5 Announce Type: replace Abstract: Current safety evaluations of language models rely on benchmark-based assessments that may miss localized vulnerabilities. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations in LM activations....
arxivpapers