Research2026-04-22

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

arXiv:2509.13281v5 Announce Type: replace Abstract: Current safety evaluations of language models rely on benchmark-based assessments that may miss localized vulnerabilities. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations in LM activations....

Read Original Article on Arxiv CS.AI

arxivpapers