BeClaude
Research2026-05-05

Attention Is Where You Attack

Source: Arxiv CS.AI

arXiv:2605.00236v1 Announce Type: cross Abstract: Safety-aligned large language models rely on RLHF and instruction tuning to refuse harmful requests, yet the internal mechanisms implementing safety behavior remain poorly understood. We introduce the Attention Redistribution Attack (ARA), a...

arxivpapers