BeClaude
Research2026-04-22

Towards Understanding the Robustness of Sparse Autoencoders

Source: Arxiv CS.AI

arXiv:2604.18756v1 Announce Type: cross Abstract: Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain...

arxivpapers