Research2026-05-11
Activation Differences Reveal Backdoors: A Comparison of SAE Architectures
Source: Arxiv CS.AI
arXiv:2605.07324v1 Announce Type: cross Abstract: Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability...
arxivpapers