Research2026-05-11

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

arXiv:2605.07324v1 Announce Type: cross Abstract: Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability...

Read Original Article on Arxiv CS.AI

arxivpapers