Research2026-05-01

GAVEL: Towards Rule-Based Safety Through Activation Monitoring

arXiv:2601.19768v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad...

Read Original Article on Arxiv CS.AI

arxivpaperssafety