Research2026-05-01
GAVEL: Towards Rule-Based Safety Through Activation Monitoring
Source: Arxiv CS.AI
arXiv:2601.19768v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad...
arxivpaperssafety