Research2026-04-28
Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models
Source: Arxiv CS.AI
arXiv:2604.24542v1 Announce Type: cross Abstract: Large language models deployed at runtime can misbehave in ways that clean-data validation cannot anticipate: training-time backdoors lie dormant until triggered, jailbreaks subvert safety alignment, and prompt injections override the deployer's...
arxivpapers