BeClaude
Research2026-05-12

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

Source: Arxiv CS.AI

arXiv:2605.08496v1 Announce Type: new Abstract: Current adversarial robustness methods for large language models require extensive datasets of harmful prompts (thousands to hundreds of thousands of examples), yet remain vulnerable to novel attack vectors and distributional shifts. We propose Latent...

arxivpapers