Research2026-04-30
Tatemae: Detecting Alignment Faking via Tool Selection in LLMs
Source: Arxiv CS.AI
arXiv:2604.26511v1 Announce Type: cross Abstract: Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences once monitoring is lifted. Current detection methods focus on conversational settings and rely...
arxivpapers