BeClaude
Research2026-04-30

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Source: Arxiv CS.AI

arXiv:2604.26511v1 Announce Type: cross Abstract: Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences once monitoring is lifted. Current detection methods focus on conversational settings and rely...

arxivpapers