Research2026-05-14
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
Source: Arxiv CS.AI
arXiv:2605.12530v1 Announce Type: cross Abstract: LLM fairness should be evaluated through in-situ conversational behavior rather than standardized-test Q&A benchmarks. We show that the standardized-test paradigm can be structurally unreliable: surface-level prompt construction choices, although...
arxivpapers