Research2026-05-06
Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese
Source: Arxiv CS.AI
arXiv:2605.01630v1 Announce Type: cross Abstract: Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement matters more...
arxivpapers