Research2026-05-06

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

arXiv:2605.01630v1 Announce Type: cross Abstract: Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement matters more...

Read Original Article on Arxiv CS.AI

arxivpapers