Research2026-04-28

AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation

arXiv:2603.21362v2 Announce Type: replace Abstract: LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. We present ADARUBRIC,...

Read Original Article on Arxiv CS.AI

arxivpapersagents