Research2026-04-22

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

arXiv:2604.18164v2 Announce Type: replace-cross Abstract: Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges...

Read Original Article on Arxiv CS.AI

arxivpapersbenchmark