Research2026-05-05
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Source: Arxiv CS.AI
arXiv:2507.01955v3 Announce Type: replace-cross Abstract: Multimodal foundation models (MFMs), such as GPT-4o, have recently made remarkable progress. However, their detailed visual understanding beyond question answering remains unclear. In this paper, we benchmark popular MFMs (GPT-4o, o4-mini,...
arxivpapersgpt-4multimodalvision