Research2026-05-14
Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding
Source: Arxiv CS.AI
arXiv:2602.02977v2 Announce Type: replace-cross Abstract: Vision-language models such as CLIP often struggle to faithfully understand long, detail-rich captions, relying on dominant scene cues while overlooking fine-grained visual evidence. We propose a hierarchical vision-language learning...
arxivpapers