Research2026-05-12
Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought
Source: Arxiv CS.AI
arXiv:2605.09906v1 Announce Type: new Abstract: Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another,...
arxivpapersreasoning