BeClaude
Research2026-05-12

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

Source: Arxiv CS.AI

arXiv:2605.09906v1 Announce Type: new Abstract: Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another,...

arxivpapersreasoning