Skip to content
BeClaude
Research2026-06-30

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

Originally published byArxiv CS.AI

arXiv:2606.30026v1 Announce Type: cross Abstract: Audiovisual arts encompass diverse creative disciplines, including cinema, visual arts, stage performance, and game design, where artistic meaning arises from deliberate combinations of visual, auditory, and narrative elements (e.g., fear amplified...

The Intent Gap: Why MuseBench Matters for Multimodal AI

A new benchmark called MuseBench has arrived to test whether multimodal large language models (MLLMs) can truly understand audiovisual arts—not just describe what they see and hear, but grasp the intent behind creative works. Published on arXiv, MuseBench moves beyond standard object recognition or caption-matching tasks into the murkier territory of artistic meaning: why a filmmaker chooses a specific sound at a specific moment, or how a game designer uses visual-audio dissonance to evoke unease.

The core innovation is that MuseBench evaluates models on "intent-level" understanding. For example, a clip from a horror film might pair a mundane visual (a closed door) with a low-frequency rumble. A caption-based model might correctly label both elements, but MuseBench asks whether the model recognizes that the audio is meant to amplify fear, not merely accompany the scene. This requires integrating narrative context, cultural conventions, and cross-modal reasoning—skills that current MLLMs often lack.

Why This Matters

The benchmark exposes a critical blind spot in today's multimodal systems. Most MLLMs (GPT-4V, Gemini, Claude 3.5) excel at factual description and simple reasoning, but they struggle with pragmatic understanding—the difference between what is literally present and what is artistically intended. This is not an edge case; it is central to how humans communicate through media. A film’s soundtrack, a game’s ambient noise, or a stage performance’s lighting design all carry deliberate emotional and narrative weight.

For AI practitioners, MuseBench signals that current evaluation frameworks are insufficient. Standard benchmarks like VQAv2 or MSCOCO measure surface-level alignment (e.g., "Is there a cat?"), not deeper comprehension. If MLLMs are to be deployed in creative industries—as tools for video editing, game design, or accessibility (e.g., audio description for the visually impaired)—they must learn to infer intent, not just match patterns.

Implications for AI Practitioners

First, training data needs to shift. Current multimodal datasets are heavy on captions and light on annotations about artistic purpose. MuseBench’s creators likely curated clips with expert-labeled intent, suggesting that future models may require similar data—expensive but necessary.

Second, architecture changes may be needed. Intent-level understanding often depends on temporal reasoning (e.g., a sound that builds tension over minutes) and cross-modal integration (e.g., visual silence contrasting with auditory chaos). Current transformer-based models process modalities separately before fusing them; MuseBench’s difficulty implies that more sophisticated fusion mechanisms—perhaps with explicit attention to narrative structure—are required.

Third, practitioners should treat MuseBench as a diagnostic tool, not a final exam. A model that scores poorly on intent-level tasks may still be useful for captioning or retrieval. But for applications involving creative judgment—like automated film editing or game narrative generation—low MuseBench scores are a red flag.

Key Takeaways

  • MuseBench evaluates whether MLLMs understand the artistic intent behind audiovisual works, not just surface-level content.
  • Current models struggle with pragmatic reasoning, exposing a gap between literal description and meaning-making.
  • AI practitioners should invest in intent-annotated training data and explore cross-modal temporal architectures.
  • MuseBench is a diagnostic benchmark for creative-domain AI, not a measure of general multimodal capability.
arxivpapersbenchmark