Research2026-04-17

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

arXiv:2602.20981v3 Announce Type: replace-cross Abstract: Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in...

Read Original Article on Arxiv CS.AI

arxivpapers