Research2026-05-11

TTF: Temporal Token Fusion for Efficient Video-Language Model

arXiv:2605.07355v1 Announce Type: cross Abstract: Video-language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at $448{\times}448$ resolution already yield >8,000 visual tokens in Qwen3-VL, making LLM prefill the dominant throughput...

Read Original Article on Arxiv CS.AI

arxivpapers