Research2026-05-05

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

arXiv:2605.00528v1 Announce Type: cross Abstract: AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level...

Read Original Article on Arxiv CS.AI

arxivpapersagents