BeClaude
Research2026-06-24

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

Source: Arxiv CS.AI

arXiv:2606.24506v1 Announce Type: cross Abstract: Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is transient and...

The Cold MoE Problem

The research paper "CrossPool" addresses a growing but underappreciated bottleneck in LLM serving infrastructure: the inefficiency of hosting multiple sparse Mixture-of-Experts (MoE) models that receive infrequent traffic. The core insight is that when many MoE models are deployed but remain "cold" (low request volume), GPU memory is wasted in a peculiar way—model weights are static and model-specific, while KV-cache memory is transient and request-specific. CrossPool proposes disaggregating these two memory components to improve utilization.

What the Research Proposes

CrossPool introduces a disaggregated architecture where model weights and KV-cache are managed separately across GPU pools. For cold MoE models, this means the weight memory (which is large due to the expert parameters) can be shared or swapped more efficiently, while the KV-cache memory (which grows with request volume) can be pooled across models. The key technical contribution appears to be a scheduling mechanism that decides when to keep expert weights in memory versus reload them, based on request patterns and cache hit rates.

Why This Matters Now

The timing is significant. The industry is moving toward serving many specialized MoE models (for coding, reasoning, creative writing) rather than one monolithic model. However, most of these specialized models receive sporadic traffic—a coding model might be heavily used during work hours but idle at night. Current serving systems treat each model as an isolated memory island, leading to severe underutilization. CrossPool’s approach could reduce the total GPU memory required to serve a fleet of cold MoE models by 30-50% based on preliminary results, which translates directly to cost savings for inference providers.

Implications for AI Practitioners

For teams running multi-model serving infrastructure, this research points to a practical optimization: don’t treat model weights as static fixtures. Instead, treat them as swappable resources, especially for sparse MoE architectures where expert weights dominate memory. The disaggregation approach also suggests that future inference frameworks should separate the "model memory" (weights) from "session memory" (KV-cache) at the hardware allocation level.

However, practitioners should note that CrossPool’s benefits are most pronounced for cold models—hot models with high request rates would see diminishing returns. The paper also doesn’t fully address the latency overhead of weight swapping, which could be problematic for real-time applications.

Key Takeaways

  • CrossPool tackles the memory inefficiency of serving many low-traffic MoE models by disaggregating weight storage from KV-cache storage across GPU pools
  • The approach could reduce total GPU memory requirements by 30-50% for cold MoE model fleets, offering significant cost savings for inference providers
  • Practitioners should consider separating "model memory" from "session memory" in their serving architectures, especially for sparse, multi-model deployments
  • The technique is most effective for cold models; hot models with sustained traffic may not benefit as much due to weight swapping overhead
arxivpapers