BeClaude
Industry2026-06-27

Show HN: KV-psi, using Linux PSI to to trim an LLM KV cache

Source: Hacker News

I thought it'd be interesting to use Linux PSI (Pressure Stall Information) for an LLM runtime to trim the KV cache. This is mainly useful imo for edge devices like the Jetson Orin super nano kit which have unified memory. I haven't benched much, but plan to do so more over time and see...

What Happened

A developer has introduced KV-psi, an experimental project that leverages Linux's Pressure Stall Information (PSI) subsystem to dynamically manage the key-value (KV) cache in large language model inference. The KV cache is a memory-intensive component that stores intermediate attention computations, and its size directly impacts both throughput and memory pressure. By hooking into PSI—a kernel feature that measures memory, CPU, and I/O contention—the system can detect when the device is under memory strain and proactively trim the cache, rather than waiting for an out-of-memory crash or relying on static allocation thresholds. The initial target is edge hardware like the NVIDIA Jetson Orin Super Nano, which uses unified memory where the GPU and CPU share a single pool, making memory pressure a first-class constraint.

Why It Matters

This approach addresses a fundamental tension in LLM deployment: the KV cache is both essential for performance and a major source of memory bloat. On edge devices with limited unified memory, static cache sizing often leads to either underutilization (wasting capacity) or catastrophic failure (swapping or OOM kills). PSI provides a real-time, kernel-level signal of memory pressure that is more nuanced than simple usage percentages—it measures how often tasks are stalled waiting for resources. Using PSI to trigger cache eviction is a systems-level optimization that aligns memory management with actual workload conditions, rather than heuristic thresholds.

For the broader AI infrastructure landscape, this is a novel intersection of operating system primitives and ML runtime design. Most LLM serving stacks (vLLM, TensorRT-LLM, llama.cpp) manage the KV cache through application-level policies like LRU eviction or pre-allocation. Tying cache management to kernel pressure signals could improve robustness in heterogeneous environments where memory pressure varies unpredictably—such as when other processes on the device compete for RAM, or when the LLM workload itself has bursty memory patterns.

Implications for AI Practitioners

For engineers deploying LLMs on edge hardware, KV-psi suggests a path toward more resilient inference without manual tuning. Instead of guessing the optimal cache size for a given model and device, the runtime can self-adjust based on actual system conditions. This is particularly relevant for robotics, autonomous vehicles, and on-device assistants where memory is shared across multiple real-time tasks.

However, the approach has limitations. PSI signals are reactive—they indicate pressure after it has begun—so there is inherent latency between detecting memory contention and reclaiming cache entries. Aggressive trimming could also degrade output quality if important context is evicted prematurely. Practitioners will need to benchmark the trade-off between memory safety and inference accuracy, especially for long-context applications.

The project is early-stage and unbenchmarked, but it represents a pragmatic systems-engineering mindset that the AI field often overlooks in favor of model-centric optimizations. If PSI-guided cache management proves effective, it could be integrated into mainstream inference engines as a configurable policy option.

Key Takeaways

  • KV-psi uses Linux PSI to dynamically trim the LLM KV cache based on real-time memory pressure, targeting edge devices with unified memory.
  • This is a novel systems-level approach that replaces static cache sizing with kernel-informed, reactive memory management.
  • Practitioners should evaluate the latency and accuracy trade-offs, as reactive eviction may not suit all latency-sensitive or long-context workloads.
  • The project highlights the value of cross-disciplinary optimization between OS kernel features and ML runtime design.
hacker-news