KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning
arXiv:2601.14232v2 Announce Type: replace-cross Abstract: Pixel-based reinforcement learning agents often fail under purely visual distribution shift even when latent dynamics and rewards are unchanged, but existing benchmarks entangle multiple sources of shift and hinder systematic analysis. We...
A New Benchmark for Isolating Visual Distribution Shift in RL
The research community has long recognized that reinforcement learning agents trained on pixel inputs are brittle when faced with visual changes—even when the underlying task dynamics remain identical. A new paper introduces KAGE-Bench, a benchmark designed specifically to isolate and evaluate this phenomenon. The core insight is that existing evaluation frameworks conflate multiple types of distribution shift simultaneously, making it impossible to determine whether an agent fails due to visual novelty, altered dynamics, or reward function changes.
KAGE-Bench addresses this by providing a controlled environment where the visual appearance of the state space can be systematically varied along known axes—such as color palette, texture, lighting, or background pattern—while keeping the latent dynamics and reward structure perfectly intact. This allows researchers to measure an agent’s visual generalization capability in isolation, rather than confounding it with other failure modes.
Why This Matters
The practical significance of this work is substantial. Real-world deployment of RL agents—in robotics, autonomous driving, or industrial control—inevitably encounters visual conditions that differ from training. A robot trained in a well-lit lab must operate in dim warehouses; a driving agent trained on sunny roads must handle rain or snow. Current benchmarks like DMControl or Atari do not systematically test this, and when agents fail, it is unclear whether the root cause is visual shift or something else.
By providing a clean diagnostic tool, KAGE-Bench enables researchers to answer a question that has been surprisingly difficult to isolate: Does my agent actually understand the task, or is it just memorizing visual features? This distinction is critical for building robust systems.
Implications for AI Practitioners
For those developing pixel-based RL agents, KAGE-Bench offers several concrete benefits. First, it provides a standardized way to compare visual robustness across different architectures and training methods—data augmentation, domain randomization, contrastive learning, or latent dynamics models. Second, it can serve as a debugging tool during development: if an agent performs well on standard benchmarks but fails on KAGE-Bench, the practitioner knows to focus on visual invariance rather than reward design or exploration.
The benchmark’s “known-axis” design is particularly valuable. Because the axes of variation are explicitly controlled, researchers can identify which types of visual change are most problematic for their agent. This granular diagnostic capability is far more useful than a single aggregate score.
However, practitioners should note that KAGE-Bench is not a replacement for full-stack evaluation. Real-world distribution shifts are often compound and unpredictable. The benchmark is best used as a targeted stress test for visual generalization, complementing broader evaluations that include dynamics and reward shifts.
Key Takeaways
- KAGE-Bench isolates visual distribution shift from other confounding factors (dynamics, rewards), enabling precise diagnosis of agent failures.
- The benchmark uses controlled, known axes of visual variation (color, texture, lighting) to measure generalization systematically.
- For practitioners, it provides a standardized tool for comparing visual robustness and debugging whether agents rely on task understanding versus visual memorization.
- The benchmark is a diagnostic complement, not a replacement—real-world shifts remain compound and require broader evaluation strategies.