Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models
arXiv:2606.19750v1 Announce Type: cross Abstract: Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum...
What Happened
A new arXiv preprint introduces "Manifold Bandits," a method that reframes curriculum learning for reinforcement learning in large language models as a multi-armed bandit problem operating over the latent geometry of LLMs. Instead of sampling training problems randomly or with simple heuristics, the approach dynamically selects problems based on their position in the model's learned representation space. The "bandit" component learns which regions of this latent manifold yield the most efficient learning gains, then prioritizes problems from those areas. This is a Bayesian approach, meaning it maintains uncertainty estimates about which problem types are most valuable and explores accordingly.
Why It Matters
Current RL training for LLMs—whether for math reasoning, code generation, or instruction following—typically uses either uniform sampling or static difficulty-based curricula. Both are suboptimal. Uniform sampling wastes compute on problems the model already handles well, while static curricula fail to adapt as the model's capabilities shift during training.
The key insight here is that the model's own latent geometry reveals where learning is happening. Problems that are "close" in the representation space tend to require similar reasoning steps. By treating problem selection as a bandit problem over this manifold, the method automatically balances exploitation (picking problems from high-learning regions) with exploration (checking under-sampled regions that might become valuable as the model improves). This is a principled, Bayesian solution to a problem that has largely been addressed with ad-hoc heuristics.
The paper also addresses a practical bottleneck: computing the latent manifold for every problem at every step is expensive. They propose approximations to make the method tractable, which is critical for real-world adoption.
Implications for AI Practitioners
For those training or fine-tuning LLMs with RL, this work offers a direct path to improved sample efficiency. If the method scales, practitioners could see significant reductions in training compute for a given performance target, or better performance with the same compute. The bandit framework is also interpretable—you can inspect which regions of the latent space are being prioritized and why.
However, there are caveats. The method adds overhead: you need to compute or approximate latent representations for each candidate problem, and maintain a bandit policy. For very large-scale training with billions of tokens, this overhead must be weighed against the gains. The paper's approximations will need validation in production settings.
Additionally, the approach assumes the latent geometry is stable enough to guide curriculum decisions. If the representation space shifts dramatically during training (which it can, especially early on), the bandit may chase a moving target. The Bayesian uncertainty estimates help, but practitioners should monitor for this.
Key Takeaways
- Dynamic curriculum via latent geometry: Manifold Bandits selects training problems by modeling their position in the LLM's representation space, not by static difficulty scores.
- Principled exploration-exploitation: The Bayesian bandit framework naturally balances trying new problem types with focusing on those that yield the fastest learning.
- Potential compute savings: For RL-based LLM training, this could reduce the number of samples needed to reach a target capability level.
- Practical overhead considerations: The method requires latent representations and bandit updates, so practitioners should benchmark the cost-benefit tradeoff for their specific training pipeline.