Skip to content
BeClaude
Research2026-07-03

Rank-Then-Act: Reward-Free Control from Frame-Order Progress

Originally published byArxiv CS.AI

arXiv:2607.01897v1 Announce Type: cross Abstract: We introduce Rank-Then-Act (RTA), a framework for learning control policies from expert video demonstrations without environment rewards. RTA trains a Vision-Language Model (VLM) offline as a progress-based ordinal scorer, using a Group Relative...

What Happened

Researchers have introduced Rank-Then-Act (RTA), a novel framework that enables AI agents to learn control policies purely from expert video demonstrations, eliminating the need for explicit reward signals. The core innovation involves training a Vision-Language Model (VLM) offline as a "progress-based ordinal scorer" using a Group Relative approach. Instead of assigning numerical rewards, RTA learns to rank frames by their progression toward task completion, then uses this ranking to guide action selection.

This shifts the learning paradigm from reward maximization to progress ordering. The VLM essentially becomes a judge that can say "this frame is closer to the goal than that frame" without ever needing to quantify how much closer. The policy then acts to increase its progress rank over time.

Why It Matters

The significance of RTA lies in addressing one of reinforcement learning's most persistent bottlenecks: reward engineering. In traditional RL, practitioners spend enormous effort designing reward functions that correctly incentivize desired behaviors. Sparse rewards make learning nearly impossible, while dense rewards often lead to reward hacking or unintended behaviors. By removing the reward signal entirely, RTA sidesteps these issues.

The use of expert video demonstrations is particularly practical. Video data is abundant—from YouTube tutorials to recorded robot teleoperation sessions—while paired reward signals are rare. This makes RTA potentially applicable to domains where reward specification is difficult, such as surgical robotics, autonomous driving in novel scenarios, or household manipulation tasks.

Furthermore, the offline training approach means the VLM scorer can be trained once and deployed without additional environment interaction. This reduces the computational cost and safety risks associated with online RL exploration.

Implications for AI Practitioners

For practitioners, RTA suggests a shift in focus from reward design to demonstration curation. The quality and diversity of expert videos will likely become the primary determinant of policy performance. Practitioners should invest in collecting comprehensive demonstration datasets that cover failure modes, recovery behaviors, and multiple successful strategies.

The framework also implies that progress-based learning may be more sample-efficient than reward-based methods. Since ordinal ranking provides a weaker learning signal than precise rewards, one might expect slower convergence. However, the relative nature of the signal may actually reduce variance and improve stability—a trade-off worth exploring in applied settings.

Practitioners should also note the VLM dependency. RTA's performance will hinge on the VLM's ability to understand visual progress. This may limit applicability to tasks where progress is visually apparent (e.g., assembly, navigation) versus tasks where progress is abstract (e.g., dialogue management, strategic planning). Domain-specific fine-tuning of the VLM may be necessary.

Finally, the Group Relative approach suggests that batch size and ranking granularity will be important hyperparameters. Too few frames per group may yield noisy rankings; too many may overwhelm the VLM's capacity. Practitioners will need to experiment with these parameters for their specific tasks.

Key Takeaways

  • RTA eliminates the need for reward engineering by learning progress-based ordinal rankings from expert video demonstrations, reducing a major bottleneck in RL deployment.
  • The framework's reliance on readily available video data makes it practical for real-world applications where reward specification is difficult or impossible.
  • Practitioners should prioritize demonstration dataset quality over reward design, and expect to fine-tune the underlying VLM for domain-specific visual progress understanding.
  • Key hyperparameters include group size for relative ranking and batch composition, which will require empirical tuning for optimal performance.
arxivpapers