Research2026-05-12
REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage
Source: Arxiv CS.AI
arXiv:2604.01527v3 Announce Type: replace-cross Abstract: Production deployment of AI coding agents requires fast, reproducible evaluation signals. Existing industrial practices trade off speed and fidelity: online A/B testing takes weeks and risks user experience, shadow deployment yields signals...
arxivpapersagentsbenchmark