CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
arXiv:2409.11363v2 Announce Type: replace-cross Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly...
What Happened
Researchers have released CORE-Bench, a new benchmark designed to evaluate AI agents on the task of computational reproducibility—the ability to independently verify published research results by re-running code and data. The benchmark, detailed in a recent arXiv paper, presents AI systems with scientific papers and their associated code repositories, then tasks agents with reproducing key figures, tables, or statistical outputs. This moves beyond typical coding benchmarks by requiring agents to navigate real-world research artifacts, handle incomplete documentation, and manage dependency hell—the messy reality of computational science.
Why It Matters
The reproducibility crisis in science is well-documented, with estimates suggesting that over 70% of researchers have failed to reproduce another scientist's experiments. CORE-Bench directly targets this pain point by creating a standardized, scalable testbed for AI agents to assist in verification. Unlike benchmarks that test isolated skills (e.g., code generation or math reasoning), CORE-Bench measures an agent's ability to execute a complete, multi-step workflow: understanding a paper, setting up an environment, running code, comparing outputs, and diagnosing failures.
This matters because computational reproducibility is a high-stakes, labor-intensive task currently performed by humans—often graduate students or postdocs. If AI agents can reliably automate even a fraction of this work, the scientific community could dramatically scale verification efforts, potentially catching errors, fraud, or methodological flaws before they propagate through the literature. The benchmark's design also forces agents to handle ambiguity and incomplete information, which mirrors the challenges of real-world scientific practice more closely than synthetic benchmarks.
Implications for AI Practitioners
For AI engineers and researchers building agentic systems, CORE-Bench highlights several critical gaps. First, current agents struggle with environment setup and dependency management—a mundane but essential skill that existing benchmarks largely ignore. Practitioners should prioritize agents that can interact with package managers, resolve version conflicts, and handle system-level operations. Second, the benchmark reveals that agents need robust error recovery: when a script fails, the agent must diagnose why (missing data, wrong Python version, hardware incompatibility) and adapt, rather than simply retrying the same command.
Third, CORE-Bench underscores the importance of multimodal understanding. Agents must parse LaTeX equations, interpret figure captions, and cross-reference code outputs with paper claims—skills that require integrating vision, language, and code execution. For teams building general-purpose coding agents, this benchmark offers a more realistic stress test than isolated coding challenges. Finally, the benchmark's focus on scientific credibility suggests a growing market for AI tools that can audit research outputs, which could become a standard part of academic publishing workflows.
Key Takeaways
- CORE-Bench evaluates AI agents on end-to-end computational reproducibility, a high-value but under-benchmarked task that requires navigating real-world research code and documentation.
- The benchmark reveals that current agents struggle most with environment setup, dependency management, and error recovery—not just code generation.
- Success on CORE-Bench would enable scalable verification of published research, directly addressing the reproducibility crisis in computational science.
- AI practitioners should treat this benchmark as a realistic proxy for deploying agents in complex, multi-step scientific workflows, not just isolated coding tasks.