A$^{2}$utoLPBench: An Auto-Generated, Agent-Friendly LP Benchmark via Inverse-KKT Construction
arXiv:2607.02141v1 Announce Type: new Abstract: Most LP-from-text benchmarks are static datasets of word problems written and labeled by hand. Once such a dataset is released, its size is fixed, its difficulty is fixed, and every problem can leak into the training data of future LLMs. We present...
The Benchmark That Builds Itself
The release of A²utoLPBench marks a significant departure from how AI benchmarks have traditionally been constructed. Rather than offering yet another static collection of hand-crafted linear programming (LP) word problems, this work introduces a fully automated pipeline that generates benchmark instances using inverse Karush–Kuhn–Tucker (KKT) conditions. The result is a dynamic, scalable, and importantly agent-friendly evaluation suite that addresses several long-standing weaknesses in the current evaluation ecosystem.
What Makes This Different
Conventional LP benchmarks like the classic Netlib collection or more recent NLP-from-text datasets are frozen artifacts. Once published, they cannot grow, their difficulty cannot be adjusted, and every problem text becomes a potential contamination risk for future large language models (LLMs). A²utoLPBench sidesteps this entirely. By constructing problems backwards—starting from a known optimal solution and deriving the problem constraints—the system can produce an unlimited number of unique, verifiable LP instances. Each generated problem comes with a guaranteed correct answer, eliminating the need for human labeling and the ambiguity that often plagues natural language math problems.
Why It Matters for Evaluation Integrity
The contamination problem in LLM evaluation is now widely acknowledged. Models trained on web-scale data inevitably encounter benchmark problems during training, making it difficult to distinguish genuine reasoning from memorization. A²utoLPBench’s auto-generation capability means that evaluators can always produce fresh, unseen instances. This is not merely a convenience—it is a structural improvement to the reliability of performance measurement. For AI practitioners working on mathematical reasoning or operations research applications, this offers a path toward more trustworthy model comparisons.
Implications for Agentic Workflows
The “agent-friendly” framing is particularly noteworthy. Many existing LP benchmarks require models to parse messy natural language and then formulate the mathematical program. A²utoLPBench appears designed to test the full pipeline: understanding the problem statement, constructing the LP, solving it, and interpreting the result. This aligns with the growing trend of evaluating LLMs not as isolated text generators but as components in multi-step reasoning systems. For developers building AI agents that handle optimization tasks—supply chain planning, resource allocation, scheduling—this benchmark provides a more realistic and reproducible test environment.
A Word of Caution
While the inverse-KKT construction elegantly ensures correctness, it also means the generated problems may lack the messy, real-world ambiguity that human-written problems often contain. The benchmark may be too clean. Practitioners should be aware that strong performance on A²utoLPBench does not necessarily translate to handling poorly specified or incomplete business problems. Additionally, the scalability of the generation process will depend on the diversity of the underlying problem templates—a risk the authors will need to address as the benchmark evolves.
Key Takeaways
- A²utoLPBench introduces a fully automated benchmark generation method using inverse-KKT construction, eliminating the need for human labeling and static datasets.
- The auto-generation capability directly mitigates data contamination risks, enabling more reliable and repeatable evaluation of LLMs on LP tasks.
- The benchmark is designed for agentic evaluation, testing the full reasoning pipeline from problem understanding to solution interpretation.
- Practitioners should be aware that generated problems may lack the ambiguity of real-world scenarios, so performance should be validated against messy, human-authored cases.