Research2026-07-02

Predicting LLM Reasoning Performance with Small Proxy Model

Originally published byArxiv CS.AI

arXiv:2509.21013v4 Announce Type: replace-cross Abstract: Given the prohibitive cost of pre-training large language models, it is essential to leverage smaller proxy models to optimize datasets before scaling up. However, this approach becomes challenging for reasoning capabilities, which exhibit...

The Proxy Model Paradox: Predicting Reasoning Performance Without the Full Cost

A new preprint (arXiv:2509.21013) tackles a pressing bottleneck in LLM development: how to predict whether a large reasoning model will succeed before spending millions on pre-training data curation. The researchers propose using a smaller, cheaper "proxy model" to forecast the reasoning performance of a much larger target model, effectively creating a low-cost simulation of scaling outcomes.

The core insight is straightforward but technically demanding. Instead of blindly assembling massive training datasets and hoping they improve reasoning, the method evaluates candidate data on a small proxy model first. The proxy's performance on reasoning benchmarks then serves as a predictor for how a scaled-up version would perform. This flips the traditional "train first, evaluate later" pipeline into a "evaluate first, train selectively" paradigm.

Why This Matters

The significance here is twofold. First, pre-training a frontier LLM now costs tens of millions of dollars, with reasoning capabilities being particularly expensive to optimize because they require complex chain-of-thought data and multi-step verification. A reliable proxy model could reduce this cost by orders of magnitude—if a proxy can accurately flag which data subsets degrade or improve reasoning, developers avoid wasting compute on dead ends.

Second, the approach addresses a known failure mode: data quality interventions that improve performance on small models often reverse or plateau when scaled. The proxy method aims to detect these scaling inversions early. If the proxy shows diminishing returns on a specific data augmentation, the team can abandon it before committing to full-scale training.

Implications for AI Practitioners

For those building or fine-tuning reasoning models, this research suggests a practical workflow shift. Rather than relying on intuition or small-scale ablation studies that may not generalize, teams can systematically screen candidate datasets using a proxy that is 10-100x smaller than the target model. The catch is that the proxy must be architecturally aligned—using the same tokenizer, attention pattern, and training objective—to produce reliable predictions. A mismatched proxy could yield misleading signals.

There are also limitations to consider. The paper likely focuses on specific reasoning benchmarks (e.g., math word problems, logical deduction), and it remains unclear whether the proxy method generalizes to open-ended reasoning tasks like creative problem-solving or multi-hop commonsense inference. Additionally, the proxy itself requires training and validation, adding overhead that must be weighed against the savings.

For AI labs, the strategic takeaway is clear: the era of brute-force data scaling is ending. The winners will be those who can predict outcomes cheaply and iterate on data composition with surgical precision. Proxy model evaluation is not a silver bullet, but it is a necessary tool for the next phase of efficient model development.

Key Takeaways

A small proxy model can predict the reasoning performance of a much larger LLM, enabling cheaper data optimization before expensive full-scale training.
This approach reduces the risk of investing in data strategies that fail to improve reasoning when scaled, saving millions in compute costs.
Practitioners must ensure architectural alignment between proxy and target models for predictions to be reliable; mismatched proxies may produce false signals.
The method is most applicable to structured reasoning benchmarks; its effectiveness on open-ended or creative reasoning tasks remains unproven.

Read Original Article on Arxiv CS.AI

arxivpapersreasoning