\textsc{DiARC}: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models
arXiv:2606.26530v1 Announce Type: cross Abstract: The Abstraction and Reasoning Corpus (ARC;~\citealp{chollet2019measure}) contains tasks that require summarizing patterns from limited grid samples and predicting output grids. Recently, many large language model based approaches have attempted to...
What Happened
A new research paper introduces DiARC (Distinguishing Positive and Negative Samples), a method designed to improve how large language models handle ARC (Abstraction and Reasoning Corpus) tasks. ARC tasks require models to infer abstract patterns from a small number of input-output grid examples and then apply those patterns to new grids. The core innovation in DiARC is a training strategy that explicitly teaches models to distinguish between correct (positive) and incorrect (negative) solution patterns, rather than simply learning to generate outputs from positive examples alone.
The approach involves augmenting training data with deliberately incorrect grid completions, then training the model to classify and reject these negative samples while reinforcing correct reasoning paths. This contrastive learning technique helps models develop a more robust understanding of the underlying rules, reducing the tendency to latch onto spurious correlations or surface-level grid features.
Why It Matters
ARC has become a benchmark for measuring genuine abstraction and reasoning capabilities in AI systems, as opposed to pattern matching or memorization. Most current LLM-based approaches to ARC struggle because the tasks require few-shot generalization from minimal data—exactly the kind of reasoning that transformers are notoriously weak at.
DiARC addresses a fundamental limitation: LLMs trained solely on correct examples often fail to develop a clear decision boundary between valid and invalid reasoning. By explicitly incorporating negative samples, the model learns not just what a correct answer looks like, but also what distinguishes it from plausible but wrong answers. This mirrors how humans learn—through both positive reinforcement and error correction.
The significance extends beyond ARC. Many real-world reasoning tasks—from code debugging to medical diagnosis—require distinguishing correct from incorrect solutions. DiARC’s methodology could be adapted to improve model performance in any domain where negative examples are available or can be synthetically generated.
Implications for AI Practitioners
For researchers and engineers working on reasoning tasks, DiARC offers a practical, data-efficient technique. Rather than requiring massive new datasets or architectural changes, the method works by modifying the training objective and data composition. This makes it accessible for teams with limited compute budgets.
Practitioners should consider three key applications:
- Few-shot learning pipelines: When building models that must generalize from minimal examples, explicitly adding negative samples during fine-tuning can sharpen decision boundaries.
- Safety and robustness: Teaching models to reject incorrect outputs is as important as teaching them to produce correct ones. DiARC’s approach could be applied to reduce hallucination rates in factual question answering.
- Evaluation methodology: The paper implicitly argues that accuracy on positive examples alone is insufficient. Practitioners should consider measuring false positive rates—how often a model confidently produces wrong answers that look plausible.
Key Takeaways
- DiARC improves ARC reasoning by training LLMs to distinguish positive (correct) from negative (incorrect) grid solutions, using contrastive learning to sharpen reasoning boundaries.
- The method addresses a core weakness of LLMs—generalizing from minimal examples—by explicitly teaching models what not to do, not just what to do.
- The approach is practical and resource-efficient, requiring no architectural changes, only modified training data and objectives.
- Practitioners should consider negative sample augmentation for any reasoning task where false positives are costly, but must invest in careful curation of negative examples to avoid confusing the model.