Skip to content
BeClaude
Research2026-06-29

DiARC: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models

Originally published byArxiv CS.AI

arXiv:2606.26530v2 Announce Type: replace-cross Abstract: The Abstraction and Reasoning Corpus (ARC) contains tasks that require summarizing patterns from limited grid samples and predicting output grids. Recently, many large language model based approaches have attempted to transform it into a...

What Happened

A new paper titled "DiARC" proposes a method to improve how large language models handle the Abstraction and Reasoning Corpus (ARC) by explicitly distinguishing between positive and negative samples during training. The ARC benchmark, originally designed to test fluid intelligence through grid-based pattern completion tasks, has proven notoriously difficult for LLMs. DiARC introduces a training framework where models learn not only from correct grid transformations (positive samples) but also from incorrect ones (negative samples), forcing the model to develop a more nuanced understanding of the underlying rules.

The approach works by generating both valid and invalid output grids for each ARC task, then training the model to discriminate between them. This contrastive learning mechanism helps the model internalize the boundaries of acceptable transformations rather than simply memorizing patterns from limited examples. The authors report significant improvements over baseline methods on ARC-style reasoning tasks.

Why It Matters

ARC has become a critical benchmark for measuring genuine reasoning capabilities in AI systems because it requires abstract pattern recognition from minimal data—a skill that humans find trivial but machines struggle with. Most current LLM approaches to ARC either rely on brute-force search over possible transformations or attempt to generate code that solves the tasks. Both methods have fundamental limitations: search-based approaches don't scale, and code-generation approaches often fail on novel patterns.

DiARC's insight is elegantly simple: reasoning about what doesn't work is as important as reasoning about what does. This mirrors how humans learn—we often understand a rule better by seeing its exceptions. For AI practitioners, this suggests that current training paradigms focused exclusively on positive examples may be leaving significant reasoning capability on the table. The paper provides empirical evidence that contrastive learning can be effectively applied to abstract reasoning tasks, not just to classification or generation problems.

Implications for AI Practitioners

For those building reasoning systems, DiARC offers a practical technique that can be integrated into existing training pipelines without architectural changes. The key requirement is generating high-quality negative samples—a non-trivial task that requires domain knowledge about what constitutes a plausible but incorrect answer. Practitioners working on similar reasoning benchmarks (e.g., Raven's Progressive Matrices, Bongard problems) should consider whether their training data could benefit from explicit negative sampling.

The approach also has implications for few-shot learning systems. By training models to distinguish correct from incorrect outputs, DiARC effectively compresses more information into each training example. This could be particularly valuable in domains where labeled data is scarce but unlabeled data is abundant—a common scenario in enterprise AI applications.

However, practitioners should note that the paper focuses on grid-based reasoning. Whether the same approach transfers to natural language reasoning tasks remains an open question, though the underlying principle seems broadly applicable.

Key Takeaways

  • DiARC improves LLM reasoning on ARC tasks by training models to distinguish correct from incorrect output grids, using contrastive learning on positive and negative samples
  • The approach addresses a fundamental limitation of current LLM reasoning: over-reliance on positive-only training data that fails to teach models the boundaries of valid transformations
  • For AI practitioners, the technique offers a practical, architecture-agnostic method to enhance reasoning capabilities, particularly useful in few-shot and data-scarce scenarios
  • The main challenge lies in generating high-quality negative samples, which requires domain expertise and careful curation to avoid teaching the model misleading patterns
arxivpapersreasoning