Skip to content
BeClaude
Research2026-07-03

IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs

Originally published byArxiv CS.AI

arXiv:2607.01431v1 Announce Type: cross Abstract: We introduce ISOSCI, a benchmark of isomorphic cross-domain science problem pairs that separates reasoning ability from domain knowledge retrieval in LLM evaluation. Each pair shares identical logical structure but requires different domain-specific...

The IsoSci Benchmark: Untangling Reasoning from Domain Knowledge in LLMs

A new research paper introduces IsoSci, a benchmark designed to isolate reasoning ability from domain-specific knowledge retrieval in large language models. The core innovation is elegant: IsoSci presents LLMs with pairs of problems that share identical logical structures but draw on different scientific domains. For example, a problem about chemical reaction rates might be paired with a structurally identical problem about population growth dynamics. If a model solves one but not the other, the failure is likely due to gaps in domain knowledge—not reasoning deficits.

This matters because current LLM evaluations often conflate these two capabilities. A model that scores highly on physics questions may simply have memorized more physics content, not necessarily possess superior reasoning. Conversely, a model that fails a biology question might be a strong reasoner that simply lacks biological facts. IsoSci’s isomorphic design allows researchers to pinpoint which capability is actually lacking.

Why This Matters for AI Practitioners

For those building or deploying LLM-based systems, this benchmark addresses a practical pain point: debugging model failures. When a model gets a question wrong, is it because it doesn’t understand the logic, or because it doesn’t know the facts? The answer dictates entirely different remediation strategies—whether to invest in better retrieval-augmented generation (RAG) pipelines or to fine-tune the model’s reasoning chains.

IsoSci also has implications for model selection. A model that performs well on IsoSci’s reasoning dimension but poorly on knowledge retrieval might be an excellent candidate for a RAG architecture, where domain knowledge can be supplied externally. Conversely, a model strong on knowledge but weak on reasoning would need different architectural interventions, such as chain-of-thought prompting or reinforcement learning from human feedback (RLHF) focused on logical consistency.

The benchmark’s cross-domain nature further exposes a subtle risk: models may appear to reason well in one domain while failing in another, simply because they have memorized domain-specific patterns. This is particularly dangerous in scientific and medical applications, where a model might correctly answer a familiar problem but fail catastrophically on an unfamiliar one with the same underlying logic.

Implications for AI Practitioners

  • Diagnostic tool: IsoSci can help practitioners identify whether their deployed models need better knowledge retrieval (e.g., improved RAG) or better reasoning (e.g., fine-tuning on logical tasks).
  • Model evaluation: When comparing models, IsoSci scores provide a more granular view than overall accuracy—revealing whether a model’s strength is genuine reasoning or memorized domain knowledge.
  • Training data design: The benchmark highlights the value of training on cross-domain isomorphic examples to encourage true reasoning transfer, rather than domain-specific pattern matching.

Key Takeaways

  • IsoSci isolates reasoning from domain knowledge by using problem pairs with identical logical structures across different scientific fields.
  • The benchmark helps practitioners diagnose whether model failures stem from reasoning deficits or knowledge gaps, enabling targeted improvements.
  • Models that appear strong in one domain may simply have memorized patterns, not developed transferable reasoning—a critical risk for deployment in novel contexts.
  • IsoSci provides a more nuanced evaluation metric than overall accuracy, supporting better model selection and training strategies.
arxivpapersreasoningbenchmark