Deductive Logic in Language Models: Horizontal vs Vertical Reasoning
arXiv:2510.09340v2 Announce Type: replace Abstract: Recent language models exhibit significant logical reasoning abilities, yet the mechanisms supporting deductive inference remain poorly understood. This paper studies small transformer-based language models trained from scratch on multi-step...
What Happened
Researchers have released a preprint (arXiv:2510.09340v2) investigating how small transformer-based language models, trained from scratch on multi-step deductive reasoning tasks, internalize logical inference. The study distinguishes between two reasoning modes: horizontal reasoning (processing premises in parallel to derive a conclusion) and vertical reasoning (chaining sequential logical steps). By training models on controlled datasets with explicit multi-step deduction requirements, the authors map how transformer attention patterns and hidden representations encode these distinct forms of logical deduction. The work provides mechanistic evidence that models can learn to perform structured logical operations, not merely mimic surface patterns.
Why It Matters
This research addresses a critical blind spot in AI interpretability. While large language models (LLMs) appear to reason logically on benchmarks like GSM8K or LogiQA, it has been unclear whether they genuinely perform deduction or rely on statistical heuristics. By isolating deductive logic in a controlled setting—small transformers trained from scratch—the paper offers a cleaner causal picture than studies on pre-trained LLMs with opaque training data.
The horizontal vs. vertical distinction is particularly significant. Horizontal reasoning (e.g., “All men are mortal; Socrates is a man → Socrates is mortal”) involves integrating multiple premises simultaneously. Vertical reasoning (e.g., chaining “If A then B; if B then C → if A then C”) requires sequential step tracking. The finding that transformers can learn both modes—but with different attention patterns—suggests that architectural inductive biases (like residual streams and attention heads) are sufficient for basic deduction without massive scale. This challenges the assumption that logical reasoning emerges only at the largest model sizes.
Implications for AI Practitioners
For model developers: The study implies that explicit training on structured logical tasks (e.g., synthetic deduction datasets) could improve reasoning reliability in smaller models. If horizontal and vertical reasoning rely on different attention mechanisms, practitioners might design specialized fine-tuning data or architectural modifications (e.g., separate attention heads for premise integration vs. step propagation). For evaluation: Current benchmarks often conflate reasoning types. A model that excels at horizontal deduction (common in multiple-choice QA) may fail at vertical chaining (required for multi-hop QA). Practitioners should disaggregate evaluation metrics to test each mode separately. For interpretability: The paper provides a template for mechanistic analysis of reasoning. By training small models on synthetic tasks, researchers can isolate specific cognitive operations—a methodology applicable to studying other capabilities like planning or counterfactual reasoning. Caveat: The study uses tiny transformers (e.g., 12M parameters) and synthetic data. Scaling to billion-parameter models with natural language may reveal different dynamics. Practitioners should treat these findings as a proof-of-concept, not a definitive guide.Key Takeaways
- Small transformers can learn both horizontal (parallel premise integration) and vertical (sequential chaining) deductive reasoning when trained from scratch on structured tasks.
- Horizontal and vertical reasoning rely on distinct attention patterns, suggesting different architectural requirements for each mode.
- AI practitioners should evaluate models separately on each reasoning type and consider targeted fine-tuning to address weaknesses.
- The controlled experimental setup offers a replicable template for studying logical mechanisms, but results may not directly transfer to large-scale LLMs with natural language inputs.