Research2026-06-19

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

arXiv:2606.19538v1 Announce Type: new Abstract: Convolutional networks, recurrent networks, and transformers each encode different inductive biases -- locality, sequential memory, and content-dependent pairwise interaction -- and have remained mathematically distinct since their inception. We show...

The paper ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence proposes a single mathematical framework—a learnable integral transform—that can replicate the core operations of convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers. By unifying these previously distinct architectures under one operator, the authors claim that the ITNet can dynamically switch between or combine inductive biases (locality, sequential memory, and pairwise interaction) depending on the task or input.

What happened

The researchers introduce a parameterized integral transform that acts on input sequences or grids. The transform kernel is learned, and by adjusting its structure (e.g., support size, weight sharing patterns, or positional dependencies), the same model can behave like a convolution (local, translation-equivariant), an attention mechanism (content-dependent, global), or a recurrence (stateful, sequential). The paper demonstrates that this single architecture achieves competitive performance on benchmarks spanning image classification, language modeling, and time-series forecasting—tasks that traditionally require separate specialized models.

Why it matters

This unification is significant for three reasons. First, it challenges the long-held assumption that CNNs, RNNs, and transformers are mathematically incompatible. If a single learnable operator can express all three, it suggests that the differences are not fundamental but rather choices in kernel parametrization. Second, it opens the door to architectures that adapt their inductive bias during training or even per input—for example, using local convolutions for texture-rich image patches and global attention for object boundaries. Third, it could simplify the engineering stack: instead of maintaining separate codebases and hyperparameter search spaces for each architecture type, practitioners might train one ITNet that learns which bias to apply.

Implications for AI practitioners

Model selection may become less binary. Rather than deciding upfront whether to use a transformer or a CNN, practitioners could start with an ITNet and let the training process discover the optimal mix of locality, recurrence, and attention. This could reduce the trial-and-error phase of architecture search.

Transfer learning and fine-tuning could benefit. A pretrained ITNet might adapt to a new domain by shifting its kernel behavior—e.g., becoming more local for medical imaging or more global for natural language—without changing the model’s core parameters.

Computational cost remains a question. The paper does not yet provide a detailed complexity analysis. If the integral transform requires dense kernel evaluations, it may be slower than specialized implementations (e.g., FFT-based convolution or flash attention). Practitioners should benchmark ITNet against optimized baselines before adopting it for production.

Interpretability may improve. Because the kernel’s structure directly encodes the inductive bias, analyzing the learned kernel could reveal whether the model relies on local patterns, long-range dependencies, or sequential memory—offering a clearer picture of how it solves a task.

Key Takeaways

ITNet introduces a learnable integral transform that can replicate convolution, attention, and recurrence within a single architecture, unifying previously distinct model families.
This unification suggests that inductive biases are not fixed architectural choices but learnable properties, potentially reducing the need for manual architecture search.
Practitioners should monitor computational efficiency benchmarks and kernel interpretability tools before deploying ITNet in resource-constrained or safety-critical applications.
The work may accelerate progress toward adaptive models that dynamically adjust their processing style based on input characteristics.

Read Original Article on Arxiv CS.AI

arxivpapers