BeClaude
Research2026-06-26

SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning

Source: Arxiv CS.AI

arXiv:2606.26290v1 Announce Type: cross Abstract: While parameter-efficient fine-tuning (PEFT) typically targets attention projectors, its efficacy for tasks requiring sequential state accumulation remains under-explored. We examine if PEFT for such tasks can benefit from state space model (SSMs)...

This new paper from arXiv tackles a surprisingly overlooked question in the world of parameter-efficient fine-tuning (PEFT): where exactly should you inject the adapter? While most PEFT research has focused on attention mechanisms—largely because of the dominance of transformer architectures—this work argues that for tasks requiring sequential state accumulation, the injection site matters as much as the adapter method itself.

What Happened

The researchers propose a novel approach called Hankel Reduced-Order Modeling (HROM) to create SSM-based adapters. Rather than simply adding LoRA-style low-rank matrices to attention projections, they use Hankel matrices—a classical control theory tool—to derive low-dimensional state-space representations that can be injected into specific layers of a pre-trained model. The critical finding is that the location of these SSM adapters within the network architecture determines their suitability for different long-context tasks. Injecting them at early layers benefits tasks requiring precise token-level recall (e.g., information retrieval from long documents), while later-layer injections favor tasks requiring global state aggregation (e.g., summarization or reasoning over long contexts).

Why It Matters

This research challenges the prevailing assumption that PEFT is a one-size-fits-all solution. Current practice treats adapters as modular plug-ins that work regardless of placement, but this paper demonstrates that the functional role of a layer—whether it processes local patterns or global semantics—interacts with the adapter’s inductive bias. For state-space models, which are inherently designed to maintain a compressed representation of sequential history, placing them in layers that already perform similar functions creates a synergy that standard attention-based adapters cannot replicate.

The methodological contribution is also significant. Using Hankel matrices to derive reduced-order models is mathematically elegant: it leverages decades of control theory to compress the dynamics of a full SSM into a smaller, trainable adapter. This is more principled than random low-rank projections and could generalize to other sequence modeling architectures.

Implications for AI Practitioners

For engineers fine-tuning large language models for long-context applications, this paper offers a practical heuristic: match the adapter’s inductive bias to the layer’s functional role. If your task requires retrieving a specific fact from a 100K-token document, inject SSM adapters into early layers. If you need to synthesize information across the entire context, target later layers. This could reduce the number of experiments needed to find optimal configurations.

Additionally, the HROM method provides a deterministic way to construct SSM adapters without extensive hyperparameter tuning. Practitioners working with state-space models (e.g., Mamba, S4) may find this approach more stable than training full SSM parameters from scratch.

However, the paper does not address computational overhead during inference. SSM adapters, while parameter-efficient, may introduce additional sequential computation that could negate some of the speed advantages of transformers. Practitioners should benchmark latency before deploying.

Key Takeaways

  • Injection site matters: The performance of SSM-based adapters depends critically on which layer they are inserted into, with early layers favoring recall tasks and late layers favoring aggregation tasks.
  • Hankel reduced-order modeling offers a principled way to derive SSM adapters, moving beyond random low-rank projections common in LoRA-style methods.
  • Practitioners should match adapter inductive bias to layer function rather than treating all PEFT as interchangeable; this is especially relevant for long-context fine-tuning.
  • The approach is complementary to attention-based PEFT, suggesting hybrid strategies (e.g., LoRA in attention layers + HROM in SSM layers) may yield the best results for complex long-context tasks.
arxivpapersfine-tuning