WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models
arXiv:2604.08558v2 Announce Type: replace-cross Abstract: Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed...
The Efficiency Bottleneck in Autoregressive TTS
The latest preprint from arXiv (2604.08558v2) introduces WAND—a method combining windowed attention with knowledge distillation to address a critical scaling problem in autoregressive text-to-speech (AR-TTS) models. Current decoder-only AR-TTS architectures, while capable of generating remarkably natural speech, suffer from a quadratic memory and compute cost relative to sequence length, because every token must attend to every previous token. For long-form speech synthesis—such as audiobooks, podcasts, or extended voice assistants—this becomes prohibitively expensive.
WAND proposes a two-pronged solution. First, it replaces full self-attention with a windowed attention mechanism, limiting each token’s field of view to a fixed local context. This reduces the complexity from O(L²) to O(L × W), where W is the window size. Second, it uses knowledge distillation to recover the quality lost by restricting attention. A full-attention teacher model trains the windowed student, transferring the ability to model long-range dependencies indirectly. The result is a model that runs faster and uses less memory while maintaining speech fidelity comparable to its unconstrained counterpart.
Why This Matters
The AR-TTS landscape has been dominated by models that trade efficiency for quality. WAND’s contribution is not a breakthrough in speech quality per se, but in making high-quality AR-TTS practical for deployment. The quadratic scaling problem has been a known barrier—this work offers a concrete, training-time solution that does not require architectural overhauls or specialized hardware.
For AI practitioners, the significance lies in the method’s generality. Windowed attention combined with distillation is not TTS-specific; it could apply to any autoregressive sequence model where long-range context is important but not uniformly critical. Speech, unlike language, has strong local structure—phonemes and prosody are heavily influenced by immediate neighbors. WAND exploits this property, but similar logic could extend to music generation, gesture synthesis, or even certain code generation tasks.
Implications for AI Practitioners
- Deployment feasibility: WAND makes it realistic to run AR-TTS on edge devices or in real-time streaming scenarios. Practitioners building voice interfaces for low-latency applications should evaluate this approach as an alternative to non-autoregressive models, which often sacrifice naturalness for speed.
- Training cost reduction: Knowledge distillation adds a training overhead, but the inference savings are substantial. For teams serving many concurrent users, the reduction in per-request compute could translate directly into lower cloud costs or higher throughput.
- Quality-efficiency trade-off: The paper’s results suggest that with careful distillation, the quality gap can be minimal. However, practitioners should test on their own domain—window size and distillation temperature are hyperparameters that may need tuning for specific voices, languages, or speaking styles.
- Model architecture agnosticism: WAND can be applied to existing AR-TTS models without redesigning the core architecture. This lowers the barrier to adoption for teams already using decoder-only TTS pipelines.
Key Takeaways
- WAND reduces AR-TTS compute and memory costs from quadratic to linear in sequence length by replacing full self-attention with windowed attention and recovering quality via knowledge distillation.
- The method is particularly relevant for long-form speech synthesis and real-time applications where latency and resource constraints are critical.
- The approach is architecture-agnostic and could generalize to other autoregressive generation tasks with strong local structure.
- Practitioners should validate the quality-efficiency trade-off on their specific data, as optimal window size and distillation settings may vary by domain.