WAND: Efficient Autoregressive TTS with Windowed Attention and Knowledge Distillation

Researchers introduce WAND, a framework that reduces the computational and memory costs of autoregressive text-to-speech models by using windowed attention and knowledge distillation. This approach maintains high-fidelity speech output while making the models more scalable.

Researchers have developed WAND, a new framework designed to make autoregressive text-to-speech (AR-TTS) models more efficient. The method addresses the quadratic scaling of memory and compute costs associated with full self-attention by introducing windowed attention and knowledge distillation. WAND separates the attention mechanism into persistent global attention for conditioning tokens and local sliding-window attention for generated tokens, significantly reducing resource requirements.

The innovation is crucial for scaling AR-TTS models, which have traditionally been limited by their high computational demands. By maintaining high-fidelity speech output while reducing costs to constant complexity, WAND opens up new possibilities for real-time and large-scale applications. This advancement could make high-quality text-to-speech technology more accessible and practical for a wider range of uses.

The research community is likely to explore WAND's potential further, particularly in applications requiring real-time processing or deployment on edge devices. Future work may focus on optimizing the windowed attention mechanism and integrating WAND with other state-of-the-art TTS models. The open questions revolve around the trade-offs between computational efficiency and speech quality, as well as the adaptability of WAND to different types of TTS models.