Research2026-06-19

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

arXiv:2606.20532v1 Announce Type: new Abstract: Style-captioned text-to-speech systems use natural language to control voice characteristics, but how individual words influence acoustic output remains unclear. Understanding this is critical for diagnosing failure modes and improving controllability...

The Black Box of Voice Style Control

A new paper from arXiv (2606.20532) tackles a fundamental blind spot in modern text-to-speech (TTS) systems: the disconnect between natural language style prompts and actual acoustic output. While style-captioned TTS allows users to describe desired voice characteristics in plain English—"speak with a warm, breathy tone" or "add a hint of sarcasm"—the internal mechanism by which individual words like "warm" or "breathy" influence the generated audio has remained opaque. The researchers introduce a cross-attention attribution method to trace which tokens in a style caption most strongly affect which acoustic features of the output.

This matters because the current generation of controllable TTS models treats style captions as monolithic embeddings, obscuring whether the system is actually attending to the intended semantic cues or latching onto spurious correlations. For example, a model might learn to associate "whisper" with low volume but fail to capture the spectral characteristics of actual whispered speech, producing a quiet but otherwise normal voice. The attribution method allows practitioners to visualize exactly which words drive changes in pitch, timbre, or speaking rate, revealing when the model is "cheating" by ignoring key terms.

Implications for AI Practitioners

For developers building voice interfaces or assistive technologies, this work offers a diagnostic tool that was previously unavailable. When a style caption fails to produce the expected vocal effect, engineers can now inspect whether the model correctly attended to the relevant word or was distracted by adjacent tokens. This is particularly critical for safety-sensitive applications—a medical voice assistant instructed to "speak calmly and slowly" must actually alter its prosody, not just lower its volume.

The research also has direct implications for prompt engineering. If attribution analysis reveals that certain style words are consistently ignored or misinterpreted, practitioners can reformulate captions to use more salient vocabulary. For instance, if "authoritative" is frequently overlooked but "commanding" triggers the desired effect, the system's behavior becomes auditable and improvable.

Broader Context

This paper sits at the intersection of interpretability and generative audio—a domain where black-box issues are even more acute than in text or image generation. Unlike LLMs where attention patterns can be visualized for individual tokens, TTS models process continuous acoustic features, making attribution inherently more complex. The cross-attention approach is a practical step toward making these systems transparent, but it also highlights how far we are from truly compositional control over voice synthesis.

Key Takeaways

Researchers developed a cross-attention attribution method to identify which words in a style caption influence specific acoustic features in TTS output, addressing a critical interpretability gap.
The technique enables practitioners to diagnose failures—such as models ignoring key style terms—and refine prompts or training data accordingly.
For safety-critical voice applications, this provides an audit trail to verify that style instructions are actually being followed, not just superficially matched.
The work underscores the need for more granular interpretability tools in generative audio, where black-box behavior is harder to analyze than in text or image domains.

Read Original Article on Arxiv CS.AI

arxivpapers