Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Nikita Koriagin, Georgii Aparin, Nikita Balagansky, Daniil Gavrilov 2026-06-08 · 14:09 UTC

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Article automatically generated from technical news.

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, l

Fonte originale

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Related Articles

June 2026 AI Model Madness: GPT-5.5, DeepSeek V4, Gemma 4 & More

FareedKhan-dev /train-llm-from-scratch

Notes on DeepSeek

How long do you think it will take for the stock market to notice that Apple and Microsoft announced at the same time that they're all-in for local AI?

Can you really replace paid models with a local model?