ZONOS2: High-Fidelity Real-Time TTS Featuring 8B Parameters and Advanced Voice Cloning

Zyphra introduces ZONOS2, a powerful Text-to-Speech (TTS) model utilizing an 8B parameter architecture with 900M active parameters to achieve state-of-the-art prosody and real-time performance.

Architectural Overview and Performance

ZONOS2 represents a significant leap in neural speech synthesis, balancing high model capacity with efficient inference. With a total parameter count of 8 billion, the model employs a sparse activation strategy, utilizing only 900 million active parameters per token. This architecture allows the model to maintain the deep representational power required for high-fidelity voice cloning while ensuring the low latency necessary for real-time applications.

Benchmarking and Prosody

According to the released evaluation data, ZONOS2 outperforms several industry-leading models in terms of prosody—the patterns of stress and intonation in a language. In the TTSDS Prosody Score benchmark, ZONOS2 (8B) achieved a score of 88.7, surpassing other prominent models including:

  • Qwen 3 TTS 1.7B: 87.6
  • Inworld TTS 2: 87.5
  • Cartesia Sonic 3.5: 87.1
  • Fish S2 Pro: 86.6
  • VoxCPM 2: 86.3

Implementation and Open Access

Zyphra has provided a comprehensive ecosystem for the deployment and evaluation of the model. The release includes the model weights for research and development, alongside dedicated inference code for implementation and a specialized evaluation framework (ZTTS1-Eval) to verify performance metrics.

Technical Resources

  • Official Blog: Detailed project documentation and methodology.
  • Model Weights: Available via Hugging Face for integration into local pipelines.
  • Source Code: Implementation details available on GitHub for both inference and evaluation.

Note: Detailed architectural specifics regarding the sparsity mechanism and training dataset are not provided in the source snippet and should be referenced via the official blog.

Original Source
Text-to-Speech Neural Audio Synthesis Voice Cloning Zyphra Real-time Inference