Optimizing Local Voice-to-Voice Pipelines: Achieving Near Real-Time Latency with Qwen3.5 and SNAC

A new implementation of a fully local voice-to-voice chatbot demonstrates the feasibility of near real-time, interruptible conversational AI running on consumer-grade 24GB VRAM hardware by leveraging quantized LLMs and optimized TTS decoders.

Technical Architecture and Model Stack

The system utilizes a modular pipeline designed for low-latency interaction, integrating state-of-the-art open-source models for speech-to-text (STT), linguistic processing, and text-to-speech (TTS). The core components include:

LLM: Qwen3.5-397B, utilizing the UD-Q3_K_XL quantization provided by Unsloth to balance model capacity with memory constraints.
STT: Whisper-small, providing the necessary balance between transcription accuracy and processing speed.
TTS: Orpheus Q4_K_XL, paired with a custom SNAC (Neural Audio Codec) decoder implemented via ONNX for accelerated inference.

Performance and Resource Optimization

A critical achievement of this implementation is the optimization of memory overhead. The entire pipeline maintains a VRAM footprint of 21.3 GB or less. This allows the system to operate comfortably on GPUs with 24 GB of VRAM, leaving sufficient headroom for compute graphs and preventing Out-of-Memory (OOM) errors during peak inference loads.

Real-Time Interaction and Interruptibility

To achieve a "close to real-time" user experience, the developer implemented Server-Sent Events (SSE) streaming. This allows the system to stream tokens and audio chunks incrementally rather than waiting for full sequence generation. Furthermore, the system supports interruptibility, allowing the user to break the AI's speech flow while the model preserves the context of the most recent exchange, ensuring conversational coherence.

Hardware and Deployment

The project is 100% local, eliminating the need for external API calls and ensuring data privacy. The use of ONNX for the SNAC decoder suggests a focus on cross-platform runtime efficiency and reduced latency in the audio synthesis stage.

Note: Detailed system RAM specifications were mentioned in the source but not provided in the text.

Original Source

LLM Voice-to-Voice Qwen3.5 Whisper ONNX Quantization Local AI

Techyon

Voice-to-voice chatbot update

Optimizing Local Voice-to-Voice Pipelines: Achieving Near Real-Time Latency with Qwen3.5 and SNAC

Technical Architecture and Model Stack

Performance and Resource Optimization

Real-Time Interaction and Interruptibility

Hardware and Deployment

Voice-to-voice chatbot update

Optimizing Local Voice-to-Voice Pipelines: Achieving Near Real-Time Latency with Qwen3.5 and SNAC

Technical Architecture and Model Stack

Performance and Resource Optimization

Real-Time Interaction and Interruptibility

Hardware and Deployment

Related Articles

GLM 5.2 API is live, weights are on HF, and ollama has it already

Google Stitch vs Claude Design vs Figma — The Future of Design Just Split Into Three Directions

Anthropic "pauses" token-based billing for its Claude Agent SDK

GPT‑NL: a sovereign language model for the Netherlands

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification