Optimizing Local Voice-to-Voice Pipelines: Achieving Near Real-Time Latency with Qwen3.5 and SNAC
A new implementation of a fully local voice-to-voice chatbot demonstrates the feasibility of near real-time, interruptible conversational AI running on consumer-grade 24GB VRAM hardware by leveraging quantized LLMs and optimized TTS decoders.
Technical Architecture and Model Stack
The system utilizes a modular pipeline designed for low-latency interaction, integrating state-of-the-art open-source models for speech-to-text (STT), linguistic processing, and text-to-speech (TTS). The core components include:
- LLM: Qwen3.5-397B, utilizing the UD-Q3_K_XL quantization provided by Unsloth to balance model capacity with memory constraints.
- STT: Whisper-small, providing the necessary balance between transcription accuracy and processing speed.
- TTS: Orpheus Q4_K_XL, paired with a custom SNAC (Neural Audio Codec) decoder implemented via ONNX for accelerated inference.
Performance and Resource Optimization
A critical achievement of this implementation is the optimization of memory overhead. The entire pipeline maintains a VRAM footprint of 21.3 GB or less. This allows the system to operate comfortably on GPUs with 24 GB of VRAM, leaving sufficient headroom for compute graphs and preventing Out-of-Memory (OOM) errors during peak inference loads.
Real-Time Interaction and Interruptibility
To achieve a "close to real-time" user experience, the developer implemented Server-Sent Events (SSE) streaming. This allows the system to stream tokens and audio chunks incrementally rather than waiting for full sequence generation. Furthermore, the system supports interruptibility, allowing the user to break the AI's speech flow while the model preserves the context of the most recent exchange, ensuring conversational coherence.
Hardware and Deployment
The project is 100% local, eliminating the need for external API calls and ensuring data privacy. The use of ONNX for the SNAC decoder suggests a focus on cross-platform runtime efficiency and reduced latency in the audio synthesis stage.
Note: Detailed system RAM specifications were mentioned in the source but not provided in the text.
Original Source