Optimizing Qwen 3.6-35B-A3B: High-Throughput Inference on Intel Arc B70 Pro
Recent benchmarks demonstrate the capabilities of the Qwen 3.6-35B-A3B MoE model running on Intel Arc B70 Pro hardware, achieving impressive prompt processing speeds of nearly 977 tokens per second via the SYCL backend.
Performance Benchmarks
Technical evaluations of the Qwen 3.6-35B-A3B (a Mixture-of-Experts model with approximately 34.66 billion parameters) reveal significant throughput capabilities when deployed on Intel Arc B70 Pro GPUs. Utilizing a 4-bit quantization (Q4_K - Medium), the model occupies 20.81 GiB of VRAM.
The following metrics highlight the efficiency of the SYCL backend implementation:
- Prompt Processing (pp512): 977.40 ± 2.02 t/s
- Token Generation (tg128): 70.54 ± 0.12 t/s
Technical Configuration
The performance was achieved using a specific optimization stack designed for Intel's XPU architecture. Key configuration details include:
- Backend: SYCL
- GPU Offloading: 99 layers offloaded (ngl)
- KV Cache Quantization: q8_0 for both type_k and type_v
- Flash Attention: Enabled (fa: 1)
- Context Window: The setup supports a massive 262k context window, enabling the processing of extensive datasets or long-form documents.
Analysis of Throughput
The prompt processing speed of 977 t/s indicates highly efficient prefill performance, which is critical for applications requiring rapid ingestion of large contexts. The generation speed of 70.54 t/s ensures a fluid user experience, well above the typical reading speed, making it suitable for real-time interactive deployment.
Note: The provided source contains fragmented information regarding the specific software version and full environment setup; further details on the exact runtime version are not available.
Original Source