Optimizing Qwen 3.6-35B-A3B: High-Throughput Inference on Intel Arc B70 Pro

Recent benchmarks demonstrate the capabilities of the Qwen 3.6-35B-A3B MoE model running on Intel Arc B70 Pro hardware, achieving impressive prompt processing speeds of nearly 977 tokens per second via the SYCL backend.

Performance Benchmarks

Technical evaluations of the Qwen 3.6-35B-A3B (a Mixture-of-Experts model with approximately 34.66 billion parameters) reveal significant throughput capabilities when deployed on Intel Arc B70 Pro GPUs. Utilizing a 4-bit quantization (Q4_K - Medium), the model occupies 20.81 GiB of VRAM.

The following metrics highlight the efficiency of the SYCL backend implementation:

Prompt Processing (pp512): 977.40 ± 2.02 t/s
Token Generation (tg128): 70.54 ± 0.12 t/s

Technical Configuration

The performance was achieved using a specific optimization stack designed for Intel's XPU architecture. Key configuration details include:

Backend: SYCL
GPU Offloading: 99 layers offloaded (ngl)
KV Cache Quantization: q8_0 for both type_k and type_v
Flash Attention: Enabled (fa: 1)
Context Window: The setup supports a massive 262k context window, enabling the processing of extensive datasets or long-form documents.

Analysis of Throughput

The prompt processing speed of 977 t/s indicates highly efficient prefill performance, which is critical for applications requiring rapid ingestion of large contexts. The generation speed of 70.54 t/s ensures a fluid user experience, well above the typical reading speed, making it suitable for real-time interactive deployment.

Note: The provided source contains fragmented information regarding the specific software version and full environment setup; further details on the exact runtime version are not available.

Original Source

LLM Intel Arc B70 Pro SYCL Qwen 3.6 Mixture-of-Experts Quantization LocalLLaMA

Techyon

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Optimizing Qwen 3.6-35B-A3B: High-Throughput Inference on Intel Arc B70 Pro

Performance Benchmarks

Technical Configuration

Analysis of Throughput

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Optimizing Qwen 3.6-35B-A3B: High-Throughput Inference on Intel Arc B70 Pro

Performance Benchmarks

Technical Configuration

Analysis of Throughput

Related Articles

Best way to index full Italian Wikipedia for 100% offline RAG in LM Studio?

Bedrock Codex, Robust MILP, Multi‑Model Deliberation, Tree‑Based Molecule Ops, and MoE Quantization

0xPlaygrounds /rig

0x4m4 /hexstrike-ai

Google ordered to put clearer links in AI search and let UK publishers opt out