Achieving Ultra-Efficient Edge LLM Inference with Custom 2-Bit Ternary Quantization in Rust
A pioneering implementation has demonstrated the feasibility of running large language models, such as GPT-2 XL (1.5B parameters), entirely offline on consumer-grade edge hardware (Microsoft Surface Pro 7) by utilizing a custom 2-Bit Ternary inference engine written in Rust. This architecture bypasses traditional quantization standards (AWQ/GPTQ) to achieve extreme compression and high token throughput.
Architectural Overview: The Ternary Mamba Engine
The increasing computational demands of LLM inference have led to significant hardware requirements, often necessitating high-end specialized GPUs. This research presents a solution to democratize efficient LLM deployment at the edge by implementing a novel framework, dubbed the "Ternary Mamba Engine." This engine fundamentally shifts the weight representation from floating-point numbers to a strict ternary format ($\{-1, 0, 1\}$).
The Quantization and Training Pipeline
Standard Post-Training Quantization (PTQ) methods often result in significant performance degradation. The custom pipeline addresses this through a specialized training phase:
- PyTorch QAT Trainer: Instead of standard PTQ, the system employs a custom HuggingFace patcher. This patch injects a
BitLinearmodule utilizing Straight-Through Estimators (STEs). This mechanism strictly enforces forward pass values to $\{-1, 0, 1\}$ while allowing gradients to be computed in FP32, enabling the model to effectively "heal" its grammar during fine-tuning and distillation. - Ternary Packer: The model weights are aggressively compressed. The framework incorporates a bit-wise compressor that transforms the original 6.4 GB GPT-2-XL model down to approximately 375 MB. This achieves a 16x compression ratio by packing 16 weights into a single 32-bit integer block.
Implementation and Performance Benchmarks
The core inference engine is implemented from scratch using Rust, eliminating reliance on high-level floating-point ALUs. By operating exclusively with ternary weights, the forward pass is reduced to pure, branchless integer addition and subtraction, which is highly conducive to efficient CPU utilization.
SIMD and Hardware Acceleration
The performance gains are achieved through targeted low-level optimizations. The Rust core leverages hardware intrinsics, specifically utilizing the _mm256_maddubs_epi16 instruction for highly optimized 8-bit dot-products. Furthermore, the application utilizes the Rayon library to fully saturate the CPU's 8 available cores.
Measured Performance on Edge Hardware
Testing was conducted on a Microsoft Surface Pro 7, an ultra-light tablet profile without dedicated GPU support. The results demonstrate a dramatic improvement in inference speed compared to standard prototypes:
| Implementation | Throughput (Tokens/sec) |
|---|---|
| Python NumPy Prototype | 0.14 |