2-Bit Ternary Inference Engine

Achieving Ultra-Efficient Edge LLM Inference with Custom 2-Bit Ternary Quantization in Rust

A pioneering implementation has demonstrated the feasibility of running large language models, such as GPT-2 XL (1.5B parameters), entirely offline on consumer-grade edge hardware (Microsoft Surface Pro 7) by utilizing a custom 2-Bit Ternary inference engine written in Rust. This architecture bypasses traditional quantization standards (AWQ/GPTQ) to achieve extreme compression and high token throughput.

Architectural Overview: The Ternary Mamba Engine

The increasing computational demands of LLM inference have led to significant hardware requirements, often necessitating high-end specialized GPUs. This research presents a solution to democratize efficient LLM deployment at the edge by implementing a novel framework, dubbed the "Ternary Mamba Engine." This engine fundamentally shifts the weight representation from floating-point numbers to a strict ternary format ($\{-1, 0, 1\}$).

The Quantization and Training Pipeline

Standard Post-Training Quantization (PTQ) methods often result in significant performance degradation. The custom pipeline addresses this through a specialized training phase:

PyTorch QAT Trainer: Instead of standard PTQ, the system employs a custom HuggingFace patcher. This patch injects a BitLinear module utilizing Straight-Through Estimators (STEs). This mechanism strictly enforces forward pass values to $\{-1, 0, 1\}$ while allowing gradients to be computed in FP32, enabling the model to effectively "heal" its grammar during fine-tuning and distillation.
Ternary Packer: The model weights are aggressively compressed. The framework incorporates a bit-wise compressor that transforms the original 6.4 GB GPT-2-XL model down to approximately 375 MB. This achieves a 16x compression ratio by packing 16 weights into a single 32-bit integer block.

Implementation and Performance Benchmarks

The core inference engine is implemented from scratch using Rust, eliminating reliance on high-level floating-point ALUs. By operating exclusively with ternary weights, the forward pass is reduced to pure, branchless integer addition and subtraction, which is highly conducive to efficient CPU utilization.

SIMD and Hardware Acceleration

The performance gains are achieved through targeted low-level optimizations. The Rust core leverages hardware intrinsics, specifically utilizing the _mm256_maddubs_epi16 instruction for highly optimized 8-bit dot-products. Furthermore, the application utilizes the Rayon library to fully saturate the CPU's 8 available cores.

Measured Performance on Edge Hardware

Testing was conducted on a Microsoft Surface Pro 7, an ultra-light tablet profile without dedicated GPU support. The results demonstrate a dramatic improvement in inference speed compared to standard prototypes:

→ View original source

← Back to homepage

Implementation	Throughput (Tokens/sec)
Python NumPy Prototype	0.14

Techyon - AI News Aggregator

I built a custom 2-Bit Ternary Inference Engine from scratch in Rust + native PyTorch QAT. I'm running GPT-2 XL (1.5B) entirely offline on a Surface Pro 7 at 115 tokens/sec.

Achieving Ultra-Efficient Edge LLM Inference with Custom 2-Bit Ternary Quantization in Rust

Architectural Overview: The Ternary Mamba Engine

The Quantization and Training Pipeline

Implementation and Performance Benchmarks

SIMD and Hardware Acceleration

Measured Performance on Edge Hardware

I built a custom 2-Bit Ternary Inference Engine from scratch in Rust + native PyTorch QAT. I'm running GPT-2 XL (1.5B) entirely offline on a Surface Pro 7 at 115 tokens/sec.

Achieving Ultra-Efficient Edge LLM Inference with Custom 2-Bit Ternary Quantization in Rust

Architectural Overview: The Ternary Mamba Engine

The Quantization and Training Pipeline

Implementation and Performance Benchmarks

SIMD and Hardware Acceleration

Measured Performance on Edge Hardware

Related Articles

Your brain doesn’t tokenize. Why should AGI?

The Second Blind Spot in AI Safety: Emotional Load, Not Emotional Logic

katanemo /plano

NVIDIA /cutlass

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp