reddit/r/localllama

BitCPM-CANN: Native 1.58-Bit Large Language Model Training on Ascend NPU

BitCPM-CANN: Achieving Native 1.58-Bit LLM Training on Huawei Ascend NPU

Researchers introduce BitCPM-CANN, a systematic study on applying 1.58-bit (ternary) Quantization-Aware Training (QAT) on the Huawei Ascend NPU platform. This work addresses the challenges of extreme low-bit LLMs, demonstrating that ternary weights maintain high performance on complex reasoning tasks while enabling significant memory and computational efficiency gains.

Abstract and Methodology

BitCPM-CANN presents a family-level investigation into 1.58-bit ternary quantization-aware training (QAT) specifically optimized for the Huawei Ascend NPU ecosystem. The study tackles two critical practical hurdles in the domain of ultra-low-bit Large Language Models (LLMs): ensuring that ternary weights preserve high capabilities on complex reasoning tasks at on-device scales, and establishing a method for native end-to-end 1.58-bit training outside traditional CUDA environments.

Implementation and Scale

The researchers successfully ported their existing GPU-based pipeline to utilize CANN and MindSpeed, integrating with Megatron-LM. They trained four distinct model variants (BitCPM-CANN-0.5B, 1B, 3B, and 8B). Crucially, these models were trained while being strictly aligned with the architecture and pre-training data of their full-precision counterparts, MiniCPM4.

This achievement is noted as the first end-to-end 1.58-bit training system on a domestic NPU scaled up to 8 billion parameters, establishing a reusable low-bit training infrastructure for the Ascend ecosystem.

Performance Metrics and Efficiency Gains

The evaluated models were benchmarked across 11 diverse tasks, covering commonsense reasoning, domain knowledge application, and mathematics & reasoning. The results highlight significant performance retention and efficiency improvements:

  • Performance Retention: The 1B, 3B, and 8B variants maintained performance between 95.7% and 97.2% compared to their full-precision counterparts.
  • Task Parity: The 3B variant achieved parity on the BBH benchmark, and both the 3B and 8B variants successfully recovered nearly all performance on the demanding GSM8K mathematical reasoning task.
  • Sub-Billion Scale Bottleneck: The 0.5B variant retained 90.1% of performance, with the residual performance gap concentrated in mathematical tasks. The authors posit that at sub-billion parameter scales, the primary bottleneck is capacity rather than the quantizer itself.

Computational Viability

From an operational standpoint, the QAT integration added a minimal 4.5% overhead to the training throughput (148 vs. 155 TFLOP/s per NPU), making ternary training a viable default configuration. Furthermore, this quantization scheme enables a substantial 8× reduction in weight memory usage, translating to approximately a 6× end-to-end reduction when scaling factors are included during inference.

Comparative Performance

The BitCPM-CANN 8B model demonstrates highly competitive performance, achieving comparability with Qwen3-8B, which was trained using 36 trillion tokens, but required only 8 trillion tokens for the BitCPM-CANN variant. (MiniCPM4 was previously released in June 2025: MiniCPM4 Paper).

← Back to homepage