BitCPM-CANN: Achieving Native 1.58-Bit LLM Training on Huawei Ascend NPU

Researchers introduce BitCPM-CANN, a systematic study on applying 1.58-bit (ternary) Quantization-Aware Training (QAT) on the Huawei Ascend NPU platform. This work addresses the challenges of extreme low-bit LLMs, demonstrating that ternary weights maintain high performance on complex reasoning tasks while enabling significant memory and computational efficiency gains.

Abstract and Methodology

BitCPM-CANN presents a family-level investigation into 1.58-bit ternary quantization-aware training (QAT) specifically optimized for the Huawei Ascend NPU ecosystem. The study tackles two critical practical hurdles in the domain of ultra-low-bit Large Language Models (LLMs): ensuring that ternary weights preserve high capabilities on complex reasoning tasks at on-device scales, and establishing a method for native end-to-end 1.58-bit training outside traditional CUDA environments.

Implementation and Scale

The researchers successfully ported their existing GPU-based pipeline to utilize CANN and MindSpeed, integrating with Megatron-LM. They trained four distinct model variants (BitCPM-CANN-0.5B, 1B, 3B, and 8B). Crucially, these models were trained while being strictly aligned with the architecture and pre-training data of their full-precision counterparts, MiniCPM4.

This achievement is noted as the first end-to-end 1.58-bit training system on a domestic NPU scaled up to 8 billion parameters, establishing a reusable low-bit training infrastructure for the Ascend ecosystem.

Performance Metrics and Efficiency Gains

The evaluated models were benchmarked across 11 diverse tasks, covering commonsense reasoning, domain knowledge application, and mathematics & reasoning. The results highlight significant performance retention and efficiency improvements:

Performance Retention: The 1B, 3B, and 8B variants maintained performance between 95.7% and 97.2% compared to their full-precision counterparts.
Task Parity: The 3B variant achieved parity on the BBH benchmark, and both the 3B and 8B variants successfully recovered nearly all performance on the demanding GSM8K mathematical reasoning task.
Sub-Billion Scale Bottleneck: The 0.5B variant retained 90.1% of performance, with the residual performance gap concentrated in mathematical tasks. The authors posit that at sub-billion parameter scales, the primary bottleneck is capacity rather than the quantizer itself.

Computational Viability

From an operational standpoint, the QAT integration added a minimal 4.5% overhead to the training throughput (148 vs. 155 TFLOP/s per NPU), making ternary training a viable default configuration. Furthermore, this quantization scheme enables a substantial 8× reduction in weight memory usage, translating to approximately a 6× end-to-end reduction when scaling factors are included during inference.

Comparative Performance

The BitCPM-CANN 8B model demonstrates highly competitive performance, achieving comparability with Qwen3-8B, which was trained using 36 trillion tokens, but required only 8 trillion tokens for the BitCPM-CANN variant. (MiniCPM4 was previously released in June 2025: MiniCPM4 Paper).

→ View original source

Techyon - AI News Aggregator

BitCPM-CANN: Native 1.58-Bit Large Language Model Training on Ascend NPU

BitCPM-CANN: Achieving Native 1.58-Bit LLM Training on Huawei Ascend NPU

Abstract and Methodology

Implementation and Scale

Performance Metrics and Efficiency Gains

Computational Viability

Comparative Performance

BitCPM-CANN: Native 1.58-Bit Large Language Model Training on Ascend NPU

BitCPM-CANN: Achieving Native 1.58-Bit LLM Training on Huawei Ascend NPU

Abstract and Methodology

Implementation and Scale

Performance Metrics and Efficiency Gains

Computational Viability

Comparative Performance

Related Articles

GPU VRAM only for small models with llama.cpp: is it possible?

farion1231 /cc-switch

cheahjs /free-llm-api-resources

anthropics /knowledge-work-plugins

DeepSeek to Make Permanent 75% Discount on Flagship AI Model