llama.cpp Integrates Architecture Support for Cohere2-MoE: Enabling North-Mini-Code 1.0
The llama.cpp ecosystem has expanded its architectural support to include the Cohere2-MoE framework, facilitating the local deployment of Cohere's North-Mini-Code 1.0, a specialized Mixture-of-Experts model optimized for software engineering and agentic workflows.
Architectural Expansion in llama.cpp
Recent updates to the llama.cpp repository (Pull Request #24260) introduce critical architecture support for Cohere2-MoE. This integration allows users to run Cohere's latest research releases locally using the GGML format, leveraging the efficiency of the llama.cpp inference engine for quantized execution on consumer hardware.
Introducing North-Mini-Code 1.0
The primary beneficiary of this update is the North-Mini-Code 1.0 model. Developed by Cohere and Cohere Labs, this model is an open-weights research release designed specifically for high-performance technical tasks.
Technical Specifications
- Parameter Count: 30B total parameters, with 3B active parameters per token (A3B), utilizing a Mixture-of-Experts (MoE) architecture.
- Primary Optimizations: The model is fine-tuned for code generation, agentic software engineering, and terminal-based tasks.
- Licensing: Distributed under the Apache 2.0 license, permitting broad research and commercial application.
Deployment and Implementation
To utilize this new support, users are advised to recompile their llama.cpp binaries to incorporate the latest architectural changes. The model is available in both original weights and GGUF format for immediate deployment:
- Original Weights: Available via CohereLabs on Hugging Face.
- Quantized GGUF: Optimized versions provided by Unsloth for reduced memory footprints.
Note: This article is based on a community announcement; detailed benchmark performance and specific quantization metrics for the North-Mini-Code 1.0 model were not provided in the source material.
Original Source