Analyzing Expert Subset Scaling in Mixture-of-Experts (MoE) Architectures

A technical inquiry into the architectural constraints of Mixture-of-Experts (MoE) models, specifically focusing on the relationship between total parameter counts and the active parameter subset used during inference, using the Qwen3.6 35B A3B model as a case study.

The Dynamics of Active Parameters in MoE Models

In Mixture-of-Experts (MoE) architectures, a critical distinction exists between the total parameter count and the active parameter count. The total parameter count represents the entire capacity of the model, while the active parameter count (the subset) refers to the specific number of parameters engaged during a single forward pass to process a given token.

In the case of the Qwen3.6 35B A3B model, while the model possesses a total of 35 billion parameters, only 3 billion parameters are active per token. This design allows the model to maintain the knowledge capacity of a large-scale model while significantly reducing the computational overhead (FLOPs) required for inference, enabling more efficient deployment on local hardware via frameworks like llama.cpp.

Scaling the Expert Subset: Architectural Considerations

The question arises whether the size of the active subset—such as increasing it from 3B to 6B or 8B—is a variable that can be independently scaled or if it is inherently dictated by the overall model size. In MoE design, the active parameter count is typically determined by two primary factors:

  • The Number of Experts: The total pool of available specialized feed-forward networks.
  • Top-k Routing: The routing mechanism that selects the 'k' most relevant experts for each token.

Increasing the active subset (e.g., increasing k) would theoretically increase the model's per-token reasoning capacity and potentially its accuracy, but at the cost of increased VRAM usage and slower inference speeds. The balance between total parameters and active parameters is a deliberate engineering trade-off intended to optimize the efficiency-to-performance ratio.

Note: The provided source is a community discussion and does not provide specific technical documentation on other models with larger active subsets or the exact mathematical constraints of the Qwen3.6 architecture. Further empirical data would be required to determine the exact correlation between total size and subset size across different model families.

Original Source
Mixture-of-Experts (MoE) Qwen3.6 Inference Optimization llama.cpp LLM Architecture