Supra-50M Launched: A Compact, High-Performance Causal Language Model for Local Deployment
SupraLabs has introduced Supra-50M, a highly efficient 50-million-parameter causal language model. Built upon a Llama-style architecture and trained on 20 billion tokens of educational web text, this model demonstrates competitive performance across several key benchmarks despite its small parameter count, positioning it as a powerful candidate for local, resource-constrained inference.
Model Overview and Architecture
Supra-50M is presented as the inaugural model in the "SupraLabs Scaling Up Plan." It is available in two configurations: Base and Instruct, providing flexibility for both fine-tuning and direct conversational use. The model adheres to a decoder-only transformer architecture, utilizing a Llama-style design for its foundation.
Technical Specifications
The compact nature of Supra-50M is achieved through careful hyperparameter tuning. Key architectural details include:
- Architecture: Llama (decoder-only transformer)
- Parameters: Approximately 50M
- Vocab Size: 32,000
- Hidden Size: 512
- Attention Heads: 8
- GQA (Key-Value Heads): 4
- Max Position Embeddings: 1,024
- Precision: bfloat16
Training Methodology and Data
The training regimen emphasized high-quality educational content. The model was trained on a massive dataset of 20 billion tokens sourced from HuggingFaceFW/fineweb-edu (specifically, the `sample-100BT` subset).
Data and Tokenization
The training process utilized a custom Byte-Level BPE tokenizer, trained from scratch on a sample of 500,000 documents from the fineweb-edu dataset. This custom tokenizer ensures optimal token representation for the specific educational corpus.
| Property | Value |
|---|---|
| Dataset | HuggingFaceFW/fineweb-edu (`sample-100BT`) |
| Total Tokens | 20 Billion |
| Sequence Length | 1,024 tokens |
| Tokenizer Type | ByteLevelBPETokenizer |
| Special Tokens | <s>, <pad>, </s>, <unk>, <mask> |
Performance Benchmarks and Efficiency
A critical aspect of Supra-50M is its efficiency. The model demonstrates competitive or superior results on several academic and logical benchmarks when compared to significantly larger open-source models.
Benchmark Comparison
The table below compares Supra-50M against larger models such as GPT-2 (124M), SmolLM-135M, and OpenELM-270M:
| Benchmark | Supra-50M (Ours) | GPT-2 (124M) | SmolLM-135M | OpenELM-270M | ← Back to homepage
|---|