Supra-50M Launched: A Compact, High-Performance Causal Language Model for Local Deployment

SupraLabs has introduced Supra-50M, a highly efficient 50-million-parameter causal language model. Built upon a Llama-style architecture and trained on 20 billion tokens of educational web text, this model demonstrates competitive performance across several key benchmarks despite its small parameter count, positioning it as a powerful candidate for local, resource-constrained inference.

Model Overview and Architecture

Supra-50M is presented as the inaugural model in the "SupraLabs Scaling Up Plan." It is available in two configurations: Base and Instruct, providing flexibility for both fine-tuning and direct conversational use. The model adheres to a decoder-only transformer architecture, utilizing a Llama-style design for its foundation.

Technical Specifications

The compact nature of Supra-50M is achieved through careful hyperparameter tuning. Key architectural details include:

Architecture: Llama (decoder-only transformer)
Parameters: Approximately 50M
Vocab Size: 32,000
Hidden Size: 512
Attention Heads: 8
GQA (Key-Value Heads): 4
Max Position Embeddings: 1,024
Precision: bfloat16

Training Methodology and Data

The training regimen emphasized high-quality educational content. The model was trained on a massive dataset of 20 billion tokens sourced from HuggingFaceFW/fineweb-edu (specifically, the `sample-100BT` subset).

Data and Tokenization

The training process utilized a custom Byte-Level BPE tokenizer, trained from scratch on a sample of 500,000 documents from the fineweb-edu dataset. This custom tokenizer ensures optimal token representation for the specific educational corpus.

Property	Value
Dataset	HuggingFaceFW/fineweb-edu (`sample-100BT`)
Total Tokens	20 Billion
Sequence Length	1,024 tokens
Tokenizer Type	ByteLevelBPETokenizer
Special Tokens	<s>, <pad>, </s>, <unk>, <mask>

Performance Benchmarks and Efficiency

A critical aspect of Supra-50M is its efficiency. The model demonstrates competitive or superior results on several academic and logical benchmarks when compared to significantly larger open-source models.

Benchmark Comparison

The table below compares Supra-50M against larger models such as GPT-2 (124M), SmolLM-135M, and OpenELM-270M:

→ View original source

← Back to homepage

Benchmark	Supra-50M (Ours)	GPT-2 (124M)	SmolLM-135M	OpenELM-270M

Techyon - AI News Aggregator