I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size

Article automatically generated from technical news.

I trained a small language model from scratch called KeyLM. It is 75M params, decoder-only, and there is a pretrained base, an instruction-tuned version, and a GGUF. On IFEval (instruction following) the 75M instruct model scores slightly higher than the original SmolLM-135M-Instruct at about half the parameters and a fraction of the training data. (SmolLM was pre trained on 600B tokens and SmolLM2 on 2T tokens, but KeyLM is only pretrained on 18B tokens.) Model Params IFEval -----------

Fonte originale