Research / Spark / Small Model Efficiency

Sub-500K Parameters on WikiText-103 Char-Level

What can 423,000 parameters achieve on a standard language modeling benchmark? 2.92 bits-per-character in 20 minutes, trained on a laptop.

April 10, 2026 Efficient ML Samir Awuapara

Key Results

2.92

Val BPC

423K

Parameters

20 min

Wall Time

M1 Pro

Hardware

Motivation

Most language model research optimizes for scale: more parameters, more data, more compute. The efficiency question — how much can you learn with how little? — gets less attention, particularly at the character level.

WikiText-103 character-level language modeling is a well-defined benchmark: predict the next character in a stream of ~538 million training tokens drawn from verified Good and Featured articles on Wikipedia. The metric is bits-per-character (BPC) — lower is better, with the theoretical minimum determined by the true entropy of English text.

We trained a novel neural network built from scratch in C++ and Metal on this benchmark, with the goal of seeing how far sub-500K parameters can go on consumer hardware. No frameworks were used. No pre-training. No pre-existing architecture components.

Setup

Task

Character-level language modeling (98-token ASCII vocab)

Dataset

WikiText-103 raw character stream (538M train tokens, 1.1M val tokens)

Hardware

Apple M1 Pro, 16 GB unified memory, single GPU

Implementation

Custom C++ engine with Metal compute shaders. No frameworks.

Learning Curve

The network starts from random initialization and trains for 20 minutes of wall time. The architecture grows during training — the parameter count at the end (423K) is the final count, not the starting count.

Wall Time Train BPC Val BPC Parameters
0s6.646.6239K
1m 12s4.194.22~45K
3m 02s3.713.74~55K
5m 18s3.483.51~80K
7m 45s3.323.36~120K
10m 10s3.193.23~170K
12m 38s3.103.14~230K
15m 05s3.023.06~300K
17m 30s2.952.99~370K
19m 53s2.892.93423K

Best validation BPC: 2.92 (measured at final evaluation checkpoint). Parameter counts interpolated from topology snapshots.

Context

Character-level BPC on WikiText-103 is not a heavily benchmarked task — most modern language models operate at the subword (BPE) level. Published results in this space tend to come from character-aware Transformers or recurrent models with parameter counts in the millions to hundreds of millions.

What makes this result notable is the efficiency ratio:

  • 423K parameters — orders of magnitude smaller than typical language models
  • 20 minutes of training — on a single laptop, not a GPU cluster
  • No pre-training, no transfer learning — trained from random initialization
  • No existing framework — every component written from scratch

The model is not competing with state-of-the-art BPE language models. It's demonstrating that a carefully built small model can learn meaningful structure from raw characters in a fraction of the time and compute typically associated with language modeling research.

Final Network Statistics

225,749

Edges

2,841

Hidden Units

5

Depth (Layers)

423K

Parameters

Efficient ML Character-Level LM WikiText-103 Small Models