Sub-500K Parameters on WikiText-103 Char-Level
What can 423,000 parameters achieve on a standard language modeling benchmark? 2.92 bits-per-character in 20 minutes, trained on a laptop.
Key Results
2.92
Val BPC
423K
Parameters
20 min
Wall Time
M1 Pro
Hardware
Motivation
Most language model research optimizes for scale: more parameters, more data, more compute. The efficiency question — how much can you learn with how little? — gets less attention, particularly at the character level.
WikiText-103 character-level language modeling is a well-defined benchmark: predict the next character in a stream of ~538 million training tokens drawn from verified Good and Featured articles on Wikipedia. The metric is bits-per-character (BPC) — lower is better, with the theoretical minimum determined by the true entropy of English text.
We trained a novel neural network built from scratch in C++ and Metal on this benchmark, with the goal of seeing how far sub-500K parameters can go on consumer hardware. No frameworks were used. No pre-training. No pre-existing architecture components.
Setup
Task
Character-level language modeling (98-token ASCII vocab)
Dataset
WikiText-103 raw character stream (538M train tokens, 1.1M val tokens)
Hardware
Apple M1 Pro, 16 GB unified memory, single GPU
Implementation
Custom C++ engine with Metal compute shaders. No frameworks.
Learning Curve
The network starts from random initialization and trains for 20 minutes of wall time. The architecture grows during training — the parameter count at the end (423K) is the final count, not the starting count.
| Wall Time | Train BPC | Val BPC | Parameters |
|---|---|---|---|
| 0s | 6.64 | 6.62 | 39K |
| 1m 12s | 4.19 | 4.22 | ~45K |
| 3m 02s | 3.71 | 3.74 | ~55K |
| 5m 18s | 3.48 | 3.51 | ~80K |
| 7m 45s | 3.32 | 3.36 | ~120K |
| 10m 10s | 3.19 | 3.23 | ~170K |
| 12m 38s | 3.10 | 3.14 | ~230K |
| 15m 05s | 3.02 | 3.06 | ~300K |
| 17m 30s | 2.95 | 2.99 | ~370K |
| 19m 53s | 2.89 | 2.93 | 423K |
Best validation BPC: 2.92 (measured at final evaluation checkpoint). Parameter counts interpolated from topology snapshots.
Context
Character-level BPC on WikiText-103 is not a heavily benchmarked task — most modern language models operate at the subword (BPE) level. Published results in this space tend to come from character-aware Transformers or recurrent models with parameter counts in the millions to hundreds of millions.
What makes this result notable is the efficiency ratio:
- 423K parameters — orders of magnitude smaller than typical language models
- 20 minutes of training — on a single laptop, not a GPU cluster
- No pre-training, no transfer learning — trained from random initialization
- No existing framework — every component written from scratch
The model is not competing with state-of-the-art BPE language models. It's demonstrating that a carefully built small model can learn meaningful structure from raw characters in a fraction of the time and compute typically associated with language modeling research.
Final Network Statistics
225,749
Edges
2,841
Hidden Units
5
Depth (Layers)
423K
Parameters