Dynamic Topology: Growing a Sparse Neural Network from Scratch
A self-assembling sparse neural network that grows its own topology during training. 39K → 3M parameters in 20 minutes on a single laptop — with 81–87% GPU cache-line utilization on a dynamically changing sparse graph.
Key Results
81–87%
Cache-Line Utilization
39K → 3M
Dynamic Growth
20 min
Wall Time
M1 Pro
Hardware
The Problem with Static Dense Architectures
Every major AI model today relies on static, dense matrices. We guess the architecture size beforehand, allocate a massive grid of parameters, and spend millions of dollars doing dense matrix multiplications to train it. The architecture is fixed before the first gradient is computed. If the model is too small, it underfits. Too large, and you waste compute on parameters that contribute nothing.
But biological brains don't start at full size and just adjust weights. They grow dynamically, wiring new sparse synapses exactly where they are needed, pruning connections that carry no signal, and deepening hierarchies only when shallow logic can't explain the data.
The goal of this project was a systems engineering challenge: Could we build a bare-metal C++/Metal engine capable of training a Dynamically Growing Sparse Directed Acyclic Graph (DAG) on consumer hardware without choking the GPU?
Setup
Task
Character-level language modeling (98-token ASCII vocab)
Dataset
WikiText-103 raw character stream (538M train tokens, 1.1M val tokens)
Hardware
Apple M1 Pro, 16 GB unified memory, single GPU
Implementation
Custom C++ engine with Metal compute shaders. No frameworks.
What the Network Grows
The network starts as 98 independent sub-DAGs, one for each character in the vocabulary. Each sub-DAG begins as a single hidden neuron connected to 2 input positions with randomly initialized weights. From that embryonic state, the system autonomously performs four types of structural mutations:
- Widening — adding new hidden neurons within an existing layer, increasing the representational bandwidth of that stage
- Deepening — inserting entirely new hidden layers, building deeper hierarchical feature extraction when shallow paths saturate
- Context Expansion — wiring new input positions, allowing the model to look further into the past to capture longer-range dependencies
- Memory Activation — equipping select neurons with LSTM-style gated memory cells, enabling stateful processing where recurrence is needed
An RL-based controller evaluates candidate mutations and probabilistically accepts or rejects each one. Accepted mutations that degrade validation loss are rolled back. The result is a network that builds only the structure it can empirically justify.
This Is Not a Multi-Layer Perceptron
A standard MLP is a uniform stack: every neuron in layer n connects to every neuron in layer n+1. The topology is a rectangle. Every pathway through the network has the same depth.
The Spark DAG violates every one of those properties:
- Asymmetric depth. Because the network builds itself, the logic is highly asymmetric. Some pathways remain shallow 1-layer logic gates (the space character learned a nearly flat input→output mapping), while the network autonomously built 5-layer deep hierarchical structures specifically for complex character combinations like punctuation sequences and rare bigrams.
- Dynamic context. The model initialized only looking 2 characters into the past. By the end of 20 minutes, it organically threw sparse wires up to 10 characters into the past to capture longer dependencies. Each sub-DAG chose its own context window independently: the space character wired 21 input positions, while rare symbols stopped at 6.
- Sparse, irregular connectivity. There are no dense layers. Edges are individually learned, individually gated, and individually prunable. The final graph has 1.8M edges across 2,841 hidden neurons, but the distribution is wildly non-uniform: the busiest neuron has 2,254 incoming edges, while some have just 1.
- Per-character specialization. Each of the 98 output characters has its own independent sub-DAG with its own depth, width, context window, and memory allocation. The network is not a single shared trunk with 98 output heads. It is 98 independent computational graphs that share only the input embedding.
Actual DAG Topology: Character 'e' at Step 1800
5 hidden layers, 14 hidden neurons, context window of 10. Edges are individually weighted and gated. This is one of 98 independent sub-DAGs.
The GPU Problem: Sparse DAGs on Dense Hardware
Training a sparse, dynamically growing DAG on a GPU is historically a hardware nightmare. GPUs love dense matrices; they hate random memory pointers. Standard sparse formats (CSR, COO) scatter data across memory, killing cache locality and leaving SIMD lanes idle. This is why virtually all practical neural networks use dense layers despite the theoretical appeal of sparsity.
To achieve 20-minute training on an M1 Pro, standard frameworks like PyTorch were abandoned entirely. We wrote a custom Sparse Matrix-Vector (SpMV) Metal engine from scratch. By carefully managing array allocations as the graph organically grows, the engine achieves exceptionally high L1 cache-line utilization on sparse lookups, coaxing the hardware into evaluating this chaotic structure at near-dense-matrix speeds.
81–87%
Cache-Line Utilization
145 MB
Peak GPU Memory
C++ / Metal
Zero Frameworks
Watching the Organism Grow
The network starts with 39K parameters, 98 hidden neurons (one per character), a maximum depth of 1, and a context window of 2. Watch the graph build itself:
| Wall Time | Val BPC | Params | Hidden Neurons | Max Depth | Context Window |
|---|---|---|---|---|---|
| 0s | 6.62 | 39K | 98 | 1 | 2 |
| 11s | 3.80 | 157K | 436 | 2 | 4 |
| 37s | 3.32 | 397K | 802 | 3 | 7 |
| 1m 33s | 3.24 | 620K | 1,123 | 4 | 9 |
| 2m 36s | 3.15 | 774K | 1,351 | 5 | 10 |
| 6m 21s | 3.03 | 1.3M | 1,716 | 5 | 14 |
| 13m 14s | 2.94 | 2.5M | 2,133 | 5 | 19 |
| 19m 53s | 2.89 | 3.0M | 2,297 | 5 | 21 |
Context window shows the maximum across all 98 sub-DAGs. Max depth capped at 5 by design. The algorithm halted depth growth at 5 layers and continued widening and extending context instead.
What This Result Is — and What It Isn’t
2.92 BPC on character-level WikiText-103 is not a competitive language modeling result. Established architectures — including simple recurrent models — achieve substantially better perplexity on this benchmark at comparable parameter counts.
That is not what this project set out to demonstrate.
The contribution is a systems engineering result: a custom sparse GPU engine that enables stable, continuous network morphism training with near-dense cache efficiency on consumer hardware. The network grew from 39K to 3M parameters over 20 minutes without memory fragmentation, without training instability, and without manual architecture decisions. The 81–87% L1 cache-line utilization on a dynamically mutating sparse DAG is the metric that matters.
The language modeling task is a proving ground — a way to stress-test the engine under realistic workload. The BPC number confirms the network is learning; the systems metrics confirm the engine works.
Final Network Statistics
1,812,885
Edges
2,466
Hidden Units
5
Max Depth (Layers)
89
Memory Neurons
23
Max Context Window
98
Independent Sub-DAGs
4,181
Training Steps
21m 40s
Total Training Time