Research / Organic Cache Locality

Organic Cache Locality: Implicit Memory Coalescing in Dynamically Grown Sparse Neural Networks

Dynamically grown sparse DAGs naturally produce near-perfect memory coalescing. The same chronological allocation that makes growth simple also makes it cache-friendly — no graph partitioning required.

April 12, 2026 Technical Report Samir Awuapara

Key Results

1.0

Cache Lines / Warp

77–89%

Cache Utilization

Sorting Passes

1.6M

Edges Profiled

Abstract

Sparse matrix–vector (SpMV) kernels on GPU architectures are fundamentally memory-bound: arithmetic intensity is far below the roofline, and performance is dominated by the cost of gathering source activations from irregular memory locations. The standard response is to apply expensive offline reordering passes — Reverse Cuthill–McKee, METIS graph partitioning, or cache-oblivious space-filling curves — to restructure index arrays for better spatial locality. These approaches treat memory layout as a post-hoc optimization applied to a fixed topology.

We demonstrate that dynamically grown sparse neural networks sidestep this problem entirely. In a continuous-learning autoregressive character model using dynamic Network Morphism, new neurons are allocated chronologically — appended to the activation array in the order they are created. Edges preferentially connect to recently created nodes because the growth policy targets regions of high local gradient signal. The resulting topology exhibits a temporal–spatial homomorphism: nodes that are close in creation time are close in memory, and the growth dynamics ensure that edges cluster within these temporal neighborhoods.

We profile a production sparse training engine across a growth trajectory from 19K to 1.66M edges and observe 1.0–1.1 cache lines touched per 32-edge SIMD warp (the theoretical minimum is 1.0) with 77–89% cache-line utilization — achieved with zero reordering passes, zero graph partitioning, and zero offline preprocessing. We formalize the conditions under which this organic locality emerges and provide a lightweight synthetic DAG generator as a reproducibility artifact.

1. Introduction

The dominant approach to neural network computation — dense matrix multiplication on tensor cores — achieves high hardware utilization by construction. Dense GEMM tiles map cleanly onto GPU warp geometry, and the access patterns are perfectly coalesced. Sparse networks forfeit these guarantees. When a neuron's fan-in is an irregular list of source indices, the forward pass becomes a gather operation over unpredictable memory locations. On both Apple Silicon and NVIDIA hardware, this reduces the problem to memory bandwidth efficiency: how many useful bytes are loaded per cache-line fetch?

The literature addresses this with post-hoc reordering. Reverse Cuthill–McKee (RCM) minimizes matrix bandwidth. METIS partitions the graph to cluster communicating nodes. CSR5 and CSR-Adaptive restructure index storage for warp-level coalescing. All of these assume a static topology: the graph is fixed, and the optimizer works to find a good layout for it.

We evaluate a different class of architecture: a dynamically grown sparse DAG where the topology itself is constructed incrementally during training. The network begins with minimal connectivity and grows via two mutation types: Width-Expansion (Parallel) — adding new neurons within an existing layer — and Depth-Expansion (Serial) — inserting new neurons between layers, increasing network depth. The exact mutation policy is a heuristic-driven combinatorial selection optimized for local error gradients; its design is an orthogonal ML problem outside the scope of this paper.

Our contribution is an observation and its formalization: the memory layout that results from chronological allocation during growth is inherently cache-friendly for SpMV dispatch. No reordering is needed because the growth process itself produces a spatially clustered layout. We call this property organic cache locality.

2. Background & Hardware Physics

2.1 Cache Architecture

Both Apple Silicon and NVIDIA GPU architectures load memory in 128-byte cache lines. For 32-bit floating-point activations (sizeof(float) = 4), each cache line holds exactly 32 consecutive floats. When a SIMD group (32 threads on Apple Metal; a warp on CUDA) reads activations[src_id] for 32 different source indices, the hardware issues one cache-line fetch per unique cache line touched. If all 32 source indices fall within the same 32-float span, exactly one cache line is fetched. If they scatter across $k$ distinct lines, $k$ lines are fetched and $(k-1) \times 128$ bytes are wasted.

The metric that governs SpMV performance is therefore:

$\text{cache\_lines\_per\_warp} = \frac{\text{distinct cache lines touched}}{\text{warps dispatched}}$

The theoretical minimum is 1.0 (perfect coalescing). Random index scatter gives approximately $\min(32, N/32)$ where $N$ is the activation array size.

2.2 Vector CSR Dispatch

Our engine stores connectivity in Compressed Sparse Row (CSR) format with two representations: by-target (forward pass) and by-source (backward pass). The forward dispatch uses a Vector CSR strategy: for each target node, a group of 32 SIMD threads cooperatively strides over the fan-in edges, each reading activations[src_ids[base + lane]] and reducing via simd_sum(). This is the standard approach from Bell & Garland (2008) adapted for Apple Metal SIMD groups.

A hybrid scalar/vector threshold at 32 edges separates light nodes (processed by a single thread iterating sequentially) from heavy nodes (processed by a full SIMD group). Nodes within each category are sorted by index within their layer, which preserves chronological memory order. This sorting is a consequence of the dispatch layout, not an optimization for cache locality.

2.3 Apple Silicon Memory Hierarchy

The M-series system-on-chip exposes a unified memory pool with peak bandwidth of approximately 200 GB/s (M1 Pro) to 800 GB/s (M4 Ultra). The GPU L1 data cache is 8 KB per SIMD group with 128-byte lines. There is no programmer-managed shared memory (unlike CUDA's __shared__); Metal's threadgroup memory is allocated from the same hardware and is modest in size. This makes L1 cache behavior the dominant factor for sparse gather performance.

2.4 The NP-Hardness of Optimal Reordering

Optimal sparse matrix reordering for cache efficiency is NP-hard (Pinar & Heath, 1999). Even approximate solutions — RCM, nested dissection, space-filling curves — have $O(|E| \log |E|)$ cost and must be re-applied whenever the topology changes. For a dynamically growing network that adds edges every $N$ steps, periodic reordering introduces overhead proportional to the growth rate. Our result shows this overhead is unnecessary: the growth process itself produces a layout that approximates optimal reordering.

3. Temporal–Spatial Homomorphism

3.1 The Allocation Invariant

Let $v_i$ denote the $i$-th neuron created during training, and let $\text{addr}(v_i)$ denote its position in the flat activation array. Under chronological allocation:

$i < j \implies \text{addr}(v_i) < \text{addr}(v_j)$

This is a trivial invariant of append-only allocation. The non-trivial observation is its interaction with edge formation. In the growth policy, new nodes preferentially wire to recently created nodes because the mutation targets regions of high local gradient signal, which tends to concentrate in the most recently grown parts of the network. If edge $(v_s, v_t)$ is created at time $\tau$, then with high probability $s$ and $t$ are both close to $\tau$ in creation order, and therefore close in memory address.

3.2 Sub-DAG Flattening

The network is organized as a collection of independent sub-DAGs (one per context position in the sequence model). Each sub-DAG is grown independently and flattened into a contiguous region of the global activation array:

$\text{activations} = [\underbrace{S_0}_{n_0} \mid \underbrace{S_1}_{n_1} \mid \cdots \mid \underbrace{S_{K-1}}_{n_{K-1}}]$

where $S_k$ is the activation block for sub-DAG $k$ and $n_k$ is its node count. Because each sub-DAG's edges are internal (source and target both belong to the same sub-DAG), the source indices for any target node fall within the contiguous $[n_k]$ range of its sub-DAG. Combined with temporal-bias attachment, this produces two layers of locality: intra-sub-DAG contiguity (structural) and intra-temporal-window clustering (statistical).

3.3 Why This Differs From Static Sparse Networks

In pruning-based sparse networks (e.g., lottery ticket, magnitude pruning), the topology is a subset of a pre-existing dense layout. The surviving edges are distributed according to the dense layer's geometry, which provides no particular cache-locality guarantee when stored in CSR format. The index array for a pruned row may reference activations scattered across the full width of the original dense layer.

In contrast, a grown network's index array references activations that were allocated near each other in time, and therefore near each other in memory. This is the temporal–spatial homomorphism: chronological proximity implies memory proximity, and the growth dynamics ensure topological proximity correlates with chronological proximity.

4. Empirical Profiling

We profile a production sparse training engine executing a continuous-learning autoregressive character model on WikiText-103. The engine uses the Vector CSR dispatch described in Section 2.2 with a hybrid scalar/vector threshold at 32 edges. All measurements are from Apple Silicon (M1 Pro, 16 GB unified memory) without any cache-optimization preprocessing.

4.1 Cache-Line Utilization During Growth

Table 1 shows cache metrics at seven snapshots during network growth. The metric cache_lines/32edges measures the average number of 128-byte cache lines touched per 32-edge SIMD warp during the forward Vector CSR dispatch.

Table 1. Cache-line utilization during organic growth

Edges	Lines/32	Util %	Useful	Fetched	Ampl
19,306	1.0	77.0%	75 KB	98 KB	1.3x
44,268	1.0	84.7%	172 KB	203 KB	1.2x
77,058	1.0	88.6%	302 KB	341 KB	1.1x
119,498	1.0	85.8%	466 KB	543 KB	1.2x
197,278	1.0	85.3%	770 KB	903 KB	1.2x
256,134	1.1	85.0%	1,000 KB	1,176 KB	1.2x
303,688	1.1	86.2%	1,186 KB	1,376 KB	1.1x

Data from production engine on M1 Pro. No reordering, partitioning, or preprocessing applied.

The cache_lines/32 metric remains at 1.0–1.1 across the entire growth trajectory. This is within rounding distance of the theoretical minimum. Utilization stabilizes between 77–89%, with amplification (fetched / useful bytes) never exceeding 1.3x. These numbers are comparable to what explicit RCM or METIS reordering would produce on a static graph of the same density — but achieved at zero cost.

4.2 Extended Growth Trajectory: SIMD Divergence

Table 2 tracks SIMD warp utilization across a longer growth trajectory from 19K to 1.66M edges. As the network grows and fan-in distributions become more skewed, the worst-case SIMD divergence ratio increases, but effective bandwidth utilization remains stable.

Table 2. SIMD divergence during extended growth (19K → 1.66M edges)

Edges	Max Fan-in	Avg Fan-in	SIMD Worst	BW (GB/s)	MB/step
19,306	196	98.0	2.0x	0.15	2.0
77,058	438	187.7	2.3x	0.22	7.3
256,134	782	330.2	2.4x	0.32	23.1
581,040	1,156	432.8	2.7x	0.46	50.8
1,024,576	1,678	548.2	3.1x	0.58	88.4
1,658,326	2,156	668.4	5.7x	0.75	132.7

20-minute training run on M1 Pro. SIMD Worst = max(fan_in) / avg(fan_in) across a layer.

The worst-case SIMD divergence grows from 2.0x to 5.7x as the power-law degree distribution develops heavier tails during growth. This is a known property of preferential-attachment networks (Barabási & Albert, 1999). However, the cache-line metrics remain stable because divergence affects warp utilization (how many threads in a 32-thread group do useful work) but not memory coalescing (how many cache lines are touched). The organic locality effect holds independently of degree distribution skew.

5. Discussion: The Arithmetic Intensity Wall

Even with near-perfect cache coalescing, the sparse forward pass is irreducibly memory-bound. Each edge performs two floating-point operations (multiply and add) but must read at minimum 8 bytes (4-byte source activation + 4-byte weight). The arithmetic intensity is:

$I = \frac{2 \text{ FLOPs}}{8 \text{ bytes}} = 0.25 \text{ FLOPs/byte (theoretical)}$

In practice, the CSR index array adds 4 bytes per edge (the source ID), and the target activation write adds amortized bytes per edge. The measured intensity is approximately:

$I_{\text{measured}} \approx 0.025 \text{ FLOPs/byte}$

This is roughly 100x below the Apple M1 Pro's compute-memory balance point (~5 FLOPs/byte for FP32). The operation is as far from compute-bound as a GPU workload can be.

Organic cache locality does not change this fundamental bound. What it does is ensure that the memory bandwidth consumed is not further amplified by poor coalescing. Without organic locality, the amplification factor would be 4–8x (touching 4–8 cache lines per warp instead of 1), making an already memory-bound operation 4–8x slower for the same edge count. The organic layout keeps the amplification at 1.1–1.3x, which means the engine operates at the memory bandwidth floor rather than a multiple above it.

This distinction matters for scaling: as the network grows from 100K to 1M+ edges, the training step time scales approximately linearly with edge count (as expected for a bandwidth-limited operation). Without organic locality, the scaling would be superlinear due to cache-miss amplification growing with the activation array size.

5.1 Comparison With Explicit Reordering

We do not compare against RCM or METIS because the relevant comparison is not "reordered vs. organic" (both achieve similar utilization) but "reordered vs. random" and "organic vs. random." The organic layout achieves the same end state as explicit reordering without the preprocessing cost. For a dynamically growing network that mutates topology every $N$ steps, the cost of periodic reordering would be $O(|E| \log |E|)$ per mutation — a non-trivial overhead that organic locality eliminates entirely.

6. Related Work

Sparse matrix formats. Bell & Garland (2008) introduced the Vector CSR dispatch strategy for SpMV on GPU. Greathouse & Daga (2014) proposed CSR-Adaptive with dynamic row-length detection. Liu & Vinter (2015) developed CSR5 for cross-platform load balancing. Our Vector CSR dispatch follows Bell & Garland's design with a hybrid scalar/vector threshold.

Graph reordering. Pinar & Heath (1999) proved NP-hardness of optimal cache-oblivious matrix reordering. Reverse Cuthill–McKee (Cuthill & McKee, 1969) minimizes matrix bandwidth via BFS ordering. METIS (Karypis & Kumar, 1998) provides multilevel graph partitioning. These methods assume static topologies and incur $O(|E| \log |E|)$ preprocessing cost.

Dynamic sparse networks. Mocanu et al. (2018) introduced Sparse Evolutionary Training (SET), which prunes and regrows edges during training but maintains a fixed neuron count. Evci et al. (2020) proposed RigL, which uses gradient information for edge regrowth. Ren et al. (2024) surveyed Neural Architecture Search methods. Our approach differs in that both neurons and edges are added during training (Network Morphism), and the layout benefit is a side effect of the growth process rather than an optimization objective.

Preferential attachment. Barabási & Albert (1999) showed that growth with preferential attachment produces power-law degree distributions. Our temporal-bias attachment shares the preferential-attachment flavor but the preference is for temporal recency rather than degree, reflecting the gradient-driven growth heuristic.

7. Conclusion

We have shown that dynamically grown sparse neural networks achieve near-optimal cache-line utilization (1.0–1.1 lines per 32-edge warp, 77–89% utilization) as a free byproduct of chronological neuron allocation. This organic cache locality emerges from the temporal–spatial homomorphism inherent in growth-based architectures: nodes created together are stored together, and the growth dynamics ensure they are wired together.

The practical consequence is that a dynamically growing sparse training engine can skip the entire graph-reordering pipeline — no RCM, no METIS, no periodic re-indexing — and still operate at the memory bandwidth floor for SpMV dispatch. For architectures that grow during training, cache-friendly memory layout is not a problem to be solved. It is a property that emerges for free.

Reproducibility Artifact

We provide a self-contained Python script that generates a synthetic sparse DAG with temporal-bias attachment and computes the cache-line utilization metrics reported in this paper. The generator models the memory layout of a production training engine: each sub-DAG is grown independently, then flattened into a single contiguous activation array for GPU dispatch. New nodes are appended chronologically within each sub-DAG. Edges preferentially connect to recently created nodes (90% within a temporal window of 80 nodes, 10% uniform over older nodes).

The script compares three conditions:

Temporal attachment (90% recent) — models organic growth
Random attachment (0% bias) — uniform random edge wiring
Shuffled indices (worst case) — random permutation of all node IDs

download Download temporal_dag_generator.py

Requirements: Python 3.10+, NumPy. Plotting requires Matplotlib. Run: python temporal_dag_generator.py --nodes 4000

References

Barabási, A.-L. & Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 509–512.

Bell, N. & Garland, M. (2008). Efficient sparse matrix–vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004.

Cuthill, E. & McKee, J. (1969). Reducing the bandwidth of sparse symmetric matrices. Proc. 24th National Conference ACM, 157–172.

Evci, U., Gale, T., Menick, J., Castro, P. S. & Elsen, E. (2020). Rigging the lottery: Making all tickets winners. ICML 2020.

Greathouse, J. L. & Daga, M. (2014). Efficient sparse matrix–vector multiplication on GPUs using the CSR storage format. SC14.

Karypis, G. & Kumar, V. (1998). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1), 359–392.

Liu, W. & Vinter, B. (2015). CSR5: An efficient storage format for cross-platform sparse matrix–vector multiplication. ICS 2015.

Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, M. & Liotta, A. (2018). Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9(1), 2383.

Pinar, A. & Heath, M. T. (1999). Improving performance of sparse matrix–vector multiplication. SC99.

Ren, P., Xiao, Y., Chang, X., Huang, P., Li, Z., Gupta, B. B. ... & Wang, X. (2024). A comprehensive survey of neural architecture search: Challenges and solutions. ACM Computing Surveys.

Cache Locality Sparse Neural Networks SpMV Dynamic Growth Apple Silicon Network Morphism

Apr 10, 2026