Research / Spark / Atomic-Free Backward Passes

Why Atomic-Free Sparse Backward Passes Are Slower on Metal

Eliminating atomic operations from a sparse backward pass sounds like a clear win. On Apple Metal, it's a 2.8x slowdown. Here's what we found and why.

April 10, 2026 GPU Performance Samir Awuapara

Key Finding

2.8x

Slower Backward Pass

174.9 ms

Atomic-Free bwd

62.7 ms

Fused+Atomic bwd

The Idea

When training a sparse neural network on GPU, the backward pass computes weight gradients by iterating over edges. With a by-source compressed sparse row (CSR) layout and (source, batch) parallelism, multiple threads accumulate gradients into the same weight locations. The standard solution is atomic_fetch_add.

The atomic-free alternative restructures the backward pass into two passes with inverted loop nesting:

  • Pass 1 (edge-outer, batch-inner): Each thread owns a unique edge and loops over the batch dimension. No two threads write to the same weight gradient. Zero atomics.
  • Pass 2: Compute input gradients with a similar exclusive-ownership decomposition.

This eliminates 100% of atomic_fetch_add calls in the backward pass. On paper, it should be faster.

The Reality

We implemented both approaches in Apple Metal compute shaders and ran identical training configurations on an M1 Pro (16 GB unified memory). The results were unambiguous.

Metric Fused + Atomics Atomic-Free (2-pass)
Run duration10 min20 min
Steps completed13,4689,206
Best val BPC3.023.23
Final step time121.1 ms200.9 ms
Final bwd time62.7 ms174.9 ms
Final fwd time28.9 ms13.6 ms

Hardware: Apple M1 Pro, 16 GB unified memory. Dataset: WikiText-103 char-level.

The fused kernel with atomics completed 13,468 steps in 10 minutes. The atomic-free kernel managed only 9,206 steps in 20 minutes — roughly 3x fewer steps per second. The backward pass itself was 2.8x slower (174.9 ms vs 62.7 ms at equivalent network sizes).

The BPC gap is a direct consequence: with half the wall time and 46% more steps, the fused kernel reaches 3.02 BPC while the atomic-free approach stalls at 3.23.

Backward Pass Scaling

The gap widens as the network grows. At similar edge counts, the atomic-free backward is consistently slower:

Approx. Edges Fused bwd (ms) Atomic-Free bwd (ms) Ratio
~22K10.881.07.5x
~30K12.6100.07.9x
~39K12.8117.39.2x
~42K12.0124.610.4x
~61K17.3159.59.2x

Fused times from revert-fused-10min run; atomic-free times from zero-atomic-20min run. Compared at similar topology sizes.

Why It's Slower

The atomic-free approach trades compute contention for memory access pattern degradation. Three factors compound on Metal:

1. Cache thrashing from batch-strided reads. The edge-outer/batch-inner loop means each thread reads activations at batch-strided offsets. On Metal's unified memory architecture, this scatters reads across the activation buffer rather than reading contiguous batch elements. The GPU cache, optimized for coalesced access patterns, thrashes badly.

2. Two dispatches instead of one. The fused kernel computes weight gradients and input gradients in a single dispatch. The atomic-free approach requires two separate compute dispatches with a pipeline barrier between them. On Metal, each dispatch carries fixed overhead: command encoding, GPU scheduling, and barrier synchronization. For small-to-medium kernels, this overhead is significant relative to compute time.

3. Metal atomics are fast. Apple Silicon's unified memory means atomic_fetch_add_explicit on device memory does not require cross-bus synchronization. The atomic is hitting unified DRAM that both CPU and GPU share. In practice, atomic contention on sparse weight gradients is low because the fan-in per neuron is modest, so most atomics complete without stalling.

Takeaway

The conventional wisdom from CUDA — that atomics are expensive and should be eliminated — does not transfer cleanly to Apple Metal. On unified memory architectures with low atomic contention, the cost of atomics is negligible compared to the cost of destroying memory coalescing.

For sparse ML training on Apple Silicon: keep the fused kernel, keep the atomics. The memory access pattern matters more than atomic-free purity.

This finding is specific to Metal on Apple Silicon. CUDA on discrete GPUs with separate VRAM and higher atomic latency may show different tradeoffs. We haven't benchmarked the equivalent experiment on CUDA yet.

Apple Metal GPU Performance Sparse Training Compute Shaders