The Case of the 0.3% Slower Node

A customer opened a ticket with a frustrating symptom: some of their training runs were about 0.3% slower than others, with no obvious pattern. Same code, same dataset, same GPU count. Sometimes fast, sometimes very slightly slow.

A third of a percent sounds like noise. At the scale these jobs run, it is real money. And "intermittent and tiny" is exactly the kind of bug that hides for months.

Reproducing the unreproducible

The first job was to make it happen on demand. We pinned the customer's job shape to a fixed set of nodes and ran a stripped-down all-reduce benchmark in a loop. The slowdown followed a specific node — whenever that node was in the placement, the job was slow.

# Bisect: run the benchmark across rotating subsets, hold one node constant.
for trial in $(seq 1 50); do
  nodes=$(pick_nodes --count 8 --include node-1147)
  bench_allreduce --nodes "$nodes" --size 256MB --iters 200 \
    | jq '{trial: '"$trial"', busbw_gbps: .busbw}'
done

The bandwidth histogram was bimodal: most trials clustered tightly, a minority sat a hair lower. The low cluster always contained node-1147.

Following the bandwidth down

Within the node, NCCL prefers NVLink for intra-node communication. We dumped the per-lane link state and found one lane on one GPU running at a reduced width — the hardware had silently downtrained it after correctable errors, exactly as designed. The link still worked. It just had less bandwidth, and NCCL routed around it in a way that added a hop.

The kicker: every health check we had was pass/fail. The link was up, so every dashboard was green. Nothing was watching link width.

The fix, and the better fix

The immediate fix was to drain node-1147, reseat the affected component, and confirm the lane retrained to full width. The job's bandwidth returned to the fast cluster.

The durable fix was to stop treating links as binary. We now export link width and lane error counters per GPU and alert when a node's aggregate NVLink bandwidth drops below spec, before anyone files a ticket:

ALERT  NVLinkDegraded
  node-1147  gpu3  link2  width=x8 (expected x16)  corr_errors=4213
  effective intra-node bw 11% below spec — node cordoned for service

Takeaways

"Up" is not a performance guarantee. For hardware that degrades gracefully, the interesting signal is the gradient between working and optimal, not the binary.
Small and persistent beats large and rare. A 0.3% tax that applies to every step of every run on a node adds up faster than an occasional crash.
Make the invisible visible. The bug existed for as long as it took us to start graphing the right number. Most of debugging at this scale is deciding what to measure.