The Case of the 0.3% Slower Node
A single node was quietly dragging down multi-node training jobs. The root cause was one degraded NVLink lane and a metric nobody was watching.
A customer opened a ticket with a frustrating symptom: some of their training runs were about 0.3% slower than others, with no obvious pattern. Same code, same dataset, same GPU count. Sometimes fast, sometimes very slightly slow.
A third of a percent sounds like noise. At the scale these jobs run, it is real money. And "intermittent and tiny" is exactly the kind of bug that hides for months.
Reproducing the unreproducible
The first job was to make it happen on demand. We pinned the customer's job shape to a fixed set of nodes and ran a stripped-down all-reduce benchmark in a loop. The slowdown followed a specific node — whenever that node was in the placement, the job was slow.
# Bisect: run the benchmark across rotating subsets, hold one node constant.
for trial in $(seq 1 50); do
nodes=$(pick_nodes --count 8 --include node-1147)
bench_allreduce --nodes "$nodes" --size 256MB --iters 200 \
| jq '{trial: '"$trial"', busbw_gbps: .busbw}'
done
The bandwidth histogram was bimodal: most trials clustered tightly, a minority sat a hair
lower. The low cluster always contained node-1147.
Following the bandwidth down
Within the node, NCCL prefers NVLink for intra-node communication. We dumped the per-lane link state and found one lane on one GPU running at a reduced width — the hardware had silently downtrained it after correctable errors, exactly as designed. The link still worked. It just had less bandwidth, and NCCL routed around it in a way that added a hop.
The kicker: every health check we had was pass/fail. The link was up, so every dashboard was green. Nothing was watching link width.
The fix, and the better fix
The immediate fix was to drain node-1147, reseat the affected component, and confirm the
lane retrained to full width. The job's bandwidth returned to the fast cluster.
The durable fix was to stop treating links as binary. We now export link width and lane error counters per GPU and alert when a node's aggregate NVLink bandwidth drops below spec, before anyone files a ticket:
ALERT NVLinkDegraded
node-1147 gpu3 link2 width=x8 (expected x16) corr_errors=4213
effective intra-node bw 11% below spec — node cordoned for service
Takeaways
- "Up" is not a performance guarantee. For hardware that degrades gracefully, the interesting signal is the gradient between working and optimal, not the binary.
- Small and persistent beats large and rare. A 0.3% tax that applies to every step of every run on a node adds up faster than an occasional crash.
- Make the invisible visible. The bug existed for as long as it took us to start graphing the right number. Most of debugging at this scale is deciding what to measure.