The Case of the 0.3% Slower Node
A single node was quietly dragging down multi-node training jobs. The root cause was one degraded NVLink lane and a metric nobody was watching.
1 article
A single node was quietly dragging down multi-node training jobs. The root cause was one degraded NVLink lane and a metric nobody was watching.