Placing Jobs on 10,000 GPUs Without Stalling the Fleet
How we think about topology-aware scheduling when a single bad placement can cost a training run hours of throughput.
How we think about topology-aware scheduling when a single bad placement can cost a training run hours of throughput.
A single node was quietly dragging down multi-node training jobs. The root cause was one degraded NVLink lane and a metric nobody was watching.