Building the infrastructure for AI at scale.

All articles

June 18, 20263 min read
Placing Jobs on 10,000 GPUs Without Stalling the Fleet
How we think about topology-aware scheduling when a single bad placement can cost a training run hours of throughput.
By Priya NairMarcus Feld
scheduling performance infrastructure
June 10, 20262 min read
The Case of the 0.3% Slower Node
A single node was quietly dragging down multi-node training jobs. The root cause was one degraded NVLink lane and a metric nobody was watching.
By Sasha Romanov
debugging hardware networking

Placing Jobs on 10,000 GPUs Without Stalling the Fleet