← All articles

Placing Jobs on 10,000 GPUs Without Stalling the Fleet

How we think about topology-aware scheduling when a single bad placement can cost a training run hours of throughput.

A training job that asks for 512 GPUs does not just want 512 free GPUs. It wants 512 GPUs that can talk to each other quickly. Put them in the wrong place and the collective operations that synchronize gradients every step will spend more time waiting on the network than computing. The job still runs. It just runs slowly, and slow at this scale is expensive.

This post is about how our scheduler reasons about that, and a few things that surprised us along the way.

The cost of a bad placement

Consider all-reduce, the operation that averages gradients across workers. Its latency is bounded by the slowest link in the group. If most of your workers share a rail-optimized fabric but two of them sit across an oversubscribed boundary, the whole group runs at the speed of those two.

We measured the gap on a representative 256-GPU job:

Placement strategy Step time Throughput vs. ideal
Topology-aware (one pod) 412 ms 100%
First-fit (any free GPUs) 631 ms 65%
Worst case (split pods) 905 ms 46%

A naive first-fit scheduler left more than a third of the fleet's effective throughput on the floor — without anything looking "broken."

Modeling the fleet as a tree

We represent the datacenter as a tree of failure-and-bandwidth domains: GPU → host → rack → pod → spine. Each level has a cost to cross. Scheduling a job becomes a search for the cheapest subtree that satisfies the request.

def placement_cost(candidate: list[GPU], topo: Topology) -> int:
    """Lower is better. Penalize every domain boundary the group spans."""
    cost = 0
    for level in topo.levels:           # host, rack, pod, spine
        domains = {topo.domain_of(gpu, level) for gpu in candidate}
        # Spanning more domains at a level costs more the higher up you go.
        cost += (len(domains) - 1) * level.crossing_weight
    return cost

The weights are not guesses. We fit them to measured collective bandwidth between domains, so the scheduler's notion of "expensive" matches the hardware's.

What surprised us

  • Perfect packing is the wrong goal. Always choosing the single cheapest subtree fragments the fleet, so the next large job cannot be placed at all. We now bias toward leaving large contiguous regions intact, even when a tighter placement exists.
  • Preemption beats waiting. A short, low-priority job sitting in the middle of an otherwise-ideal pod is worth preempting for a large high-priority run. The math almost always favors it once the large job is big enough.
  • Operators want to see the reasoning. The single most requested feature was not faster scheduling — it was an explanation of why a job landed where it did.

Where we are headed

The current scheduler optimizes one job at a time against the current fleet state. The obvious next step is to plan across the queue: look at the next N jobs together and place them to minimize fragmentation over the whole batch. That is a harder problem, and a good topic for a future post.

If this is the kind of thing you want to work on, we're hiring.

Priya Nair
Software Engineer, Scheduling
Marcus Feld
Software Engineer, Control Plane