The Fleet Problem

Bjorn Runaker

Dec 25, 2025

On scaling to eight GPUs, the bottleneck that keeps moving, and why coordination beats raw power

Read →

2 Comments

Rainbow Roxy

Dec 27

This clarifies a lot. Musubi truly explains why data aligment is key for AI training.

Comment removed

Comment removed

Thanks — and I’m glad the “moving bottleneck” + musubi framing landed.

What you described on feature extraction is the same pattern I keep seeing in training stacks: the step that changes everything isn’t “more hardware,” it’s making state and coordination cheap. Once you parallelise, the dominant cost often shifts from compute to how processes communicate progress, ownership, and failure—and if that layer is sloppy, you get the convoy effect even if your GPUs/cores are sitting there ready to work.

In my case, it showed up as:

• rank/device binding mistakes silently collapsing utilisation,

• NCCL/NIC selection turning all‑reduce into the “slow truck,”

• and framework quirks (like auto‑batch probing under DDP), creating a serial choke point at startup.

Your point about state communication is also why I’m biased toward explicit, observable contracts between workers (who owns what shard, what’s the global step, how to tear down cleanly) rather than hoping the framework gets it right by default.

Curious: In your extraction pipeline, was the breakthrough more about reducing coordination frequency (fewer barriers / larger chunks) or making coordination more reliable (better state model, idempotency, retries)?