The Ghost Watch

A short story about processes that wouldn’t die

Dec 27, 2025

The dashboard said 100%.

All eight bars lit up, green and confident. I should have felt relief—the overnight run was burning through epochs, the fleet working as one. Musubi achieved.

But something was wrong.

I noticed it in the silence. The server room fans were spinning, but the logs had stopped. No new checkpoints. No loss values scrolling past. The training script’s last heartbeat was forty minutes old.

I pulled up nvidia-smi:

GPU Temp Power Memory Util 0 38°C 98W 0 MiB / 96 GB 100% 1 39°C 103W 0 MiB / 96 GB 100% 2 37°C 100W 0 MiB / 96 GB 100% ... ... ... ... ...

Seven GPUs are reporting 100% utilisation. Zero memory used—no processes listed.

It was the GPU equivalent of a car revving at redline with no driver in the seat.

The Revenant

I’ve learned to distrust round numbers from machines. A GPU at 87% utilisation is working. A GPU at 100% with nothing in memory is lying—or haunted.

The training job had crashed sometime in the night. Segfault, maybe. Or a stray OOM that the handler missed. The CUDA context was gone—memory released, processes terminated. But the GPUs themselves were stuck in a wedged state, their utilisation counters frozen mid-stride.

In The Fleet Problem, I wrote about musubi—the Japanese concept of connection that creates. Eight processors bound together, gradients flowing in the dark. The binding that makes the fleet move as one.

But musubi has a shadow. When the connection breaks badly, the parts don’t always know they’re alone. They keep signalling readiness to a collective that no longer exists.

Ghost processes. Zombie GPUs. The binding dissolved, but the habit remained.

The Investigation

I ran through the checklist I’ve since committed to memory:

First: confirm it’s not a stale read.

nvidia-smi dmon -s u

The utilisation stayed pinned at 100% across multiple sampling windows. Memory stayed at zero. This wasn’t a display glitch.

Second: check for hidden clients.

sudo lsof /dev/nvidia* | head -200

Only nvidia-persistenced holding the device nodes. No training processes. No stragglers.

Third: check the kernel log.

sudo dmesg -T | grep -iE "nvrm|xid|gpu"

Xid errors. A cascade of them, timestamped around 2:47 AM. The GPUs weren’t just stuck—they were sick. Something had gone wrong below the driver level, where my tools couldn’t reach.

Fourth: attempt a reset.

sudo nvidia-smi --gpu-reset -i 1

“Unable to reset GPU... GPU is in use by another client.”

But there was no client. The ghost was holding the door closed from the inside.

I stopped the persistence daemon. Tried again—same error.

The Only Fix

When the GPU is hung below the driver level, driver-level tools can’t convince it back into reality.

The only fix was a reboot.

I scheduled it, waited for the node to come back up, and ran the environment check. Eight GPUs, all showing 0% utilisation. Zero memory. Ready to work.

The ghosts were gone.

What I Changed

This incident cost me four hours of training time and a night’s sleep. A small price for a lesson on teardown.

First: absolute teardown paths. My training scripts now wrap everything in try/finally. On exit, we call dist.destroy_process_group()to synchronise CUDA and clean up rank-by-rank. If a process is going to die, it should die cleanly—releasing its grip on the hardware.

Second: a supervisor who watches for ghosts. The new preflight check runs before every multi-hour job: are any GPUs showing high utilisation with zero memory? Are there lingering holders on /dev/nvidia*? Does the kernel log show fresh Xid errors? If so, we abort before wasting compute on a wedged node.

Third: isolation. On shared boxes, I scope CUDA_VISIBLE_DEVICES so a crashing experiment can only poison a subset of GPUs. The blast radius is contained.

torch.cuda.empty_cache() It isn’t a reset button. It releases cached allocations back to PyTorch’s allocator, nothing more. It helps with fragmentation. It doesn’t exorcise ghosts.

The Lesson

Musubi requires a clean entry and a clean exit.

The binding between eight GPUs isn’t just about launching them together—it’s about ensuring they can release together. A process that crashes without being torn down leaves a residue. Enough residue, and the GPUs forget how to be idle.

The fleet problem isn’t just coordination. It’s also dissolution. Knowing when the convoy has ended. Letting the ships return to port.

I think about that image now whenever I write shutdown logic: the server room at 3 AM, fans spinning, utilisation bars green and lying. Eight GPUs doing deadlifts in a dark room with no one watching.

The next morning, after the reboot, I queued the training run again. Defined “solved” before starting. Wrote down the one change I was testing. Noted when I would check results.

Then I closed the laptop and went to sleep.

The computer can run the night shift. But only if I teach it how to stop.

I made a song about the ghost watch. About processes that wouldn’t die, and the lesson they taught me about teardown.

The catchy part—” teach it how to stop”—kept running through my head while I was debugging at 3 AM. Binding isn’t just about starting together. It’s about releasing together. Musubi needs an exit as much as an entry.

0:00

-4:11

Try. Finally. Then sleep.

This is a companion piece to The Fleet Problem, which explores scaling to eight GPUs and the Japanese concept of musubi—a connection that creates.

Discussion about this post

Ready for more?