-
-
InfiniBand Is Losing the Fabric War. Here’s What That Changes for Your Architecture.
The InfiniBand vs RoCEv2 decision has been settled at the hyperscaler level — and the answer is Ethernet. Broadcom’s March 2026 earnings confirmed what most AI infrastructure architects had already suspected: roughly 70% of new AI infrastructure deployments are now choosing Ethernet-based fabrics over InfiniBand. That number is worth sitting with for a moment —…
-
Deterministic Networking: The Missing Layer in AI-Ready Infrastructure
Deterministic Networking for AI Infrastructure: Engineering the System Backplane Deterministic networking is the infrastructure requirement that most AI cluster designs get wrong — not because the concept is misunderstood, but because it gets treated as a networking problem when it is actually a systems problem. In the legacy data center, networking was a best-effort transport…
-
GPU Fabric Physics 2026: Why 800G Isn’t Enough for 100k-GPU Training
The NCCL Timeout Nightmare GPU fabric physics is where $50 million clusters go to die. You wired up 800G OSFP optics, fired up your 100,000-GPU cluster for the Big Run — and six hours in, the loss curve flatlines. Logs start screaming: NCCL_WATCHDOG_TIMEOUT. It’s not a bad GPU. It’s not a driver crash. Honestly, it’s…
-
GPU Cluster Architecture: Engineering the Hardware Stack for Private LLM Training
Private AI infrastructure is systems engineering, not optimization. If you treat a GPU cluster like a standard virtualization farm, you will fail. I have seen deployments where millions of dollars in H100s sat idle 40% of the time because the architect underestimated the network fabric or the storage controller’s ability to swallow a checkpoint. The…
