AI Infrastructure: Learning Path
Operational · Maturity Stage 02

FABRIC ARCHITECTURE

Fabric topology, execution locality, and east-west congestion — the network layer that determines where AI workloads can run.

AI fabric architecture maturity stage 02 — east-west networking and execution locality
East-west fabric constraints determine where AI workloads can physically execute — before the scheduler is ever consulted.

>_ Architecture Maturity Position

Current Stage

Operational — Maturity Stage 02 of 07

Primary Architectural Concern

East-west fabric constraints that determine where execution, data, models, and context can physically move at scale.

Primary Architectural Tension

Execution locality — performance and efficiency — vs. execution mobility — flexibility and resource utilization. Optimizing for one degrades the other; most AI cluster failures at scale trace back to a design that never made this tension explicit.

Primary Failure Mode

Fabric-Blind Architecture — cluster designs that model GPU capacity, storage capacity, and scheduling behavior independently while treating east-west networking as shared utility infrastructure. Locality collapse, congestion amplification, and stranded GPU capacity follow at scale.

Stage Outcome

Ability to evaluate fabric topology against AI workload demands, identify the Execution Locality Boundary (#116), and specify network requirements before cluster design or procurement decisions are made.

Next Stage

A3 — AI Storage & Data Pipeline Architecture → /ai-architecture-learning-path/ai-storage-data-pipeline-architecture/

AI fabric architecture is the constraint layer that precedes every placement, scheduling, and locality decision in an AI cluster. Before the scheduler runs, before workloads are admitted, before inference routes are evaluated — the fabric has already determined what is physically possible. The east-west bandwidth envelope, the oversubscription ratio, the congestion control model, and the topology design are all decided at infrastructure time, not runtime. Those decisions propagate silently into every workload outcome that follows.

Teams that treat east-west networking as plumbing don’t encounter the gap at design time — they encounter it when training jobs stall at scale, inference latency spikes under load, or congestion collapses throughput in ways that present as GPU failures or scheduler inefficiency. By then the architectural decision has already been made, often at procurement, often without any analysis of the Execution Locality Boundary that governs where execution can physically occur. This stage exists to move that analysis to where it belongs — before the cluster is designed.

>_ Why This Stage Exists

Named Failure State — Fabric-Blind Architecture

Fabric-Blind Architecture is the condition where network constraints are treated as implementation details rather than architectural constraints. Cluster designs model GPU capacity, storage capacity, and scheduling behavior as independent variables while east-west networking is provisioned as a shared utility service — sized for average load, not for the amplified east-west demand that AI workloads generate at scale.

Once the Execution Locality Boundary is crossed, the fabric becomes the dominant workload-placement authority regardless of scheduler intent. The scheduler can route correctly — the fabric still determines whether execution is feasible. This ties Framework #103 (Infrastructure Authority Migration) and Framework #116 (Execution Locality Boundary) together as the core architectural identity of this stage: the network stops being passive infrastructure and starts making placement decisions whether or not the architecture acknowledges it.

Framework #116 — Execution Locality Boundary

The Execution Locality Boundary is the point at which moving data, models, or context costs more than moving execution, causing network architecture to become the dominant workload-placement constraint. It is not a threshold that can be calculated from a spec sheet — it emerges from the intersection of workload access patterns, model size, context window requirements, and east-west bandwidth capacity. Identifying it before cluster procurement is the primary architectural outcome of this stage.

Three Failure Patterns This Stage Prevents

  • 01Fabric saturation misdiagnosed as GPU underperformance or scheduler inefficiency — the root cause remains invisible until workload density increases
  • 02InfiniBand vs. RoCEv2 selection made on throughput spec alone without topology requirements or congestion control analysis — a procurement decision that cannot be corrected at runtime
  • 03Cluster procurement finalized before the Execution Locality Boundary is identified — placement constraints discovered post-deployment when architectural correction is no longer economically viable

>_ What This Stage Is Not

01 — Not a Networking Fundamentals Primer

This stage assumes working familiarity with switching, routing, and basic topology. The concern is architectural constraint modeling at the AI workload layer — not introductory networking concepts.

02 — Not a Vendor Selection Guide

InfiniBand vs. RoCEv2 is an architectural tradeoff analysis, not a product comparison. This stage is not about switch SKU selection — it is about understanding which fabric model fits which workload topology and why the decision cannot be deferred.

03 — Not a Kubernetes Networking Tutorial

CNI plugins, service mesh, and overlay networking belong to the cluster orchestration layer. Those are A4 concerns. A2 operates at the physical and logical fabric layer that underlies every workload container regardless of how the scheduler addresses it.

04 — Not a Substitute for A4

Fabric constraints define what is physically possible. The scheduler and placement authority layer in A4 decides what actually happens within those constraints. A2 answers: what constrains execution movement? A4 answers: given those constraints, who decides where execution occurs? Different layer. Different architectural concern.

Stage: 2 of 7 · Articles in stage: 5 · Estimated depth: 3–4 hrs · Stage sequencing last reviewed: June 2026

>_ Where to Enter This Stage

The default entry point for this stage is completion of A1 — Accelerated Compute Architecture — or equivalent working vocabulary: VRAM constraints, interconnect topology, GPU scheduling primitives, and the distinction between compute-bound and memory-bound workloads. A1 establishes how accelerated compute behaves. A2 establishes what constrains its movement.

Architects who already hold fabric-layer vocabulary — east-west amplification, oversubscription physics, RoCEv2 congestion control mechanisms — can enter directly at Cluster 02. The Cluster 01 articles are still recommended as calibration for how those concepts interact specifically with AI workload demand patterns, but they are not required prerequisites for experienced network architects.

A4 — AI Runtime & Cluster Orchestration — should not be entered without completing this stage. Placement and scheduling decisions made without a fabric constraint model produce cluster designs that are architecturally unsound before the first workload runs. The scheduler assumes a fabric. A2 is where that assumption gets examined.

>_ Where This Stage Sits

AI Infrastructure Architecture Path — Maturity Progression

Stage Architectural Question Maturity Level
A1 How does accelerated compute behave? Foundation
A2 ← YOU ARE HERE What constrains execution movement? Operational
A3 What constrains data movement? Operational
A4 Who decides where execution occurs? Strategic
A5 How is execution operated? Strategic
A6 Who governs execution authority? Strategic
A7 How does execution survive failure? Resilient
Stage 2 of 7 · 5 articles · ~3–4 hrs · Reviewed June 2026
AI infrastructure architecture maturity spine stage 02 of 07 active — AI fabric architecture
Stage 02 of 07 in the AI Infrastructure Architecture Path — Operational maturity, fabric constraint layer.

>_ Stage Reading Sequence

Each cluster below is organized by architectural problem. Every cluster answers: what becomes architecturally unstable if this discipline is misunderstood?

Cluster 01 — Fabric Physics

How AI fabric behaves at scale

Published
C1 · Prerequisite Condition

Deterministic Networking: The Missing Layer in AI-Ready Infrastructure

Non-deterministic networking is an architectural liability before any other fabric decision is made. This article establishes why fabric behavior must be specified and validated — not assumed — and what the operational consequences are when latency and delivery guarantees are left to chance at AI scale.

1 article · ~45 min
Published
C1 · Scale Physics

GPU Fabric Physics 2026: Why 800G Isn’t Enough for 100k-GPU Training

East-west traffic amplification, oversubscription physics, and the scale failure modes that emerge when fabric is sized for throughput rather than topology. This article makes the Execution Locality Boundary concrete — demonstrating at what point movement costs exceed compute costs and how that threshold shifts with cluster scale.

1 article · ~50 min

Cluster 02 — Congestion & Locality

Where and why execution movement breaks down

Published
C2 · Protocol Constraint

InfiniBand Is Losing the Fabric War. Here’s What That Changes for Your Architecture.

The architectural tradeoff between guaranteed delivery and commodity scale — congestion control requirements, topology fit, and why the InfiniBand vs. RoCEv2 decision is not a vendor preference question. The article surfaces the protocol-level constraints that make this a pre-procurement architectural decision, not a post-deployment optimization.

1 article · ~50 min

>_ AI Fabric Architecture Failure Patterns

01 Execution Locality Boundary exceeded — data, model, or context movement costs exceed compute costs; the fabric becomes the de facto scheduling bottleneck regardless of scheduler intent
02 Fabric saturation misread as GPU underperformance — congestion presents as accelerator failure; root cause remains invisible until workload density increases past the next threshold
03 RoCEv2 deployed without PFC/ECN — congestion control absent at the protocol layer; lossless behavior assumed but not enforced; fabric degrades under load in ways that are difficult to attribute
04 InfiniBand selected on throughput spec, not topology — scale failure mode latent at procurement; topology requirements discovered post-deployment when architectural correction is no longer economically viable

Cluster 03 — Network Authority

The network layer as architectural decision-maker

Published
C3 · Control Plane Emergence · Framework #103

The Network Is Becoming the AI Control Plane

Framework #103 — Infrastructure Authority Migration. As AI systems scale, execution feasibility becomes increasingly governed by network constraints rather than scheduler intent. The fabric layer evolves from transport mechanism to control-plane authority — making placement decisions whether or not the architecture acknowledges it. This article is the bridge between A2’s fabric constraint model and A4’s placement authority layer.

1 article · ~45 min
Published
C3 · Placement Consequence

AI Placement Decisions Are Architecture — Not Optimization

How fabric constraints propagate into placement economics — the cost and latency consequences of decisions made without a locality model. This article closes A2 by making the downstream propagation explicit: the constraints established in this stage are the starting assumptions for A3 (data locality) and the boundary conditions for A4 (placement authority).

1 article · ~45 min

>_ Stage Graduates Can Now

You can now operate at the fabric layer with architectural intent. A1 established how accelerated compute behaves — A2 establishes what constrains its movement. What the next stages add is the decision authority layer: A3 asks what constrains data movement across that same fabric, and A4 asks given these constraints, who controls where execution occurs and under what enforcement model.

  • Evaluate InfiniBand vs. RoCEv2 selection against workload topology and congestion control requirements — not throughput specifications
  • Identify the Execution Locality Boundary (#116) in a planned or existing cluster design before procurement decisions are made
  • Diagnose Fabric-Blind Architecture failure modes — fabric saturation events that present as GPU underperformance or scheduler inefficiency
  • Specify east-west bandwidth, oversubscription ratios, and congestion control requirements as first-class cluster design inputs
  • Recognize when fabric constraints have become the dominant execution authority within a platform — and identify where scheduler decisions are no longer the primary determinant of workload placement
  • Upstream bridge: fabric constraint vocabulary established here propagates directly into A3 storage locality decisions and A4 placement authority design — both stages assume this constraint model as their starting point

>_ Where Do You Go From Here?

← Previous Stage
A1 — Accelerated Compute Architecture. How GPUs, TPUs, and accelerators behave — VRAM constraints, interconnect physics, and execution locality at the compute layer.
Open Stage →
→ Next Stage
A3 — AI Storage & Data Pipeline Architecture. What constrains data movement across the fabric — data locality, pipeline latency, tiering, and checkpoint architecture.
Open Stage →
AI Infrastructure Architecture Path
The full seven-stage path — accelerated compute through AI system survivability. Return to the domain path for the complete maturity spine.
Open Domain Path →
AI Fabric Pressure Analyzer
Model east-west saturation, identify congestion thresholds, and validate fabric architecture against AI workload demand profiles.
Open Tool →
AI Inference Saturation Analyzer
Fabric constraint awareness from A2 feeds directly into inference saturation modeling — throughput, queue collapse, and the cost of ignoring locality at runtime.
Open Tool →
Virtualization Architecture Path
Cross-domain: network integration at the Operational stage — how east-west fabric design applies to hypervisor-based workload environments.
Open Domain Path →
Learning Paths
All five domain paths and the full maturity spine — return to the top-level reading architecture.
Open Learning Paths →
AI Infrastructure — Architecture Review

YOUR FABRIC DESIGN MAY BE CONSTRAINING EXECUTION
BEFORE THE FIRST MODEL DEPLOYS.

Most AI cluster failures that get blamed on GPU shortage or scheduler inefficiency are fabric saturation events that were never modeled at design time. An Infrastructure Architecture Review surfaces the constraints before they become incidents.

>_ Architectural Guidance

Infrastructure Architecture Review

A structured architecture review across your AI infrastructure stack — fabric design, locality modeling, congestion exposure, and control plane dependencies.

  • > Execution locality assessment
  • > East-west congestion modeling
  • > Fabric architecture validation
  • > Control-plane dependency mapping
>_ Request Architecture Review
>_ The Dispatch

Architecture Playbooks. Field-Tested Blueprints.

AI infrastructure failure patterns, fabric design blueprints, and operational architecture guides from production environments.

  • > Fabric congestion analysis patterns
  • > Execution locality modeling
  • > AI cluster design blueprints
  • > Control plane architecture guides
[+] Get the Playbooks

Zero spam. Unsubscribe anytime.

>_ Frequently Asked Questions

Q: What is the Execution Locality Boundary and why does it determine AI workload placement?

A: The Execution Locality Boundary (Framework #116) is the point at which moving data, models, or context costs more than moving execution — causing network architecture to become the dominant workload-placement constraint. Below this boundary, schedulers can route workloads freely across the cluster. Above it, the fabric’s east-west bandwidth and topology constraints become the primary determinant of where execution can actually occur. The significance for architects is that this boundary exists in every AI cluster but is only discovered at procurement time if it is explicitly modeled. Post-deployment discovery means the architectural correction is no longer economically viable.

Q: When does fabric architecture become the binding constraint on AI cluster performance?

A: Fabric architecture becomes the binding constraint when east-west traffic demand — driven by gradient synchronization, activation exchange, model parallelism, or context distribution — exceeds the oversubscribed available bandwidth between compute nodes. At that point, the fabric is making placement decisions that override scheduler intent. The threshold is not fixed — it depends on model size, parallelism strategy, batch size, and topology. The Cluster 01 articles establish the physics of where this threshold sits for different cluster configurations.

Q: InfiniBand vs. RoCEv2 — how should an architect frame that decision beyond throughput specifications?

A: The decision should be framed around three variables: topology requirements (fat-tree vs. dragonfly vs. rail-optimized), congestion control architecture (RDMA losslessness via PFC/ECN vs. InfiniBand’s native flow control), and operational complexity tolerance. Throughput specifications are a starting point, not the decision. RoCEv2 at scale without proper PFC/ECN configuration degrades under load in ways that are difficult to attribute and harder to correct. InfiniBand provides native lossless delivery but constrains topology choices and introduces vendor concentration risk. The architectural question is not which delivers more bandwidth — it is which fabric model the workload topology can actually use.

Q: What is east-west traffic amplification and why does it behave differently in AI clusters than general compute?

A: In general compute environments, east-west traffic is driven by microservice communication — relatively low bandwidth, high message rate, short duration. In AI clusters, east-west traffic is driven by distributed training operations that require synchronized parameter updates across every GPU involved in a training run. A single all-reduce operation in a 512-GPU training job generates traffic proportional to model size multiplied by the number of participating nodes — not just the number of communicating pairs. That amplification factor means AI clusters require fabric bandwidth and topology designed for sustained high-bandwidth all-to-all communication patterns, not the bursty low-bandwidth patterns that general compute fabrics handle well.

Q: What is Fabric-Blind Architecture and what failure modes does it produce at scale?

A: Fabric-Blind Architecture is the condition where a cluster design models GPU capacity, storage capacity, and scheduling behavior as independent variables while treating east-west networking as a shared utility service. The failure modes that follow are: locality collapse (execution placed without respect to data proximity, causing movement costs to dominate runtime), congestion amplification (bandwidth contention that presents as GPU idleness or scheduler inefficiency), and stranded GPU capacity (accelerators that cannot be efficiently utilized because the fabric cannot support the communication patterns the workloads require). The failure state is architecturally preventable — it emerges from treating fabric as an implementation detail rather than a first-class architectural constraint.

Q: How do the fabric constraints established in this stage affect placement and scheduling decisions in A4?

A: A4 — AI Runtime & Cluster Orchestration — inherits the fabric constraint model from A2 as its boundary condition. The scheduler in A4 operates within the feasibility space defined by the fabric topology, east-west bandwidth capacity, and the Execution Locality Boundary. A scheduler that is unaware of these constraints will produce placement decisions that are logically correct but physically suboptimal — routing workloads to nodes that the fabric cannot efficiently connect at the required bandwidth. A4’s placement authority layer is only architecturally sound if it is built on the constraint model that A2 establishes.

>_ Related Systems

A1 — Accelerated Compute Architecture

Foundation stage — GPU and accelerator mechanics that A2 fabric constraints are designed around. Required context for the Execution Locality Boundary.

Open Stage →
A3 — AI Storage & Data Pipeline Architecture

Next stage — data locality and pipeline constraints that inherit the fabric model established here. A2 and A3 together define the full movement constraint envelope.

Open Stage →
A4 — AI Runtime & Cluster Orchestration

Strategic stage — placement authority and scheduling decisions operate within the constraint boundary A2 defines. The scheduler’s feasibility space is determined here.

Open Stage →
The Network Is Becoming the AI Control Plane

Framework #103 — Infrastructure Authority Migration. The doctrinal anchor for Cluster 03 — how the fabric layer acquires control plane authority as AI workload complexity scales.

Open Article →
AI Fabric Pressure Analyzer

Model east-west saturation thresholds and validate fabric architecture against AI workload demand profiles — apply the constraint model from this stage interactively.

Open Tool →
Virtualization — Networking Architecture Track

Cross-domain: east-west fabric design principles as applied to hypervisor-based environments — the constraint model overlaps significantly at the physical layer.

Open Track →
NVIDIA Quantum-2 InfiniBand Architecture

NVIDIA’s architecture documentation for Quantum-2 — reference for the topology and congestion control characteristics covered in the InfiniBand analysis in this stage.

Open Reference →
RFC 8257 — RoCEv2 Congestion Management

IETF specification for RoCEv2 congestion management — the PFC/ECN requirements that determine whether RoCEv2 delivers lossless behavior under load.

Open Reference →