AI training fabric design

Pick the right fabric for your training cluster

InfiniBand or RoCE? We design, build, and tune the interconnect for large-scale GPU training — from intra-server NVLink all the way to cross-site WAN, for on-prem, colocated, or ApeTops-hosted clusters.

Three planes of communication

A training cluster moves traffic across three distinct planes — each with different bandwidth, latency, and topology constraints.

Inside the server

NVLink, NVSwitch, and PCIe traffic between GPUs, CPUs, and NICs within a single chassis — where raw bandwidth dominates.

Inside the cluster

Server-to-server traffic across the training fabric — where collective-op latency and non-blocking bisection bandwidth determine throughput.

Across clusters

WAN traffic between sites for federated training, data staging, and DR — where deterministic latency and throughput guarantees matter.

Reference: A100 server interconnect

What a single DGX-class node looks like under the hood — the baseline we design fabrics around.

Per-server components

  • 2× CPU sockets
  • 8× A100 GPUs
  • 6× NVSwitch chips
  • 4× PCIe Gen4 switch chips
  • 8× InfiniBand compute NICs
  • 2× InfiniBand storage NICs (BlueField-3 DPU)

Link bandwidths

Link Bandwidth
NVLink (A100, per GPU)600 GB/s
NVLink (A800, per GPU)400 GB/s
GPU ↔ NIC over PCIe32 GB/s
InfiniBand HDR200 Gbps (25 GB/s)

InfiniBand vs. RoCE

Both are viable at scale. We pick based on cluster size, budget envelope, and operational maturity.

InfiniBand

  • Closed, vertically integrated architecture
  • Roughly 20%+ collective-op performance edge over RoCE
  • Higher CapEx — component pricing materially above Ethernet
  • Best fit: small-to-medium training clusters where every percent matters

RoCE

  • Open Ethernet-based ecosystem with broad vendor choice
  • Significantly lower cost per port at scale
  • Fast-moving technology curve — 800G fabrics now shipping
  • Best fit: mid-to-large training clusters with strong network ops teams

What we deliver

End-to-end fabric engagement — from napkin sketch to production-ready cluster.

Topology design: fat-tree, dragonfly+, rail-optimized
Non-blocking 400G IB and 800G RoCE fabric builds
Rack-and-stack, cabling, and optical budget validation
Congestion-control and QoS tuning (ECN, DCQCN, adaptive routing)
NCCL / SHARP / SHArP-aware collective optimization
End-to-end benchmarking: all-reduce, all-to-all, MFU targets
Cross-site WAN interconnect for federated training
Private and public deployments — on your floor or in ours

Planning a training cluster?

Share your GPU count, model size, and training targets — we'll come back within two business days with a fabric recommendation, BOM, and deployment timeline.