Pick the right fabric for your training cluster
InfiniBand or RoCE? We design, build, and tune the interconnect for large-scale GPU training — from intra-server NVLink all the way to cross-site WAN, for on-prem, colocated, or ApeTops-hosted clusters.
Three planes of communication
A training cluster moves traffic across three distinct planes — each with different bandwidth, latency, and topology constraints.
Inside the server
NVLink, NVSwitch, and PCIe traffic between GPUs, CPUs, and NICs within a single chassis — where raw bandwidth dominates.
Inside the cluster
Server-to-server traffic across the training fabric — where collective-op latency and non-blocking bisection bandwidth determine throughput.
Across clusters
WAN traffic between sites for federated training, data staging, and DR — where deterministic latency and throughput guarantees matter.
Reference: A100 server interconnect
What a single DGX-class node looks like under the hood — the baseline we design fabrics around.
Per-server components
- 2× CPU sockets
- 8× A100 GPUs
- 6× NVSwitch chips
- 4× PCIe Gen4 switch chips
- 8× InfiniBand compute NICs
- 2× InfiniBand storage NICs (BlueField-3 DPU)
Link bandwidths
| Link | Bandwidth |
|---|---|
| NVLink (A100, per GPU) | 600 GB/s |
| NVLink (A800, per GPU) | 400 GB/s |
| GPU ↔ NIC over PCIe | 32 GB/s |
| InfiniBand HDR | 200 Gbps (25 GB/s) |
InfiniBand vs. RoCE
Both are viable at scale. We pick based on cluster size, budget envelope, and operational maturity.
InfiniBand
- Closed, vertically integrated architecture
- Roughly 20%+ collective-op performance edge over RoCE
- Higher CapEx — component pricing materially above Ethernet
- Best fit: small-to-medium training clusters where every percent matters
RoCE
- Open Ethernet-based ecosystem with broad vendor choice
- Significantly lower cost per port at scale
- Fast-moving technology curve — 800G fabrics now shipping
- Best fit: mid-to-large training clusters with strong network ops teams
What we deliver
End-to-end fabric engagement — from napkin sketch to production-ready cluster.
Planning a training cluster?
Share your GPU count, model size, and training targets — we'll come back within two business days with a fabric recommendation, BOM, and deployment timeline.
Other services
High-Performance Compute
Elite GPU horsepower for large-scale model training.
Inference Compute
Cost-efficient GPUs tuned for production inference.
Server Colocation
Host your own GPU servers in our Tier 3+ facilities.
GPU Repair & Maintenance
Keep your accelerators alive and under warranty.
Private Network
Dedicated point-to-point connectivity for secure workloads.
Managed Operations
24/7 NOC and on-site remote hands.
Hardware & Appliances
Ready-to-deploy GPU servers and turnkey appliances.