Networking

Network Topology

Definition

Network topology describes the physical and logical arrangement of interconnections between nodes in a GPU cluster. The topology determines the communication bandwidth, latency, and fault tolerance available to distributed workloads. Fat-tree (Clos) topologies provide full bisection bandwidth — any node can communicate with any other node at maximum rate simultaneously — but require large numbers of expensive switches. Rail-optimised and dragonfly topologies reduce switch requirements but introduce performance trade-offs for certain communication patterns.

Technical Context

In a fat-tree topology, leaf switches connect to servers and spine switches connect leaf switches, creating multiple equal-cost paths between any two nodes. A 3-tier fat tree for 1,000 nodes might require 40+ InfiniBand switches. Rail-optimised topologies exploit the structure of 8-GPU nodes: each GPU's InfiniBand adapter connects to a dedicated "rail" switch, reducing cross-rail traffic. This works well for training workloads that naturally align with rails but degrades for workloads requiring arbitrary all-to-all communication.

Advisory Relevance

Network topology choices have significant cost and performance implications. We evaluate cluster designs in deployment advisory to ensure the chosen topology matches the target workload profile — an incorrect topology choice can cost millions in unnecessary switch hardware or deliver suboptimal training performance.

This glossary is maintained by Disintermediate as a reference for GPU infrastructure professionals, investors, and operators. Each entry reflects terminology as used in active advisory engagements and market intelligence work.

View all terms Discuss this topic