Infrastructure

Training Cluster

Definition

A training cluster is a tightly coupled array of GPU nodes connected via high-bandwidth interconnects — typically InfiniBand — configured for distributed deep learning. Modern training clusters range from 64 GPUs (single-rack) to 100,000+ GPUs (hyperscale). The defining characteristic is that all nodes must communicate during training, making network bandwidth and topology as important as raw GPU compute power. Building and operating training clusters is capital-intensive: a 1,000-GPU H100 cluster costs approximately $30-40M in hardware alone, excluding facility, cooling, and networking infrastructure.

Technical Context

Training clusters use collective communication operations — primarily all-reduce — to synchronise gradients across GPUs during distributed training. The network must support simultaneous communication from all nodes without congestion. Fat-tree topologies provide full bisection bandwidth but require exponentially more switch ports at scale. Rail-optimised topologies reduce switch requirements but introduce communication bottlenecks for certain workload patterns. Storage architecture is equally critical: clusters need high-throughput parallel file systems (GPFS, Lustre, WekaFS) to feed training data at sufficient rates.

Advisory Relevance

Training cluster economics are a core competency. We model fleet revenue across multiple GPU configurations, utilisation levels, and contract structures. Our work for consulting firms advising on GPU infrastructure acquisitions has involved benchmarking management assumptions for clusters against live market data.

This glossary is maintained by Disintermediate as a reference for GPU infrastructure professionals, investors, and operators. Each entry reflects terminology as used in active advisory engagements and market intelligence work.

View all terms Discuss this topic