Compute

Distributed Training

Definition

Distributed training is the practice of splitting a deep learning training workload across multiple GPUs — within a node and across nodes — to reduce training time and enable models that exceed single-GPU memory capacity. Distributed training combines data parallelism (replicating the model on each GPU with different data batches), tensor parallelism (splitting layers across GPUs), and pipeline parallelism (splitting the model sequentially across stages). The choice of parallelism strategy depends on model size, cluster topology, and interconnect bandwidth.

Technical Context

Libraries like NVIDIA NCCL, DeepSpeed, and Megatron-LM handle the communication and synchronisation required for distributed training. Scaling efficiency — the ratio of actual speedup to theoretical linear speedup — is the key metric. A 1,000-GPU cluster with 90% scaling efficiency delivers 900 GPUs of effective compute. Efficiency depends on communication-to-computation ratio, network bandwidth, software optimisation, and workload characteristics. Current state-of-the-art achieves 85-95% scaling efficiency for large transformer training on InfiniBand-connected clusters.

Advisory Relevance

Scaling efficiency directly affects the economics of large training deployments. We use scaling efficiency benchmarks to validate operator performance claims and to model the effective cost per GPU-hour for different cluster configurations.

This glossary is maintained by Disintermediate as a reference for GPU infrastructure professionals, investors, and operators. Each entry reflects terminology as used in active advisory engagements and market intelligence work.

View all terms Discuss this topic