Networking

Network Latency

Definition

Network latency is the time delay for data to travel between two points in a network, measured in microseconds (µs) or milliseconds (ms). In GPU clusters, latency determines the minimum time for collective operations regardless of message size. InfiniBand achieves sub-microsecond port-to-port latency (approximately 0.6 µs), while Ethernet-based solutions typically achieve 2-5 µs. For training workloads, latency matters most for small synchronisation messages; for inference, latency directly affects response time and user experience.

Technical Context

Latency has three components: propagation delay (speed of light through the medium), serialisation delay (time to transmit the message at line rate), and switching delay (processing time at each network hop). InfiniBand's advantage comes from hardware-level RDMA support and cut-through switching, which minimise switching delay. For inference endpoints, the relevant latency metric is end-to-end — from API request to first generated token — which includes model loading, preprocessing, and generation time in addition to network transit.

Advisory Relevance

Latency requirements differ significantly between training and inference workloads, which affects data centre location strategy. Training clusters can tolerate geographic distance between nodes within a site, while inference endpoints benefit from proximity to end users — a consideration we factor into deployment advisory.

This glossary is maintained by Disintermediate as a reference for GPU infrastructure professionals, investors, and operators. Each entry reflects terminology as used in active advisory engagements and market intelligence work.

View all terms Discuss this topic