GPU Infrastructure

[01]

Supply Chain Concentration & NVIDIA Ecosystem Dominance

GPU infrastructure exhibits acute supply chain concentration risk. NVIDIA controls three critical dependencies: GPU accelerator design and manufacturing (exclusive partnership with TSMC for cutting-edge process nodes); InfiniBand networking fabric and switches (acquired Mellanox 2020); NVSwitch interconnect switching for intra-rack GPU communication.

No viable alternative topology exists for current-generation high-performance training clusters. AMD's MI300X offers competitive compute but incompatible ecosystem (NCCL libraries, CUDA migration tax, driver maturity deficits) creating 18-32 month adoption barriers.

Intel's Gaudi accelerator remains greenfield with minimal ecosystem adoption. This concentration yields NVIDIA pricing power: GPU pricing commands 18-28% premium over competitive benchmarks; InfiniBand switching shows 35-52% gross margin exceeding standard Ethernet at 22-35%. The ecosystem lock-in extends beyond hardware: NVIDIA CUDA developer velocity, software stack maturity, and neural network library optimisations create switching costs exceeding $2.4M per 1,000-GPU cluster. Customers accepting AMD or Intel require 12-24 month runway before productivity recovery.

[02]

Power Density as Primary Infrastructure Constraint

Power consumption emerged as the binding constraint in GPU infrastructure 2024-2025, surpassing compute density or networking topology. GB200 NVL72 configurations (72 GPUs per 19-inch rack) consume 120 kilowatts per rack at full utilisation, a 60% power increase versus H100 predecessor.

A 10,000-GPU cluster (139 racks at 72 GPU/rack) demands 50-70MW peak facility power, exceeding typical data centre utility feeds. Power distribution losses accumulate 8-12% across the cascade, meaning 50MW facility requires 56-57MW utility procurement.

Peak power pricing in North America averages $1,200-$2,400 per megawatt per year; adding peak-shaving capabilities costs $180,000-$340,000 per megawatt capex. Renewable energy sourcing (solar, wind) adds 8-15% facility capex but locks operators into lower-margin power purchase agreements (typically $35-$55/MWh versus grid peak pricing at $120-$180/MWh). Cooling infrastructure must reject 90-95% of compute power as waste heat, requiring 50-70MW thermal capacity constrained by ambient water availability and climate zone.

[03]

Networking Topology: Fat-Tree vs Rail-Optimised Architectures

GPU cluster networking topology choice drives capex, opex, and application performance materially. Traditional fat-tree topology (spine-leaf architecture with redundant paths) provides full mesh connectivity enabling any-to-any communication.

For a 32-rack cluster, fat-tree requires 8 leaf switches ($124K-$156K each), 4 spine switches ($156K-$204K each), plus 2,048+ direct attach cables. Total networking capex: $1.6M-$2.1M for compute fabric alone.

Rail-optimised topologies (emerging late 2024) accept latency variance for lower capex, optimising for high-performance training clusters where communication follows predictable all-reduce patterns. Rail-optimised implementations reduce switch count by 40-55%, dropping networking capex to $720K-$990K for equivalent cluster. The trade-off: rail-optimised fails for applications requiring random memory access patterns or bursty all-to-all communication. This decision locks architecture for 5+ year capex life.

[04]

Build vs Lease: Facility Capex & Operational Control

GPU infrastructure operators face binary strategic choice: build proprietary data centre (greenfield) versus lease colocation capacity. Greenfield construction for 50MW facility costs $300M-$500M total capex including land acquisition, permitting, facility construction, and infrastructure systems.

This capex targets 10-12 year amortisation at WACC 8-10%, adding $32M-$54M annual depreciation burden. Greenfield builds return to investors only at 70%+ utilisation and stable pricing; achieving target utilisation requires 18-36 month ramp given operator churn.

Colocation lease strategy caps upfront capex to GPU and networking equipment ($10.8M-$13.2M per 16-rack cluster) but surrenders pricing control: colocation charges $450-$650/kW annually versus modelled $120-$180/kW all-in cost at owned property. Over 10-year lifecycle, colocation vs owned economics break at 55% utilisation assuming 4% annual pricing escalation. Operators with captive workloads (15%+ revenue from proprietary use) favour greenfield; those serving volatile customer base favour colocation flexibility.

Key Takeaways

NVIDIA controls GPU, InfiniBand, and NVSwitch ecosystem; pricing premium 18-28% above competitive benchmarks; AMD/Intel switch costs exceed $2.4M per 1,000-GPU cluster

GB200 NVL72 racks consume 120kW each; 10,000-GPU cluster demands 50-70MW peak facility power creating cooling and electrical plant constraints

Fat-tree networking topology $1.6M-$2.1M capex; rail-optimised 40-55% cheaper but locks application workload mix for 5+ years

Greenfield capex $300M-$500M for 50MW facility; colocation lease breaks even at 55% utilisation over 10 years; operators with captive workloads favour build

Supply Chain Concentration & NVIDIA Ecosystem Dominance

Power Density as Primary Infrastructure Constraint

Networking Topology: Fat-Tree vs Rail-Optimised Architectures

Build vs Lease: Facility Capex & Operational Control

GPU Procurement & Capex Benchmarking

Data Centre Market Navigation

GPU Infrastructure

GPU Cluster Financial Modelling