Supply Chain Concentration & NVIDIA Ecosystem Dominance
GPU infrastructure exhibits acute supply chain concentration risk. NVIDIA controls three critical dependencies: GPU accelerator design and manufacturing (exclusive partnership with TSMC for cutting-edge process nodes); InfiniBand networking fabric and switches (acquired Mellanox 2020); NVSwitch interconnect switching for intra-rack GPU communication.
No viable alternative topology exists for current-generation high-performance training clusters. AMD's MI300X offers competitive compute but incompatible ecosystem (NCCL libraries, CUDA migration tax, driver maturity deficits) creating 18-32 month adoption barriers.
Intel's Gaudi accelerator remains greenfield with minimal ecosystem adoption. This concentration yields NVIDIA pricing power: GPU pricing commands 18-28% premium over competitive benchmarks; InfiniBand switching shows 35-52% gross margin exceeding standard Ethernet at 22-35%. The ecosystem lock-in extends beyond hardware: NVIDIA CUDA developer velocity, software stack maturity, and neural network library optimisations create switching costs exceeding $2.4M per 1,000-GPU cluster. Customers accepting AMD or Intel require 12-24 month runway before productivity recovery.
Power Density as Primary Infrastructure Constraint
Power consumption emerged as the binding constraint in GPU infrastructure 2024-2025, surpassing compute density or networking topology. GB200 NVL72 configurations (72 GPUs per 19-inch rack) consume 120 kilowatts per rack at full utilisation, a 60% power increase versus H100 predecessor.
A 10,000-GPU cluster (139 racks at 72 GPU/rack) demands 50-70MW peak facility power, exceeding typical data centre utility feeds. Power distribution losses accumulate 8-12% across the cascade, meaning 50MW facility requires 56-57MW utility procurement.
Peak power pricing in North America averages $1,200-$2,400 per megawatt per year; adding peak-shaving capabilities costs $180,000-$340,000 per megawatt capex. Renewable energy sourcing (solar, wind) adds 8-15% facility capex but locks operators into lower-margin power purchase agreements (typically $35-$55/MWh versus grid peak pricing at $120-$180/MWh). Cooling infrastructure must reject 90-95% of compute power as waste heat, requiring 50-70MW thermal capacity constrained by ambient water availability and climate zone.
Networking Topology: Fat-Tree vs Rail-Optimised Architectures
GPU cluster networking topology choice drives capex, opex, and application performance materially. Traditional fat-tree topology (spine-leaf architecture with redundant paths) provides full mesh connectivity enabling any-to-any communication.
For a 32-rack cluster, fat-tree requires 8 leaf switches ($124K-$156K each), 4 spine switches ($156K-$204K each), plus 2,048+ direct attach cables. Total networking capex: $1.6M-$2.1M for compute fabric alone.
Rail-optimised topologies (emerging late 2024) accept latency variance for lower capex, optimising for high-performance training clusters where communication follows predictable all-reduce patterns. Rail-optimised implementations reduce switch count by 40-55%, dropping networking capex to $720K-$990K for equivalent cluster. The trade-off: rail-optimised fails for applications requiring random memory access patterns or bursty all-to-all communication. This decision locks architecture for 5+ year capex life.
Build vs Lease: Facility Capex & Operational Control
GPU infrastructure operators face binary strategic choice: build proprietary data centre (greenfield) versus lease colocation capacity. Greenfield construction for 50MW facility costs $300M-$500M total capex including land acquisition, permitting, facility construction, and infrastructure systems.
This capex targets 10-12 year amortisation at WACC 8-10%, adding $32M-$54M annual depreciation burden. Greenfield builds return to investors only at 70%+ utilisation and stable pricing; achieving target utilisation requires 18-36 month ramp given operator churn.
Colocation lease strategy caps upfront capex to GPU and networking equipment ($10.8M-$13.2M per 16-rack cluster) but surrenders pricing control: colocation charges $450-$650/kW annually versus modelled $120-$180/kW all-in cost at owned property. Over 10-year lifecycle, colocation vs owned economics break at 55% utilisation assuming 4% annual pricing escalation. Operators with captive workloads (15%+ revenue from proprietary use) favour greenfield; those serving volatile customer base favour colocation flexibility.
NVIDIA controls GPU, InfiniBand, and NVSwitch ecosystem; pricing premium 18-28% above competitive benchmarks; AMD/Intel switch costs exceed $2.4M per 1,000-GPU cluster
GB200 NVL72 racks consume 120kW each; 10,000-GPU cluster demands 50-70MW peak facility power creating cooling and electrical plant constraints
Fat-tree networking topology $1.6M-$2.1M capex; rail-optimised 40-55% cheaper but locks application workload mix for 5+ years
Greenfield capex $300M-$500M for 50MW facility; colocation lease breaks even at 55% utilisation over 10 years; operators with captive workloads favour build