Single-Tenant vs Multi-Tenant GPU Infrastructure

[01]

The Single-Tenant Training Reality

Single-tenant training is your baseline offering. A customer buys 8 A100s or 8 B200s, trains their model for weeks or months, pays a flat or hourly rate, then leaves. Utilization across the fleet runs 60-75% because training workloads are bursty.

Not every customer starts or stops at convenient times. Margins compress because customers shop on price and have clear alternatives: hyperscalers, in-house, open-source. The market treats this as a commodity.

Single-tenant training is essential nonetheless. It attracts customers, builds relationships, and creates data about workload patterns. The real margin expansion happens post-training.

[02]

Multi-Tenant Inference: The Margin Multiplier

Multi-tenant inference inverts the economics. One GPU serves ten different customers' inference requests (fine-tuned models, LoRA adapters, different tensor-parallel configs) through intelligent batching and scheduling.

A B200 generates $2,000/month as a single-tenant training lease. The same B200 running vLLM with dynamic batching generates $8,000-12,000/month in inference revenue across 6-8 concurrent customers.

Gross margin expands because you amortise capex across more revenue. This requires orchestration (routing requests to available capacity), batching (combining requests to maximise throughput), memory management (fitting multiple models in GPU VRAM through quantisation, paging), and isolation (ensuring one customer's latency spike doesn't crash another's SLA). Software, not hardware, determines your success.

[03]

The Software Stack Requirement

vLLM's token-level batching, TensorRT-LLM's quantisation and fusion, and Triton Inference Server's multi-model management are not optional. They are operational necessities for profitable multi-tenant inference.

A naive inference server (load balancer plus requests queued on GPU) achieves 20-30% utilisation in multi-tenant scenarios because batches wait for slow customers. Modern stacks achieve 70-85% utilisation because they interleave tokens from different users, pausing low-priority requests. The operators who win multi-tenant inference are those who invest in software faster than competitors, not those who access cheaper GPUs.

[04]

The Transition Strategy

Successful operators funnel single-tenant training customers into multi-tenant inference. Train your model, then serve it at lower cost on shared infrastructure.

This creates a 'sticky' customer (training plus inference on one platform) and reduces churn. Inference has lower cost per token if locked to your platform.

Margins expand. The challenge: training customers must be convinced inference SLAs are sufficient for their production workload. This requires not just software capability but operational track record and SLA credibility.

Key Takeaways

Single-tenant training is table stakes and typically margin-compressed due to commoditization and hyperscaler competition

Multi-tenant inference can generate 4-6x revenue per GPU by amortizing capex across multiple customers and workloads

Margin expansion depends on software (vLLM, TensorRT-LLM, Triton) and orchestration, not hardware generation or cost

Utilization improves from 60-75% (training) to 70-85% (well-optimized inference) through intelligent scheduling and batching

Successful operators treat training as a customer acquisition channel for higher-margin inference products

The Single-Tenant Training Reality

Multi-Tenant Inference: The Margin Multiplier

The Software Stack Requirement

The Transition Strategy

GPU-as-a-Service Business Model

Bare Metal vs Managed GPU Cloud

GPU Infrastructure Operating Expenses