NVIDIA MPS vs. Dedicated GPU Allocation for LLM Inference

−50%GPU cost (1 vs. 2 GPUs)

−7.5%throughput trade-off

6.6×cost savings per % perf lost

Study overview

This benchmark compares NVIDIA Multi-Process Service (MPS) GPU sharing against dedicated GPU allocation for large language model inference. The target workload is the Qwen3-4B-FP8 model, deployed on Google Kubernetes Engine (GKE) with NVIDIA A100-80GB GPUs.

The problem: GPU underutilization in LLM inference

LLM inference workloads consistently leave GPU memory and compute on the table. A typical scenario: a 15 GB model running on an 80 GB GPU, leaving more than 80% of memory idle while still consuming the full GPU's power and cost. At fleet scale, this is millions of dollars and megawatt-hours of waste.

How MPS works

MPS enables multiple CUDA applications to share a single GPU context via spatial sharing rather than time-slicing. This allows concurrent kernel execution across different Streaming Multiprocessors (SMs), and eliminates the context-switching overhead seen in time-sliced approaches.

Deployment configurations

MPS shared: 60% SM allocation per process, 50% GPU memory per process — 2 processes per GPU
Dedicated: 80% memory utilization, 1 process per GPU

Performance metrics

Metric	Dedicated GPU	MPS Shared	Δ
Output throughput	1,557.1 tokens/s	1,448.2 tokens/s	−7.5%
Request throughput	6.17 req/s	5.73 req/s	−7.1%
TTFT (median)	497.5 ms	504.7 ms	+1.4%
TPOT (median)	52.3 ms	62.5 ms	+19.5%

Cost-efficiency trade-off

Cost reduction: 50% (1 GPU vs. 2 GPUs).
Performance loss: 7.5% throughput.
Each percentage point of performance sacrificed yields roughly 6.6% cost reduction.

Latency insights

MPS overhead predominantly affects the decode phase (TPOT +19.5%) rather than the initial prefill (TTFT +1.4%), indicating resource contention during iterative token generation rather than during the first-token path. For interactive UX where TTFT matters most, MPS is essentially free.

Recommendations

Cost-sensitive deployments and smaller models: default to MPS sharing.
Mixed traffic: hybrid approach — MPS for batch / async, dedicated for latency-critical tiers.
Performance-critical production: reserve dedicated GPUs and use MPS as a burst-capacity tier.

← Back to all case studies

Maximizing GPU efficiency: NVIDIA MPS vs. dedicated GPU allocation for LLM inference.