Study overview
This benchmark compares NVIDIA Multi-Process Service (MPS) GPU sharing
against dedicated GPU allocation for large language model inference. The
target workload is the Qwen3-4B-FP8 model, deployed on Google Kubernetes
Engine (GKE) with NVIDIA A100-80GB GPUs.
The problem: GPU underutilization in LLM inference
LLM inference workloads consistently leave GPU memory and compute on the table. A typical scenario: a 15 GB model running on an 80 GB GPU, leaving more than 80% of memory idle while still consuming the full GPU's power and cost. At fleet scale, this is millions of dollars and megawatt-hours of waste.
How MPS works
MPS enables multiple CUDA applications to share a single GPU context via spatial sharing rather than time-slicing. This allows concurrent kernel execution across different Streaming Multiprocessors (SMs), and eliminates the context-switching overhead seen in time-sliced approaches.
Deployment configurations
- MPS shared: 60% SM allocation per process, 50% GPU memory per process — 2 processes per GPU
- Dedicated: 80% memory utilization, 1 process per GPU
Performance metrics
| Metric | Dedicated GPU | MPS Shared | Δ |
|---|---|---|---|
| Output throughput | 1,557.1 tokens/s | 1,448.2 tokens/s | −7.5% |
| Request throughput | 6.17 req/s | 5.73 req/s | −7.1% |
| TTFT (median) | 497.5 ms | 504.7 ms | +1.4% |
| TPOT (median) | 52.3 ms | 62.5 ms | +19.5% |
Cost-efficiency trade-off
Cost reduction: 50% (1 GPU vs. 2 GPUs).
Performance loss: 7.5% throughput.
Each percentage point of performance sacrificed yields roughly 6.6% cost reduction.
Latency insights
MPS overhead predominantly affects the decode phase (TPOT +19.5%) rather than the initial prefill (TTFT +1.4%), indicating resource contention during iterative token generation rather than during the first-token path. For interactive UX where TTFT matters most, MPS is essentially free.
Recommendations
- Cost-sensitive deployments and smaller models: default to MPS sharing.
- Mixed traffic: hybrid approach — MPS for batch / async, dedicated for latency-critical tiers.
- Performance-critical production: reserve dedicated GPUs and use MPS as a burst-capacity tier.