−50%GPU cost (1 vs. 2 GPUs)
−7.5%throughput trade-off
6.6×cost savings per % perf lost

Study overview

This benchmark compares NVIDIA Multi-Process Service (MPS) GPU sharing against dedicated GPU allocation for large language model inference. The target workload is the Qwen3-4B-FP8 model, deployed on Google Kubernetes Engine (GKE) with NVIDIA A100-80GB GPUs.

The problem: GPU underutilization in LLM inference

LLM inference workloads consistently leave GPU memory and compute on the table. A typical scenario: a 15 GB model running on an 80 GB GPU, leaving more than 80% of memory idle while still consuming the full GPU's power and cost. At fleet scale, this is millions of dollars and megawatt-hours of waste.

How MPS works

MPS enables multiple CUDA applications to share a single GPU context via spatial sharing rather than time-slicing. This allows concurrent kernel execution across different Streaming Multiprocessors (SMs), and eliminates the context-switching overhead seen in time-sliced approaches.

Deployment configurations

Performance metrics

MetricDedicated GPUMPS SharedΔ
Output throughput1,557.1 tokens/s1,448.2 tokens/s−7.5%
Request throughput6.17 req/s5.73 req/s−7.1%
TTFT (median)497.5 ms504.7 ms+1.4%
TPOT (median)52.3 ms62.5 ms+19.5%

Cost-efficiency trade-off

Cost reduction: 50% (1 GPU vs. 2 GPUs).
Performance loss: 7.5% throughput.
Each percentage point of performance sacrificed yields roughly 6.6% cost reduction.

Latency insights

MPS overhead predominantly affects the decode phase (TPOT +19.5%) rather than the initial prefill (TTFT +1.4%), indicating resource contention during iterative token generation rather than during the first-token path. For interactive UX where TTFT matters most, MPS is essentially free.

Recommendations


← Back to all case studies

Want this analysis on your inference fleet?

Pebble's optimizer can model MPS, MIG, and dedicated configurations against your actual traffic, then automate the rollout.