Compute·May 2, 2026·5 min read

GPU utilization is the only metric that matters

Throughput, latency, queue depth — useful, sure. But if your GPUs are sitting at 38% you don't have a serving problem, you have a billing problem.

VTVertotech Research

Every team we audit shows us a beautiful Grafana dashboard tracking queue depth, p99 latency, request volume, and token throughput. Almost none of them show GPU utilization at the top, and the ones that do are usually below 50%.

Utilization is the single number that compounds into every other cost question. Doubling utilization halves the cost-per-token without touching the model. Halving it doubles your bill while you keep tuning prompts.

Why utilization is so often low

Single-tenant deployment for a workload that doesn't fill a GPU.
Static batch sizes tuned for the long tail, leaving slack on common cases.
Cold instances kept hot "just in case" without queue evidence.
Routing rules that pin traffic to under-loaded models.
No bin-packing across model families on the same hardware.

What to do about it

Start by instrumenting it. Most inference servers expose GPU SM-activity and memory-bandwidth utilization through DCGM. Plot p50 and p99 utilization next to your p99 latency and watch the gap.

Adopt continuous batching (vLLM, TGI, TRT-LLM) if you haven't.
Co-locate compatible workloads with priority routing.
Use spot + on-demand mixes for stateless inference paths.
Treat utilization as an SLO, not just a metric.

If you remember nothing else: a dashboard without GPU utilization on the front page is a dashboard built for the wrong audience.

Working on something like this?

Tell us about your stack. We'll come back with a scoped plan in two business days.

Start a conversation →