In 2025, "GPU cost" was a problem for OpenAI, Anthropic, and four labs in Mountain View. In 2026, it's a line item on every Series B SaaS company's board deck.
The FinOps Foundation's 2026 State of FinOps Report calls AI cost management the single most desired skillset across organisations of all sizes. 98% of FinOps teams now manage AI spend, up from 63% last year. The reason is simple arithmetic: a single H100 costs $2–4 per GPU-hour on-demand, your product needs 8 of them behind the "Ask AI" button, and at 24×7 uptime that's $14,000–28,000 a month for one inference pool.
Here's the playbook we run when a client comes to us with a runaway LLM inference bill. Realistic expectation: 50–70% savings in 4–8 weeks without re-architecting the model.
1. Look at actual GPU utilisation first
Every GPU cost engagement we've run starts the same way: we SSH into a production node mid-workday and run nvidia-smi dmon. Almost without fail, average utilisation is 20–40%. We've seen 8%.
The reason isn't that your app isn't busy — it's that GPUs are bursty by nature for inference. KV cache warm-up, model loading, tokenisation overhead, batch formation — they all create idle gaps. A GPU at 40% average utilisation is telling you something very specific: you can put more workload on it. You're paying for an entire H100 to do the work of half an H100.
First action: before any fancy optimisation, just measure. Prometheus + NVIDIA DCGM exporter + a Grafana dashboard on DCGM_FI_DEV_GPU_UTIL per node. If your average is under 50%, you have easy wins and you haven't started yet.
2. Fractional GPUs: MIG, time-slicing, MPS
NVIDIA's Multi-Instance GPU (MIG) on H100s and A100s lets you split one physical GPU into up to seven isolated slices. Kubernetes treats each slice as a discrete allocatable resource — your workloads request nvidia.com/mig-1g.10gb just like they request CPU.
For inference workloads that don't need a full GPU, MIG is the single biggest lever. A 7-way split of a $2/hr H100 gives you seven $0.29/hr slices. A 7B-parameter model that comfortably fits in 10GB, served on a MIG slice, is almost free money compared to a dedicated card.
If MIG doesn't fit (small batch sizes, tight latency, non-MIG-capable card), time-slicing or MPS (Multi-Process Service) let multiple containers share one GPU. Less isolation, more flexibility. For dev and staging clusters, time-slicing alone typically cuts GPU spend in half.
3. Spot and preemptible GPUs for the right workloads
AWS, GCP, and Azure all offer spot GPU instances at 50–70% off on-demand pricing. The catch: they can be reclaimed with 30–120 seconds of warning.
For user-facing real-time inference, spot is risky. But for everything else — batch embedding generation, nightly classification, fine-tuning and eval jobs, A/B model evaluation, async summarisation — spot is dramatic savings with manageable risk. We pair spot with Karpenter (on EKS) or GKE node auto-provisioning, use a PodDisruptionBudget to keep a minimum on-demand floor, and route critical traffic to the floor when spot gets reclaimed.
Rule of thumb: if the workload has a retry loop and no human waiting on the other end, it belongs on spot.
4. Continuous batching and KV caching
Almost every LLM inference bill has 20–30% waste from not batching. If you serve one request at a time, your GPU runs a forward pass for each token of each request in isolation. Batch 16 requests together — even with padding — and you get near-linear throughput improvement at the cost of a small latency penalty.
vLLM (continuous batching), TensorRT-LLM (in-flight batching), and SGLang (RadixAttention) all solve this. Pick one. The delta from naive batching to continuous batching is typically 3–4x throughput on the same hardware.
KV cache reuse across requests (prefix caching) is the other big lever. If your users send similar system prompts — and 95% of production LLM apps do — the cache hit rate can save 30–60% of compute on the prompt-processing phase. vLLM and SGLang both expose this out of the box.
5. Quantise before you scale
Every production LLM inference pipeline we've touched in the last year runs the full FP16 model by default and panics when the bill arrives. FP8 and INT8 quantisation — via TensorRT-LLM, AWQ, GPTQ, or Marlin kernels — typically give you:
- 2x throughput on the same hardware
- 2x lower memory footprint, letting you fit on smaller GPUs or larger MIG slices
- <1% quality degradation on most benchmarks (measure, don't assume)
If you haven't quantised, you're running on 2023 defaults. After continuous batching, this is the single biggest lever.
6. The full stack impact
Here's what a realistic optimisation stack looks like, cumulatively:
- Baseline (FP16 7B model, 1 full H100 on-demand, avg util 25%): $2,160/mo
- + Continuous batching (3x throughput on same hardware): effective $720/mo
- + INT8 quantisation (2x throughput, 2x memory): effective $360/mo
- + MIG 2-way slice (quantised model fits): $180/mo per workload
- + Spot for 60% of non-SLA traffic: ~$100/mo
From $2,160 to ~$100 for the same logical workload is real, but it takes a tuning pass, a runtime change, and a model recompilation. Most teams get 50–70% without the full stack — which is plenty to buy a quarter of runway.
The takeaway
Cloud consultants love to say "it depends." For GPU cost, it really does — the same model can cost 10x different amounts depending on how you serve it.
Before you escalate your next GPU budget request, run nvidia-smi dmon on a prod node for an hour and see what you find. If it's anywhere near 25%, you have a 50% saving waiting for you and you haven't even opened a ticket yet.
GPU bill out of control? This is exactly the kind of FinOps work we do. Book a free 30-minute GPU cost review — we'll look at your utilisation dashboards and your inference stack and tell you honestly where the 50% is.
Related: Cloud Consulting & FinOps · DevOps for Fintech · Kubernetes 1.33 in-place pod resize