The hidden cost of cold starts in inference
Autoscale-to-zero looks cheap until you bill p99 latency at 14 seconds and learn what your users actually feel.
Autoscale-to-zero is the most over-prescribed optimization in production AI. It looks great on the cost dashboard, terrible on the latency dashboard, and worse on the support queue.
What a cold start actually costs
A cold start for a 70B-parameter model is rarely the millisecond-class spin-up you get from a stateless web container. It's: pull the image, mount the weights, warm the CUDA context, JIT or compile kernels, and run a few warmup requests so the KV-cache and graphs settle. End-to-end this is often 10-90 seconds.
If even one in twenty user-facing requests hits a cold start, your p99 latency is owned by the cold start, no matter how fast warm requests are.
Patterns that work
- Warm pools: keep N replicas warm and overflow to autoscaling for spikes.
- Snapshot weights and kernels for fast restore.
- Predictive scaling on diurnal traffic, not reactive scaling on queue depth.
- Route cold-start eligible requests to a longer-running cohort.
When cold starts are actually fine
For batch jobs, internal tools with relaxed SLOs, and overflow capacity on cheap GPUs — autoscale-to-zero is genuinely great. The mistake is applying it to your main user-facing endpoint because the cost dashboard told you to.
Choose the policy based on the latency budget, not on the cloud provider's blog post.
Working on something like this?
Tell us about your stack. We'll come back with a scoped plan in two business days.