Blog
Telemetry-Driven Capacity Forecasting for On-Premises GPU Clusters
How to use real-time telemetry and historical usage patterns to forecast GPU capacity needs, avoid over-provisioning, and plan infrastructure investments with confidence.
The problem with gut-feel GPU procurement
Most on-premises AI teams procure GPUs the same way enterprises used to buy servers in 2005: somebody estimates peak demand, adds a safety margin, and submits a purchase order. Six months later the cluster is either sitting 30 percent idle or buckling under workloads nobody anticipated. The lead times for enterprise GPU hardware — often 12 to 20 weeks for NVIDIA H100 or B200 nodes — make correcting a bad forecast expensive and slow.
Telemetry-driven capacity forecasting replaces guesswork with data. By instrumenting every layer of the inference and training stack, you build a living model of how your cluster is actually consumed. That model lets you forecast demand weeks or months ahead, plan procurement around real growth curves, and defend your hardware budget with numbers your CFO can audit.
What to measure: the telemetry stack for GPU infrastructure
Effective forecasting requires telemetry at four distinct layers, each contributing signals that raw GPU utilization alone cannot provide:
GPU hardware metrics: SM (streaming multiprocessor) utilization, memory bandwidth saturation, GPU memory occupancy, thermal throttling events, and NVLink or PCIe traffic. Tools like NVIDIA DCGM (Data Center GPU Manager) export these as Prometheus-compatible metrics with per-second granularity.
Serving-layer metrics: requests per second, queue depth, time-to-first-token, inter-token latency, batch sizes achieved by the continuous batcher, and KV cache hit rates. These come from your inference engine — vLLM, TGI, or TensorRT-LLM all expose them.
Workload-level metrics: tokens processed per model per team, request type distribution (interactive versus batch versus agent), prompt and completion length distributions, and error or timeout rates. These typically live in your AI gateway or routing layer.
Scheduler and orchestration metrics: pending job queue lengths, scheduling wait times, preemption counts, and GPU time-sharing ratios from Kubernetes device plugins or SLURM accounting logs.
The key insight is that GPU utilization alone is a lagging indicator. A cluster running at 70 percent GPU utilization might already be at capacity if queue depths are climbing and tail latencies are degrading. Conversely, 90 percent utilization with flat queue depths and healthy latencies means you have more headroom than the headline number suggests.
Building a demand model from historical patterns
Raw telemetry is noise until you decompose it into patterns. Three techniques consistently deliver useful forecasts for GPU workloads:
Seasonal decomposition: most enterprise AI workloads have strong daily and weekly cycles. Document processing peaks during business hours; training jobs run overnight or on weekends when interactive demand drops. Use classical decomposition (STL or similar) to separate trend, seasonal, and residual components. The trend line is your growth signal; the seasonal component sizes your peak-to-trough swing.
Workload segmentation: aggregate GPU demand is hard to forecast because it mixes fundamentally different workloads. Segment by model, team, request type, and priority tier. A fine-tuning workload from the NLP team may be growing at 15 percent per month while the computer vision inference load is flat. Forecasting each segment independently and summing produces far better results than forecasting the aggregate.
Saturation curve detection: GPU clusters exhibit non-linear degradation as utilization approaches capacity. Response latencies remain stable until a tipping point — typically around 75 to 85 percent sustained utilization — then degrade rapidly. Your demand model should flag when projected utilization crosses this threshold, not when it crosses 100 percent. By the time you hit 100 percent, your users have been suffering for weeks.
From forecast to procurement: translating demand curves into hardware plans
A demand forecast is only useful if it maps to actionable procurement decisions. The translation requires three inputs: the demand curve itself, the hardware options available, and the lead time for each option.
Start by defining capacity units that match your actual workloads. For inference, a useful unit might be "concurrent 14B-parameter chat sessions at p95 latency under 200ms." For training, it might be "GPU-hours per week at BF16 on 8xH100 nodes." These units let you convert demand forecasts into hardware quantities without getting lost in raw FLOPS comparisons that rarely reflect real-world performance.
Next, build a lead-time-adjusted procurement timeline. If your forecast shows you will cross the saturation threshold in 16 weeks and hardware lead time is 14 weeks, you have two weeks of decision margin — not 16. Many teams discover they needed to order hardware three months ago. The forecast makes this visible early enough to explore alternatives: rebalancing workloads across models, offloading low-priority batch work to spot cloud capacity, or accelerating model compression projects to reduce per-request GPU cost.
Finally, run scenario analysis against your demand model. What happens if the new product feature doubles inference traffic? What if the research team starts a large fine-tuning run that consumes 40 percent of training capacity for three weeks? Parameterizing these scenarios lets you stress-test your procurement plan before committing capital.
Tooling and architecture for a forecasting pipeline
You do not need a custom ML platform to build a forecasting pipeline. A practical architecture uses components most infrastructure teams already operate:
Collection: NVIDIA DCGM Exporter and application-level Prometheus exporters scrape metrics into a time-series database. Retain raw data for at least 90 days and downsampled data for 12 to 18 months to capture seasonal patterns.
Storage: Prometheus with Thanos or Cortex for long-term retention, or VictoriaMetrics for teams that prefer a single-binary deployment. The critical requirement is the ability to query across months of data without hitting cardinality limits.
Forecasting: Prophet, NeuralProphet, or even simple Holt-Winters models applied to the decomposed demand segments. Run these as scheduled jobs — weekly is sufficient for most procurement cycles — and store forecast outputs alongside actuals so you can measure forecast accuracy over time.
Visualization and alerting: Grafana dashboards that overlay forecast bands on actual utilization, with alerts when actuals consistently exceed the upper forecast bound. A "procurement horizon" panel that shows weeks remaining before saturation at current growth rate is the single most valuable view for infrastructure leadership.
The entire pipeline can run on a modest VM. The investment is not in compute but in the discipline of defining workload segments, choosing appropriate capacity units, and reviewing forecast accuracy monthly.
Common pitfalls and how to avoid them
Several failure modes recur across organizations building GPU forecasting capabilities:
Ignoring the KV cache: GPU memory utilization driven by KV cache growth behaves differently from utilization driven by model weight loading. If your telemetry does not distinguish between the two, your memory forecasts will be unreliable. Most modern serving engines expose KV cache metrics separately — use them.
Forecasting averages instead of peaks: procurement must accommodate peak demand, not average demand. Always forecast the p95 or p99 of your demand distribution, and verify that your seasonal decomposition captures weekly and monthly peaks accurately.
Treating all GPUs as interchangeable: a forecast that says "we need 16 more GPUs" is useless if your workloads require specific memory sizes, interconnect topologies, or hardware generations. Forecast at the level of your capacity units, which should encode these constraints.
Never validating forecast accuracy: a forecast you never check against actuals is a guess with extra steps. Track mean absolute percentage error (MAPE) per segment monthly. If a segment's MAPE consistently exceeds 20 percent, the model is not capturing something important about that workload — investigate before trusting it for procurement.
Telemetry-driven forecasting is not a one-time project. It is a practice: instrument, measure, forecast, procure, verify, and refine. The teams that adopt it stop arguing about GPU budgets and start having evidence-based conversations about infrastructure investment.
Featured image by Nadin Nandin on Unsplash.