The GPU Utilization Problem

GPUs are the most expensive component in on-premises AI infrastructure, yet they are frequently the most underutilized. Studies of enterprise GPU clusters consistently show average utilization rates between 30% and 50%. The reasons are familiar: teams reserve GPUs for exclusive use but only run inference during business hours. Training jobs claim entire nodes but spend significant time on data loading and preprocessing. Batch workloads run overnight but leave GPUs idle during the day.

In cloud environments, you pay only for what you use and can scale elastically. On-premises, you have already paid for the hardware — every idle GPU hour is wasted capital. Effective GPU scheduling and orchestration can push utilization above 80% without degrading the performance of individual workloads, fundamentally changing the economics of on-premises AI.

Scheduling Strategies for Mixed Workloads

On-premises AI clusters typically serve three types of workloads with different scheduling requirements:

Real-time inference needs guaranteed GPU access with predictable latency. These workloads serve user-facing applications and cannot tolerate queuing delays. They require reserved GPU capacity that is always available, making them the highest-priority tenant on the cluster.

Batch inference processes large volumes of data (embedding generation, document classification, image analysis) where throughput matters more than latency for any individual request. These workloads can tolerate queuing and can be interrupted and resumed without loss.

Training and fine-tuning jobs consume the most GPU resources but are typically the most flexible in timing. Training can run during off-peak hours and can be checkpointed and resumed if preempted by higher-priority work.

The scheduling strategy should reflect these priorities. Reserve a guaranteed allocation for real-time inference, allow batch workloads to use remaining capacity on a best-effort basis, and schedule training jobs to fill gaps — particularly overnight and weekend hours. This layered approach mirrors how electricity grids manage base load, intermediate load, and peak demand.

Multi-Tenancy and GPU Sharing

Dedicating entire GPUs to single workloads is the simplest approach but leads to the worst utilization. Modern GPU hardware and software support several sharing mechanisms that allow multiple workloads to share GPU resources safely:

NVIDIA Multi-Instance GPU (MIG). Available on A100, H100, and newer GPUs, MIG partitions a single GPU into up to seven independent instances, each with its own compute cores, memory, and cache. Each partition is hardware-isolated, preventing one tenant from affecting another's performance. MIG is ideal for running multiple small inference models on a single high-end GPU.

NVIDIA Multi-Process Service (MPS). MPS allows multiple CUDA processes to share a GPU concurrently by time-slicing the GPU's compute resources. Unlike MIG, MPS does not provide hardware-level isolation, so one workload's memory errors or crashes can affect others. MPS works well for trusted workloads from the same team or organization where isolation is less critical than utilization.

Time-slicing. Kubernetes with the NVIDIA device plugin supports time-slicing, where multiple pods share a GPU by alternating access in short intervals. This is the simplest sharing mechanism to configure but provides the least performance predictability. It works best for light inference workloads that do not saturate the GPU.

Choose the sharing mechanism based on your isolation requirements and GPU hardware. For production inference serving multiple teams, MIG provides the strongest guarantees. For development and testing environments, time-slicing or MPS offers adequate sharing at lower complexity.

Orchestration with Kubernetes and Beyond

Kubernetes has become the standard orchestration platform for on-premises AI, but GPU workloads require extensions beyond the default Kubernetes scheduler:

NVIDIA GPU Operator. Automates the management of GPU drivers, container runtimes, device plugins, and monitoring tools across the cluster. It ensures that every node has a consistent, working GPU software stack without manual driver installation — a significant operational burden in bare-metal GPU clusters.

Volcano and Kueue. The default Kubernetes scheduler is designed for microservices, not GPU-intensive AI workloads. Volcano adds gang scheduling (ensuring all pods in a distributed training job start simultaneously), fair-share scheduling across teams, and preemption policies. Kueue, a newer Kubernetes-native project, provides workload queuing and quota management specifically designed for batch and AI workloads.

Run:ai and similar platforms. Commercial solutions like Run:ai add a higher-level abstraction for GPU scheduling, providing fractional GPU allocation, automatic preemption and resume for training jobs, and utilization dashboards that help administrators identify waste. These platforms sit on top of Kubernetes and simplify GPU management for organizations that lack deep Kubernetes expertise.

Regardless of which tools you choose, implement resource quotas per team or project. Without quotas, one team's exploratory training job can consume the entire cluster, starving production inference workloads. Quotas should cover both guaranteed minimums (what each team can always use) and burst limits (what they can use when capacity is available).

Preemption and Priority: Avoiding Conflicts

When GPU demand exceeds supply — and on-premises, it inevitably will — the scheduler must decide which workloads to preempt. A well-designed preemption policy prevents the two most common failure modes: production inference degradation and wasted training compute.

Define at least three priority tiers:

Critical (non-preemptible): Production inference endpoints, compliance-critical batch processing. These workloads are never preempted. Reserve enough cluster capacity to guarantee they always run.
Standard (preemptible with checkpointing): Training jobs, large batch inference, embedding generation. These workloads save checkpoints regularly and can be stopped and resumed when higher-priority work needs their GPUs. Checkpoint frequency should balance the cost of lost compute (time since last checkpoint) against the overhead of checkpointing itself.
Best-effort (preemptible immediately): Development experiments, hyperparameter searches, ad hoc analysis. These workloads run only when spare capacity exists and are the first to be evicted.

Communicate these tiers clearly to all teams. When a data scientist understands that their training job will be checkpointed and resumed — not killed — they are far more willing to accept a shared scheduling model rather than demanding dedicated hardware.

Measuring and Improving Utilization

You cannot improve what you do not measure. Deploy GPU monitoring that captures utilization at the workload level, not just the node level:

GPU compute utilization measures what percentage of time the GPU's streaming multiprocessors are active. But high compute utilization does not always mean productive work — a poorly optimized model can keep the GPU busy while processing requests slowly.

GPU memory utilization shows how much of the GPU's memory is allocated. High memory utilization with low compute utilization often indicates a model that is loaded but idle — a prime candidate for sharing or consolidation.

Throughput per GPU-hour is the most meaningful business metric. Measure the actual work accomplished (requests served, documents processed, training steps completed) per unit of GPU time. This metric reveals whether scheduling changes translate into more work done, not just higher utilization numbers.

Review utilization dashboards weekly with infrastructure and data science teams together. Identify patterns: GPUs that are consistently underutilized during specific hours, workloads that could be consolidated onto fewer GPUs, and training jobs that would benefit from off-peak scheduling. Iterating on your scheduling configuration based on observed patterns is what turns a GPU cluster from an expensive asset into a productive one.

Featured image by GAMERCOMP.RU on Unsplash.

AI-Driven Consulting

People & Culture

Academy

Who we are

What we do

Resources

Career

Search across SysArt

GPU Resource Scheduling and Orchestration for On-Premises AI Workloads