Skip to main content
Version: main 🚧

GPU and inference autoscaling

Supported Configurations
Running the control plane as a container with:

This guide shows how to configure Horizontal Pod Autoscaler (HPA) and KEDA scaling inside a vCluster with Private Nodes. The metrics pipeline uses NVIDIA DCGM Exporter, Prometheus, and Prometheus Adapter to expose real GPU hardware metrics through the Kubernetes custom metrics API. The same Prometheus pipeline can also expose model-serving metrics such as queue depth, concurrency, and latency.

Use cases

GPU and inference autoscaling help when your workloads have variable GPU utilization or request load:

  • Inference serving: Scale model-serving Deployments up during traffic spikes and down during quiet periods to reduce idle GPU cost.
  • Batch processing: Run multiple GPU jobs with a shared pool of replicas that grows and shrinks based on actual GPU load.
  • Development clusters: Give data-science teams a vCluster with autoscaling GPU workloads while maintaining hard limits through maxReplicas.

How it works

With Private Nodes, vCluster workloads run directly on dedicated physical nodes rather than inside control plane cluster pods. This means:

  • DCGM Exporter DaemonSet pods are scheduled directly on the private GPU nodes and have access to real GPU hardware through the NVIDIA Management Library (NVML).
  • GPU metrics reflect actual hardware utilization, not virtualized or approximated values.
  • HPA scaling decisions are based on real hardware load, making autoscaling reliable for GPU-intensive workloads.

For hardware metrics, the pipeline flows through four components:

GPU Node (Private)
└── DCGM Exporter DaemonSet ← scrapes GPU hardware metrics
└── Prometheus ← collects via ServiceMonitor
└── Prometheus Adapter ← exposes as custom metrics API
└── HPA ← scales pods on GPU utilization

For model-serving metrics, the runtime exposes Prometheus metrics from the inference pod. Prometheus scrapes those metrics and either:

  • Prometheus Adapter exposes them to HPA as custom metrics.
  • KEDA queries Prometheus directly and scales the Deployment from the query result.

Use hardware metrics for capacity signals such as GPU utilization and memory pressure. Use serving metrics for demand signals such as request concurrency, queue depth, time to first token, tokens per second, and p95 or p99 latency. Inference providers usually need both.

If you are building an inference provider platform, see Inference Provider: Managed Model Serving for the full production path including tenancy model, endpoint routing, and Day 2 operations.

Prerequisites

  • A vCluster with Private Nodes enabled and at least one GPU node attached
  • NVIDIA GPU drivers and the NVIDIA device plugin installed on the private GPU nodes, usually through the NVIDIA GPU Operator
  • helm and kubectl connected to the vCluster

Step 1: Install DCGM exporter

Install the NVIDIA DCGM Exporter inside the vCluster. It runs as a DaemonSet on the private GPU nodes and exposes per-GPU metrics in Prometheus format at port 9400/metrics.

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update

helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace monitoring \
--create-namespace \
--set kubernetes.enablePodLabels=true
Why enablePodLabels matters

The kubernetes.enablePodLabels=true flag instructs DCGM Exporter to attach pod, namespace, and container labels to each metric by mapping GPU usage to the consuming pod. Without this flag, Prometheus Adapter cannot expose per-pod custom metrics, and HPA cannot query them.

Verify the DaemonSet is running on your GPU node:

kubectl get daemonset dcgm-exporter -n monitoring

Confirm the metrics endpoint is reachable and returning GPU data:

DCGM_POD=$(kubectl get pods -l app.kubernetes.io/name=dcgm-exporter -n monitoring -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward $DCGM_POD 9400:9400 -n monitoring &
curl -s http://localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
# Expected: DCGM_FI_DEV_GPU_UTIL{gpu="0",...,pod="<your-pod>",namespace="<your-ns>",...} 42

Step 2: Install Prometheus

Install the kube-prometheus-stack Helm chart inside the vCluster. This deploys Prometheus, the Prometheus Operator, and Alertmanager.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
serviceMonitorSelectorNilUsesHelmValues

Setting this to false tells Prometheus to discover all ServiceMonitor resources in the cluster, not only those created by the Helm release. This is required for Prometheus to find the DCGM Exporter ServiceMonitor created in the next step.

Verify that Prometheus pods are running:

kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus

Step 3: Configure a ServiceMonitor for DCGM exporter

Create a ServiceMonitor so Prometheus automatically discovers and scrapes the DCGM Exporter:

servicemonitor-dcgm.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: dcgm-exporter
endpoints:
- port: metrics
interval: 15s
path: /metrics

Apply it:

kubectl apply -f servicemonitor-dcgm.yaml

Confirm Prometheus is scraping DCGM by checking the Prometheus targets UI:

kubectl port-forward svc/kube-prometheus-kube-prome-prometheus 9090:9090 -n monitoring
# Open http://localhost:9090/targets and verify dcgm-exporter shows State: UP
Service name

The Prometheus service name kube-prometheus-kube-prome-prometheus is generated from the Helm release name kube-prometheus. If you used a different release name, run kubectl get svc -n monitoring to find the correct service name.

Step 4: Install Prometheus adapter

The Prometheus Adapter translates Prometheus metrics into the Kubernetes custom metrics API, which HPA can query.

Create a values file for the adapter that maps DCGM_FI_DEV_GPU_UTIL to a custom metric named gpu_utilization:

prometheus-adapter-values.yaml
prometheus:
url: http://kube-prometheus-kube-prome-prometheus.monitoring.svc
port: 9090

rules:
custom:
- seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{namespace!="",pod!=""}'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "DCGM_FI_DEV_GPU_UTIL"
as: "gpu_utilization"
metricsQuery: 'avg(DCGM_FI_DEV_GPU_UTIL{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

Install the adapter:

helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
-f prometheus-adapter-values.yaml

Verify the custom metric is available:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq '.resources[].name' | grep gpu
# Expected: "pods/gpu_utilization"

Available GPU metrics

The default DCGM Exporter profile exposes several metrics. You can use any of these as HPA targets by adding entries under rules.custom in the Prometheus Adapter config:

MetricDescription
DCGM_FI_DEV_GPU_UTILGPU core utilization (%)
DCGM_FI_DEV_FB_USEDGPU framebuffer memory used (MiB)
DCGM_FI_DEV_POWER_USAGEGPU power draw (W)
DCGM_FI_DEV_SM_CLOCKStreaming multiprocessor clock (MHz)
DCGM_FI_DEV_MEM_CLOCKMemory clock (MHz)

For memory-bound workloads such as large language model inference, DCGM_FI_DEV_FB_USED may be a better scaling signal than GPU core utilization.

Scale on serving metrics

GPU utilization is useful, but it is often a lagging signal for inference. A model server can have a growing request queue while GPU utilization is still low, or a saturated key-value cache while core utilization looks healthy. Add serving metrics from your runtime and use them alongside DCGM metrics.

Common serving metrics include:

SignalUse when
Request concurrencyYou want to keep active requests per replica below a service-level target.
Queue depthYou want to scale before queued requests increase latency.
Time to first tokenYou serve streaming responses and need to maintain low interactive latency.
Tokens per secondYou need throughput-based scaling for generated-token workloads.
p95 or p99 latencyYou scale based on customer-visible latency, often with smoothing to avoid flapping.
GPU memory or cache pressureYou need to protect model cache, KV cache, or batch admission behavior.

Expose runtime metrics to Prometheus

Most serving runtimes expose Prometheus metrics on an HTTP endpoint such as /metrics. Add a port and a ServiceMonitor for the runtime. The exact port and metric names depend on the runtime.

servicemonitor-vllm.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-inference
namespace: inference
spec:
selector:
matchLabels:
app: vllm-inference
endpoints:
- port: http
path: /metrics
interval: 15s

Apply it:

kubectl apply -f servicemonitor-vllm.yaml

Confirm Prometheus contains the runtime metric you want to scale on before configuring HPA or KEDA:

kubectl port-forward svc/kube-prometheus-kube-prome-prometheus 9090:9090 -n monitoring
# Query your runtime metric in Prometheus, for example vllm:num_requests_waiting

HPA with Prometheus adapter

Add Prometheus Adapter rules for the runtime metric. Merge the serving rules into the same adapter values file you use for GPU metrics. This example maps vLLM waiting requests to a custom pod metric named inference_queue_depth.

prometheus-adapter-serving-values.yaml
prometheus:
url: http://kube-prometheus-kube-prome-prometheus.monitoring.svc
port: 9090

rules:
custom:
- seriesQuery: 'vllm:num_requests_waiting{namespace!="",pod!=""}'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "vllm:num_requests_waiting"
as: "inference_queue_depth"
metricsQuery: 'avg(vllm:num_requests_waiting{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

Upgrade the adapter with the merged GPU and serving metric rules:

helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
-f prometheus-adapter-serving-values.yaml

Create an HPA that scales when average queue depth exceeds five waiting requests per pod:

inference-queue-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-inference-hpa
namespace: inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
minReplicas: 1
maxReplicas: 8
behavior:
scaleDown:
stabilizationWindowSeconds: 300
metrics:
- type: Pods
pods:
metric:
name: inference_queue_depth
target:
type: AverageValue
averageValue: "5"

Apply it:

kubectl apply -f inference-queue-hpa.yaml
kubectl get hpa vllm-inference-hpa -n inference -w

KEDA with Prometheus

KEDA is often a good fit when your serving signal is naturally expressed as a Prometheus query. Install KEDA in the tenant cluster, then create a ScaledObject that queries Prometheus.

vllm-keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-inference
namespace: inference
spec:
scaleTargetRef:
name: vllm-inference
minReplicaCount: 1
maxReplicaCount: 8
cooldownPeriod: 300
triggers:
- type: prometheus
metadata:
serverAddress: http://kube-prometheus-kube-prome-prometheus.monitoring.svc:9090
metricName: inference_queue_depth
threshold: "5"
query: sum(vllm:num_requests_waiting{namespace="inference",pod=~"vllm-inference-.*"})

Use KEDA for query-driven scaling or scale-to-zero patterns. Use HPA with Prometheus Adapter when you want Kubernetes-native custom metrics and HPA behavior. In both cases, set conservative scale-down windows for model servers because loading large models is slow and repeated cold starts degrade response latency.

Step 5: Create an HPA targeting GPU utilization

Create an HPA that scales your GPU workload when average GPU utilization exceeds 50%. Replace gpu-workload with the name of your Deployment:

gpu-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: gpu-workload-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: gpu-workload
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "50" # DCGM_FI_DEV_GPU_UTIL is 0–100; this targets 50% utilization

Apply the HPA:

kubectl apply -f gpu-hpa.yaml

Monitor scaling behavior:

kubectl get hpa gpu-workload-hpa -w
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# gpu-workload-hpa Deployment/gpu-workload 42/50 (avg) 1 4 1

Troubleshoot common issues

Custom metric not found

If kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" returns an error or doesn't list gpu_utilization:

  1. Verify the Prometheus Adapter pod is running: kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus-adapter
  2. Check Prometheus Adapter logs for configuration errors: kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-adapter
  3. Confirm Prometheus contains DCGM data by port-forwarding to Prometheus and querying DCGM_FI_DEV_GPU_UTIL directly.

HPA shows unknown targets

This usually means the metric is registered but no data is available for the target pods:

  1. Verify your GPU workload pods are running and consuming GPU resources.
  2. Check that DCGM Exporter metrics include pod and namespace labels. If these labels are missing, confirm kubernetes.enablePodLabels=true is set.
  3. Wait 1–2 minutes for the metrics pipeline to propagate data from DCGM through Prometheus to the adapter.