GPU and inference autoscaling
This guide shows how to configure Horizontal Pod Autoscaler (HPA) and KEDA scaling inside a vCluster with Private Nodes. The metrics pipeline uses NVIDIA DCGM Exporter, Prometheus, and Prometheus Adapter to expose real GPU hardware metrics through the Kubernetes custom metrics API. The same Prometheus pipeline can also expose model-serving metrics such as queue depth, concurrency, and latency.
Use cases
GPU and inference autoscaling help when your workloads have variable GPU utilization or request load:
- Inference serving: Scale model-serving Deployments up during traffic spikes and down during quiet periods to reduce idle GPU cost.
- Batch processing: Run multiple GPU jobs with a shared pool of replicas that grows and shrinks based on actual GPU load.
- Development clusters: Give data-science teams a vCluster with autoscaling GPU workloads while maintaining hard limits through
maxReplicas.
How it works
With Private Nodes, vCluster workloads run directly on dedicated physical nodes rather than inside control plane cluster pods. This means:
- DCGM Exporter DaemonSet pods are scheduled directly on the private GPU nodes and have access to real GPU hardware through the NVIDIA Management Library (NVML).
- GPU metrics reflect actual hardware utilization, not virtualized or approximated values.
- HPA scaling decisions are based on real hardware load, making autoscaling reliable for GPU-intensive workloads.
For hardware metrics, the pipeline flows through four components:
GPU Node (Private)
└── DCGM Exporter DaemonSet ← scrapes GPU hardware metrics
└── Prometheus ← collects via ServiceMonitor
└── Prometheus Adapter ← exposes as custom metrics API
└── HPA ← scales pods on GPU utilization
For model-serving metrics, the runtime exposes Prometheus metrics from the inference pod. Prometheus scrapes those metrics and either:
- Prometheus Adapter exposes them to HPA as custom metrics.
- KEDA queries Prometheus directly and scales the Deployment from the query result.
Use hardware metrics for capacity signals such as GPU utilization and memory pressure. Use serving metrics for demand signals such as request concurrency, queue depth, time to first token, tokens per second, and p95 or p99 latency. Inference providers usually need both.
If you are building an inference provider platform, see Inference Provider: Managed Model Serving for the full production path including tenancy model, endpoint routing, and Day 2 operations.
Prerequisites
- A vCluster with Private Nodes enabled and at least one GPU node attached
- NVIDIA GPU drivers and the NVIDIA device plugin installed on the private GPU nodes, usually through the NVIDIA GPU Operator
helmandkubectlconnected to the vCluster
Step 1: Install DCGM exporter
Install the NVIDIA DCGM Exporter inside the vCluster. It runs as a DaemonSet on the private GPU nodes and exposes per-GPU metrics in Prometheus format at port 9400/metrics.
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace monitoring \
--create-namespace \
--set kubernetes.enablePodLabels=true
The kubernetes.enablePodLabels=true flag instructs DCGM Exporter to attach pod, namespace, and container labels to each metric by mapping GPU usage to the consuming pod. Without this flag, Prometheus Adapter cannot expose per-pod custom metrics, and HPA cannot query them.
Verify the DaemonSet is running on your GPU node:
kubectl get daemonset dcgm-exporter -n monitoring
Confirm the metrics endpoint is reachable and returning GPU data:
DCGM_POD=$(kubectl get pods -l app.kubernetes.io/name=dcgm-exporter -n monitoring -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward $DCGM_POD 9400:9400 -n monitoring &
curl -s http://localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
# Expected: DCGM_FI_DEV_GPU_UTIL{gpu="0",...,pod="<your-pod>",namespace="<your-ns>",...} 42
Step 2: Install Prometheus
Install the kube-prometheus-stack Helm chart inside the vCluster. This deploys Prometheus, the Prometheus Operator, and Alertmanager.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
Setting this to false tells Prometheus to discover all ServiceMonitor resources in the cluster, not only those created by the Helm release. This is required for Prometheus to find the DCGM Exporter ServiceMonitor created in the next step.
Verify that Prometheus pods are running:
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus
Step 3: Configure a ServiceMonitor for DCGM exporter
Create a ServiceMonitor so Prometheus automatically discovers and scrapes the DCGM Exporter:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: dcgm-exporter
endpoints:
- port: metrics
interval: 15s
path: /metrics
Apply it:
kubectl apply -f servicemonitor-dcgm.yaml
Confirm Prometheus is scraping DCGM by checking the Prometheus targets UI:
kubectl port-forward svc/kube-prometheus-kube-prome-prometheus 9090:9090 -n monitoring
# Open http://localhost:9090/targets and verify dcgm-exporter shows State: UP
The Prometheus service name kube-prometheus-kube-prome-prometheus is generated from the Helm release name kube-prometheus. If you used a different release name, run kubectl get svc -n monitoring to find the correct service name.
Step 4: Install Prometheus adapter
The Prometheus Adapter translates Prometheus metrics into the Kubernetes custom metrics API, which HPA can query.
Create a values file for the adapter that maps DCGM_FI_DEV_GPU_UTIL to a custom metric named gpu_utilization:
prometheus:
url: http://kube-prometheus-kube-prome-prometheus.monitoring.svc
port: 9090
rules:
custom:
- seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{namespace!="",pod!=""}'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "DCGM_FI_DEV_GPU_UTIL"
as: "gpu_utilization"
metricsQuery: 'avg(DCGM_FI_DEV_GPU_UTIL{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
Install the adapter:
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
-f prometheus-adapter-values.yaml
Verify the custom metric is available:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq '.resources[].name' | grep gpu
# Expected: "pods/gpu_utilization"
Available GPU metrics
The default DCGM Exporter profile exposes several metrics. You can use any of these as HPA targets by adding entries under rules.custom in the Prometheus Adapter config:
| Metric | Description |
|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU core utilization (%) |
DCGM_FI_DEV_FB_USED | GPU framebuffer memory used (MiB) |
DCGM_FI_DEV_POWER_USAGE | GPU power draw (W) |
DCGM_FI_DEV_SM_CLOCK | Streaming multiprocessor clock (MHz) |
DCGM_FI_DEV_MEM_CLOCK | Memory clock (MHz) |
For memory-bound workloads such as large language model inference, DCGM_FI_DEV_FB_USED may be a better scaling signal than GPU core utilization.
Scale on serving metrics
GPU utilization is useful, but it is often a lagging signal for inference. A model server can have a growing request queue while GPU utilization is still low, or a saturated key-value cache while core utilization looks healthy. Add serving metrics from your runtime and use them alongside DCGM metrics.
Common serving metrics include:
| Signal | Use when |
|---|---|
| Request concurrency | You want to keep active requests per replica below a service-level target. |
| Queue depth | You want to scale before queued requests increase latency. |
| Time to first token | You serve streaming responses and need to maintain low interactive latency. |
| Tokens per second | You need throughput-based scaling for generated-token workloads. |
| p95 or p99 latency | You scale based on customer-visible latency, often with smoothing to avoid flapping. |
| GPU memory or cache pressure | You need to protect model cache, KV cache, or batch admission behavior. |
Expose runtime metrics to Prometheus
Most serving runtimes expose Prometheus metrics on an HTTP endpoint such as /metrics. Add a port and a ServiceMonitor for the runtime. The exact port and metric names depend on the runtime.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-inference
namespace: inference
spec:
selector:
matchLabels:
app: vllm-inference
endpoints:
- port: http
path: /metrics
interval: 15s
Apply it:
kubectl apply -f servicemonitor-vllm.yaml
Confirm Prometheus contains the runtime metric you want to scale on before configuring HPA or KEDA:
kubectl port-forward svc/kube-prometheus-kube-prome-prometheus 9090:9090 -n monitoring
# Query your runtime metric in Prometheus, for example vllm:num_requests_waiting
HPA with Prometheus adapter
Add Prometheus Adapter rules for the runtime metric. Merge the serving rules into the same adapter values file you use for GPU metrics. This example maps vLLM waiting requests to a custom pod metric named inference_queue_depth.
prometheus:
url: http://kube-prometheus-kube-prome-prometheus.monitoring.svc
port: 9090
rules:
custom:
- seriesQuery: 'vllm:num_requests_waiting{namespace!="",pod!=""}'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "vllm:num_requests_waiting"
as: "inference_queue_depth"
metricsQuery: 'avg(vllm:num_requests_waiting{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
Upgrade the adapter with the merged GPU and serving metric rules:
helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
-f prometheus-adapter-serving-values.yaml
Create an HPA that scales when average queue depth exceeds five waiting requests per pod:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-inference-hpa
namespace: inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
minReplicas: 1
maxReplicas: 8
behavior:
scaleDown:
stabilizationWindowSeconds: 300
metrics:
- type: Pods
pods:
metric:
name: inference_queue_depth
target:
type: AverageValue
averageValue: "5"
Apply it:
kubectl apply -f inference-queue-hpa.yaml
kubectl get hpa vllm-inference-hpa -n inference -w
KEDA with Prometheus
KEDA is often a good fit when your serving signal is naturally expressed as a Prometheus query. Install KEDA in the tenant cluster, then create a ScaledObject that queries Prometheus.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-inference
namespace: inference
spec:
scaleTargetRef:
name: vllm-inference
minReplicaCount: 1
maxReplicaCount: 8
cooldownPeriod: 300
triggers:
- type: prometheus
metadata:
serverAddress: http://kube-prometheus-kube-prome-prometheus.monitoring.svc:9090
metricName: inference_queue_depth
threshold: "5"
query: sum(vllm:num_requests_waiting{namespace="inference",pod=~"vllm-inference-.*"})
Use KEDA for query-driven scaling or scale-to-zero patterns. Use HPA with Prometheus Adapter when you want Kubernetes-native custom metrics and HPA behavior. In both cases, set conservative scale-down windows for model servers because loading large models is slow and repeated cold starts degrade response latency.
Step 5: Create an HPA targeting GPU utilization
Create an HPA that scales your GPU workload when average GPU utilization exceeds 50%. Replace gpu-workload with the name of your Deployment:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: gpu-workload-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: gpu-workload
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "50" # DCGM_FI_DEV_GPU_UTIL is 0–100; this targets 50% utilization
Apply the HPA:
kubectl apply -f gpu-hpa.yaml
Monitor scaling behavior:
kubectl get hpa gpu-workload-hpa -w
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# gpu-workload-hpa Deployment/gpu-workload 42/50 (avg) 1 4 1
Troubleshoot common issues
Custom metric not found
If kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" returns an error or doesn't list gpu_utilization:
- Verify the Prometheus Adapter pod is running:
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus-adapter - Check Prometheus Adapter logs for configuration errors:
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-adapter - Confirm Prometheus contains DCGM data by port-forwarding to Prometheus and querying
DCGM_FI_DEV_GPU_UTILdirectly.
HPA shows unknown targets
This usually means the metric is registered but no data is available for the target pods:
- Verify your GPU workload pods are running and consuming GPU resources.
- Check that DCGM Exporter metrics include
podandnamespacelabels. If these labels are missing, confirmkubernetes.enablePodLabels=trueis set. - Wait 1–2 minutes for the metrics pipeline to propagate data from DCGM through Prometheus to the adapter.