Version: main 🚧

GPU and inference autoscaling

Supported Configurations

Running the control plane as a container with:

Private Nodes

This guide shows how to configure Horizontal Pod Autoscaler (HPA) and KEDA scaling inside a vCluster with Private Nodes. The metrics pipeline uses NVIDIA DCGM Exporter, Prometheus, and Prometheus Adapter to expose real GPU hardware metrics through the Kubernetes custom metrics API. The same Prometheus pipeline can also expose model-serving metrics such as queue depth, concurrency, and latency.

Use cases

GPU and inference autoscaling help when your workloads have variable GPU utilization or request load:

Inference serving: Scale model-serving Deployments up during traffic spikes and down during quiet periods to reduce idle GPU cost.
Batch processing: Run multiple GPU jobs with a shared pool of replicas that grows and shrinks based on actual GPU load.
Development clusters: Give data-science teams a vCluster with autoscaling GPU workloads while maintaining hard limits through maxReplicas.

How it works

With Private Nodes, vCluster workloads run directly on dedicated physical nodes rather than inside control plane cluster pods. This means:

DCGM Exporter DaemonSet pods are scheduled directly on the private GPU nodes and have access to real GPU hardware through the NVIDIA Management Library (NVML).
GPU metrics reflect actual hardware utilization, not virtualized or approximated values.
HPA scaling decisions are based on real hardware load, making autoscaling reliable for GPU-intensive workloads.

For hardware metrics, the pipeline flows through four components:

GPU Node (Private)
  └── DCGM Exporter DaemonSet    ← scrapes GPU hardware metrics
        └── Prometheus             ← collects via ServiceMonitor
              └── Prometheus Adapter  ← exposes as custom metrics API
                    └── HPA            ← scales pods on GPU utilization

For model-serving metrics, the runtime exposes Prometheus metrics from the inference pod. Prometheus scrapes those metrics and either:

Prometheus Adapter exposes them to HPA as custom metrics.
KEDA queries Prometheus directly and scales the Deployment from the query result.

Use hardware metrics for capacity signals such as GPU utilization and memory pressure. Use serving metrics for demand signals such as request concurrency, queue depth, time to first token, tokens per second, and p95 or p99 latency. Inference providers usually need both.

If you are building an inference provider platform, see Inference Provider: Managed Model Serving for the full production path including tenancy model, endpoint routing, and Day 2 operations.

Prerequisites

A vCluster with Private Nodes enabled and at least one GPU node attached
NVIDIA GPU drivers and the NVIDIA device plugin installed on the private GPU nodes, usually through the NVIDIA GPU Operator
helm and kubectl connected to the vCluster

Step 1: Install DCGM exporter

Install the NVIDIA DCGM Exporter inside the vCluster. It runs as a DaemonSet on the private GPU nodes and exposes per-GPU metrics in Prometheus format at port 9400/metrics.

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update

helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --create-namespace \
  --set kubernetes.enablePodLabels=true

Why enablePodLabels matters

The kubernetes.enablePodLabels=true flag instructs DCGM Exporter to attach pod, namespace, and container labels to each metric by mapping GPU usage to the consuming pod. Without this flag, Prometheus Adapter cannot expose per-pod custom metrics, and HPA cannot query them.

Verify the DaemonSet is running on your GPU node:

kubectl get daemonset dcgm-exporter -n monitoring

Confirm the metrics endpoint is reachable and returning GPU data:

DCGM_POD=$(kubectl get pods -l app.kubernetes.io/name=dcgm-exporter -n monitoring -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward $DCGM_POD 9400:9400 -n monitoring &
curl -s http://localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
# Expected: DCGM_FI_DEV_GPU_UTIL{gpu="0",...,pod="<your-pod>",namespace="<your-ns>",...} 42

Step 2: Install Prometheus

Install the kube-prometheus-stack Helm chart inside the vCluster. This deploys Prometheus, the Prometheus Operator, and Alertmanager.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

serviceMonitorSelectorNilUsesHelmValues

Setting this to false tells Prometheus to discover all ServiceMonitor resources in the cluster, not only those created by the Helm release. This is required for Prometheus to find the DCGM Exporter ServiceMonitor created in the next step.

Verify that Prometheus pods are running:

kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus

Step 3: Configure a ServiceMonitor for DCGM exporter

Create a ServiceMonitor so Prometheus automatically discovers and scrapes the DCGM Exporter:

servicemonitor-dcgm.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

Apply it:

kubectl apply -f servicemonitor-dcgm.yaml

Confirm Prometheus is scraping DCGM by checking the Prometheus targets UI:

kubectl port-forward svc/kube-prometheus-kube-prome-prometheus 9090:9090 -n monitoring
# Open http://localhost:9090/targets and verify dcgm-exporter shows State: UP

Service name

The Prometheus service name kube-prometheus-kube-prome-prometheus is generated from the Helm release name kube-prometheus. If you used a different release name, run kubectl get svc -n monitoring to find the correct service name.

Step 4: Install Prometheus adapter

The Prometheus Adapter translates Prometheus metrics into the Kubernetes custom metrics API, which HPA can query.

Create a values file for the adapter that maps DCGM_FI_DEV_GPU_UTIL to a custom metric named gpu_utilization:

prometheus-adapter-values.yaml
prometheus:
  url: http://kube-prometheus-kube-prome-prometheus.monitoring.svc
  port: 9090

rules:
  custom:
    - seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "DCGM_FI_DEV_GPU_UTIL"
        as: "gpu_utilization"
      metricsQuery: 'avg(DCGM_FI_DEV_GPU_UTIL{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

Install the adapter:

helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  -f prometheus-adapter-values.yaml

Verify the custom metric is available:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq '.resources[].name' | grep gpu
# Expected: "pods/gpu_utilization"

Available GPU metrics

The default DCGM Exporter profile exposes several metrics. You can use any of these as HPA targets by adding entries under rules.custom in the Prometheus Adapter config:

Metric	Description
`DCGM_FI_DEV_GPU_UTIL`	GPU core utilization (%)
`DCGM_FI_DEV_FB_USED`	GPU framebuffer memory used (MiB)
`DCGM_FI_DEV_POWER_USAGE`	GPU power draw (W)
`DCGM_FI_DEV_SM_CLOCK`	Streaming multiprocessor clock (MHz)
`DCGM_FI_DEV_MEM_CLOCK`	Memory clock (MHz)

For memory-bound workloads such as large language model inference, DCGM_FI_DEV_FB_USED may be a better scaling signal than GPU core utilization.

Scale on serving metrics

GPU utilization is useful, but it is often a lagging signal for inference. A model server can have a growing request queue while GPU utilization is still low, or a saturated key-value cache while core utilization looks healthy. Add serving metrics from your runtime and use them alongside DCGM metrics.

Common serving metrics include:

Signal	Use when
Request concurrency	You want to keep active requests per replica below a service-level target.
Queue depth	You want to scale before queued requests increase latency.
Time to first token	You serve streaming responses and need to maintain low interactive latency.
Tokens per second	You need throughput-based scaling for generated-token workloads.
p95 or p99 latency	You scale based on customer-visible latency, often with smoothing to avoid flapping.
GPU memory or cache pressure	You need to protect model cache, KV cache, or batch admission behavior.

Expose runtime metrics to Prometheus

Most serving runtimes expose Prometheus metrics on an HTTP endpoint such as /metrics. Add a port and a ServiceMonitor for the runtime. The exact port and metric names depend on the runtime.

servicemonitor-vllm.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm-inference
  namespace: inference
spec:
  selector:
    matchLabels:
      app: vllm-inference
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Apply it:

kubectl apply -f servicemonitor-vllm.yaml

Confirm Prometheus contains the runtime metric you want to scale on before configuring HPA or KEDA:

kubectl port-forward svc/kube-prometheus-kube-prome-prometheus 9090:9090 -n monitoring
# Query your runtime metric in Prometheus, for example vllm:num_requests_waiting

HPA with Prometheus adapter

Add Prometheus Adapter rules for the runtime metric. Merge the serving rules into the same adapter values file you use for GPU metrics. This example maps vLLM waiting requests to a custom pod metric named inference_queue_depth.

prometheus-adapter-serving-values.yaml
prometheus:
  url: http://kube-prometheus-kube-prome-prometheus.monitoring.svc
  port: 9090

rules:
  custom:
    - seriesQuery: 'vllm:num_requests_waiting{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "vllm:num_requests_waiting"
        as: "inference_queue_depth"
      metricsQuery: 'avg(vllm:num_requests_waiting{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

Upgrade the adapter with the merged GPU and serving metric rules:

helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  -f prometheus-adapter-serving-values.yaml

Create an HPA that scales when average queue depth exceeds five waiting requests per pod:

inference-queue-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-inference-hpa
  namespace: inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-inference
  minReplicas: 1
  maxReplicas: 8
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
  metrics:
    - type: Pods
      pods:
        metric:
          name: inference_queue_depth
        target:
          type: AverageValue
          averageValue: "5"

Apply it:

kubectl apply -f inference-queue-hpa.yaml
kubectl get hpa vllm-inference-hpa -n inference -w

KEDA with Prometheus

KEDA is often a good fit when your serving signal is naturally expressed as a Prometheus query. Install KEDA in the tenant cluster, then create a ScaledObject that queries Prometheus.

vllm-keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-inference
  namespace: inference
spec:
  scaleTargetRef:
    name: vllm-inference
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://kube-prometheus-kube-prome-prometheus.monitoring.svc:9090
        metricName: inference_queue_depth
        threshold: "5"
        query: sum(vllm:num_requests_waiting{namespace="inference",pod=~"vllm-inference-.*"})

Use KEDA for query-driven scaling or scale-to-zero patterns. Use HPA with Prometheus Adapter when you want Kubernetes-native custom metrics and HPA behavior. In both cases, set conservative scale-down windows for model servers because loading large models is slow and repeated cold starts degrade response latency.

Step 5: Create an HPA targeting GPU utilization

Create an HPA that scales your GPU workload when average GPU utilization exceeds 50%. Replace gpu-workload with the name of your Deployment:

gpu-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-workload-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: gpu-workload
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: "50"   # DCGM_FI_DEV_GPU_UTIL is 0–100; this targets 50% utilization

Apply the HPA:

kubectl apply -f gpu-hpa.yaml

Monitor scaling behavior:

kubectl get hpa gpu-workload-hpa -w
# NAME               REFERENCE                  TARGETS        MINPODS   MAXPODS   REPLICAS
# gpu-workload-hpa   Deployment/gpu-workload    42/50 (avg)    1         4         1

Troubleshoot common issues

Custom metric not found

If kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" returns an error or doesn't list gpu_utilization:

Verify the Prometheus Adapter pod is running: kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus-adapter
Check Prometheus Adapter logs for configuration errors: kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-adapter
Confirm Prometheus contains DCGM data by port-forwarding to Prometheus and querying DCGM_FI_DEV_GPU_UTIL directly.

HPA shows unknown targets

This usually means the metric is registered but no data is available for the target pods:

Verify your GPU workload pods are running and consuming GPU resources.
Check that DCGM Exporter metrics include pod and namespace labels. If these labels are missing, confirm kubernetes.enablePodLabels=true is set.
Wait 1–2 minutes for the metrics pipeline to propagate data from DCGM through Prometheus to the adapter.

Use cases​

How it works​

Prerequisites​

Step 1: Install DCGM exporter​

Step 2: Install Prometheus​

Step 3: Configure a ServiceMonitor for DCGM exporter​

Step 4: Install Prometheus adapter​

Available GPU metrics​

Scale on serving metrics​

Expose runtime metrics to Prometheus​

HPA with Prometheus adapter​

KEDA with Prometheus​

Step 5: Create an HPA targeting GPU utilization​

Troubleshoot common issues​

Custom metric not found​

HPA shows unknown targets​

Use cases

How it works

Prerequisites

Step 1: Install DCGM exporter

Step 2: Install Prometheus

Step 3: Configure a ServiceMonitor for DCGM exporter

Step 4: Install Prometheus adapter

Available GPU metrics

Scale on serving metrics

Expose runtime metrics to Prometheus

HPA with Prometheus adapter

KEDA with Prometheus

Step 5: Create an HPA targeting GPU utilization

Troubleshoot common issues

Custom metric not found

HPA shows unknown targets