Model serving runtimes
Model-serving runtimes run inside the tenant cluster. vCluster provides the tenant Kubernetes API, GPU access through the selected worker node model, resource sync, and templates for installing supporting components. The serving runtime owns model loading, request handling, batching, tokenizer behavior, and runtime-specific metrics.
Use this page as a tenant-runtime pattern for inference endpoints. The examples show vLLM because it is easy to recognize, but the same structure applies to NVIDIA Triton, KServe, SGLang, Ray Serve, or a provider-owned runtime.
This page focuses on the private-node endpoint pattern, where the tenant cluster owns its worker-node software stack. Provider-owned shared serving pools can also run behind a product API, but they need a different tenancy and routing design.
Responsibility boundary​
| Layer | Responsible component |
|---|---|
| Tenant Kubernetes API, object lifecycle, and sync | vCluster |
| GPU node provisioning and reclaim | Private Nodes, Auto Nodes, and optionally vMetal |
| GPU driver, device plugin, GPU Operator, MIG, vGPU, or Dynamic Resource Allocation | Node image and vendor components |
| Model server, batching, model loading, and inference API | Runtime such as vLLM, NVIDIA Triton, KServe, SGLang, or Ray Serve |
| External endpoint routing | Gateway API, ingress, service mesh, or provider traffic layer |
| Product API and customer lifecycle | Your inference provider platform |
Prerequisites​
- A tenant cluster with Private Nodes enabled.
- At least one GPU node attached to the tenant cluster.
- A vendor GPU Operator, device plugin, or Dynamic Resource Allocation driver installed where your GPU stack requires it.
- A Gateway API controller, ingress controller, or provider traffic layer for external endpoint routing.
- A model source, registry credential, or object storage path your runtime can reach.
For GPU setup details, see GPU and accelerator support. For bare metal GPU node provisioning, see the vMetal GPU Quickstart.
Preflight the tenant cluster​
Before installing a runtime, confirm the tenant cluster can see GPU capacity and the routing resources your template expects.
Check the GPU resource advertised by the private node:
kubectl get nodes -o 'custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu'
kubectl describe node <gpu-node-name> | grep -A5 Allocatable
For AMD or another accelerator vendor, replace nvidia.com/gpu with the resource name exposed by that vendor's device plugin or DRA driver.
If your endpoint uses an imported Gateway, confirm the tenant cluster can see it before creating routes:
kubectl get gateway -A
kubectl describe gateway shared-inference -n shared-gateways
If the Gateway is missing, fix the tenant cluster template before deploying the runtime. For imported Gateways, the template must enable sync.fromHost.gateways and sync.toHost.gatewayApi.httpRoutes. For tenant-owned Gateways, the template must enable the Gateway API sync resources your controller requires.
Deploy a smoke-test vLLM runtime​
The following example shows a small vLLM Deployment and Service for validating GPU scheduling, model loading, and in-cluster serving. It uses a small public model to keep the example concise. Treat it as a smoke test, not as a production endpoint template.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
namespace: inference
spec:
replicas: 1
selector:
matchLabels:
app: vllm-inference
template:
metadata:
labels:
app: vllm-inference
spec:
terminationGracePeriodSeconds: 60
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.5
imagePullPolicy: IfNotPresent
args:
- --model
- facebook/opt-125m
- --host
- 0.0.0.0
- --port
- "8000"
env:
- name: HF_HOME
value: /models/cache
ports:
- name: http
containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 10
startupProbe:
httpGet:
path: /health
port: http
failureThreshold: 60
periodSeconds: 10
resources:
requests:
cpu: "4"
memory: 16Gi
limits:
cpu: "8"
memory: 32Gi
nvidia.com/gpu: "1"
volumeMounts:
- name: model-cache
mountPath: /models/cache
volumes:
- name: model-cache
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: vllm-inference
namespace: inference
spec:
selector:
app: vllm-inference
ports:
- name: http
port: 80
targetPort: http
Apply the manifest inside the tenant cluster:
kubectl create namespace inference
kubectl apply -f vllm-inference.yaml
kubectl rollout status deployment/vllm-inference -n inference
Validate the Service from inside the tenant cluster:
kubectl port-forward svc/vllm-inference 8000:80 -n inference
curl http://localhost:8000/v1/models
If the pod stays pending, check node capacity, GPU resource names, project quotas, allowed node types, node selectors, and taints. If the pod starts but the model does not load, check model registry credentials, outbound network policy, model cache storage, and runtime logs.
Plan model storage​
Large models make storage and startup behavior part of the endpoint architecture. Decide where model weights live, how access credentials are managed, how weights are cached, and how endpoint readiness behaves while the runtime loads them.
Common patterns include:
| Pattern | Use when | Tradeoff |
|---|---|---|
| Object storage, such as S3, GCS, Azure Blob, or MinIO | Models are shared across regions, endpoints, or customers and downloaded at startup. | Simple source of truth, but large models can make cold starts slow unless you add cache warming. |
| PersistentVolume model cache | A tenant or endpoint repeatedly loads the same model. | Reduces repeated downloads, but requires capacity planning, cleanup, and access controls. |
| Local NVMe or node image cache | You sell low-latency tiers for a small set of popular models. | Fastest startup after scheduling, but ties models to node images or node-local cache lifecycle. |
| Shared filesystem, such as NFS or a parallel filesystem | Multiple replicas need a shared model store inside the same environment. | Avoids repeated downloads, but can become a throughput bottleneck for large model loads. |
| Runtime image with model baked in | Small, fixed models change rarely. | Simple deployment, but large images slow pulls and make model rollout the same as image rollout. |
For production templates, make the model source, model version, credentials secret, cache volume, and cache size explicit parameters. The runtime template should mount registry or object-storage credentials, prepare the cache with an init container or sidecar when needed, and report readiness only after the model is loaded.
For large models, plan warmup separately from pod scheduling. A pod can be Running while the endpoint is still downloading weights, building a cache, compiling kernels, or loading the model into GPU memory. Surface that state in your product API.
Build a production runtime template​
Inference providers usually package the runtime into a Platform template or a provider-owned deployment workflow. A production endpoint template should include:
- pinned runtime images and a rollout policy
- model registry, object storage, or artifact repository credentials
- persistent or pre-warmed model cache storage
- model source, model version, cache size, and credential parameters
- resource requests and limits for CPU, memory, GPU, and ephemeral storage
- node selectors, affinities, tolerations, or runtime classes that match the endpoint tier
- readiness, startup, and liveness probes tuned to model load time
- PodDisruptionBudget and graceful shutdown for draining traffic
- NetworkPolicies and service account permissions
- metrics scraping, logs, traces, and runtime-specific dashboards
- authentication and authorization at the provider traffic layer
- route, DNS, and certificate policy
Keep model identity and customer-specific values as template parameters. Keep driver configuration, GPU presentation mode, and node image details in the GPU stack and node type instead of embedding them into every endpoint manifest.
For template guidance, see Templates, Quotas, and Allowed node types.
Expose an inference endpoint​
For new HTTP endpoint deployments, prefer Gateway API. The most common provider model is:
- The platform team owns Gateway API CRDs, the Gateway controller, DNS, certificates, and shared
Gatewayresources in the control plane cluster. - The tenant cluster imports approved Gateways with
sync.fromHost.gateways. - The endpoint template creates
HTTPRouteresources in the tenant cluster. - vCluster syncs the tenant route to the control plane cluster and enforces the Gateway attachment policy.
The tenant cluster template needs Gateway sync enabled. The following example maps a platform-owned Gateway to a tenant-facing Gateway and allows routes from selected tenant namespaces:
sync:
fromHost:
gatewayClasses:
enabled: true
selector:
matchLabels:
inference.example.com/sync: "yes"
gateways:
enabled: true
selector:
matchLabels:
inference.example.com/sync: "yes"
mappings:
byName:
"platform-gateways/public-inference": "shared-gateways/shared-inference"
allowedRoutes:
overrides:
- hostNamespace: platform-gateways
name: public-inference
allowedHostnames:
- "*.inference.example.com"
virtualNamespacePolicy:
from: Selector
selector:
matchLabels:
inference.example.com/route-access: "allowed"
toHost:
gatewayApi:
httpRoutes:
enabled: true
Label the namespace that may attach routes:
kubectl label namespace inference inference.example.com/route-access=allowed
For the full Gateway API model, see Gateway API, Gateway API sync, and imported Gateways and GatewayClasses.
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: vllm-inference
namespace: inference
spec:
parentRefs:
- name: shared-inference
namespace: shared-gateways
hostnames:
- customer-a.inference.example.com
rules:
- backendRefs:
- name: vllm-inference
port: 80
Apply the route inside the tenant cluster:
kubectl apply -f vllm-route.yaml
kubectl describe httproute vllm-inference -n inference
Look for Accepted=True and ResolvedRefs=True conditions. If the route fails to become ready, check the imported Gateway, the allowed hostname list, listener policy, and Gateway API sync configuration.
After DNS points to the Gateway address, validate the endpoint externally:
curl -H "Host: customer-a.inference.example.com" http://<gateway-address>/v1/models
In production, handle TLS and enforce customer authentication in your provider traffic layer, API gateway, service mesh, or runtime sidecar. Do not expose unauthenticated model endpoints directly to the public internet.
Add autoscaling and observability​
Start with runtime metrics and GPU hardware metrics. GPU and inference autoscaling shows how to expose NVIDIA DCGM metrics and model-serving metrics to HPA or KEDA.
For inference workloads, GPU utilization alone is rarely enough. Add runtime metrics such as:
- request concurrency
- queue depth
- time to first token
- tokens per second
- p95 and p99 latency
- GPU memory usage
- cache pressure
Use these signals with Horizontal Pod Autoscaler, KEDA, a runtime-specific autoscaler, or your provider control plane.
For the first production endpoint, verify that each signal is visible before enabling autoscaling:
kubectl top pods -n inference
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1
kubectl logs deployment/vllm-inference -n inference
Validate provider readiness​
Before exposing the endpoint to customers, validate the full provider path:
- Product automation creates or selects the project, template, quota, and allowed node type.
- The tenant cluster reaches
Readyand the private GPU node joins. - The GPU device plugin, GPU Operator, or DRA driver exposes the expected resource.
- The runtime pod schedules on the intended GPU node type.
- The model loads from the approved source and reaches readiness.
- The route reports
Accepted=TrueandResolvedRefs=True. - The endpoint accepts authorized traffic and rejects unauthorized traffic.
- Logs, metrics, and alerts include endpoint, tenant, model, and node tier labels.
- Delete or scale-to-zero workflows drain traffic and reclaim GPU capacity.