Skip to main content
Version: main 🚧

Inference Provider: Managed Model Serving

Build managed inference endpoints on your GPU infrastructure. Customers interact with your product API and model endpoints. Platform and vCluster form the operations layer your team runs behind it.

Typical stack:

  • Platform as the management plane.
  • If you already operate Kubernetes, run vCluster control planes as pods on that cluster. Use Standalone (HA) when you need vCluster to bootstrap the control plane on bare metal or VMs.
  • Private GPU nodes for dedicated inference customers.
  • vMetal when you own bare metal GPU lifecycle. Cloud GPU instances, manually joined nodes, or Auto Nodes when your capacity comes from a cloud GPU provider.
  • vNode, a separate vCluster Labs product, is optional runtime isolation for custom containers, adapters, plugins, or other untrusted code.
Inference provider architectureCustomers call a product API and inference endpoints. The provider product uses vCluster Platform to manage isolated inference environments backed by GPU capacity.INFERENCE PROVIDER SERVICE - CUSTOMER API BOUNDARYCustomer AppsSDKs, API clients, UIProduct APIcreate, scale, update, deleteTraffic Layerauth, routing, rate limitsEndpoint/v1/chat, /v1/modelsOPERATOR ENVIRONMENT - HIDDEN FROM CUSTOMERSEndpoint Tier ATenant ClusterModel Runtime TemplatevLLM / Triton / KServeGPU NodesEndpoint Tier BTenant ClusterModel Runtime Templatemodel cache, secrets, metricsGPU NodesShared Serving PoolOptionalProvider Modelstrusted workloadsGPU PoolvCluster PlatformVControl Plane Cluster - EKS, AKS, GKE, cloud VMs, or vCluster Standalonetemplates, quotas, allowed node types, access keys, endpoint lifecycleGPU capacity from cloud instances, Auto Nodes, manually joined nodes, or vMetal
Product API and inference endpoint boundary backed by isolated runtime environments and GPU capacity

What makes this path different: Customers don't use Platform directly. They call your product API or inference endpoint. Your product uses Platform and vCluster APIs to create tenant clusters, attach GPU capacity, apply templates, enforce quotas, publish routes, and reclaim capacity when endpoints are deleted.

Internal inference platform?

If your users are internal teams that should provision and manage their own environments in Platform, start with Enterprise AI Factory. Use this inference provider path when your product hides Platform and exposes an inference endpoint lifecycle instead.

Day 0: Design decisions​

DecisionRead nextOutcome
Choose the customer-facing product boundaryPlatform API, access keys, ProjectsDecide which actions your product exposes, such as endpoint creation, model selection, GPU tier selection, quota changes, endpoint updates, and deletion. Customers shouldn't need Platform access.
Choose the inference tenancy modelChoose an inference tenancy model, ArchitectureDecide whether each customer, model family, or endpoint tier gets a shared serving pool, a tenant cluster, private GPU nodes, or vNode runtime isolation.
Plan GPU capacity classesNode providers, vMetal inference fleet capacity, GPU and acceleratorsDefine node types by GPU model, GPU count, region, provider, rack, and reservation model. Use vMetal for owned bare metal fleets, or cloud GPU instances and Auto Nodes for cloud capacity. Keep MIG, vGPU, and Dynamic Resource Allocation decisions in the vendor stack.
Define serving runtime templatesModel serving runtimes, Templates, Certified StacksStandardize the components each endpoint receives, such as GPU Operator, scheduler, model server, ingress or Gateway API routes, metrics, and secrets.
Plan model storage and warmupModel storage architecture, Model serving runtimesDecide where model weights live, how they reach each endpoint, how caches are warmed, and how readiness is reported while large models load.
Plan endpoint routingModel serving runtimes, Gateway APIDecide whether tenant clusters attach routes to platform-owned shared Gateways, tenant-owned Gateways, or another ingress layer controlled by your product.
Plan autoscaling signalsGPU and inference autoscaling, Monitoring overviewCombine hardware metrics, such as GPU utilization and memory, with serving metrics, such as request concurrency, queue depth, latency, tokens per second, and time to first token.
Plan billing and meteringvBilling, Fleet monitoringDecide how you attribute GPU-hours, node-hours, storage, egress, and request or token usage to customers. vBilling is an experimental vCluster Labs project for metering tenant clusters and streaming usage events to a billing adapter.
Plan endpoint readiness and SLAEndpoint readiness and warm pools, Model serving runtimesDecide whether endpoints are cold-provisioned, pre-warmed, or backed by warm pools. Account for node provisioning time, image pulls, model downloads, and model load time.
Confirm product tiers and licensingOpen Source vs Free tier, vCluster pricingConfirm which parts of the stack require Platform activation or a paid tier. Sidebar badges show feature tier requirements, but commercial planning usually happens before implementation.
Plan durability and operationsBacking store, Platform HA, Platform backupChoose data stores, backup policy, control plane availability, and recovery procedures before customers depend on endpoints.

Choose an inference tenancy model​

Inference providers usually need more than one tenancy model. Use the model that matches the customer's trust level, performance expectation, and isolation requirement.

ModelUse whenTradeoff
Shared serving tierCustomers use provider-owned models and don't run custom containers or plugins.Highest utilization, but weaker customer isolation. Use this for trusted workloads and low-risk endpoint tiers.
Tenant cluster per customerCustomers need their own Kubernetes API boundary, custom resources, secrets, and lifecycle.Strong control plane isolation without automatically dedicating every GPU node.
Private GPU nodesCustomers need dedicated worker-node infrastructure, predictable performance, separate CNI/CSI, or compliance isolation.Stronger isolation and performance predictability, with more capacity planning.
vNode runtime isolationCustomers run untrusted code, custom containers, privileged workloads, adapters, or dynamic execution paths on shared nodes.Adds a workload sandbox boundary, but doesn't replace private nodes when customers require separate infrastructure.

Tenant clusters are a strong default when each customer or endpoint tier needs its own Kubernetes API boundary, custom resources, secrets, lifecycle, or compliance requirements. A single tenant cluster can host multiple model-serving Deployments behind separate Services and routes, so "tenant cluster" does not have to mean one model per cluster. Cost-sensitive providers can start with a shared serving tier for trusted provider-owned models, then add tenant clusters and private GPU nodes for dedicated or higher-isolation tiers. Add vNode when custom inference workloads need a stronger runtime boundary without dedicating every node.

Define the endpoint contract​

Before building automation, define the customer-facing contract separately from the Platform resources that implement it.

Product conceptPlatform or tenant resource
Customer account, organization, or workspacePlatform project, quotas, allowed templates, and allowed node types
Endpoint tier, such as h100-dedicated or l40s-sharedTemplate parameters, node type constraints, and quota policy
Endpoint instanceTenant cluster, namespace, model-serving Deployment, Service, route, secrets, and metrics resources
Endpoint URL and API keyGateway API or ingress route, DNS entry, certificate, and provider authentication layer
Usage and billing recordGPU-hour, node-hour, storage, egress, request, or token usage events mapped to customer, endpoint, model, and tier
Endpoint lifecycle eventPlatform API call, template update, tenant workload rollout, drain, scale, or delete operation

Keep this mapping in your product control plane. Customers should see endpoint status, model status, URL, usage, and errors. Your operators should see the backing project, tenant cluster, GPU node claim, template version, route, and runtime rollout state.

Size the product automation work​

The product API is not a thin wrapper around one Platform call. It is the control plane for your commercial inference service. At minimum, it needs to:

  • authenticate the customer and check entitlement
  • map the requested model, region, and endpoint tier to approved Platform projects, templates, quotas, and node types
  • call the Platform API using an internal access key
  • create or select the tenant cluster and runtime template for the endpoint
  • watch cluster, node, route, and model rollout status
  • return endpoint URL, readiness, error, and usage state to the customer
  • reconcile scale, update, drain, delete, and capacity-reclaim workflows

Build a small internal provisioning path first, then decide which parts become customer-facing product API operations. Treat this as product engineering, not as a one-time installation step.

Endpoint readiness and warm pools​

Endpoint readiness depends on more than Kubernetes pod readiness. A new endpoint might wait for GPU capacity, node bootstrap, GPU Operator reconciliation, image pulls, model download, model load, cache warmup, route attachment, DNS, and certificate issuance.

Plan which endpoint tiers are cold-provisioned and which use warm capacity:

StrategyUse whenTradeoff
Cold provisionLow-cost tiers or infrequent endpoint creationLowest idle cost, but customers wait for node provisioning and model load.
Warm GPU nodesYou can keep spare GPU nodes joined to tenant clusters or poolsFaster scheduling, but idle GPU capacity costs money.
Warm model cacheLarge models are reused across endpointsReduces model download time, but requires cache storage and invalidation policy.
Shared serving poolProvider-owned models serve many customersFastest utilization path, but weaker customer isolation and more careful traffic/auth design.

On bare metal, the wait for capacity is not instant. Provisioning a server includes BMC verification, hardware inspection, PXE boot, and OS install before the node joins. See vMetal inference fleet capacity for warm pool against on-demand provisioning at the hardware layer.

Expose readiness stages in your product, not just a single Pending state. Customers should know whether an endpoint is waiting for capacity, loading a model, warming cache, publishing a route, or ready for traffic.

Track metering and billing​

Commercial inference providers need a usage model before launch, even if invoicing comes later. At minimum, decide how you attribute GPU capacity, model runtime usage, storage, network egress, and support costs to a customer, endpoint, model, and product tier.

For dedicated private-node tiers, the simplest unit is node-hours or GPU-hours per SKU. For shared serving pools, you usually need finer-grained signals such as requests, generated tokens, model runtime hours, or pod-level GPU allocation. Keep billing labels stable across tenant clusters, routes, model-serving Deployments, and metrics so usage can be attributed to the product object the customer sees.

vBilling is an experimental vCluster Labs project for this space. It meters tenant clusters, including dedicated-node GPU capacity by SKU, and streams usage events to a billing adapter. Treat it as an option to evaluate for usage-event collection; pricing, plans, invoices, and customer billing logic still belong in your billing system or product backend.

Day 1: Stand up the first production inference endpoint​

note

Steps 3 and 4 configure Platform for your platform engineering team and automation. Customers should provision endpoints through your product.

  1. Install vCluster Platform. If building from bare metal, deploy vCluster Standalone first, then move to Standalone HA before production traffic.
  2. Configure backing store, Platform HA, and Platform backup.
  3. Configure SSO, permissions, and access keys for the internal automation that provisions endpoints.
  4. Create projects, templates, quotas, and allowed node types for each customer tier.
  5. Configure GPU capacity for the inference tiers you sell. If you own bare metal, set up vMetal and the Metal3 node provider. If you use cloud GPU instances, create node types and automation for those instances through node providers, Auto Nodes, or your existing provisioning system.
  6. Create a tenant cluster template for the endpoint tier. The template should enable Private Nodes, restrict GPU node selection, install the GPU Operator or device plugin, enable any required sync settings, and deploy your scheduler and metrics stack.
  7. Create a runtime template or post-create workflow for the endpoint. It should install the model-serving Deployment, Service, secrets, storage, metrics resources, and Gateway API or ingress resources. See Model serving runtimes for the tenant-runtime pattern.
  8. Configure endpoint routing with Gateway API, ingress, or your product's traffic layer. Prefer Gateway API for new HTTP endpoint deployments, and decide whether the route attaches to a platform-owned imported Gateway or a tenant-owned Gateway.
  9. Configure autoscaling with hardware metrics, serving metrics, or both. Use GPU and inference autoscaling for the hardware-metric baseline and serving-metric patterns such as queue depth, request concurrency, latency, and tokens per second.
  10. Build the product automation path. For an endpoint create request, your product should choose the project and template version, pass tier parameters, provision the tenant cluster, wait for GPU capacity, deploy the runtime template, publish the route, and return endpoint status and URL. Start with an internal operator workflow before exposing the API to customers.
  11. Validate the first endpoint from outside the tenant cluster. Confirm the endpoint accepts authorized traffic, rejects unauthorized traffic, reports healthy model status, emits metrics, and can be traced back to the tenant cluster, node type, template version, and route.
  12. Validate isolation and quota enforcement. Confirm the customer can't see Platform internals, other tenant clusters, disallowed templates, disallowed GPU node types, or control plane cluster resources.
  13. Test the full endpoint lifecycle through your product API: create, warm, scale, update model or adapter, rotate credentials, drain, delete, and reclaim GPU capacity.

Endpoint provisioning flow​

Use this flow as the acceptance test for the first provider-managed endpoint:

  1. Customer calls your product API with model, region, endpoint tier, and scaling parameters.
  2. Product API authenticates the customer and checks product-level entitlement.
  3. Product automation calls Platform with an internal access key.
  4. Platform creates or selects the project, tenant cluster template, runtime template, quota, and allowed node types for the customer's tier.
  5. Private Nodes or Auto Nodes attach GPU capacity from the selected node type.
  6. The tenant cluster installs the GPU stack, metrics stack, model-serving runtime, Service, and route from the tenant cluster template, runtime template, or a post-create Helm/GitOps workflow managed by your product automation.
  7. Gateway API, ingress, or your traffic layer publishes the endpoint hostname.
  8. Product API returns endpoint status, URL, model state, and operational identifiers.
  9. Day 2 automation watches rollout, route, GPU, and serving metrics and reconciles endpoint status back to your product.

Day 2: Operate​

OperationRead next
Manage GPU capacity and machine lifecyclevMetal inference fleet capacity, Metal3 node provider, Manage private nodes
Expose and troubleshoot endpoint routingModel serving runtimes, Gateway API, Gateway API sync troubleshooting
Scale model-serving workloadsGPU and inference autoscaling, Fleet monitoring
Meter customer usagevBilling, Fleet monitoring
Roll model runtimes and endpoint templatesModel serving runtimes, Templates, GPU and inference autoscaling
Roll drivers, vCluster versions, and OS imagesDeploy changes, vMetal GPU fleet operations, Upgrade vCluster
Enforce customer capacity boundariesQuotas, Allowed node types, Projects
Back up and restore tenant clusters and PlatformSnapshots, restore, Platform backup
Manage vNode compatibility during upgradesvNode limitations, vNode configuration