Inference Provider: Managed Model Serving
Build managed inference endpoints on your GPU infrastructure. Customers interact with your product API and model endpoints. Platform and vCluster form the operations layer your team runs behind it.
Typical stack:
- Platform as the management plane.
- If you already operate Kubernetes, run vCluster control planes as pods on that cluster. Use Standalone (HA) when you need vCluster to bootstrap the control plane on bare metal or VMs.
- Private GPU nodes for dedicated inference customers.
- vMetal when you own bare metal GPU lifecycle. Cloud GPU instances, manually joined nodes, or Auto Nodes when your capacity comes from a cloud GPU provider.
- vNode, a separate vCluster Labs product, is optional runtime isolation for custom containers, adapters, plugins, or other untrusted code.
What makes this path different: Customers don't use Platform directly. They call your product API or inference endpoint. Your product uses Platform and vCluster APIs to create tenant clusters, attach GPU capacity, apply templates, enforce quotas, publish routes, and reclaim capacity when endpoints are deleted.
If your users are internal teams that should provision and manage their own environments in Platform, start with Enterprise AI Factory. Use this inference provider path when your product hides Platform and exposes an inference endpoint lifecycle instead.
Day 0: Design decisions​
| Decision | Read next | Outcome |
|---|---|---|
| Choose the customer-facing product boundary | Platform API, access keys, Projects | Decide which actions your product exposes, such as endpoint creation, model selection, GPU tier selection, quota changes, endpoint updates, and deletion. Customers shouldn't need Platform access. |
| Choose the inference tenancy model | Choose an inference tenancy model, Architecture | Decide whether each customer, model family, or endpoint tier gets a shared serving pool, a tenant cluster, private GPU nodes, or vNode runtime isolation. |
| Plan GPU capacity classes | Node providers, vMetal inference fleet capacity, GPU and accelerators | Define node types by GPU model, GPU count, region, provider, rack, and reservation model. Use vMetal for owned bare metal fleets, or cloud GPU instances and Auto Nodes for cloud capacity. Keep MIG, vGPU, and Dynamic Resource Allocation decisions in the vendor stack. |
| Define serving runtime templates | Model serving runtimes, Templates, Certified Stacks | Standardize the components each endpoint receives, such as GPU Operator, scheduler, model server, ingress or Gateway API routes, metrics, and secrets. |
| Plan model storage and warmup | Model storage architecture, Model serving runtimes | Decide where model weights live, how they reach each endpoint, how caches are warmed, and how readiness is reported while large models load. |
| Plan endpoint routing | Model serving runtimes, Gateway API | Decide whether tenant clusters attach routes to platform-owned shared Gateways, tenant-owned Gateways, or another ingress layer controlled by your product. |
| Plan autoscaling signals | GPU and inference autoscaling, Monitoring overview | Combine hardware metrics, such as GPU utilization and memory, with serving metrics, such as request concurrency, queue depth, latency, tokens per second, and time to first token. |
| Plan billing and metering | vBilling, Fleet monitoring | Decide how you attribute GPU-hours, node-hours, storage, egress, and request or token usage to customers. vBilling is an experimental vCluster Labs project for metering tenant clusters and streaming usage events to a billing adapter. |
| Plan endpoint readiness and SLA | Endpoint readiness and warm pools, Model serving runtimes | Decide whether endpoints are cold-provisioned, pre-warmed, or backed by warm pools. Account for node provisioning time, image pulls, model downloads, and model load time. |
| Confirm product tiers and licensing | Open Source vs Free tier, vCluster pricing | Confirm which parts of the stack require Platform activation or a paid tier. Sidebar badges show feature tier requirements, but commercial planning usually happens before implementation. |
| Plan durability and operations | Backing store, Platform HA, Platform backup | Choose data stores, backup policy, control plane availability, and recovery procedures before customers depend on endpoints. |
Choose an inference tenancy model​
Inference providers usually need more than one tenancy model. Use the model that matches the customer's trust level, performance expectation, and isolation requirement.
| Model | Use when | Tradeoff |
|---|---|---|
| Shared serving tier | Customers use provider-owned models and don't run custom containers or plugins. | Highest utilization, but weaker customer isolation. Use this for trusted workloads and low-risk endpoint tiers. |
| Tenant cluster per customer | Customers need their own Kubernetes API boundary, custom resources, secrets, and lifecycle. | Strong control plane isolation without automatically dedicating every GPU node. |
| Private GPU nodes | Customers need dedicated worker-node infrastructure, predictable performance, separate CNI/CSI, or compliance isolation. | Stronger isolation and performance predictability, with more capacity planning. |
| vNode runtime isolation | Customers run untrusted code, custom containers, privileged workloads, adapters, or dynamic execution paths on shared nodes. | Adds a workload sandbox boundary, but doesn't replace private nodes when customers require separate infrastructure. |
Tenant clusters are a strong default when each customer or endpoint tier needs its own Kubernetes API boundary, custom resources, secrets, lifecycle, or compliance requirements. A single tenant cluster can host multiple model-serving Deployments behind separate Services and routes, so "tenant cluster" does not have to mean one model per cluster. Cost-sensitive providers can start with a shared serving tier for trusted provider-owned models, then add tenant clusters and private GPU nodes for dedicated or higher-isolation tiers. Add vNode when custom inference workloads need a stronger runtime boundary without dedicating every node.
Define the endpoint contract​
Before building automation, define the customer-facing contract separately from the Platform resources that implement it.
| Product concept | Platform or tenant resource |
|---|---|
| Customer account, organization, or workspace | Platform project, quotas, allowed templates, and allowed node types |
Endpoint tier, such as h100-dedicated or l40s-shared | Template parameters, node type constraints, and quota policy |
| Endpoint instance | Tenant cluster, namespace, model-serving Deployment, Service, route, secrets, and metrics resources |
| Endpoint URL and API key | Gateway API or ingress route, DNS entry, certificate, and provider authentication layer |
| Usage and billing record | GPU-hour, node-hour, storage, egress, request, or token usage events mapped to customer, endpoint, model, and tier |
| Endpoint lifecycle event | Platform API call, template update, tenant workload rollout, drain, scale, or delete operation |
Keep this mapping in your product control plane. Customers should see endpoint status, model status, URL, usage, and errors. Your operators should see the backing project, tenant cluster, GPU node claim, template version, route, and runtime rollout state.
Size the product automation work​
The product API is not a thin wrapper around one Platform call. It is the control plane for your commercial inference service. At minimum, it needs to:
- authenticate the customer and check entitlement
- map the requested model, region, and endpoint tier to approved Platform projects, templates, quotas, and node types
- call the Platform API using an internal access key
- create or select the tenant cluster and runtime template for the endpoint
- watch cluster, node, route, and model rollout status
- return endpoint URL, readiness, error, and usage state to the customer
- reconcile scale, update, drain, delete, and capacity-reclaim workflows
Build a small internal provisioning path first, then decide which parts become customer-facing product API operations. Treat this as product engineering, not as a one-time installation step.
Endpoint readiness and warm pools​
Endpoint readiness depends on more than Kubernetes pod readiness. A new endpoint might wait for GPU capacity, node bootstrap, GPU Operator reconciliation, image pulls, model download, model load, cache warmup, route attachment, DNS, and certificate issuance.
Plan which endpoint tiers are cold-provisioned and which use warm capacity:
| Strategy | Use when | Tradeoff |
|---|---|---|
| Cold provision | Low-cost tiers or infrequent endpoint creation | Lowest idle cost, but customers wait for node provisioning and model load. |
| Warm GPU nodes | You can keep spare GPU nodes joined to tenant clusters or pools | Faster scheduling, but idle GPU capacity costs money. |
| Warm model cache | Large models are reused across endpoints | Reduces model download time, but requires cache storage and invalidation policy. |
| Shared serving pool | Provider-owned models serve many customers | Fastest utilization path, but weaker customer isolation and more careful traffic/auth design. |
On bare metal, the wait for capacity is not instant. Provisioning a server includes BMC verification, hardware inspection, PXE boot, and OS install before the node joins. See vMetal inference fleet capacity for warm pool against on-demand provisioning at the hardware layer.
Expose readiness stages in your product, not just a single Pending state. Customers should know whether an endpoint is waiting for capacity, loading a model, warming cache, publishing a route, or ready for traffic.
Track metering and billing​
Commercial inference providers need a usage model before launch, even if invoicing comes later. At minimum, decide how you attribute GPU capacity, model runtime usage, storage, network egress, and support costs to a customer, endpoint, model, and product tier.
For dedicated private-node tiers, the simplest unit is node-hours or GPU-hours per SKU. For shared serving pools, you usually need finer-grained signals such as requests, generated tokens, model runtime hours, or pod-level GPU allocation. Keep billing labels stable across tenant clusters, routes, model-serving Deployments, and metrics so usage can be attributed to the product object the customer sees.
vBilling is an experimental vCluster Labs project for this space. It meters tenant clusters, including dedicated-node GPU capacity by SKU, and streams usage events to a billing adapter. Treat it as an option to evaluate for usage-event collection; pricing, plans, invoices, and customer billing logic still belong in your billing system or product backend.
Day 1: Stand up the first production inference endpoint​
Steps 3 and 4 configure Platform for your platform engineering team and automation. Customers should provision endpoints through your product.
- Install vCluster Platform. If building from bare metal, deploy vCluster Standalone first, then move to Standalone HA before production traffic.
- Configure backing store, Platform HA, and Platform backup.
- Configure SSO, permissions, and access keys for the internal automation that provisions endpoints.
- Create projects, templates, quotas, and allowed node types for each customer tier.
- Configure GPU capacity for the inference tiers you sell. If you own bare metal, set up vMetal and the Metal3 node provider. If you use cloud GPU instances, create node types and automation for those instances through node providers, Auto Nodes, or your existing provisioning system.
- Create a tenant cluster template for the endpoint tier. The template should enable Private Nodes, restrict GPU node selection, install the GPU Operator or device plugin, enable any required sync settings, and deploy your scheduler and metrics stack.
- Create a runtime template or post-create workflow for the endpoint. It should install the model-serving Deployment, Service, secrets, storage, metrics resources, and Gateway API or ingress resources. See Model serving runtimes for the tenant-runtime pattern.
- Configure endpoint routing with Gateway API, ingress, or your product's traffic layer. Prefer Gateway API for new HTTP endpoint deployments, and decide whether the route attaches to a platform-owned imported Gateway or a tenant-owned Gateway.
- Configure autoscaling with hardware metrics, serving metrics, or both. Use GPU and inference autoscaling for the hardware-metric baseline and serving-metric patterns such as queue depth, request concurrency, latency, and tokens per second.
- Build the product automation path. For an endpoint create request, your product should choose the project and template version, pass tier parameters, provision the tenant cluster, wait for GPU capacity, deploy the runtime template, publish the route, and return endpoint status and URL. Start with an internal operator workflow before exposing the API to customers.
- Validate the first endpoint from outside the tenant cluster. Confirm the endpoint accepts authorized traffic, rejects unauthorized traffic, reports healthy model status, emits metrics, and can be traced back to the tenant cluster, node type, template version, and route.
- Validate isolation and quota enforcement. Confirm the customer can't see Platform internals, other tenant clusters, disallowed templates, disallowed GPU node types, or control plane cluster resources.
- Test the full endpoint lifecycle through your product API: create, warm, scale, update model or adapter, rotate credentials, drain, delete, and reclaim GPU capacity.
Endpoint provisioning flow​
Use this flow as the acceptance test for the first provider-managed endpoint:
- Customer calls your product API with model, region, endpoint tier, and scaling parameters.
- Product API authenticates the customer and checks product-level entitlement.
- Product automation calls Platform with an internal access key.
- Platform creates or selects the project, tenant cluster template, runtime template, quota, and allowed node types for the customer's tier.
- Private Nodes or Auto Nodes attach GPU capacity from the selected node type.
- The tenant cluster installs the GPU stack, metrics stack, model-serving runtime, Service, and route from the tenant cluster template, runtime template, or a post-create Helm/GitOps workflow managed by your product automation.
- Gateway API, ingress, or your traffic layer publishes the endpoint hostname.
- Product API returns endpoint status, URL, model state, and operational identifiers.
- Day 2 automation watches rollout, route, GPU, and serving metrics and reconciles endpoint status back to your product.
Day 2: Operate​
| Operation | Read next |
|---|---|
| Manage GPU capacity and machine lifecycle | vMetal inference fleet capacity, Metal3 node provider, Manage private nodes |
| Expose and troubleshoot endpoint routing | Model serving runtimes, Gateway API, Gateway API sync troubleshooting |
| Scale model-serving workloads | GPU and inference autoscaling, Fleet monitoring |
| Meter customer usage | vBilling, Fleet monitoring |
| Roll model runtimes and endpoint templates | Model serving runtimes, Templates, GPU and inference autoscaling |
| Roll drivers, vCluster versions, and OS images | Deploy changes, vMetal GPU fleet operations, Upgrade vCluster |
| Enforce customer capacity boundaries | Quotas, Allowed node types, Projects |
| Back up and restore tenant clusters and Platform | Snapshots, restore, Platform backup |
| Manage vNode compatibility during upgrades | vNode limitations, vNode configuration |