Model Serving with KServe¶
KServe provides scalable, standards-based model serving on OpenShift AI. It supports popular frameworks (vLLM, TensorRT, Triton) and handles autoscaling, canary rollouts, and request batching. Use KServe when you need to deploy individual models with dedicated resources and independent scaling. It deploys models as
InferenceService resources with auto-scaling, canary rollouts, and an
OpenAI-compatible API. RHOAI ships a validated vLLM runtime for GPU-accelerated
LLM serving.
Dependencies¶
| Requirement | Type | Path |
|---|---|---|
| RHOAI Operator | Operator | components/operators/rhoai-operator/ |
| cert-manager | Operator | components/operators/cert-manager/ |
DSC kserve: Managed |
DSC component | components/instances/rhoai-instance/ |
| GPU Infrastructure | Operator + Instance | See gpu-infrastructure.md |
cert-manager is required because KServe uses Knative Serving, which needs TLS certificates for internal routing.
GPU not always required
KServe can serve CPU-only models. GPU infrastructure (NFD + GPU Operator) is only needed for GPU-accelerated inference.
Enable It¶
Deploy¶
KServe is enabled automatically when the rhoai-instance ArgoCD Application points to the serving, full, or dev overlay.
# 1. Install required operators
oc apply -k components/operators/cert-manager/
oc apply -k components/operators/rhoai-operator/
# Wait for CSVs
oc get csv -A | grep -E "cert-manager|rhods"
# 2. Install GPU infrastructure (if serving GPU models)
oc apply -k components/operators/nfd/
oc apply -k components/operators/gpu-operator/
oc apply -k components/instances/nfd-instance/
oc apply -k components/instances/gpu-instance/
# 3. Create DSC with serving overlay
oc apply -k components/instances/rhoai-instance/overlays/serving/
# 4. Wait for DSC to be ready
oc wait --for=jsonpath='{.status.conditions[?(@.type=="Ready")].status}'=True \
datasciencecluster/default-dsc --timeout=600s
Verify¶
# KServe controller should be running
oc get pods -n redhat-ods-applications -l app=kserve-controller-manager
# Knative Serving should be ready
oc get knativeserving -n knative-serving
Example: Deploy a Model¶
A minimal InferenceService using the RHOAI vLLM runtime:
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: my-model-runtime
namespace: my-namespace
spec:
supportedModelFormats:
- name: vLLM
autoSelect: true
multiModel: false
containers:
- name: kserve-container
image: registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --port=8080
- --model=/mnt/models
- --served-model-name={{.Name}}
ports:
- containerPort: 8080
protocol: TCP
name: http
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: my-model
namespace: my-namespace
labels:
opendatahub.io/dashboard: "true"
annotations:
serving.kserve.io/deploymentMode: RawDeployment
security.opendatahub.io/enable-auth: "false"
spec:
predictor:
minReplicas: 1
maxReplicas: 1
model:
modelFormat:
name: vLLM
runtime: my-model-runtime
storageUri: "pvc://my-model-weights/model-dir"
resources:
requests:
cpu: "1"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "24Gi"
nvidia.com/gpu: "1"
Key fields:
- serving.kserve.io/deploymentMode: RawDeployment -- bypasses Knative for simpler
networking (recommended for GPU models)
- storageUri: "pvc://..." -- loads model weights from a PVC
- opendatahub.io/dashboard: "true" -- makes the model visible in the RHOAI Dashboard
For complete working examples, see usecases/models/orchestrator-8b/manifests/ and usecases/models/gpt-oss-120b/manifests/.
Disable It¶
Set kserve.managementState to Removed in the DSC. Then clean up:
The RHOAI operator will remove the KServe controller and Knative components.