ModelMesh Multi-Model Serving¶

ModelMesh enables efficient multi-model serving by packing multiple models into shared serving pods. Use ModelMesh when you need to serve many smaller models cost-effectively with shared GPU resources, rather than dedicating a full pod per model as KServe does.

Dependencies¶

Requirement	Type	Path
RHOAI Operator	Operator	`components/operators/rhoai-operator/`
DSC `modelmeshserving: Managed`	DSC component	`components/instances/rhoai-instance/`
GPU Infrastructure (optional)	Operator + Instance	See gpu-infrastructure.md

ModelMesh does not require cert-manager or Knative -- it uses its own routing.

Enable It¶

OverlayDSC Patch

Use the pre-built serving overlay (enables both KServe and ModelMesh):

oc apply -k components/instances/rhoai-instance/overlays/serving/

spec:
  components:
    modelmeshserving:
      managementState: Managed

Deploy¶

GitOpsManual

ModelMesh is enabled automatically when the rhoai-instance ArgoCD Application points to the serving, full, or dev overlay.

# 1. Install the RHOAI operator
oc apply -k components/operators/rhoai-operator/
oc get csv -A | grep rhods

# 2. Create DSC with serving overlay
oc apply -k components/instances/rhoai-instance/overlays/serving/

# 3. Wait for DSC
oc wait --for=jsonpath='{.status.conditions[?(@.type=="Ready")].status}'=True \
  datasciencecluster/default-dsc --timeout=600s

Verify¶

# ModelMesh controller should be running
oc get pods -n redhat-ods-applications -l app=modelmesh-controller

Example: Deploy a Model with ModelMesh¶

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-sklearn-model
  namespace: my-namespace
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://my-bucket/sklearn-model"

The key difference from KServe is serving.kserve.io/deploymentMode: ModelMesh, which routes the model to the shared ModelMesh pool instead of creating a dedicated pod.

When to Use ModelMesh vs KServe¶

Factor	KServe	ModelMesh
Model isolation	Dedicated pod per model	Shared pod pool
Scale-to-zero	Yes (via Knative)	No
GPU efficiency	One GPU per model	Multiple models per GPU
Best for	LLMs, large models	Many small/medium models
Protocol	OpenAI-compatible (vLLM)	gRPC / REST

Disable It¶

Set modelmeshserving.managementState to Removed in the DSC.