Skip to content

Capabilities Guide

Red Hat OpenShift AI (RHOAI) is modular. Pick the capabilities you need, understand their dependencies, and deploy only what matters for your use case.

Capability Map

Capability DataScienceCluster (DSC) Component Required Operators Required Instances Guide
KServe Model Serving kserve rhoai-operator, cert-manager rhoai-instance model-serving.md
ModelMesh Serving modelmeshserving rhoai-operator rhoai-instance modelmesh.md
Distributed Training ray, trainingoperator rhoai-operator, cert-manager, kueue-operator, jobset-operator rhoai-instance, kueue-instance, jobset-instance training.md
Data Science Pipelines datasciencepipelines rhoai-operator rhoai-instance pipelines.md
Workbenches workbenches rhoai-operator rhoai-instance workbenches.md
Model Registry modelregistry rhoai-operator rhoai-instance, external MySQL 5.x+, S3 storage model-registry.md
MLflow mlflowoperator rhoai-operator rhoai-instance, mlflow-instance mlflow.md
GPU Infrastructure N/A nfd, gpu-operator nfd-instance, gpu-instance, gpu-workers gpu-infrastructure.md
Kueue (GPU Quotas) kueue (Unmanaged) kueue-operator, cert-manager kueue-instance, kueue-config kueue.md

Dependency Diagram

graph TD
  CertMgr["cert-manager"] -->|"TLS certificates"| KServe["KServe"]
  CertMgr -->|"required"| KueueInst["Kueue Instance + ClusterQueue"]
  CertMgr -->|"required"| Training["Training"]
  CertMgr -->|"required"| LlamaStackOp["LlamaStack Operator"]
  ServiceMesh["ServiceMesh Operator"] -->|"required"| LlamaStackOp
  NFD["NFD Operator"] --> GPU["GPU Operator"]
  KueueOp["Kueue Operator"] --> KueueInst
  JobSetOp["JobSet Operator"] --> JobSetInst["JobSet Instance"]
  RHOAI["RHOAI Operator"] --> DSC["DSC (components)"]
  GPU --> GPUWorkers["GPU Workers"]
  DSC --> KServe
  DSC --> ModelMesh["ModelMesh"]
  DSC --> Pipelines["Pipelines"]
  DSC --> Workbenches["Workbenches"]
  DSC --> Registry["Registry"]
  DSC --> TrustyAI["TrustyAI"]
  DSC --> CodeFlare["CodeFlare"]
  DSC --> MLflow["MLflow Operator"]
  MLflow --> MLflowInst["MLflow Instance"]
  DSC --> LlamaStackOp
  GPUWorkers --> ModelServing["Model Serving"]
  KServe --> ModelServing
  ModelMesh --> ModelServing
  DSC --> Ray["Ray"]
  DSC --> TrainOp["Training Operator"]
  Ray --> Training
  TrainOp --> Training
  KueueInst --> Training
  JobSetInst --> Training

Key takeaways:

  • Every capability requires the RHOAI operator and a DataScienceCluster (DSC)
  • GPU Infrastructure (NFD + GPU Operator) is required for any GPU workload (model serving, training)
  • Kueue is required for training workloads that need GPU quota management
  • JobSet is required for distributed training (TrainJob depends on it)
  • cert-manager is required for KServe (TLS via Knative), Kueue-based workloads (training), distributed inference (llm-d), and LlamaStack
  • ServiceMesh Operator 3.x is required for LlamaStack
  • Model Registry requires an external MySQL database (5.x+) and S3-compatible object storage
  • Capabilities without GPU needs (Pipelines, Workbenches, Registry) can run on CPU-only clusters

Additional RHOAI 3.3 capabilities not covered in this repo

The official RHOAI 3.3 documentation lists additional DSC components that this repository does not deploy or document in detail:

  • advancedkserve (Distributed Inference with llm-d) -- enables distributed model inference using the llm-d framework. Requires cert-manager, Red Hat Connectivity Link Operator, Red Hat Leader Worker Set Operator, and OpenShift 4.20+. Not included in this repo's manifests.
  • feastoperator (Feature Store) -- present in our base DSC as Removed. The Feast Operator provides a feature store for ML workloads. Enable it by setting feastoperator.managementState: Managed if needed.

DSC Overlays -- Pick Your Profile

Instead of editing the DSC YAML directly, use a pre-built overlay:

Overlay Components Enabled Use Case
overlays/minimal/ Dashboard only Exploration, start here
overlays/serving/ Dashboard, KServe, ModelMesh Model serving only
overlays/training/ Dashboard, Ray, TrainingOperator Distributed training only
overlays/full/ All 10 DSC components (see below) Full platform
overlays/dev/ All 10 DSC components (same as full) Development (current default)

What 'All components' means

The full and dev overlays enable: workbenches, kserve, ray, trainingoperator, modelregistry, trustyai, datasciencepipelines, modelmeshserving, codeflare, and llamastackoperator. The base DSC always keeps dashboard Managed and kueue Unmanaged (Red Hat Build of Kueue is deployed as a standalone operator).

Deploy with an overlay

# GitOps: point the rhoai-instance ArgoCD app at your chosen overlay
# Manual:
oc apply -k components/instances/rhoai-instance/overlays/serving/

Composing a Custom Profile

If the pre-built overlays don't match your needs, compose your own by stacking JSON patches from the capability overlays.

Example: serving + pipelines

Create components/instances/rhoai-instance/overlays/my-profile/kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - ../../base

patches:
  - path: ../serving/patch-serving.yaml
    target:
      kind: DataScienceCluster
  - path: patch-pipelines.yaml
    target:
      kind: DataScienceCluster

And patch-pipelines.yaml:

- op: replace
  path: /spec/components/datasciencepipelines/managementState
  value: Managed

Each capability overlay's patch file can be referenced from any custom overlay, making profiles fully composable without duplication.

Manual Installation Order

When deploying without ArgoCD, install in this order. The four phases must be completed sequentially -- each phase depends on the previous one.

Phase 1 -- Pre-RHOAI Operators

oc apply -k components/operators/cert-manager/       # Required for KServe, training, Kueue, LlamaStack
oc apply -k components/operators/servicemesh/         # Required for LlamaStack
oc apply -k components/operators/nfd/                 # Required for GPU
oc apply -k components/operators/gpu-operator/        # Required for GPU
oc apply -k components/operators/kueue-operator/      # Required for training
oc apply -k components/operators/jobset-operator/     # Required for training
oc apply -k components/operators/rhoai-operator/      # Always required

# Wait for all CSVs to reach Succeeded (re-run until all show Succeeded)
watch "oc get csv -A | grep -E 'cert-manager|servicemesh|nfd|gpu-operator|kueue|jobset|rhods'"

# IMPORTANT: Do NOT proceed until every CSV shows "Succeeded".

Phase 2 -- Pre-DSC Instances (order matters)

oc apply -k components/instances/nfd-instance/        # NFD first (GPU depends on it)
oc apply -k components/instances/gpu-instance/         # GPU ClusterPolicy
oc apply -k components/instances/gpu-workers/examples/aws/  # GPU MachineSets (cloud-specific)
oc apply -k components/instances/cluster-autoscaler/   # Auto-scaling
oc apply -k components/instances/kueue-instance/       # Kueue
oc apply -k components/instances/kueue-config/         # GPU ResourceFlavors + ClusterQueue
oc apply -k components/instances/jobset-instance/      # JobSet

Phase 3 -- DSC + Post-DSC Instances

# RHOAI DSC -- pick your overlay
oc apply -k components/instances/rhoai-instance/overlays/serving/

# Wait for DSC to be Ready before applying post-DSC instances
oc wait --for=jsonpath='{.status.conditions[?(@.type=="Ready")].status}'=True \
  datasciencecluster/default-dsc --timeout=600s

# Post-DSC instances (target the redhat-ods-applications namespace created by DSC)
oc apply -k components/instances/dashboard-config/     # Enables GenAI Studio (Tech Preview, not enabled by default)
oc apply -k components/instances/mcp-servers/           # Registers MCP servers in GenAI Studio
oc apply -k components/instances/mlflow-instance/       # MLflow tracking server

Phase 4 -- Use Cases (models before services)

# Deploy models first
oc apply -k usecases/models/orchestrator-8b/profiles/tier1-minimal/
oc apply -k usecases/models/qwen-math-7b/profiles/tier1-minimal/

# Deploy services (depend on model endpoints being reachable)
oc apply -k usecases/services/toolorchestra-app/profiles/tier1-minimal/

Minimal Installs by Goal

"I just want to serve a model" -- install cert-manager, RHOAI operator, then use the serving overlay. See model-serving.md.

"I just want notebooks" -- install RHOAI operator, use the dev or full overlay (includes Dashboard + Workbenches). The minimal overlay only enables Dashboard without Workbenches. See workbenches.md.

"I need training" -- install RHOAI, Kueue, JobSet, NFD, GPU operators, their instances, then use the training overlay. See training.md.

"I want everything" -- follow the Quick Start with the full or dev overlay.