Skip to content

Distributed Training with Ray and Training Operator

Distributed training enables fine-tuning and training of large models across multiple GPU nodes. This capability uses KubeRay for Ray-based distributed workloads and the Kubeflow Training Operator for PyTorchJob and TrainJob resources. Use this when your model training needs more GPU memory or compute than a single node can provide. RHOAI provides distributed training through two components:

  • Ray (KubeRay) -- distributed compute framework for RayJob workloads (used for GRPO reinforcement learning in this repo)
  • Training Operator -- Kubeflow Training Operator for PyTorchJob, TrainJob, and other framework-specific distributed training jobs

Both integrate with Kueue for GPU quota management and JobSet for multi-pod job orchestration.

Dependencies

Requirement Type Path
RHOAI Operator Operator components/operators/rhoai-operator/
cert-manager Operator Operator components/operators/cert-manager/
Kueue Operator Operator components/operators/kueue-operator/
JobSet Operator Operator components/operators/jobset-operator/
DSC ray: Managed DSC component components/instances/rhoai-instance/
DSC trainingoperator: Managed DSC component components/instances/rhoai-instance/
Kueue Instance + Config Instance components/instances/kueue-instance/, kueue-config/
JobSet Instance Instance components/instances/jobset-instance/
GPU Infrastructure Operator + Instance See gpu-infrastructure.md

cert-manager is required

The official RHOAI 3.3 documentation lists cert-manager as a dependency for Kueue-based workloads (training, Ray). Install the cert-manager Operator before deploying training workloads.

Enable It

Use the pre-built training overlay:

oc apply -k components/instances/rhoai-instance/overlays/training/
spec:
  components:
    ray:
      managementState: Managed
    trainingoperator:
      managementState: Managed

Note

Kueue is set to Unmanaged in the DSC because it is managed by the standalone Red Hat Build of Kueue Operator. See Kueue.

Deploy

Training components are enabled automatically when the rhoai-instance ArgoCD Application points to the training, full, or dev overlay. The Kueue and JobSet operators are installed via their own ApplicationSet-discovered Applications.

# 1. Install all required operators
oc apply -k components/operators/cert-manager/
oc apply -k components/operators/rhoai-operator/
oc apply -k components/operators/kueue-operator/
oc apply -k components/operators/jobset-operator/
oc apply -k components/operators/nfd/
oc apply -k components/operators/gpu-operator/

# Wait for all CSVs to reach Succeeded before proceeding (re-run until all show Succeeded)
watch "oc get csv -A | grep -E 'cert-manager|rhods|kueue|jobset|nfd|gpu'"

# IMPORTANT: Do NOT proceed until every CSV shows "Succeeded".

# 2. Install GPU infrastructure
oc apply -k components/instances/nfd-instance/
oc apply -k components/instances/gpu-instance/
oc apply -k components/instances/gpu-workers/examples/aws/  # cloud-specific

# 3. Install Kueue and JobSet instances
oc apply -k components/instances/kueue-instance/
oc apply -k components/instances/kueue-config/
oc apply -k components/instances/jobset-instance/

# 4. Create DSC with training overlay
oc apply -k components/instances/rhoai-instance/overlays/training/

# 5. Wait for DSC
oc wait --for=jsonpath='{.status.conditions[?(@.type=="Ready")].status}'=True \
  datasciencecluster/default-dsc --timeout=600s

Verify

# KubeRay operator should be running
oc get pods -n redhat-ods-applications -l app.kubernetes.io/name=kuberay-operator

# Training operator should be running
oc get pods -n redhat-ods-applications -l control-plane=kubeflow-training-operator

GPU and Kueue required

Distributed training requires GPU infrastructure (NFD + GPU Operator) and Kueue for quota management. Deploy these first. See GPU Infrastructure and Kueue.

Example: RayJob for GRPO Training

This repo includes a complete GRPO training pipeline. To run it:

# Via ArgoCD
argocd app sync usecase-toolorchestra-training

# Or manually (deploys both training infra and workloads)
oc apply -k usecases/services/toolorchestra-app/manifests/training/

The training pipeline uses sync waves: - Wave 0: Download jobs fetch the base model and dataset - Wave 1: RayJob starts GRPO training (1 head + 3 GPU workers)

Monitor progress:

oc get rayjob grpo-training -n orchestrator-rhoai -w
oc logs -f -l app.kubernetes.io/name=grpo-head -n orchestrator-rhoai

Training infrastructure resources

Training infrastructure must be deployed separately before running training workloads. These resources live in usecases/services/toolorchestra-app/manifests/training/infra/:

  • LocalQueue (training-queue) -- namespaced Kueue queue
  • PVC (training-checkpoints, 100Gi) -- model + dataset + checkpoint storage
  • ConfigMap (grpo-training-config) -- GRPO hyperparameters

Deploy infra and workloads together:

oc apply -k usecases/services/toolorchestra-app/manifests/training/

Example: Minimal RayJob

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: my-training-job
  namespace: my-namespace
  labels:
    kueue.x-k8s.io/queue-name: training-queue
spec:
  entrypoint: "python train.py"
  runtimeEnvYAML: |
    pip:
      - torch
  rayClusterSpec:
    headGroupSpec:
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.40.0-py311-gpu
              resources:
                requests:
                  cpu: "2"
                  memory: "8Gi"
    workerGroupSpecs:
      - replicas: 2
        groupName: gpu-workers
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.40.0-py311-gpu
                resources:
                  requests:
                    cpu: "2"
                    memory: "16Gi"
                    nvidia.com/gpu: "1"

The kueue.x-k8s.io/queue-name label routes the job through Kueue for quota management. See kueue.md for configuring queues and quotas.

Disable It

Set ray.managementState and trainingoperator.managementState to Removed in the DSC. Clean up any running jobs first:

oc delete rayjob --all -A
oc delete pytorchjob --all -A