ToolOrchestra Multi-Model Orchestrator¶

NVIDIA ToolOrchestra is a multi-model orchestrator that coordinates specialized AI models for complex reasoning tasks.

Components¶

Component	Description
`orchestrator-8b`	Nemotron-Orchestrator-8B -- orchestrates tool calls across specialist models
`qwen-math-7b`	Qwen2.5-Math-7B-Instruct -- math reasoning specialist
`orchestrator-ui`	Web UI for interactive orchestration with SSE streaming

Profiles¶

Profile	Models	Training	Use Case
`tier1-minimal`	orchestrator-8b, qwen-math-7b	Not included	Development, demos
`training`	orchestrator-8b, qwen-math-7b	Full (infra + workloads)	Training runs

Adding profiles

Create additional profiles (e.g. tier2-standard with more models) by adding a new directory under profiles/. See CONTRIBUTING.md for conventions.

Prerequisites¶

ToolOrchestra deploys InferenceServices that require a fully configured RHOAI platform. Ensure the following are installed and ready before deploying:

Requirement	Why	Guide
RHOAI Operator + DSC with `kserve: Managed`	InferenceServices require the KServe serving platform	Quick Start or Model Serving
cert-manager Operator	KServe requires TLS certificates via Knative	Capabilities
GPU infrastructure (NFD + GPU Operator + GPU workers)	Models require NVIDIA L4 or L40S GPUs	GPU Infrastructure
Kueue + JobSet (for training only)	Training workloads need GPU quota management	Training

GPU MachineSet customization

GPU worker provisioning is cloud-specific. Example manifests for AWS are in components/instances/gpu-workers/examples/aws/. Customize them for your cluster or create your own. See GPU Infrastructure.

Deploy¶

Model dependencies required

ToolOrchestra requires orchestrator-8b and qwen-math-7b models to be deployed and Ready. The ToolOrchestra UI connects to these model endpoints at runtime. Without them, the UI will load but inference calls will fail.

GitOpsManual

ToolOrchestra is auto-deployed by the cluster-services ApplicationSet when using the tier1-minimal profile.

After bootstrapping the cluster, the service-toolorchestra-app Application is created automatically. Model dependencies (orchestrator-8b, qwen-math-7b) are deployed separately by the cluster-models ApplicationSet.

# Deploy models first (services depend on these endpoints)
oc apply -k usecases/models/orchestrator-8b/profiles/tier1-minimal/
oc apply -k usecases/models/qwen-math-7b/profiles/tier1-minimal/

# Wait for models to download and become Ready
oc wait --for=condition=Ready inferenceservice/orchestrator-8b \
  -n orchestrator-8b --timeout=1800s
oc wait --for=condition=Ready inferenceservice/qwen-math-7b \
  -n qwen-math-7b --timeout=1800s

# Deploy the ToolOrchestra service
oc apply -k usecases/services/toolorchestra-app/profiles/tier1-minimal/

Sync Wave Ordering¶

Sync waves ensure correct resource ordering across the related ArgoCD Applications:

model-orchestrator-8b and model-qwen-math-7b apps:

Wave	Resources	Purpose
-1 (default)	Namespace, PVCs, ServingRuntimes, Service, Route	Infrastructure created first
0	`download-orchestrator-8b`, `download-qwen-math-7b` Jobs	Model downloads run before predictors start
1	`orchestrator-8b`, `qwen-math-7b` InferenceServices	Predictors created after models are downloaded

Download jobs are idempotent (check for .download_complete marker) and have no TTL, so completed jobs persist as Synced/Healthy in ArgoCD.

service-toolorchestra-app app: Deploys Namespace, RBAC, ConfigMaps, NetworkPolicy, and the ToolOrchestra UI Deployment at the default wave. No sync wave ordering is needed since it contains no download jobs or InferenceServices.

Training Pipeline¶

The repository includes a GRPO training pipeline using KubeRay for distributed training and Kueue for GPU quota management.

Training Infrastructure (deploy separately)¶

Training infrastructure lives in usecases/services/toolorchestra-app/manifests/training/infra/ and must be deployed before running training workloads:

LocalQueue (training-queue) -- namespaced queue pointing to training-cluster-queue
PVC (training-checkpoints, 100Gi) -- stores base model, dataset, and checkpoints
ConfigMap (grpo-training-config) -- GRPO hyperparameters adapted for L4 GPUs

Not included in tier1-minimal

The tier1-minimal profile deploys only the ToolOrchestra UI. Training infrastructure and workloads require separate deployment.

Training Workloads (on-demand)¶

Managed by usecase-toolorchestra-training (explicit Application in clusters/overlays/dev/) with manual sync:

Download Jobs (sync-wave 0):
- download-qwen3-8b -- downloads Qwen/Qwen3-8B base model
- download-training-data -- downloads nvidia/ToolScale dataset
RayJob (sync-wave 1):
- grpo-training -- 1 head node (no GPU) + 3 GPU worker nodes (1xL4 each)
- Uses verl framework with GRPO algorithm

Running Training¶

ArgoCD CLIArgoCD UIManual

argocd app sync usecase-toolorchestra-training

ArgoCD processes sync waves: download jobs first (wave 0), then RayJob (wave 1).

Navigate to usecase-toolorchestra-training and click Sync.

# Deploys both training infra (LocalQueue, PVC, ConfigMap) and workloads (download jobs, RayJob)
oc apply -k usecases/services/toolorchestra-app/manifests/training/

Monitor Training¶

oc get rayjob grpo-training -n orchestrator-rhoai -w
oc logs -f -l app.kubernetes.io/name=grpo-head -n orchestrator-rhoai

GPU Worker Node Scaling¶

GPU worker nodes can be managed via Git (cloud-specific, see components/instances/gpu-workers/examples/aws/).

Manual Scaling via Git¶

# Edit components/instances/gpu-workers/examples/aws/gpu-machineset-l4.yaml
#   spec.replicas: 5
git commit -am "Scale L4 GPU workers to 5" && git push

Auto-scaling¶

Resource	Config	Effect
ClusterAutoscaler	max 20 nodes, max 8 GPUs	Cluster-wide scaling limits
MachineAutoscaler (L4)	min: 1, max: 6	Auto-scales `g6.2xlarge` nodes
MachineAutoscaler (L40S)	min: 0, max: 4	Auto-scales `g6e.2xlarge` nodes

When a pod requests nvidia.com/gpu and no capacity is available, the ClusterAutoscaler automatically adds GPU nodes. Idle nodes are removed after 10 minutes.