GPU Infrastructure (NFD, GPU Operator, MachineSets)¶
GPU infrastructure provides the foundation for all GPU-accelerated workloads on OpenShift. This includes Node Feature Discovery (NFD) for detecting GPU hardware, the NVIDIA GPU Operator for installing drivers and container toolkit, and MachineSets for provisioning GPU worker nodes. Deploy this before any capability that requires GPU acceleration.
Dependencies¶
| Requirement | Type | Path |
|---|---|---|
| NFD Operator | Operator | components/operators/nfd/ |
| GPU Operator | Operator | components/operators/gpu-operator/ |
| NFD Instance | Instance | components/instances/nfd-instance/ |
| GPU Instance (ClusterPolicy) | Instance | components/instances/gpu-instance/ |
| GPU Workers (MachineSets) | Instance | components/instances/gpu-workers/ |
| Cluster Autoscaler | Instance | components/instances/cluster-autoscaler/ |
NFD must be installed and running before the GPU Operator, as the GPU Operator relies on NFD node labels to identify GPU hardware.
Cluster-specific configuration required
GPU worker provisioning is cloud-specific. Example MachineSet manifests are provided in components/instances/gpu-workers/examples/aws/. Copy and customize them for your cluster. See Customizing for your cluster below.
Deploy¶
GPU infrastructure is deployed automatically via ApplicationSet-discovered Applications:
instance-nfd-instance-- NFD NodeFeatureDiscovery CRinstance-gpu-instance-- GPU ClusterPolicy CR- GPU MachineSets + MachineAutoscalers (cloud-specific, not auto-deployed; see examples)
instance-cluster-autoscaler-- ClusterAutoscaler
The operators (operator-nfd, operator-gpu-operator) are also auto-discovered.
# 1. Install NFD operator and wait
oc apply -k components/operators/nfd/
oc get csv -n openshift-nfd | grep nfd
# 2. Create NFD instance
oc apply -k components/instances/nfd-instance/
oc wait --for=jsonpath='{.status.conditions[0].type}'=Available \
nodefeaturediscovery/nfd-instance -n openshift-nfd --timeout=300s
# 3. Install GPU operator and wait
oc apply -k components/operators/gpu-operator/
oc get csv -n nvidia-gpu-operator | grep gpu
# 4. Create GPU ClusterPolicy
oc apply -k components/instances/gpu-instance/
oc wait --for=jsonpath='{.status.state}'=ready \
clusterpolicy/gpu-cluster-policy --timeout=600s
# 5. Create GPU worker MachineSets (cloud-specific, use your cloud's example)
oc apply -k components/instances/gpu-workers/examples/aws/
# 6. (Optional) Create ClusterAutoscaler for auto-scaling
oc apply -k components/instances/cluster-autoscaler/
Verify¶
# NFD labels on nodes
oc get nodes -l feature.node.kubernetes.io/pci-10de.present=true
# GPU operator pods running
oc get pods -n nvidia-gpu-operator
# GPU devices available on nodes
oc describe node <gpu-node> | grep nvidia.com/gpu
# MachineSet status
oc get machinesets -n openshift-machine-api | grep gpu
GPU Worker Nodes¶
This repo provisions two types of GPU MachineSets on AWS:
| MachineSet | Instance Type | GPU | Use Case |
|---|---|---|---|
| L4 workers | g6.2xlarge |
NVIDIA L4 (24GB) | Inference, light training |
| L40S workers | g6e.2xlarge |
NVIDIA L40S (48GB) | Heavy training, large models |
Customizing for your cluster¶
The MachineSet manifests contain cluster-specific values. Update these fields
in components/instances/gpu-workers/examples/aws/gpu-machineset-*.yaml:
metadata.name-- replaceocp-2qkbkwith your cluster's infra IDspec.template.spec.providerSpec.value.ami.id-- your RHCOS AMIspec.template.spec.providerSpec.value.iamInstanceProfile.id-- your IAM profilesubnet,securityGroups,tags-- your cluster's networking config
Scaling¶
Manual scaling via Git:
# Edit components/instances/gpu-workers/examples/aws/gpu-machineset-l4.yaml, set spec.replicas: 5
git commit -am "Scale L4 GPU workers to 5" && git push
Auto-scaling is configured via:
- ClusterAutoscaler -- cluster-wide limits (max 20 nodes, max 8 GPUs)
- MachineAutoscaler (L4) -- min: 1, max: 6 nodes
- MachineAutoscaler (L40S) -- min: 0, max: 4 nodes
When a pod requests nvidia.com/gpu and no capacity exists, nodes are
auto-provisioned. Idle nodes are removed after 10 minutes.
Disable It¶
Remove GPU workers and instances in reverse order:
oc delete -k components/instances/cluster-autoscaler/
oc delete -k components/instances/gpu-workers/examples/aws/
oc delete clusterpolicy gpu-cluster-policy
oc delete nodefeaturediscovery nfd-instance -n openshift-nfd
oc delete sub gpu-operator-certified -n nvidia-gpu-operator
oc delete sub nfd -n openshift-nfd