Skip to content

GPU Infrastructure (NFD, GPU Operator, MachineSets)

GPU infrastructure provides the foundation for all GPU-accelerated workloads on OpenShift. This includes Node Feature Discovery (NFD) for detecting GPU hardware, the NVIDIA GPU Operator for installing drivers and container toolkit, and MachineSets for provisioning GPU worker nodes. Deploy this before any capability that requires GPU acceleration.

Dependencies

Requirement Type Path
NFD Operator Operator components/operators/nfd/
GPU Operator Operator components/operators/gpu-operator/
NFD Instance Instance components/instances/nfd-instance/
GPU Instance (ClusterPolicy) Instance components/instances/gpu-instance/
GPU Workers (MachineSets) Instance components/instances/gpu-workers/
Cluster Autoscaler Instance components/instances/cluster-autoscaler/

NFD must be installed and running before the GPU Operator, as the GPU Operator relies on NFD node labels to identify GPU hardware.

Cluster-specific configuration required

GPU worker provisioning is cloud-specific. Example MachineSet manifests are provided in components/instances/gpu-workers/examples/aws/. Copy and customize them for your cluster. See Customizing for your cluster below.

Deploy

GPU infrastructure is deployed automatically via ApplicationSet-discovered Applications:

  • instance-nfd-instance -- NFD NodeFeatureDiscovery CR
  • instance-gpu-instance -- GPU ClusterPolicy CR
  • GPU MachineSets + MachineAutoscalers (cloud-specific, not auto-deployed; see examples)
  • instance-cluster-autoscaler -- ClusterAutoscaler

The operators (operator-nfd, operator-gpu-operator) are also auto-discovered.

# 1. Install NFD operator and wait
oc apply -k components/operators/nfd/
oc get csv -n openshift-nfd | grep nfd

# 2. Create NFD instance
oc apply -k components/instances/nfd-instance/
oc wait --for=jsonpath='{.status.conditions[0].type}'=Available \
  nodefeaturediscovery/nfd-instance -n openshift-nfd --timeout=300s

# 3. Install GPU operator and wait
oc apply -k components/operators/gpu-operator/
oc get csv -n nvidia-gpu-operator | grep gpu

# 4. Create GPU ClusterPolicy
oc apply -k components/instances/gpu-instance/
oc wait --for=jsonpath='{.status.state}'=ready \
  clusterpolicy/gpu-cluster-policy --timeout=600s

# 5. Create GPU worker MachineSets (cloud-specific, use your cloud's example)
oc apply -k components/instances/gpu-workers/examples/aws/

# 6. (Optional) Create ClusterAutoscaler for auto-scaling
oc apply -k components/instances/cluster-autoscaler/

Verify

# NFD labels on nodes
oc get nodes -l feature.node.kubernetes.io/pci-10de.present=true

# GPU operator pods running
oc get pods -n nvidia-gpu-operator

# GPU devices available on nodes
oc describe node <gpu-node> | grep nvidia.com/gpu

# MachineSet status
oc get machinesets -n openshift-machine-api | grep gpu

GPU Worker Nodes

This repo provisions two types of GPU MachineSets on AWS:

MachineSet Instance Type GPU Use Case
L4 workers g6.2xlarge NVIDIA L4 (24GB) Inference, light training
L40S workers g6e.2xlarge NVIDIA L40S (48GB) Heavy training, large models

Customizing for your cluster

The MachineSet manifests contain cluster-specific values. Update these fields in components/instances/gpu-workers/examples/aws/gpu-machineset-*.yaml:

  • metadata.name -- replace ocp-2qkbk with your cluster's infra ID
  • spec.template.spec.providerSpec.value.ami.id -- your RHCOS AMI
  • spec.template.spec.providerSpec.value.iamInstanceProfile.id -- your IAM profile
  • subnet, securityGroups, tags -- your cluster's networking config

Scaling

Manual scaling via Git:

# Edit components/instances/gpu-workers/examples/aws/gpu-machineset-l4.yaml, set spec.replicas: 5
git commit -am "Scale L4 GPU workers to 5" && git push

Auto-scaling is configured via: - ClusterAutoscaler -- cluster-wide limits (max 20 nodes, max 8 GPUs) - MachineAutoscaler (L4) -- min: 1, max: 6 nodes - MachineAutoscaler (L40S) -- min: 0, max: 4 nodes

When a pod requests nvidia.com/gpu and no capacity exists, nodes are auto-provisioned. Idle nodes are removed after 10 minutes.

Disable It

Remove GPU workers and instances in reverse order:

oc delete -k components/instances/cluster-autoscaler/
oc delete -k components/instances/gpu-workers/examples/aws/
oc delete clusterpolicy gpu-cluster-policy
oc delete nodefeaturediscovery nfd-instance -n openshift-nfd
oc delete sub gpu-operator-certified -n nvidia-gpu-operator
oc delete sub nfd -n openshift-nfd