πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Deployments advanced ⏱ 15 minutes K8s 1.28+

GitOps Bootstrap for Bare-Metal GPU Clusters

Bootstrap bare-metal GPU clusters with ArgoCD and Kustomize in air-gapped environments with NVIDIA GPU and Network Operators.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Use Ansible for initial bare-metal handshake, then install OpenShift GitOps (ArgoCD) as the single source of truth. A root App-of-Apps pattern manages GPU Operator, Network Operator, SR-IOV, storage classes, and tenant overlays β€” all from Git.

The Problem

Bootstrapping a bare-metal GPU cluster involves dozens of interdependent components: GPU Operator, Network Operator, SR-IOV, storage, RBAC, quotas, and tenant configs. Manual setup is error-prone, non-reproducible, and impossible to audit. In air-gapped environments, you also need local mirrors, custom CatalogSources, and CA trust chains.

The Solution

GitOps-first approach: Git repository is the single source of truth. ArgoCD syncs desired state to the cluster. Every change is a PR, every rollback is a git revert.

Repository Structure

gitops/
β”œβ”€β”€ cluster-config/
β”‚   β”œβ”€β”€ base/
β”‚   β”‚   β”œβ”€β”€ operators/
β”‚   β”‚   β”‚   β”œβ”€β”€ gpu-operator.yaml        # NVIDIA GPU Operator subscription
β”‚   β”‚   β”‚   β”œβ”€β”€ network-operator.yaml     # NVIDIA Network Operator
β”‚   β”‚   β”‚   β”œβ”€β”€ sriov-operator.yaml       # SR-IOV Network Operator
β”‚   β”‚   β”‚   └── kustomization.yaml
β”‚   β”‚   β”œβ”€β”€ infra/
β”‚   β”‚   β”‚   β”œβ”€β”€ storageclasses.yaml       # PowerScale NFS, local NVMe
β”‚   β”‚   β”‚   β”œβ”€β”€ network-attachments.yaml  # Multus networks
β”‚   β”‚   β”‚   β”œβ”€β”€ machineconfigs.yaml       # Kernel params, modules
β”‚   β”‚   β”‚   └── kustomization.yaml
β”‚   β”‚   └── kustomization.yaml
β”‚   └── overlays/
β”‚       └── prod/
β”‚           β”œβ”€β”€ patches/
β”‚           β”‚   β”œβ”€β”€ gpu-operator-config.yaml
β”‚           β”‚   β”œβ”€β”€ quotas-tenant-alpha.yaml
β”‚           β”‚   β”œβ”€β”€ quotas-tenant-beta.yaml
β”‚           β”‚   └── oauth-config.yaml
β”‚           └── kustomization.yaml
β”œβ”€β”€ applications/                          # Helm apps per environment
β”‚   β”œβ”€β”€ monitoring/
β”‚   β”œβ”€β”€ logging/
β”‚   └── values-prod.yaml
β”œβ”€β”€ argocd/
β”‚   β”œβ”€β”€ root-app.yaml                     # App-of-Apps entry point
β”‚   β”œβ”€β”€ cluster-config-app.yaml
β”‚   └── applicationsets.yaml              # Per-tenant ApplicationSets
└── README.md

Step 1: Ansible Initial Bootstrap

# ansible/bootstrap-gitops.yaml
- name: Bootstrap OpenShift GitOps
  hosts: bastion
  tasks:
    - name: Install OpenShift GitOps operator
      kubernetes.core.k8s:
        state: present
        definition:
          apiVersion: operators.coreos.com/v1alpha1
          kind: Subscription
          metadata:
            name: openshift-gitops-operator
            namespace: openshift-operators
          spec:
            channel: latest
            name: openshift-gitops-operator
            source: redhat-operators
            sourceNamespace: openshift-marketplace

    - name: Wait for ArgoCD instance
      kubernetes.core.k8s_info:
        kind: ArgoCD
        namespace: openshift-gitops
        name: openshift-gitops
      register: argocd
      until: argocd.resources | length > 0
      retries: 30
      delay: 10

    - name: Configure Git repository
      kubernetes.core.k8s:
        state: present
        definition:
          apiVersion: v1
          kind: Secret
          metadata:
            name: gpu-cluster-repo
            namespace: openshift-gitops
            labels:
              argocd.argoproj.io/secret-type: repository
          stringData:
            url: "https://git.internal.example.com/platform/gpu-gitops.git"
            username: argocd
            password: "{{ git_token }}"

    - name: Apply root application
      kubernetes.core.k8s:
        state: present
        src: argocd/root-app.yaml

Step 2: Root App-of-Apps

# argocd/root-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: gpu-cluster-root
  namespace: openshift-gitops
  annotations:
    argocd.argoproj.io/sync-wave: "0"
spec:
  project: default
  source:
    repoURL: https://git.internal.example.com/platform/gpu-gitops.git
    targetRevision: main
    path: argocd
  destination:
    server: https://kubernetes.default.svc
    namespace: openshift-gitops
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true

Step 3: Cluster Config Application

# argocd/cluster-config-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cluster-config
  namespace: openshift-gitops
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  project: default
  source:
    repoURL: https://git.internal.example.com/platform/gpu-gitops.git
    targetRevision: main
    path: cluster-config/overlays/prod
  destination:
    server: https://kubernetes.default.svc
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - ServerSideApply=true

Air-Gap Configuration

# cluster-config/base/infra/imagedigestmirrorset.yaml
apiVersion: config.openshift.io/v1
kind: ImageDigestMirrorSet
metadata:
  name: gpu-operator-mirror
spec:
  imageDigestMirrors:
    - source: nvcr.io/nvidia
      mirrors:
        - quay.internal.example.com/nvidia-mirror
    - source: registry.k8s.io
      mirrors:
        - quay.internal.example.com/k8s-mirror
---
# Local CatalogSources
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: nvidia-gpu-operator-catalog
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.internal.example.com/nvidia-mirror/gpu-operator-bundle-catalog:latest
  displayName: NVIDIA GPU Operator
  publisher: NVIDIA
  updateStrategy:
    registryPoll:
      interval: 30m

Bootstrap Flow

# Full bootstrap sequence:
# 1. Ansible installs OpenShift GitOps operator
ansible-playbook -i inventory bootstrap-gitops.yaml

# 2. Register Git repository with ArgoCD
# 3. Apply root App-of-Apps
# 4. ArgoCD auto-syncs:
#    Wave 0: Root app
#    Wave 1: Cluster config (operators, infra)
#    Wave 2: Operator configs (ClusterPolicy, NicClusterPolicy)
#    Wave 3: Tenant namespaces, RBAC, quotas
#    Wave 4: Applications (monitoring, logging)

# Verify sync status
oc get applications -n openshift-gitops
graph TD
    A[Ansible Bootstrap] --> B[Install GitOps Operator]
    B --> C[Register Git Repo]
    C --> D[Apply Root App]
    D --> E[ArgoCD Syncs]
    
    E --> F[Wave 1: Operators]
    E --> G[Wave 2: Operator Config]
    E --> H[Wave 3: Tenants]
    E --> I[Wave 4: Applications]
    
    F --> J[GPU Operator]
    F --> K[Network Operator]
    F --> L[SR-IOV Operator]
    
    G --> M[ClusterPolicy]
    G --> N[NicClusterPolicy]
    
    H --> O[NS + RBAC + Quotas per tenant]

Common Issues

  • ArgoCD can’t reach air-gapped Git β€” configure internal Git URL and credentials in ArgoCD secret; verify network connectivity from ArgoCD pod
  • CatalogSource image pull fails β€” ensure IDMS mirrors are configured before CatalogSource references; check Quay CA trust
  • Sync order wrong β€” use sync waves (argocd.argoproj.io/sync-wave) to order operator install before config
  • Self-heal reverts manual changes β€” this is intentional; all changes must go through Git
  • GPU Operator subscription pending β€” CatalogSource may not have synced; check oc get catalogsource -n openshift-marketplace

Best Practices

  • Ansible only for initial bootstrap β€” everything after is GitOps
  • Use sync waves to order: operators β†’ operator configs β†’ tenants β†’ apps
  • Air-gap: mirror all images to local Quay before bootstrap
  • Store known-good version matrix in Git alongside configs
  • Enable auto-prune and self-heal for drift detection
  • Use Kustomize overlays for environment-specific patches (dev/staging/prod)
  • Test changes in a dev overlay before promoting to prod

Key Takeaways

  • Git commit β†’ PR β†’ ArgoCD sync = reproducible, auditable cluster state
  • Ansible handles the chicken-and-egg bootstrap; GitOps handles everything after
  • App-of-Apps pattern scales to manage operators, infra, and tenant configs
  • Air-gapped clusters need IDMS, CatalogSources, and CA trust before operator install
  • Rollback = git revert β†’ ArgoCD auto-syncs previous known-good state
#gitops #argocd #bare-metal #gpu #air-gap #kustomize
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens