πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Security advanced ⏱ 20 minutes K8s 1.28+

AI ML Security and Compliance Kubernetes

Secure AI and ML workloads on Kubernetes with model encryption, data governance, audit logging, network isolation for training jobs.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Secure AI workloads with: (1) network isolation for training jobs (no egress except storage), (2) RBAC per ML team with GPU quotas, (3) model encryption at rest and in transit, (4) audit logging of model access and data downloads, (5) image scanning for ML framework CVEs.

The Problem

AI/ML workloads handle sensitive data (medical records, financial models, PII) and create valuable IP (trained models). In regulated industries (healthcare, finance, defense), you must demonstrate compliance with data governance, access control, and audit requirements β€” all while running on shared Kubernetes infrastructure.

The Solution

Network Isolation for Training Jobs

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: training-isolation
  namespace: ml-training
spec:
  podSelector:
    matchLabels:
      workload: training
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              workload: training
      ports:
        - port: 29500
          protocol: TCP
  egress:
    - to:
        - podSelector:
            matchLabels:
              workload: training
      ports:
        - port: 29500
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - port: 53
          protocol: UDP
    - to:
        - ipBlock:
            cidr: 10.0.0.0/8
      ports:
        - port: 9000
          protocol: TCP

Training pods can only talk to each other (NCCL port 29500), DNS, and the model storage endpoint.

RBAC for ML Teams

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ml-engineer
  namespace: team-alpha
rules:
  - apiGroups: ["kubeflow.org"]
    resources: ["pytorchjobs", "tfjobs", "notebooks"]
    verbs: ["create", "get", "list", "delete"]
  - apiGroups: ["serving.kserve.io"]
    resources: ["inferenceservices"]
    verbs: ["create", "get", "list", "update"]
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list"]
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: team-alpha
spec:
  hard:
    requests.nvidia.com/gpu: "16"
    persistentvolumeclaims: "20"

Model Encryption at Rest

apiVersion: v1
kind: Pod
metadata:
  name: model-server
spec:
  containers:
    - name: inference
      volumeMounts:
        - name: encrypted-models
          mountPath: /models
          readOnly: true
  volumes:
    - name: encrypted-models
      csi:
        driver: secrets-store.csi.k8s.io
        readOnly: true
        volumeAttributes:
          secretProviderClass: model-vault
---
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: model-vault
spec:
  provider: vault
  parameters:
    vaultAddress: "https://vault.example.com"
    roleName: "model-reader"
    objects: |
      - objectName: "model-encryption-key"
        secretPath: "secret/data/ml/encryption"

Audit Logging for Model Access

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: RequestResponse
    resources:
      - group: "serving.kserve.io"
        resources: ["inferenceservices"]
      - group: "kubeflow.org"
        resources: ["pytorchjobs", "notebooks"]
    namespaces: ["ml-*"]
  - level: Metadata
    resources:
      - group: ""
        resources: ["secrets", "configmaps"]
    namespaces: ["ml-*"]
graph TD
    subgraph Security Layers
        RBAC[RBAC<br/>Per-team access control] --> NETPOL[NetworkPolicy<br/>Training isolation]
        NETPOL --> ENCRYPT[Encryption<br/>Models at rest + transit]
        ENCRYPT --> AUDIT[Audit Logging<br/>Who accessed what]
        AUDIT --> SCAN[Image Scanning<br/>ML framework CVEs]
    end
    
    subgraph Compliance
        HIPAA[HIPAA<br/>Healthcare] --> CONTROLS[Controls]
        SOC2[SOC2<br/>Cloud service] --> CONTROLS
        GDPR[GDPR<br/>EU data protection] --> CONTROLS
    end

Common Issues

Training pods can’t communicate after NetworkPolicy

Ensure NCCL ports (29500 default) are allowed between training pods. Also allow RDMA ports if using InfiniBand.

Audit logs too verbose β€” storage filling up

Use level: Metadata instead of RequestResponse for most resources. Only use RequestResponse for sensitive operations (model access, secret reads).

Best Practices

  • Network isolation for every training namespace β€” prevent data exfiltration
  • RBAC + ResourceQuota per team β€” limit GPU access and prevent resource monopolization
  • Encrypt models at rest β€” models are IP, treat them like secrets
  • Audit all model access β€” who deployed, who queried, when
  • Scan ML images for CVEs β€” PyTorch, TensorFlow have frequent security patches

Key Takeaways

  • AI/ML workloads need network isolation β€” training jobs should only reach storage and each other
  • RBAC per ML team with GPU quotas prevents resource monopolization
  • Model encryption at rest and in transit β€” models are valuable IP
  • Audit logging of model access is mandatory for regulated industries
  • Image scanning for ML framework CVEs β€” PyTorch and TensorFlow release frequent patches
#ai-security #ml-compliance #model-encryption #data-governance
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens