Security Best Practices

This guide provides security best practices for deploying and operating Chaos Mesh in production environments.

Installation Security

1. Enable Security Mode

Always enable security features in production:

# Controller security
controllerManager:
  env:
    SECURITY_MODE: "true"  # Enable authorization validation

# Dashboard security
dashboard:
  securityMode: true  # Require user authentication

# Chaos daemon mTLS
chaosDaemon:
  mtls:
    enabled: true  # Secure controller-daemon communication

# Chaosd mTLS
controllerManager:
  chaosdSecurityMode: true  # Secure controller-chaosd communication

Relevant code:

pkg/config/controller.go:96
pkg/config/dashboard.go:42-43
helm/chaos-mesh/values.yaml:142,174-176,294-295

2. Use Namespace-Scoped Mode

Limit the blast radius by using namespace-scoped deployment:

clusterScoped: false

controllerManager:
  targetNamespace: "staging"

Benefits:

Prevents accidental chaos in production namespaces
Reduces required RBAC permissions
Easier compliance and audit
Defense in depth

3. Enable Namespace Filtering

Require explicit opt-in for chaos injection:

controllerManager:
  enableFilterNamespace: true

Then annotate allowed namespaces:

kubectl annotate namespace staging chaos-mesh.org/inject=enabled
kubectl annotate namespace dev chaos-mesh.org/inject=enabled

Relevant code: pkg/config/controller.go:78-80

Privileged vs Non-Privileged Mode

Chaos daemon can run in two modes:

Non-Privileged Mode (Recommended)

Uses specific Linux capabilities instead of full privileged mode:

chaosDaemon:
  privileged: false
  
  capabilities:
    add:
      - SYS_PTRACE    # Attach to processes
      - NET_ADMIN     # Network manipulation
      - NET_RAW       # Raw socket access
      - MKNOD         # Create device nodes
      - SYS_CHROOT    # Change root directory
      - SYS_ADMIN     # System administration
      - KILL          # Send signals
      - IPC_LOCK      # Lock memory

Source: helm/chaos-mesh/values.yaml:191-202 Benefits:

Reduced attack surface
Compliance with security policies
Suitable for most chaos types

Limitations:

Some kernel-level chaos may not work
BPFKI (eBPF) features require privileged mode

Privileged Mode

Grants full system access:

chaosDaemon:
  privileged: true

Use only when:

You need kernel-level chaos (KernelChaos)
You’re using BPFKI features
You’ve assessed the security implications

Source: helm/chaos-mesh/templates/chaos-daemon-rbac.yaml:69-118

RBAC Configuration

Implement Least Privilege

Create granular roles for different user types:

# Chaos engineers (full access to specific namespace)
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: chaos-engineers
  namespace: staging
roleRef:
  kind: ClusterRole
  name: chaos-mesh-chaos-controller-manager-target-namespace
subjects:
  - kind: Group
    name: chaos-engineers
---
# Chaos viewers (read-only)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: chaos-viewer
  namespace: staging
rules:
  - apiGroups: ["chaos-mesh.org"]
    resources: ["*"]
    verbs: ["get", "list", "watch"]
---
# Network chaos only (limited chaos types)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: network-chaos-only
  namespace: staging
rules:
  - apiGroups: ["chaos-mesh.org"]
    resources: ["networkchaos", "dnschaos"]
    verbs: ["create", "delete", "get", "list", "patch", "update"]

Avoid Cluster-Admin

Never grant chaos users cluster-admin permissions. Use the built-in roles:

chaos-controller-manager-target-namespace: For creating chaos experiments
chaos-dashboard-target-namespace: For dashboard access
Custom roles: For specific chaos types

Network Security

1. mTLS for Component Communication

Enable mutual TLS between components:

chaosDaemon:
  mtls:
    enabled: true

controllerManager:
  chaosdSecurityMode: true

Chaos Mesh auto-generates certificates during installation. Certificates are stored as secrets in the chaos-mesh namespace. Relevant code: helm/chaos-mesh/values.yaml:172-176,142

2. Webhook Configuration

Configure appropriate timeouts and failure policies:

webhook:
  timeoutSeconds: 5       # Fast failure for availability
  FailurePolicy: Fail     # Deny on webhook failure (secure default)

For high availability, consider:

webhook:
  FailurePolicy: Ignore   # Allow creation if webhook is down

Security trade-off: Ignore improves availability but may allow unauthorized experiments if the webhook is down. Source: helm/chaos-mesh/values.yaml:554-558

3. Dashboard Network Exposure

Use secure methods to expose the dashboard: Option 1: NodePort (Development)

dashboard:
  service:
    type: NodePort

Option 2: ClusterIP + Ingress (Production)

dashboard:
  service:
    type: ClusterIP
  
  ingress:
    enabled: true
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod
      nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    hosts:
      - name: chaos.example.com
        tls: true
        tlsSecret: chaos-dashboard-tls

Option 3: kubectl port-forward (Most Secure)

kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333

Access at http://localhost:2333

Runtime Security

1. Disable Host Network Testing

Prevent chaos on host-network pods:

controllerManager:
  allowHostNetworkTesting: false  # Default: false

Host-network pods have elevated privileges and can affect cluster networking. Only enable if absolutely necessary. Relevant code: helm/chaos-mesh/values.yaml:60-61,104-105

2. Configure Resource Limits

Prevent resource exhaustion:

controllerManager:
  resources:
    limits:
      cpu: 500m
      memory: 1024Mi
    requests:
      cpu: 25m
      memory: 256Mi

chaosDaemon:
  resourceProfile: "standard"  # or "intensive" for production
  resources:
    limits:
      cpu: 500m
      memory: 1Gi
    requests:
      cpu: 250m
      memory: 512Mi

dashboard:
  resources:
    limits:
      cpu: 500m
      memory: 1024Mi
    requests:
      cpu: 25m
      memory: 256Mi

Source: helm/chaos-mesh/values.yaml:103-113,178-245,326-337

3. Enable Profiling Selectively

Disable pprof in production:

enableProfiling: false

Relevant code: helm/chaos-mesh/values.yaml:41-42

Authentication & Authorization

1. Use Service Account Tokens

Generate time-limited tokens for dashboard users:

# Create service account
kubectl create serviceaccount chaos-user -n chaos-mesh

# Bind to appropriate role
kubectl create rolebinding chaos-user \
  --clusterrole=chaos-mesh-chaos-controller-manager-target-namespace \
  --serviceaccount=chaos-mesh:chaos-user \
  --namespace=staging

# Generate 24-hour token
kubectl create token chaos-user -n chaos-mesh --duration=24h

2. Configure GCP OAuth (GKE)

For GKE clusters, use OAuth:

dashboard:
  gcpSecurityMode:
    enabled: true
    existingSecret: gcp-oauth-secret  # Create secret with GCP_CLIENT_ID and GCP_CLIENT_SECRET

Create the secret:

kubectl create secret generic gcp-oauth-secret -n chaos-mesh \
  --from-literal=GCP_CLIENT_ID="your-client-id.apps.googleusercontent.com" \
  --from-literal=GCP_CLIENT_SECRET="your-client-secret"

Relevant code: helm/chaos-mesh/values.yaml:297-302

3. Implement Token Rotation

Rotate service account tokens regularly:

# Revoke old tokens by recreating the service account
kubectl delete serviceaccount chaos-user -n chaos-mesh
kubectl create serviceaccount chaos-user -n chaos-mesh
# Recreate role bindings

Audit and Monitoring

1. Enable Kubernetes Audit Logging

Configure API server to log chaos operations:

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: RequestResponse
    omitStages: ["RequestReceived"]
    resources:
      - group: "chaos-mesh.org"
        resources: ["*"]

2. Monitor Authorization Events

Watch for authorization failures:

# Controller authorization logs
kubectl logs -n chaos-mesh -l app.kubernetes.io/component=controller-manager \
  | grep "auth validate"

# Dashboard authorization logs
kubectl logs -n chaos-mesh -l app.kubernetes.io/component=chaos-dashboard \
  | grep "forbidden"

3. Track Chaos Experiments

Monitor active chaos:

# List all active chaos experiments
kubectl get podchaos,networkchaos,iochaos,stresschaos --all-namespaces

# Watch chaos events
kubectl get events --all-namespaces --field-selector involvedObject.apiGroup=chaos-mesh.org

Data Security

1. Secure Dashboard Database

Use secrets for database credentials:

dashboard:
  databaseSecretName: "chaos-dashboard-db-secret"
  env:
    DATABASE_DRIVER: "mysql"
    # DATABASE_DATASOURCE omitted - loaded from secret

Create the secret:

kubectl create secret generic chaos-dashboard-db-secret -n chaos-mesh \
  --from-literal=DATABASE_DATASOURCE="user:password@tcp(mysql:3306)/chaos?parseTime=true"

Relevant code: helm/chaos-mesh/values.yaml:270-272,379-383

2. Configure Data Retention

Limit data retention:

dashboard:
  env:
    CLEAN_SYNC_PERIOD: "12h"
    TTL_EVENT: "168h"       # 7 days
    TTL_EXPERIMENT: "336h"  # 14 days
    TTL_SCHEDULE: "336h"    # 14 days
    TTL_WORKFLOW: "336h"    # 14 days

Source: helm/chaos-mesh/values.yaml:385-394

3. Secure Persistent Volumes

Encrypt dashboard storage:

dashboard:
  persistentVolume:
    enabled: true
    size: 8Gi
    storageClassName: encrypted-ssd  # Use encrypted storage class

Container Security

1. Use Security Contexts

Configure security contexts:

controllerManager:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault

dashboard:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault

Source: helm/chaos-mesh/values.yaml:56-57,168-169,275-276

2. Use Private Container Registries

Pull images from private registries:

images:
  registry: "your-registry.example.com"
  tag: "v2.6.0"

imagePullSecrets:
  - name: regcred

Create the secret:

kubectl create secret docker-registry regcred -n chaos-mesh \
  --docker-server=your-registry.example.com \
  --docker-username=your-user \
  --docker-password=your-password

Source: helm/chaos-mesh/values.yaml:44-53

3. Scan Images for Vulnerabilities

Regularly scan Chaos Mesh images:

# Using Trivy
trivy image ghcr.io/chaos-mesh/chaos-mesh:latest
trivy image ghcr.io/chaos-mesh/chaos-daemon:latest
trivy image ghcr.io/chaos-mesh/chaos-dashboard:latest

Multi-Tenancy

1. Isolate Chaos Mesh Instances

Deploy separate Chaos Mesh instances per team:

# Team A
helm install chaos-mesh-team-a chaos-mesh/chaos-mesh -n chaos-mesh-team-a \
  --set clusterScoped=false \
  --set controllerManager.targetNamespace=team-a-apps

# Team B
helm install chaos-mesh-team-b chaos-mesh/chaos-mesh -n chaos-mesh-team-b \
  --set clusterScoped=false \
  --set controllerManager.targetNamespace=team-b-apps

2. Use NetworkPolicies

Restrict network access:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: chaos-mesh-isolation
  namespace: chaos-mesh
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: chaos-mesh
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
      - namespaceSelector:
          matchLabels:
            name: chaos-mesh
  egress:
    - to:
      - namespaceSelector: {}
    - to:
      - podSelector: {}
    - ports:
      - protocol: TCP
        port: 53
      - protocol: UDP
        port: 53

Compliance and Governance

1. Implement Change Approval

Require approval for chaos experiments:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-production
  namespace: production
  annotations:
    approval-ticket: "JIRA-1234"
    approved-by: "senior-sre@example.com"
    approval-date: "2026-03-04"
spec:
  # ...

2. Document Chaos Experiments

Use labels and annotations:

metadata:
  labels:
    chaos-type: "network"
    severity: "high"
    team: "platform"
  annotations:
    description: "Simulates network partition between services"
    runbook: "https://wiki.example.com/chaos/network-partition"
    oncall: "platform-team@example.com"

3. Schedule Chaos During Business Hours

Use schedules to control when chaos runs:

apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: weekday-chaos
spec:
  schedule: "0 10 * * 1-5"  # 10 AM weekdays
  type: PodChaos
  # ...

Incident Response

1. Prepare Rollback Procedures

Document how to stop chaos:

# Pause all chaos experiments
kubectl annotate podchaos --all experiment.chaos-mesh.org/pause=true -n production

# Delete all chaos experiments
kubectl delete podchaos,networkchaos,iochaos,stresschaos --all -n production

# Restart affected pods
kubectl rollout restart deployment/my-app -n production

2. Emergency Dashboard Access

Keep an emergency service account:

# Create break-glass account
kubectl create serviceaccount chaos-admin -n chaos-mesh
kubectl create clusterrolebinding chaos-admin \
  --clusterrole=cluster-admin \
  --serviceaccount=chaos-mesh:chaos-admin

# Store token securely (e.g., in a password manager)
kubectl create token chaos-admin -n chaos-mesh --duration=168h

3. Monitor Blast Radius

Set up alerts for unexpected chaos:

# Prometheus alert example
groups:
  - name: chaos-mesh
    rules:
      - alert: UnexpectedChaosExperiment
        expr: count(chaos_mesh_experiments{namespace="production"}) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Unexpected chaos in production namespace"

Security Checklist

Before deploying to production:

See Authorization for authentication and authorization details
See RBAC for role-based access control configuration

Documentation Index

​Installation Security

​1. Enable Security Mode

​2. Use Namespace-Scoped Mode

​3. Enable Namespace Filtering

​Privileged vs Non-Privileged Mode

​Non-Privileged Mode (Recommended)

​Privileged Mode

​RBAC Configuration

​Implement Least Privilege

​Avoid Cluster-Admin

​Network Security

​1. mTLS for Component Communication

​2. Webhook Configuration

​3. Dashboard Network Exposure

​Runtime Security

​1. Disable Host Network Testing

​2. Configure Resource Limits

​3. Enable Profiling Selectively

​Authentication & Authorization

​1. Use Service Account Tokens

​2. Configure GCP OAuth (GKE)

​3. Implement Token Rotation

​Audit and Monitoring

​1. Enable Kubernetes Audit Logging

​2. Monitor Authorization Events

​3. Track Chaos Experiments

​Data Security

​1. Secure Dashboard Database

​2. Configure Data Retention

​3. Secure Persistent Volumes

​Container Security

​1. Use Security Contexts

​2. Use Private Container Registries

​3. Scan Images for Vulnerabilities

​Multi-Tenancy

​1. Isolate Chaos Mesh Instances

​2. Use NetworkPolicies

​Compliance and Governance

​1. Implement Change Approval

​2. Document Chaos Experiments

​3. Schedule Chaos During Business Hours

​Incident Response

​1. Prepare Rollback Procedures

​2. Emergency Dashboard Access

​3. Monitor Blast Radius

​Security Checklist

​Related Configuration

Installation Security

1. Enable Security Mode

2. Use Namespace-Scoped Mode

3. Enable Namespace Filtering

Privileged vs Non-Privileged Mode

Non-Privileged Mode (Recommended)

Privileged Mode

RBAC Configuration

Implement Least Privilege

Avoid Cluster-Admin

Network Security

1. mTLS for Component Communication

2. Webhook Configuration

3. Dashboard Network Exposure

Runtime Security

1. Disable Host Network Testing

2. Configure Resource Limits

3. Enable Profiling Selectively

Authentication & Authorization

1. Use Service Account Tokens

2. Configure GCP OAuth (GKE)

3. Implement Token Rotation

Audit and Monitoring

1. Enable Kubernetes Audit Logging

2. Monitor Authorization Events

3. Track Chaos Experiments

Data Security

1. Secure Dashboard Database

2. Configure Data Retention

3. Secure Persistent Volumes

Container Security

1. Use Security Contexts

2. Use Private Container Registries

3. Scan Images for Vulnerabilities

Multi-Tenancy

1. Isolate Chaos Mesh Instances

2. Use NetworkPolicies

Compliance and Governance

1. Implement Change Approval

2. Document Chaos Experiments

3. Schedule Chaos During Business Hours

Incident Response

1. Prepare Rollback Procedures

2. Emergency Dashboard Access

3. Monitor Blast Radius

Security Checklist

Related Configuration