Documentation Index
Fetch the complete documentation index at: https://mintlify.com/chaos-mesh/chaos-mesh/llms.txt
Use this file to discover all available pages before exploring further.
This guide provides security best practices for deploying and operating Chaos Mesh in production environments.
Installation Security
1. Enable Security Mode
Always enable security features in production:
# Controller security
controllerManager:
env:
SECURITY_MODE: "true" # Enable authorization validation
# Dashboard security
dashboard:
securityMode: true # Require user authentication
# Chaos daemon mTLS
chaosDaemon:
mtls:
enabled: true # Secure controller-daemon communication
# Chaosd mTLS
controllerManager:
chaosdSecurityMode: true # Secure controller-chaosd communication
Relevant code:
pkg/config/controller.go:96
pkg/config/dashboard.go:42-43
helm/chaos-mesh/values.yaml:142,174-176,294-295
2. Use Namespace-Scoped Mode
Limit the blast radius by using namespace-scoped deployment:
clusterScoped: false
controllerManager:
targetNamespace: "staging"
Benefits:
- Prevents accidental chaos in production namespaces
- Reduces required RBAC permissions
- Easier compliance and audit
- Defense in depth
3. Enable Namespace Filtering
Require explicit opt-in for chaos injection:
controllerManager:
enableFilterNamespace: true
Then annotate allowed namespaces:
kubectl annotate namespace staging chaos-mesh.org/inject=enabled
kubectl annotate namespace dev chaos-mesh.org/inject=enabled
Relevant code: pkg/config/controller.go:78-80
Privileged vs Non-Privileged Mode
Chaos daemon can run in two modes:
Non-Privileged Mode (Recommended)
Uses specific Linux capabilities instead of full privileged mode:
chaosDaemon:
privileged: false
capabilities:
add:
- SYS_PTRACE # Attach to processes
- NET_ADMIN # Network manipulation
- NET_RAW # Raw socket access
- MKNOD # Create device nodes
- SYS_CHROOT # Change root directory
- SYS_ADMIN # System administration
- KILL # Send signals
- IPC_LOCK # Lock memory
Source: helm/chaos-mesh/values.yaml:191-202
Benefits:
- Reduced attack surface
- Compliance with security policies
- Suitable for most chaos types
Limitations:
- Some kernel-level chaos may not work
- BPFKI (eBPF) features require privileged mode
Privileged Mode
Grants full system access:
chaosDaemon:
privileged: true
Use only when:
- You need kernel-level chaos (KernelChaos)
- You’re using BPFKI features
- You’ve assessed the security implications
Source: helm/chaos-mesh/templates/chaos-daemon-rbac.yaml:69-118
RBAC Configuration
Implement Least Privilege
Create granular roles for different user types:
# Chaos engineers (full access to specific namespace)
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: chaos-engineers
namespace: staging
roleRef:
kind: ClusterRole
name: chaos-mesh-chaos-controller-manager-target-namespace
subjects:
- kind: Group
name: chaos-engineers
---
# Chaos viewers (read-only)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: chaos-viewer
namespace: staging
rules:
- apiGroups: ["chaos-mesh.org"]
resources: ["*"]
verbs: ["get", "list", "watch"]
---
# Network chaos only (limited chaos types)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: network-chaos-only
namespace: staging
rules:
- apiGroups: ["chaos-mesh.org"]
resources: ["networkchaos", "dnschaos"]
verbs: ["create", "delete", "get", "list", "patch", "update"]
Avoid Cluster-Admin
Never grant chaos users cluster-admin permissions. Use the built-in roles:
chaos-controller-manager-target-namespace: For creating chaos experiments
chaos-dashboard-target-namespace: For dashboard access
- Custom roles: For specific chaos types
Network Security
1. mTLS for Component Communication
Enable mutual TLS between components:
chaosDaemon:
mtls:
enabled: true
controllerManager:
chaosdSecurityMode: true
Chaos Mesh auto-generates certificates during installation. Certificates are stored as secrets in the chaos-mesh namespace.
Relevant code: helm/chaos-mesh/values.yaml:172-176,142
2. Webhook Configuration
Configure appropriate timeouts and failure policies:
webhook:
timeoutSeconds: 5 # Fast failure for availability
FailurePolicy: Fail # Deny on webhook failure (secure default)
For high availability, consider:
webhook:
FailurePolicy: Ignore # Allow creation if webhook is down
Security trade-off: Ignore improves availability but may allow unauthorized experiments if the webhook is down.
Source: helm/chaos-mesh/values.yaml:554-558
3. Dashboard Network Exposure
Use secure methods to expose the dashboard:
Option 1: NodePort (Development)
dashboard:
service:
type: NodePort
Option 2: ClusterIP + Ingress (Production)
dashboard:
service:
type: ClusterIP
ingress:
enabled: true
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
hosts:
- name: chaos.example.com
tls: true
tlsSecret: chaos-dashboard-tls
Option 3: kubectl port-forward (Most Secure)
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
Access at http://localhost:2333
Runtime Security
1. Disable Host Network Testing
Prevent chaos on host-network pods:
controllerManager:
allowHostNetworkTesting: false # Default: false
Host-network pods have elevated privileges and can affect cluster networking. Only enable if absolutely necessary.
Relevant code: helm/chaos-mesh/values.yaml:60-61,104-105
Prevent resource exhaustion:
controllerManager:
resources:
limits:
cpu: 500m
memory: 1024Mi
requests:
cpu: 25m
memory: 256Mi
chaosDaemon:
resourceProfile: "standard" # or "intensive" for production
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 250m
memory: 512Mi
dashboard:
resources:
limits:
cpu: 500m
memory: 1024Mi
requests:
cpu: 25m
memory: 256Mi
Source: helm/chaos-mesh/values.yaml:103-113,178-245,326-337
3. Enable Profiling Selectively
Disable pprof in production:
Relevant code: helm/chaos-mesh/values.yaml:41-42
Authentication & Authorization
1. Use Service Account Tokens
Generate time-limited tokens for dashboard users:
# Create service account
kubectl create serviceaccount chaos-user -n chaos-mesh
# Bind to appropriate role
kubectl create rolebinding chaos-user \
--clusterrole=chaos-mesh-chaos-controller-manager-target-namespace \
--serviceaccount=chaos-mesh:chaos-user \
--namespace=staging
# Generate 24-hour token
kubectl create token chaos-user -n chaos-mesh --duration=24h
For GKE clusters, use OAuth:
dashboard:
gcpSecurityMode:
enabled: true
existingSecret: gcp-oauth-secret # Create secret with GCP_CLIENT_ID and GCP_CLIENT_SECRET
Create the secret:
kubectl create secret generic gcp-oauth-secret -n chaos-mesh \
--from-literal=GCP_CLIENT_ID="your-client-id.apps.googleusercontent.com" \
--from-literal=GCP_CLIENT_SECRET="your-client-secret"
Relevant code: helm/chaos-mesh/values.yaml:297-302
3. Implement Token Rotation
Rotate service account tokens regularly:
# Revoke old tokens by recreating the service account
kubectl delete serviceaccount chaos-user -n chaos-mesh
kubectl create serviceaccount chaos-user -n chaos-mesh
# Recreate role bindings
Audit and Monitoring
1. Enable Kubernetes Audit Logging
Configure API server to log chaos operations:
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: RequestResponse
omitStages: ["RequestReceived"]
resources:
- group: "chaos-mesh.org"
resources: ["*"]
2. Monitor Authorization Events
Watch for authorization failures:
# Controller authorization logs
kubectl logs -n chaos-mesh -l app.kubernetes.io/component=controller-manager \
| grep "auth validate"
# Dashboard authorization logs
kubectl logs -n chaos-mesh -l app.kubernetes.io/component=chaos-dashboard \
| grep "forbidden"
3. Track Chaos Experiments
Monitor active chaos:
# List all active chaos experiments
kubectl get podchaos,networkchaos,iochaos,stresschaos --all-namespaces
# Watch chaos events
kubectl get events --all-namespaces --field-selector involvedObject.apiGroup=chaos-mesh.org
Data Security
1. Secure Dashboard Database
Use secrets for database credentials:
dashboard:
databaseSecretName: "chaos-dashboard-db-secret"
env:
DATABASE_DRIVER: "mysql"
# DATABASE_DATASOURCE omitted - loaded from secret
Create the secret:
kubectl create secret generic chaos-dashboard-db-secret -n chaos-mesh \
--from-literal=DATABASE_DATASOURCE="user:password@tcp(mysql:3306)/chaos?parseTime=true"
Relevant code: helm/chaos-mesh/values.yaml:270-272,379-383
Limit data retention:
dashboard:
env:
CLEAN_SYNC_PERIOD: "12h"
TTL_EVENT: "168h" # 7 days
TTL_EXPERIMENT: "336h" # 14 days
TTL_SCHEDULE: "336h" # 14 days
TTL_WORKFLOW: "336h" # 14 days
Source: helm/chaos-mesh/values.yaml:385-394
3. Secure Persistent Volumes
Encrypt dashboard storage:
dashboard:
persistentVolume:
enabled: true
size: 8Gi
storageClassName: encrypted-ssd # Use encrypted storage class
Container Security
1. Use Security Contexts
Configure security contexts:
controllerManager:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
dashboard:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
Source: helm/chaos-mesh/values.yaml:56-57,168-169,275-276
2. Use Private Container Registries
Pull images from private registries:
images:
registry: "your-registry.example.com"
tag: "v2.6.0"
imagePullSecrets:
- name: regcred
Create the secret:
kubectl create secret docker-registry regcred -n chaos-mesh \
--docker-server=your-registry.example.com \
--docker-username=your-user \
--docker-password=your-password
Source: helm/chaos-mesh/values.yaml:44-53
3. Scan Images for Vulnerabilities
Regularly scan Chaos Mesh images:
# Using Trivy
trivy image ghcr.io/chaos-mesh/chaos-mesh:latest
trivy image ghcr.io/chaos-mesh/chaos-daemon:latest
trivy image ghcr.io/chaos-mesh/chaos-dashboard:latest
Multi-Tenancy
1. Isolate Chaos Mesh Instances
Deploy separate Chaos Mesh instances per team:
# Team A
helm install chaos-mesh-team-a chaos-mesh/chaos-mesh -n chaos-mesh-team-a \
--set clusterScoped=false \
--set controllerManager.targetNamespace=team-a-apps
# Team B
helm install chaos-mesh-team-b chaos-mesh/chaos-mesh -n chaos-mesh-team-b \
--set clusterScoped=false \
--set controllerManager.targetNamespace=team-b-apps
2. Use NetworkPolicies
Restrict network access:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: chaos-mesh-isolation
namespace: chaos-mesh
spec:
podSelector:
matchLabels:
app.kubernetes.io/instance: chaos-mesh
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: chaos-mesh
egress:
- to:
- namespaceSelector: {}
- to:
- podSelector: {}
- ports:
- protocol: TCP
port: 53
- protocol: UDP
port: 53
Compliance and Governance
1. Implement Change Approval
Require approval for chaos experiments:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-production
namespace: production
annotations:
approval-ticket: "JIRA-1234"
approved-by: "senior-sre@example.com"
approval-date: "2026-03-04"
spec:
# ...
2. Document Chaos Experiments
Use labels and annotations:
metadata:
labels:
chaos-type: "network"
severity: "high"
team: "platform"
annotations:
description: "Simulates network partition between services"
runbook: "https://wiki.example.com/chaos/network-partition"
oncall: "platform-team@example.com"
3. Schedule Chaos During Business Hours
Use schedules to control when chaos runs:
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: weekday-chaos
spec:
schedule: "0 10 * * 1-5" # 10 AM weekdays
type: PodChaos
# ...
Incident Response
1. Prepare Rollback Procedures
Document how to stop chaos:
# Pause all chaos experiments
kubectl annotate podchaos --all experiment.chaos-mesh.org/pause=true -n production
# Delete all chaos experiments
kubectl delete podchaos,networkchaos,iochaos,stresschaos --all -n production
# Restart affected pods
kubectl rollout restart deployment/my-app -n production
2. Emergency Dashboard Access
Keep an emergency service account:
# Create break-glass account
kubectl create serviceaccount chaos-admin -n chaos-mesh
kubectl create clusterrolebinding chaos-admin \
--clusterrole=cluster-admin \
--serviceaccount=chaos-mesh:chaos-admin
# Store token securely (e.g., in a password manager)
kubectl create token chaos-admin -n chaos-mesh --duration=168h
3. Monitor Blast Radius
Set up alerts for unexpected chaos:
# Prometheus alert example
groups:
- name: chaos-mesh
rules:
- alert: UnexpectedChaosExperiment
expr: count(chaos_mesh_experiments{namespace="production"}) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Unexpected chaos in production namespace"
Security Checklist
Before deploying to production:
- See Authorization for authentication and authorization details
- See RBAC for role-based access control configuration