Chaos Mesh is built from three primary components that work together to orchestrate chaos experiments in Kubernetes environments.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/chaos-mesh/chaos-mesh/llms.txt
Use this file to discover all available pages before exploring further.
Chaos Controller Manager
The Chaos Controller Manager is the brain of Chaos Mesh, running as a Kubernetes deployment that orchestrates all chaos experiments.Responsibilities
CRD Management
Watches and reconciles 14+ chaos CRD types and orchestration resources
Experiment Scheduling
Manages lifecycle, timing, and state transitions of chaos experiments
Target Selection
Processes selectors to identify which pods/resources to target
Status Tracking
Maintains experiment status, conditions, and records
Architecture
Entry Point:cmd/controller-manager/main.go
Core Logic: controllers/
The controller manager runs multiple Kubernetes controller-runtime based controllers:
Controller Types
Chaos Type Controllers (controllers/chaosimpl/):
awschaos/- AWS fault injection (EC2, EBS)azurechaos/- Azure fault injection (VM, Disks)blockchaos/- Block device I/O faultsdnschaos/- DNS resolution errorsgcpchaos/- GCP fault injection (Compute Engine, Disks)httpchaos/- HTTP request/response manipulationiochaos/- File system I/O faultsjvmchaos/- JVM-level fault injectionkernelchaos/- Kernel fault injection via BPFnetworkchaos/- Network faults (delay, loss, partition, etc.)physicalmachinechaos/- Physical/VM machine faultspodchaos/- Pod lifecycle faultsstresschaos/- CPU and memory stresstimechaos/- Clock skew simulation
- Workflow Controller - Multi-step chaos scenarios
- Scheduler Controller - Scheduled/recurring experiments
- StatusCheck Controller - Health validation
Design Principles
Controllers in Chaos Mesh follow strict design principles (controllers/README.md):
One Controller Per Field
One Controller Per Field
Each field is controlled by at most one controller to avoid conflicts. This prevents race conditions and makes the system predictable.Example: The pause annotation and duration are handled by the same controller because both affect the
desiredPhase field.Standalone Operation
Standalone Operation
Controllers work independently without depending on other controllers. Each controller can be understood and debugged in isolation.
Simple Behavior
Simple Behavior
Controller logic should be describable in ~100 words. Complex logic should be split into multiple controllers or a new CRD.
Error Handling with Backoff
Error Handling with Backoff
For retriable errors, return
ctrl.Result{Requeue: true}, nil to leverage exponential backoff:- Start delay: 5ms
- Max delay: 1000s
- Overall rate: 10 qps
Key Packages
Selector Processing:pkg/selector/
- Evaluates label selectors, namespace selectors, field selectors
- Determines which pods match the experiment criteria
- Implements selector modes (one, all, fixed, fixed-percent, random-max-percent)
pkg/metrics/
- Prometheus metrics for experiments and controller operations
- Tracks injection counts, failures, and durations
api/v1alpha1/*_webhook.go
- Validates CRD objects before admission
- Applies defaults and performs semantic validation
- Ensures experiments are well-formed
Controller Reconciliation Flow
Chaos Daemon
The Chaos Daemon is the execution engine that performs actual fault injection on target pods. It runs as a DaemonSet on every Kubernetes node.Responsibilities
Fault Injection
Executes low-level chaos actions in pod namespaces
Namespace Access
Enters target pod namespaces to manipulate network, filesystem, and processes
Runtime Integration
Interfaces with Docker, containerd, and CRI-O container runtimes
gRPC Server
Exposes fault injection APIs to Controller Manager
Architecture
Entry Point:cmd/chaos-daemon/main.go
Core Logic: pkg/chaosdaemon/
Protocol: pkg/chaosdaemon/pb/chaosdaemon.proto
Privileged Operations
Fault Injection Mechanisms
Network Chaos
Implementation:pkg/chaosdaemon/tc_server.go, pkg/chaosdaemon/netem/
Technologies:
- tc (traffic control): Linux kernel’s traffic control subsystem
- netem: Network emulation using tc qdisc
- iptables: Packet filtering and manipulation
- ipset: IP address set management for efficient filtering
- Delay: Add latency with optional jitter and correlation
- Loss: Drop packets with configurable percentage
- Duplicate: Duplicate packets
- Corrupt: Corrupt packet data
- Partition: Block traffic between pods/services
- Bandwidth: Limit bandwidth with token bucket
pkg/chaosdaemon/iptables_server.go, pkg/chaosdaemon/ipset_server.go
I/O Chaos
Implementation:pkg/chaosdaemon/iochaos_server.go
Technology: FUSE (Filesystem in Userspace)
Actions:
- Latency: Delay I/O operations
- Fault: Return errors on I/O operations
- AttrOverride: Modify file attributes
- Mistake: Inject incorrect data
- Mount FUSE filesystem over target volume path
- Intercept I/O operations
- Inject faults based on configuration
- Pass through or modify operations
HTTP Chaos
Implementation:pkg/chaosdaemon/httpchaos_server.go
Technology: Transparent proxy with TLS interception
Actions:
- Abort: Return error responses
- Delay: Add latency to requests/responses
- Replace: Modify request/response data
- Patch: Add/modify headers
pkg/chaosdaemon/tproxyconfig/config.go
DNS Chaos
Implementation:pkg/chaosdaemon/dns_server.go
Technology: Custom DNS server + /etc/resolv.conf manipulation
Actions:
- Error: Return DNS resolution errors
- Random: Return random IP addresses
Stress Chaos
Implementation:pkg/chaosdaemon/stress_server_linux.go
Technology: stress-ng (stress test tool)
Actions:
- CPU stress with configurable workers and load percentage
- Memory stress with configurable size and workers
Time Chaos
Implementation:pkg/chaosdaemon/time_server_linux.go
Technology: Clock offset via vDSO manipulation
Actions: Offset system clocks (CLOCK_REALTIME, CLOCK_MONOTONIC, etc.)
How it works:
- Inject library into target container
- Intercept clock_gettime syscalls
- Add offset to returned time values
JVM Chaos
Implementation:pkg/chaosdaemon/jvm_server.go
Technology: Byteman (JVM bytecode manipulation)
Actions:
- Latency: Add delay to method invocations
- Return: Override return values
- Exception: Throw exceptions
- Stress: CPU/Memory stress within JVM
- GC: Trigger garbage collection
- MySQL: Inject faults into MySQL JDBC operations
- Attach Byteman agent to target JVM process
- Inject Byteman rules into target classes/methods
- Rules execute at specified injection points
Kernel Chaos
Implementation: Leverages BPF (via bpfki library) Actions: Inject faults into kernel functions- slab allocation failures
- page allocation failures
- bio (block I/O) failures
- FailType: What to fail (0=slab, 1=alloc_page, 2=bio)
- Callchain: Specific call chain to target
- Probability: Percentage of calls to fail
- Times: Maximum number of failures
Block Chaos
Implementation:pkg/chaosdaemon/blockchaos_server_linux.go
Technology: Block device I/O interception
Actions:
- Delay: Add latency to block I/O operations
Container Runtime Support
Location:pkg/chaosdaemon/crclients/
Chaos Daemon abstracts container runtime operations through a common interface:
- Docker:
pkg/chaosdaemon/crclients/docker/client.go - containerd:
pkg/chaosdaemon/crclients/containerd/client.go - CRI-O:
pkg/chaosdaemon/crclients/crio/client.go
Task Management
Location:pkg/chaosdaemon/tasks/
Manages ongoing chaos tasks with proper lifecycle:
- TaskManager: Tracks active chaos injections (
task_manager.go) - ProcessGroupHandler: Manages stress processes (
process_group_handler.go) - PodContainerHandler: Manages container-level tasks (
pod_container_handler.go)
Helper Utilities
Location:pkg/chaosdaemon/util/, pkg/chaosdaemon/helper/
- Container inspection and manipulation
- Namespace operations
- Process management
- File system operations
Chaos Dashboard
The Chaos Dashboard provides a web-based UI for designing, managing, and monitoring chaos experiments.Responsibilities
Experiment Design
Visual interface for creating chaos experiments without YAML
Workflow Builder
Design complex multi-step chaos scenarios
Monitoring
Real-time status and metrics visualization
Access Control
RBAC integration for secure multi-user environments
Architecture
Frontend:ui/
- Technology: React with TypeScript
- Build: pnpm-based build system
- Components: Material-UI based components
cmd/chaos-dashboard/ and pkg/dashboard/
- Technology: Go HTTP server
- API: RESTful + WebSocket for real-time updates
- Authentication: Kubernetes RBAC integration
Key Features
Visual Experiment Creation
The dashboard provides forms for each chaos type with:- Selector configuration
- Action-specific parameters
- Duration and scheduling
- Preview of generated YAML
Workflow Editor
Drag-and-drop interface for creating multi-step workflows:- Serial and parallel execution
- Conditional execution
- Status checks between steps
- Template support
Monitoring Dashboard
- Experiment list with status filtering
- Real-time event stream
- Metrics and graphs
- Detailed experiment history
RBAC Integration
Respects Kubernetes RBAC for:- Namespace isolation
- Resource permissions
- Role-based experiment management
API Endpoints
Experiments:GET /api/experiments- List experimentsPOST /api/experiments- Create experimentGET /api/experiments/{uid}- Get experiment detailsDELETE /api/experiments/{uid}- Delete experimentPUT /api/experiments/{uid}- Update experiment
GET /api/workflows- List workflowsPOST /api/workflows- Create workflow
GET /api/events- Get experiment events- WebSocket
/api/events/ws- Real-time event stream
Development
Frontend Development:Component Interactions
Experiment Execution Flow
Cross-Component Communication
Dashboard → API Server:- Protocol: REST/HTTPS
- Authentication: Kubernetes RBAC tokens
- Operations: CRUD on CRDs
- Protocol: Kubernetes client-go
- Operations: Watch, Update CRDs
- Leader Election: For HA deployments
- Protocol: gRPC over TLS
- Operations: Fault injection commands
- Connection: Per-node daemon discovery
- Protocol: Runtime-specific (Docker API, CRI, etc.)
- Operations: Container inspection, kill, stats
Deployment Architecture
Standard Deployment
Multi-Cluster Support
Chaos Mesh supports multi-cluster chaos experiments: Remote Cluster CRDs:api/v1alpha1/remote_cluster_types.go
Experiments can target pods in remote clusters by specifying remoteCluster field.
High Availability
Controller Manager:- Leader election ensures only one active controller
- Multiple replicas for resilience
- Graceful leadership transition
- DaemonSet ensures coverage of all nodes
- Node failure automatically handled by Kubernetes
- Stateless design allows horizontal scaling
- Session state stored in Kubernetes
Related Resources
Architecture
High-level architecture overview
Chaos Types
All available chaos types