KernelChaos allows you to inject faults at the Linux kernel level using BPF (Berkeley Packet Filter) to simulate low-level system failures such as memory allocation failures, page allocation failures, and bio (block I/O) failures.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/chaos-mesh/chaos-mesh/llms.txt
Use this file to discover all available pages before exploring further.
Supported Fault Types
KernelChaos supports three types of kernel-level fault injection:- slab allocation failure (should_failslab) - failtype 0
- page allocation failure (should_fail_alloc_page) - failtype 1
- bio (block I/O) failure (should_fail_bio) - failtype 2
Configuration
Basic Example
Spec Fields
Specifies the target pods for the chaos experiment. See PodChaos documentation for selector details.
Selection mode:
one, all, fixed, fixed-percent, or random-max-percentDefines the kernel fault injection configuration
Duration of the chaos action. Format: “300ms”, “1.5h”, “2h45m”
List of container names to inject chaos into. If not set, chaos is injected at pod level.
Remote cluster where chaos will be deployed
Examples
Slab Allocation Failure
Inject kmalloc failures with 10% probability:Page Allocation Failure
Inject page allocation failures:Bio Failure
Inject block I/O failures:Targeted Call Chain
Inject failures only in specific call chain:Use Cases
Testing Memory Allocation Failures
Usefailtype: 0 (slab) or failtype: 1 (page) to test how applications handle memory allocation failures, which can occur under memory pressure.
Storage Layer Testing
Usefailtype: 2 (bio) to test application resilience to block I/O failures, simulating disk errors or storage subsystem issues.
Resource Exhaustion Scenarios
Simulate kernel-level resource exhaustion that may occur in production during high load or memory pressure.Low-Level Error Path Testing
Test error handling in kernel code paths that are difficult to trigger through normal application behavior.Best Practices
- Understand Kernel Internals: KernelChaos requires knowledge of Linux kernel internals and function call paths
- Start with Low Probability: Begin with 1-5% probability to avoid overwhelming the system
- Use Times Limit: Set
timesto limit the total number of injected faults and prevent cascading failures - Reference Documentation:
- Test in Isolation: Start with
mode: oneand test on non-critical pods first - Monitor System Stability: Watch for:
- OOM killer activations
- Kernel panics or warnings
- Application crashes
- System performance degradation
- Use Call Chains: Specify call chains to target specific code paths and avoid system-wide impact
- Headers Selection: Include appropriate kernel headers based on the failtype:
- failtype 0:
linux/slab.h,linux/mm.h - failtype 1:
linux/mmzone.h,linux/gfp.h - failtype 2:
linux/blkdev.h,linux/bio.h
- failtype 0:
Understanding Failtype
Failtype 0 - should_failslab
Causes kmalloc, kmem_cache_alloc, and related slab allocations to fail. This simulates memory allocation failures at the slab allocator level. Common in memory-constrained scenarios.
Failtype 1 - should_fail_alloc_page
Causes page allocations to fail. This affects larger memory allocations and can simulate severe memory pressure scenarios.
Failtype 2 - should_fail_bio
Causes block I/O operations to fail. This simulates storage device failures, I/O errors, or storage subsystem issues.
Call Chain Predicates
Predicates allow fine-grained control over when faults are injected:Notes
- KernelChaos uses eBPF (extended Berkeley Packet Filter) to inject faults at kernel level
- Requires kernel support for fault injection and BPF
- The chaos-mesh/bpfki project provides the BPF kernel injection infrastructure
- Kernel functions and their parameters can be found using:
/proc/kallsymsfor function addresses- Kernel source code for function signatures
bpftraceorbcctools for function tracing
- Call chains must match the actual kernel call path for the fault to be injected
- Predicates use C-style syntax and have access to function parameters
- This is an advanced chaos type - improper use can cause system instability
- Always test in non-production environments first
- Some kernel versions may not support all failtype options