Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/chaos-mesh/chaos-mesh/llms.txt

Use this file to discover all available pages before exploring further.

Overview

StatusCheck provides automated health validation during chaos experiments. It enables you to verify that your system remains functional or meets specific criteria while chaos is being injected, making experiments safer and more informative.

Use Cases

Experiment Validation

Verify application endpoints remain responsive during chaos injection

SLA Monitoring

Ensure service level objectives are maintained under fault conditions

Continuous Health Check

Monitor system health throughout workflow execution

Failure Detection

Automatically detect when experiments cause unacceptable degradation

StatusCheck Types

HTTP Status Check

Currently, Chaos Mesh supports HTTP-based status checks to validate endpoint availability and response codes.
apiVersion: chaos-mesh.org/v1alpha1
kind: StatusCheck
metadata:
  name: api-health-check
spec:
  mode: Continuous
  type: HTTP
  duration: "5m"
  timeoutSeconds: 5
  intervalSeconds: 10
  failureThreshold: 3
  successThreshold: 1
  http:
    url: http://my-service.default.svc.cluster.local:8080/health
    method: GET
    criteria:
      statusCode: "200"

Execution Modes

mode
StatusCheckMode
default:"Synchronous"
Defines how the status check executes.Synchronous: Exits immediately after success or failure threshold is reached.Continuous: Continues checking until duration expires or failure threshold is exceeded.

Synchronous Mode

Best for validating a specific condition before proceeding:
spec:
  mode: Synchronous
  type: HTTP
  duration: "1m"
  successThreshold: 3  # Exit after 3 consecutive successes
  failureThreshold: 5  # Fail after 5 consecutive failures
  http:
    url: http://app/ready
    criteria:
      statusCode: "200"

Continuous Mode

Best for ongoing monitoring during experiments:
spec:
  mode: Continuous
  type: HTTP
  duration: "10m"
  failureThreshold: 3  # Abort if 3 consecutive failures
  http:
    url: http://app/health
    criteria:
      statusCode: "200-299"  # Accept any 2xx status

Configuration Parameters

type
StatusCheckType
default:"HTTP"
required
Type of status check to perform. Currently only HTTP is supported.
duration
string
Maximum duration for the status check execution.Format: Duration string (e.g., “5m”, ”30s”, “1h30m”)
  • Synchronous: Maximum time to wait for success threshold
  • Continuous: Total monitoring duration
timeoutSeconds
int
default:"1"
Timeout in seconds for each individual status check execution. Must be ≥ 1.
intervalSeconds
int
default:"10"
Interval in seconds between status check executions. Must be ≥ 1.
failureThreshold
int
default:"3"
Minimum consecutive failures before status check is considered failed. Must be ≥ 1.When exceeded, the status check terminates with failure status.
successThreshold
int
default:"1"
Minimum consecutive successes before status check is considered successful. Must be ≥ 1.Only applies to Synchronous mode. When exceeded, the check terminates with success status.
recordsHistoryLimit
int
default:"100"
Number of status check execution records to retain. Range: 1-1000.Controls memory usage and history depth.

HTTP Status Check Configuration

http.url
string
required
Full URL to check, including protocol and path.Examples:
  • http://my-service:8080/health
  • https://api.example.com/status
  • http://10.0.0.1:3000/ready
http.method
string
default:"GET"
HTTP method to use. Supported values: GET, POST
http.headers
map[string][]string
HTTP headers to include in the request.Example:
headers:
  Authorization: ["Bearer token123"]
  Content-Type: ["application/json"]
http.body
string
Request body for POST requests.Example:
body: '{"check": "health"}'
http.criteria.statusCode
string
required
Expected HTTP status code(s).Formats:
  • Single code: "200"
  • Range (inclusive): "200-299"
  • Multiple codes: Use multiple checks or ranges
Examples:
  • "200" - Exact match
  • "200-204" - Range including both endpoints
  • "2xx" - Not supported, use "200-299"

Status Check in Workflows

Status checks are most powerful when integrated into workflows:
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: chaos-with-validation
spec:
  entry: the-entry
  templates:
    - name: the-entry
      templateType: Parallel
      deadline: 240s
      children:
        - continuous-health-check
        - network-chaos-experiment
    
    - name: continuous-health-check
      templateType: StatusCheck
      deadline: 120s
      abortWithStatusCheck: true  # Abort workflow if check fails
      statusCheck:
        mode: Continuous
        type: HTTP
        intervalSeconds: 5
        failureThreshold: 3
        http:
          url: http://my-app.default.svc:8080/health
          method: GET
          criteria:
            statusCode: "200"
    
    - name: network-chaos-experiment
      templateType: NetworkChaos
      deadline: 120s
      networkChaos:
        action: delay
        mode: all
        selector:
          labelSelectors:
            "app": "my-app"
        delay:
          latency: "100ms"
abortWithStatusCheck
bool
default:"false"
When used in a workflow template, determines whether to abort the entire workflow if the status check’s failure threshold is exceeded.
  • true: Workflow is aborted on check failure
  • false: Status check failure is recorded but workflow continues

Status and Conditions

status
StatusCheckStatus
Current state of the status check.

Advanced Examples

POST Request with Headers

apiVersion: chaos-mesh.org/v1alpha1
kind: StatusCheck
metadata:
  name: api-post-check
spec:
  mode: Synchronous
  type: HTTP
  duration: "30s"
  intervalSeconds: 5
  successThreshold: 2
  http:
    url: http://api.default.svc:8080/validate
    method: POST
    headers:
      Content-Type: ["application/json"]
      X-API-Key: ["secret-key-123"]
    body: '{"check": true}'
    criteria:
      statusCode: "200-201"

Status Range Validation

http:
  url: http://service/endpoint
  criteria:
    statusCode: "200-299"  # Any successful 2xx response

High-Frequency Monitoring

spec:
  mode: Continuous
  type: HTTP
  duration: "2m"
  intervalSeconds: 1      # Check every second
  timeoutSeconds: 1       # 1 second timeout
  failureThreshold: 5     # Allow 5 consecutive failures
  recordsHistoryLimit: 200
  http:
    url: http://critical-service/health
    criteria:
      statusCode: "200"

Monitoring Status Checks

View Status Check Details

kubectl get statuscheck my-check -o yaml

Check Conditions

kubectl get statuscheck my-check -o jsonpath='{.status.conditions}' | jq

Best Practices

Configure failure thresholds based on your application’s expected behavior:
  • Transient failures: Higher threshold (e.g., 5-10)
  • Critical endpoints: Lower threshold (e.g., 2-3)
Balance between responsiveness and overhead:
  • Critical checks: 1-5 seconds
  • Standard checks: 10-30 seconds
  • Long-running experiments: 30-60 seconds
In parallel workflow executions, use Continuous mode to monitor health throughout the entire chaos injection period.
Set timeoutSeconds lower than intervalSeconds to prevent overlapping checks.
For long-running or high-frequency checks, adjust recordsHistoryLimit to manage memory usage.

Troubleshooting

Check URL accessibility:
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -v http://your-service/health
Verify timeout settings:
  • Ensure timeoutSeconds is sufficient for response time
  • Check network latency to the target service
Increase timeout if legitimate responses take longer:
timeoutSeconds: 10  # Increase from default 1
Check actual response:
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -w "\nStatus: %{http_code}\n" http://your-service/health
Use ranges for flexibility:
criteria:
  statusCode: "200-299"  # Accept any 2xx

Future Enhancements

Status checks may be extended to support:
  • Response body validation
  • Custom command execution
  • Kubernetes resource checks
  • Prometheus query validation
  • gRPC health checks

Next Steps

Workflows

Integrate status checks into complex workflows

Monitoring

Track status check metrics with Prometheus

Dashboard

View status check execution through the UI

Scheduling

Combine status checks with scheduled experiments