Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/chaos-mesh/chaos-mesh/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Chaos Mesh exposes Prometheus metrics for monitoring chaos experiments, workflows, schedules, and system health. These metrics enable observability, alerting, and performance analysis of your chaos engineering practice.

Metrics Components

Chaos Mesh exposes metrics from three main components:
  1. Chaos Controller Manager - Experiment and orchestration metrics
  2. Chaos Dashboard - API and UI request metrics
  3. Chaos Daemon - Agent-level injection metrics

Controller Manager Metrics

The controller manager exposes metrics on port 10080 at /metrics endpoint.

Experiment Metrics

chaos_controller_manager_chaos_experiments
gauge
Total number of chaos experiments and their current phases.Labels:
  • namespace: Experiment namespace
  • kind: Chaos type (PodChaos, NetworkChaos, etc.)
  • phase: Current phase (Running, Paused, Finished, Failed)
Example:
chaos_controller_manager_chaos_experiments{namespace="default",kind="PodChaos",phase="Running"} 3
chaos_controller_manager_chaos_experiments{namespace="production",kind="NetworkChaos",phase="Finished"} 15

Workflow Metrics

chaos_controller_manager_chaos_workflows
gauge
Total number of workflows by namespace.Labels:
  • namespace: Workflow namespace
Example:
chaos_controller_manager_chaos_workflows{namespace="default"} 5
chaos_controller_manager_chaos_workflows{namespace="staging"} 12

Schedule Metrics

chaos_controller_manager_chaos_schedules
gauge
Total number of active schedules by namespace.Labels:
  • namespace: Schedule namespace
Example:
chaos_controller_manager_chaos_schedules{namespace="default"} 8
chaos_controller_manager_chaos_schedules{namespace="production"} 20

Sidecar Injection Metrics

chaos_mesh_injections_total
counter
Total number of sidecar injections performed via the webhook.Labels:
  • namespace: Target pod namespace
  • config: Injection configuration name
Example:
chaos_mesh_injections_total{namespace="default",config="default"} 156
chaos_controller_manager_chaos_mesh_inject_required_total
counter
Total number of injection requests received.Labels:
  • namespace: Target namespace
  • config: Configuration name

Template Metrics

chaos_controller_manager_chaos_mesh_templates
gauge
Total number of injection templates configured.
chaos_controller_manager_chaos_mesh_config_templates
gauge
Total number of configuration templates.Labels:
  • namespace: Template namespace
  • template: Template name
chaos_controller_manager_chaos_mesh_injection_configs
gauge
Total number of active injection configs.Labels:
  • namespace: Config namespace
  • template: Associated template

Error Metrics

chaos_controller_manager_chaos_mesh_template_not_exist_total
counter
Total template not found errors.Labels:
  • namespace: Request namespace
  • template: Missing template name
chaos_controller_manager_chaos_mesh_template_load_failed_total
counter
Total template rendering failures.
chaos_controller_manager_chaos_mesh_config_name_duplicate_total
counter
Total configuration name duplication errors.Labels:
  • namespace: Config namespace
  • config: Duplicate config name

Event Metrics

chaos_controller_manager_emitted_event_total
counter
Total number of Kubernetes events emitted by the controller.Labels:
  • type: Event type (Normal, Warning)
  • reason: Event reason code
  • namespace: Event namespace
Example:
chaos_controller_manager_emitted_event_total{type="Normal",reason="ChaosCRCreated",namespace="default"} 45
chaos_controller_manager_emitted_event_total{type="Warning",reason="ChaosCRCreateFailed",namespace="default"} 2

Dashboard Metrics

The dashboard exposes metrics on the same port as the API (default 2333) at /metrics.
chaos_dashboard_http_request_duration_seconds
histogram
HTTP request latency histogram for dashboard API endpoints.Labels:
  • path: API endpoint path
  • method: HTTP method (GET, POST, PUT, DELETE)
  • status: HTTP status code
Buckets: Standard Prometheus histogram bucketsExample:
chaos_dashboard_http_request_duration_seconds_bucket{path="/api/experiments",method="GET",status="200",le="0.1"} 1543
chaos_dashboard_http_request_duration_seconds_bucket{path="/api/workflows",method="POST",status="201",le="0.5"} 89

Scraping Configuration

ServiceMonitor (Prometheus Operator)

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: chaos-controller-manager
  namespace: chaos-mesh
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: controller-manager
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

Prometheus Configuration

For standalone Prometheus without the operator:
scrape_configs:
  - job_name: 'chaos-controller-manager'
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - chaos-mesh
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
        regex: controller-manager
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        regex: http
        action: keep
  
  - job_name: 'chaos-dashboard'
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - chaos-mesh
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
        regex: chaos-dashboard
        action: keep

Example Queries

Experiment Monitoring

sum by (kind) (chaos_controller_manager_chaos_experiments{phase="Running"})

Workflow Monitoring

sum(chaos_controller_manager_chaos_workflows)

Schedule Monitoring

sum(chaos_controller_manager_chaos_schedules)

Dashboard Performance

histogram_quantile(0.95, 
  rate(chaos_dashboard_http_request_duration_seconds_bucket[5m])
)

Injection Monitoring

rate(chaos_mesh_injections_total[5m])

Grafana Dashboard

Create comprehensive Grafana dashboards to visualize Chaos Mesh metrics:
1

Import Data Source

Add your Prometheus instance as a data source in Grafana.
2

Create Panels

Recommended panels:
  • Active experiments by type (pie chart)
  • Experiment timeline (time series)
  • Workflow execution count (gauge)
  • Schedule execution rate (graph)
  • Dashboard API latency (heatmap)
  • Error rate by type (graph)
3

Set Up Variables

Create dashboard variables:
  • $namespace - Namespace selector
  • $chaos_type - Chaos experiment type
  • $interval - Time interval for rate calculations

Example Panel Queries

sum by (phase, kind) (
  chaos_controller_manager_chaos_experiments{namespace=~"$namespace"}
)

Alerting Rules

Define Prometheus alerting rules for chaos operations:
groups:
  - name: chaos-mesh
    interval: 30s
    rules:
      - alert: HighChaosExperimentFailureRate
        expr: |
          sum(rate(chaos_controller_manager_chaos_experiments{phase="Failed"}[5m]))
          / sum(rate(chaos_controller_manager_chaos_experiments[5m]))
          > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High chaos experiment failure rate"
          description: "More than 10% of chaos experiments are failing"
      
      - alert: DashboardHighLatency
        expr: |
          histogram_quantile(0.95,
            rate(chaos_dashboard_http_request_duration_seconds_bucket[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Chaos Dashboard high API latency"
          description: "95th percentile latency is above 1 second"
      
      - alert: InjectionErrors
        expr: |
          rate(chaos_controller_manager_chaos_mesh_template_not_exist_total[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Chaos injection template errors"
          description: "Template not found errors detected"
      
      - alert: NoScheduleExecutions
        expr: |
          sum(chaos_controller_manager_chaos_schedules) > 0
          and rate(chaos_controller_manager_emitted_event_total{reason="ScheduleCreated"}[30m]) == 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Schedules not executing"
          description: "Active schedules exist but no executions detected"

Best Practices

  • Controller metrics: 30-60 seconds
  • Dashboard metrics: 15-30 seconds
  • Balance observability with cardinality
Pre-compute frequently used queries:
- record: chaos:experiments:running:total
  expr: sum(chaos_controller_manager_chaos_experiments{phase="Running"})
Track unique label combinations to prevent cardinality explosion, especially with namespace and kind labels.
Combine Chaos Mesh metrics with application performance metrics to measure experiment impact.

Troubleshooting

Check metrics endpoint:
kubectl port-forward -n chaos-mesh svc/chaos-controller-manager 10080:10080
curl http://localhost:10080/metrics
Verify ServiceMonitor:
kubectl get servicemonitor -n chaos-mesh
kubectl describe servicemonitor chaos-controller-manager -n chaos-mesh
Some metrics may be missing if no resources of that type exist. Create a test experiment to populate metrics.
If experiencing high cardinality issues, consider:
  • Reducing namespace diversity
  • Aggregating less frequently used labels
  • Using recording rules

Next Steps

Dashboard

Visualize metrics through the Chaos Dashboard

Workflows

Monitor complex workflow executions

Scheduling

Track scheduled experiment metrics

Status Checks

Validate experiments with health checks