Monitoring - Chaos Mesh

Overview

Chaos Mesh exposes Prometheus metrics for monitoring chaos experiments, workflows, schedules, and system health. These metrics enable observability, alerting, and performance analysis of your chaos engineering practice.

Metrics Components

Chaos Mesh exposes metrics from three main components:

Chaos Controller Manager - Experiment and orchestration metrics
Chaos Dashboard - API and UI request metrics
Chaos Daemon - Agent-level injection metrics

Controller Manager Metrics

The controller manager exposes metrics on port 10080 at /metrics endpoint.

Experiment Metrics

chaos_controller_manager_chaos_experiments

gauge

Total number of chaos experiments and their current phases.Labels:

namespace: Experiment namespace
kind: Chaos type (PodChaos, NetworkChaos, etc.)
phase: Current phase (Running, Paused, Finished, Failed)

Example:

chaos_controller_manager_chaos_experiments{namespace="default",kind="PodChaos",phase="Running"} 3
chaos_controller_manager_chaos_experiments{namespace="production",kind="NetworkChaos",phase="Finished"} 15

Workflow Metrics

chaos_controller_manager_chaos_workflows

gauge

Total number of workflows by namespace.Labels:

namespace: Workflow namespace

Example:

chaos_controller_manager_chaos_workflows{namespace="default"} 5
chaos_controller_manager_chaos_workflows{namespace="staging"} 12

Schedule Metrics

chaos_controller_manager_chaos_schedules

gauge

Total number of active schedules by namespace.Labels:

namespace: Schedule namespace

Example:

chaos_controller_manager_chaos_schedules{namespace="default"} 8
chaos_controller_manager_chaos_schedules{namespace="production"} 20

Sidecar Injection Metrics

chaos_mesh_injections_total

counter

Total number of sidecar injections performed via the webhook.Labels:

namespace: Target pod namespace
config: Injection configuration name

Example:

chaos_mesh_injections_total{namespace="default",config="default"} 156

chaos_controller_manager_chaos_mesh_inject_required_total

counter

Total number of injection requests received.Labels:

namespace: Target namespace
config: Configuration name

Template Metrics

chaos_controller_manager_chaos_mesh_templates

gauge

Total number of injection templates configured.

chaos_controller_manager_chaos_mesh_config_templates

gauge

Total number of configuration templates.Labels:

namespace: Template namespace
template: Template name

chaos_controller_manager_chaos_mesh_injection_configs

gauge

Total number of active injection configs.Labels:

namespace: Config namespace
template: Associated template

Error Metrics

chaos_controller_manager_chaos_mesh_template_not_exist_total

counter

Total template not found errors.Labels:

namespace: Request namespace
template: Missing template name

chaos_controller_manager_chaos_mesh_template_load_failed_total

counter

Total template rendering failures.

chaos_controller_manager_chaos_mesh_config_name_duplicate_total

counter

Total configuration name duplication errors.Labels:

namespace: Config namespace
config: Duplicate config name

Event Metrics

chaos_controller_manager_emitted_event_total

counter

Total number of Kubernetes events emitted by the controller.Labels:

type: Event type (Normal, Warning)
reason: Event reason code
namespace: Event namespace

Example:

chaos_controller_manager_emitted_event_total{type="Normal",reason="ChaosCRCreated",namespace="default"} 45
chaos_controller_manager_emitted_event_total{type="Warning",reason="ChaosCRCreateFailed",namespace="default"} 2

Dashboard Metrics

The dashboard exposes metrics on the same port as the API (default 2333) at /metrics.

chaos_dashboard_http_request_duration_seconds

histogram

HTTP request latency histogram for dashboard API endpoints.Labels:

path: API endpoint path
method: HTTP method (GET, POST, PUT, DELETE)
status: HTTP status code

Buckets: Standard Prometheus histogram bucketsExample:

chaos_dashboard_http_request_duration_seconds_bucket{path="/api/experiments",method="GET",status="200",le="0.1"} 1543
chaos_dashboard_http_request_duration_seconds_bucket{path="/api/workflows",method="POST",status="201",le="0.5"} 89

Scraping Configuration

ServiceMonitor (Prometheus Operator)

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: chaos-controller-manager
  namespace: chaos-mesh
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: controller-manager
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

Prometheus Configuration

For standalone Prometheus without the operator:

scrape_configs:
  - job_name: 'chaos-controller-manager'
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - chaos-mesh
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
        regex: controller-manager
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        regex: http
        action: keep
  
  - job_name: 'chaos-dashboard'
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - chaos-mesh
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
        regex: chaos-dashboard
        action: keep

Example Queries

Experiment Monitoring

sum by (kind) (chaos_controller_manager_chaos_experiments{phase="Running"})

Workflow Monitoring

sum(chaos_controller_manager_chaos_workflows)

Schedule Monitoring

sum(chaos_controller_manager_chaos_schedules)

Dashboard Performance

histogram_quantile(0.95, 
  rate(chaos_dashboard_http_request_duration_seconds_bucket[5m])
)

Injection Monitoring

rate(chaos_mesh_injections_total[5m])

Grafana Dashboard

Create comprehensive Grafana dashboards to visualize Chaos Mesh metrics:

Import Data Source

Add your Prometheus instance as a data source in Grafana.

Create Panels

Recommended panels:

Active experiments by type (pie chart)
Experiment timeline (time series)
Workflow execution count (gauge)
Schedule execution rate (graph)
Dashboard API latency (heatmap)
Error rate by type (graph)

Set Up Variables

Create dashboard variables:

$namespace - Namespace selector
$chaos_type - Chaos experiment type
$interval - Time interval for rate calculations

Example Panel Queries

sum by (phase, kind) (
  chaos_controller_manager_chaos_experiments{namespace=~"$namespace"}
)

Alerting Rules

Define Prometheus alerting rules for chaos operations:

groups:
  - name: chaos-mesh
    interval: 30s
    rules:
      - alert: HighChaosExperimentFailureRate
        expr: |
          sum(rate(chaos_controller_manager_chaos_experiments{phase="Failed"}[5m]))
          / sum(rate(chaos_controller_manager_chaos_experiments[5m]))
          > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High chaos experiment failure rate"
          description: "More than 10% of chaos experiments are failing"
      
      - alert: DashboardHighLatency
        expr: |
          histogram_quantile(0.95,
            rate(chaos_dashboard_http_request_duration_seconds_bucket[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Chaos Dashboard high API latency"
          description: "95th percentile latency is above 1 second"
      
      - alert: InjectionErrors
        expr: |
          rate(chaos_controller_manager_chaos_mesh_template_not_exist_total[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Chaos injection template errors"
          description: "Template not found errors detected"
      
      - alert: NoScheduleExecutions
        expr: |
          sum(chaos_controller_manager_chaos_schedules) > 0
          and rate(chaos_controller_manager_emitted_event_total{reason="ScheduleCreated"}[30m]) == 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Schedules not executing"
          description: "Active schedules exist but no executions detected"

Best Practices

Set Appropriate Scrape Intervals

Controller metrics: 30-60 seconds
Dashboard metrics: 15-30 seconds
Balance observability with cardinality

Use Recording Rules

Pre-compute frequently used queries:

- record: chaos:experiments:running:total
  expr: sum(chaos_controller_manager_chaos_experiments{phase="Running"})

Monitor Metric Cardinality

Track unique label combinations to prevent cardinality explosion, especially with namespace and kind labels.

Correlate with Application Metrics

Combine Chaos Mesh metrics with application performance metrics to measure experiment impact.

Troubleshooting

Metrics Not Appearing

Check metrics endpoint:

kubectl port-forward -n chaos-mesh svc/chaos-controller-manager 10080:10080
curl http://localhost:10080/metrics

Verify ServiceMonitor:

kubectl get servicemonitor -n chaos-mesh
kubectl describe servicemonitor chaos-controller-manager -n chaos-mesh

Missing Labels

Some metrics may be missing if no resources of that type exist. Create a test experiment to populate metrics.

High Cardinality

If experiencing high cardinality issues, consider:

Reducing namespace diversity
Aggregating less frequently used labels
Using recording rules

Next Steps

Dashboard

Visualize metrics through the Chaos Dashboard

Workflows

Monitor complex workflow executions

Scheduling

Track scheduled experiment metrics

Status Checks

Validate experiments with health checks

Documentation Index

​Overview

​Metrics Components

​Controller Manager Metrics

​Experiment Metrics

​Workflow Metrics

​Schedule Metrics

​Sidecar Injection Metrics

​Template Metrics

​Error Metrics

​Event Metrics

​Dashboard Metrics

​Scraping Configuration

​ServiceMonitor (Prometheus Operator)

​Prometheus Configuration

​Example Queries

​Experiment Monitoring

​Workflow Monitoring

​Schedule Monitoring

​Dashboard Performance

​Injection Monitoring

​Grafana Dashboard

​Example Panel Queries

​Alerting Rules

​Best Practices

​Troubleshooting

​Next Steps

Dashboard

Workflows

Scheduling

Status Checks

Overview

Metrics Components

Controller Manager Metrics

Experiment Metrics

Workflow Metrics

Schedule Metrics

Sidecar Injection Metrics

Template Metrics

Error Metrics

Event Metrics

Dashboard Metrics

Scraping Configuration

ServiceMonitor (Prometheus Operator)

Prometheus Configuration

Example Queries

Experiment Monitoring

Workflow Monitoring

Schedule Monitoring

Dashboard Performance

Injection Monitoring

Grafana Dashboard

Example Panel Queries

Alerting Rules

Best Practices

Troubleshooting

Next Steps