  • monitoring without alerting is equivalent to no monitoring
  • alerting with too much noise is equivalent to no alerting
  • despite all efforts, there may still be false positives (noise), true negatives (uncaught incidents) in your alert rules. To build a robust system, include monitoring/alerting in your postmortems, identify what’s missing in your monitoring, and fill in the gap.
- name: ElasticSearch
- alert: UnassignedShards
expr: elasticsearch_cluster_health_unassigned_shards > 0
- alert: ClusterRed
expr: elasticsearch_cluster_health_status{color="red"} == 1
- alert: JVMUsage
expr: (elasticsearch_jvm_memory_used_bytes/elasticsearch_jvm_memory_max_bytes) > 0.9
- alert: HealthyNodes
expr: elasticsearch_cluster_health_number_of_nodes < 3
- alert: NumberOfPendingTasks
expr: elasticsearch_cluster_health_number_of_pending_tasks > 0
- alert: ElasticSearchUsedFS
expr: (elasticsearch_filesystem_data_size_bytes - elasticsearch_filesystem_data_free_bytes) / elasticsearch_filesystem_data_size_bytes * 100 > 90
curl -X GET "localhost:9200/_cluster/health?pretty"
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"
curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed'



