Elasticsearch Administration

2 min readOct 27, 2021

I have been using Graylog (community edition) for many years. The backend of Graylog is Elasticsearch DB, which stores the logs.

Here are some Prometheus alerts to monitor Elasticsearch clusters, and some useful troubleshooting commands.

Monitor Elasticsearch with Prometheus

This shall not be confused with ‘collecting Prometheus data with Elasticsearch’. I’m referring to monitoring the Elasticsearch service with Prometheus:

My philosophies about monitoring are:

  • monitoring without alerting is equivalent to no monitoring
  • alerting with too much noise is equivalent to no alerting
  • despite all efforts, there may still be false positives (noise), true negatives (uncaught incidents) in your alert rules. To build a robust system, include monitoring/alerting in your postmortems, identify what’s missing in your monitoring, and fill in the gap.

Note that sometimes the metrics may be renamed after a version change. Which will invalidate your alert rules, and perhaps leaving some broken graphs on Grafana too. It’s good to test the new version of Prometheus exporters before upgrading.

Alert rules for Elasticsearch (good as of 2021):

- name: ElasticSearch
- alert: UnassignedShards
expr: elasticsearch_cluster_health_unassigned_shards > 0
- alert: ClusterRed
expr: elasticsearch_cluster_health_status{color="red"} == 1
- alert: JVMUsage
expr: (elasticsearch_jvm_memory_used_bytes/elasticsearch_jvm_memory_max_bytes) > 0.9
- alert: HealthyNodes
expr: elasticsearch_cluster_health_number_of_nodes < 3
- alert: NumberOfPendingTasks
expr: elasticsearch_cluster_health_number_of_pending_tasks > 0
- alert: ElasticSearchUsedFS
expr: (elasticsearch_filesystem_data_size_bytes - elasticsearch_filesystem_data_free_bytes) / elasticsearch_filesystem_data_size_bytes * 100 > 90

Some common commands for troubleshooting.

In the node-in-question, get the cluster status by:

curl -X GET "localhost:9200/_cluster/health?pretty"

If you see an alert for unassigned shards, look into

curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

Resolve any underlying issues you see from above output (for example, disk became read-only), then retry the failed jobs:

curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed'