Elasticsearch Administration

2 min readOct 27, 2021

I have been using Graylog (community edition) for many years. The backend of Graylog is Elasticsearch DB, which stores the logs.

Here are some Prometheus alerts to monitor Elasticsearch clusters, and some useful troubleshooting commands.

Monitor Elasticsearch with Prometheus

This shall not be confused with ‘collecting Prometheus data with Elasticsearch’. I’m referring to monitoring the Elasticsearch service with Prometheus:

GitHub - prometheus-community/elasticsearch_exporter: Elasticsearch stats exporter for Prometheus

Prometheus exporter for various metrics about ElasticSearch, written in Go. For pre-built binaries please take a look…

github.com

My philosophies about monitoring are:

monitoring without alerting is equivalent to no monitoring
alerting with too much noise is equivalent to no alerting
despite all efforts, there may still be false positives (noise), true negatives (uncaught incidents) in your alert rules. To build a robust system, include monitoring/alerting in your postmortems, identify what’s missing in your monitoring, and fill in the gap.

Note that sometimes the metrics may be renamed after a version change. Which will invalidate your alert rules, and perhaps leaving some broken graphs on Grafana too. It’s good to test the new version of Prometheus exporters before upgrading.

Alert rules for Elasticsearch (good as of 2021):

groups:
- name: ElasticSearch
  rules:
  - alert: UnassignedShards
    expr: elasticsearch_cluster_health_unassigned_shards > 0
  - alert: ClusterRed
    expr: elasticsearch_cluster_health_status{color="red"} == 1
  - alert: JVMUsage
    expr: (elasticsearch_jvm_memory_used_bytes/elasticsearch_jvm_memory_max_bytes) > 0.9
  - alert: HealthyNodes
    expr: elasticsearch_cluster_health_number_of_nodes < 3
  - alert: NumberOfPendingTasks
    expr: elasticsearch_cluster_health_number_of_pending_tasks > 0
  - alert: ElasticSearchUsedFS
    expr:  (elasticsearch_filesystem_data_size_bytes - elasticsearch_filesystem_data_free_bytes) / elasticsearch_filesystem_data_size_bytes * 100 > 90

Some common commands for troubleshooting.

In the node-in-question, get the cluster status by:

curl -X GET "localhost:9200/_cluster/health?pretty"

If you see an alert for unassigned shards, look into

curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

Resolve any underlying issues you see from above output (for example, disk became read-only), then retry the failed jobs:

curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed'

Elasticsearch Administration

GitHub - prometheus-community/elasticsearch_exporter: Elasticsearch stats exporter for Prometheus

Prometheus exporter for various metrics about ElasticSearch, written in Go. For pre-built binaries please take a look…

Written by One9twO