elasticsearch: How to interpret log file (cluster went to yellow status)? - elasticsearch

Elasticsearch 1.7.2 on CentOS, 8GB RAM, 2 node cluster.
We posted the whole log here: http://pastebin.com/zc2iG2q4
When we look at /_cluster/health , we see 2 unassigned shards:
{
"cluster_name" : "elasticsearch-prod",
"status" : "yellow", <--------------------------
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 5,
"active_shards" : 8,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 2, <--------------------------
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0
In the log, we see:
marking and sending shard failed due to [failed to create shard]
java.lang.OutOfMemoryError: Java heap space
And other errors.
The only memory related config value we have is:
indices.fielddata.cache.size: 75%
We are looking to:
understand the log more completely
understand what action we need to take to address the situation now (recover) and prevent it in the future
Additional details:
1) ES_HEAP_SIZE is stock, no changes. (Further, looking around, it is not clear where best to change it.... /etc/init.d/elasticsearch ?)
2) Our jvm stats are below. (And please note, as a test, I modded "/etc/init.d/elasticsearch" and and added export ES_HEAP_SIZE=4g [in place of the existing "export ES_HEAP_SIZE" line] and restarted ES.... Comparing two identical nodes, one with the changed elasticsearch file, and one stock, the values below appear identical)
"jvm" : {
"timestamp" : 1448395039780,
"uptime_in_millis" : 228297,
"mem" : {
"heap_used_in_bytes" : 81418872,
"heap_used_percent" : 7,
"heap_committed_in_bytes" : 259522560,
"heap_max_in_bytes" : 1037959168,
"non_heap_used_in_bytes" : 50733680,
"non_heap_committed_in_bytes" : 51470336,
"pools" : {
"young" : {
"used_in_bytes" : 52283368,
"max_in_bytes" : 286326784,
"peak_used_in_bytes" : 71630848,
"peak_max_in_bytes" : 286326784
},
"survivor" : {
"used_in_bytes" : 2726824,
"max_in_bytes" : 35782656,
"peak_used_in_bytes" : 8912896,
"peak_max_in_bytes" : 35782656
},
"old" : {
"used_in_bytes" : 26408680,
"max_in_bytes" : 715849728,
"peak_used_in_bytes" : 26408680,
"peak_max_in_bytes" : 715849728
}
}
},
"threads" : {
"count" : 81,
"peak_count" : 81
},
"gc" : {
"collectors" : {
"young" : {
"collection_count" : 250,
"collection_time_in_millis" : 477
},
"old" : {
"collection_count" : 1,
"collection_time_in_millis" : 22
}
}
},
"buffer_pools" : {
"direct" : {
"count" : 112,
"used_in_bytes" : 20205138,
"total_capacity_in_bytes" : 20205138
},
"mapped" : {
"count" : 0,
"used_in_bytes" : 0,
"total_capacity_in_bytes" : 0
}
}
},

Solved.
The key here is the error "java.lang.OutOfMemoryError: Java heap space"
Another day, another gem from the ES docs:
https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html
says (emphasis mine):
The default installation of Elasticsearch is configured with a 1 GB heap. For just about every deployment, this number is far too small. If you are using the default heap values, your cluster is probably configured incorrectly.
Resolution:
Edit: /etc/sysconfig/elasticsearch
Set ES_HEAP_SIZE=4g // this system has 8GB RAM
Restart ES
And tada.... the unassigned shards are magically assigned, and the cluster goes green.

Related

Elasticsearch get snapshot size

I'm looking for a way to get the storage size of an specific Elasticsearch snapshot? The snapshots are located on a shared filesystem.
It seems there is no API for this?
In order to get the size or status of the elasticsearch snapshot, run snapshot status API snapshot status API
curl -X GET "localhost:9200/_snapshot/my_repository/my_snapshot/_status?pretty"
Note: Mention appropriate values in the above curl.
Sample Output:
"snapshots" : [
{
"snapshot" : "index-01",
"repository" : "my_repository",
"uuid" : "OKHNDHSKENGHLEWNALWEERTJNS",
"state" : "SUCCESS",
"include_global_state" : true,
"shards_stats" : {
"initializing" : 0,
"started" : 0,
"finalizing" : 0,
"done" : 2,
"failed" : 0,
"total" : 2
},
"stats" : {
"incremental" : {
"file_count" : 149,
"size_in_bytes" : 8229187919
},
"total" : {
"file_count" : 463,
"size_in_bytes" : 169401330819
},
"start_time_in_millis" : 1631622333285,
"time_in_millis" : 208851,
"number_of_files" : 149,
"processed_files" : 149,
"total_size_in_bytes" : 8229187919,
"processed_size_in_bytes" : 8229187919
},
"indices" : {
"graylog_130" : {
"shards_stats" : {
"initializing" : 0,
"started" : 0,
"finalizing" : 0,
"done" : 2,
"failed" : 0,
"total" : 2
},
"stats" : {
"incremental" : {
"file_count" : 149,
"size_in_bytes" : 8229187919
},
"total" : {
"file_count" : 463,
"size_in_bytes" : 169401330819
},
"start_time_in_millis" : 1631622333285,
"time_in_millis" : 208851,
"number_of_files" : 149,
"processed_files" : 149,
"total_size_in_bytes" : 8229187919,
"processed_size_in_bytes" : 8229187919
},
"shards" : {
"0" : {
"stage" : "DONE",
"stats" : {
"incremental" : {
"file_count" : 97,
"size_in_bytes" : 1807163337
},
"total" : {
"file_count" : 271,
"size_in_bytes" : 84885391182
},
"start_time_in_millis" : 1631622334048,
"time_in_millis" : 49607,
"number_of_files" : 97,
"processed_files" : 97,
"total_size_in_bytes" : 1807163337,
"processed_size_in_bytes" : 1807163337
}
},
"1" : {
"stage" : "DONE",
"stats" : {
"incremental" : {
"file_count" : 52,
"size_in_bytes" : 6422024582
},
"total" : {
"file_count" : 192,
"size_in_bytes" : 84515939637
},
"start_time_in_millis" : 1631622333285,
"time_in_millis" : 208851,
"number_of_files" : 52,
"processed_files" : 52,
"total_size_in_bytes" : 6422024582,
"processed_size_in_bytes" : 6422024582
}
}
}
}
In the above output, look for
"total" : {
"file_count" : 463,
"size_in_bytes" : 169401330819
}
Now convert size_in_bytes to GB, you will get the exact size of the snapshot in GB's Convert bytes to GB
You could get storage used by index using _cat API ( primary store size). First snapshot should be around index size.
For Incremental snapshots, it depends . This is because snapshots are taken in a segment level ( index-.. ) so it may be much smaller depending your indexing. Merges could cause new segments to form etc..
https://www.elastic.co/blog/found-elasticsearch-snapshot-and-restore Gives a nice overview
I need an exact solution of the used size on the storage.
Now I use the following approach: separate directories on index/snapshot level and so I can get the used storage size on system level (du command) for a specific index or snapshot.

unassigned_shards how to solve it

curl -XGET 'http://localhost:9200/_cluster/health?level=shards'
{
"cluster_name" : "elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 1486,
"active_shards" : 1486,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1486,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 50.0
}
How to solve unassigned shards and active_shards_percent_as_number
** Am using ES 2.4 latest **
thanking you
It's simply because you have a single node in your cluster but all your indices are configured to have one replica per shard (the default setting).
If you run the following command you'll get rid of the replicas and your cluster will turn green:
curl -XPUT 'localhost:9200/_settings' -d '{
"index" : {
"number_of_replicas" : 0
}
}'

Elastic Search 2.3.4 Stops allocating shards with no obvious reason

I am attempting to upgrade our Elastic Search cluster from 1.6 to 2.3.4. The upgrade seems to work, and I can see shard allocation starting to happen within Kopf - but at some point the shard allocation appears to stop with many shards left unallocated, and no errors being reported in the logs. Typically I'm left with 1200 / 3800 shards unallocated.
We have a typical 3 node cluster and I am trialing this standalone on my local machine with all 3 nodes running on my local machine.
I have seen similar symptoms reported - see https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html
. The solution here seemed to be to manually allocate the shards, which I've tried (and works) but I'm at a loss to explain the behaviour of elastic search here. I'd prefer not to go down this route, as I want my cluster to spin up automatically without intervention.
There is also https://github.com/elastic/elasticsearch/pull/14494 which seems to be resolved with the latest ES version, so shouldn't be a problem.
There are no errors in log files - I have upped the root level logging to 'DEBUG' in order to see what I can. What I can see is lines like the below for each unallocated shard (this from the master node logs):
[2016-07-26 09:18:04,859][DEBUG][gateway ] [germany] [index][4] found 0 allocations of [index][4], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-07-26T08:05:04.447Z]], highest version: [-1]
[2016-07-26 09:18:04,859][DEBUG][gateway ] [germany] [index][4]: not allocating, number_of_allocated_shards_found [0]
Config file (with comments removed):
cluster.name: elasticsearch-jm-2.3.4
node.name: germany
script.inline: true
script.indexed: true
If I query the cluster health after reallocation has stopped - I get the response below:
http://localhost:9200/_cluster/health?pretty
cluster_name : elasticsearch-jm-2.3.4
status : red
timed_out : False
number_of_nodes : 3
number_of_data_nodes : 3
active_primary_shards : 1289
active_shards : 2578
relocating_shards : 0
initializing_shards : 0
unassigned_shards : 1264
delayed_unassigned_shards : 0
number_of_pending_tasks : 0
number_of_in_flight_fetch : 0
task_max_waiting_in_queue_millis : 0
active_shards_percent_as_number : 67.10046850598647
Further querying for shards - filtered to one index with unallocated shards. As can be seen - shard 0 and 4 are unallocated whereas shard 1 2 and 3 have been allocated :
http://localhost:9200/_cat/shards
cs-payment-warn-2016.07.20 3 p STARTED 106 92.4kb 127.0.0.1 germany
cs-payment-warn-2016.07.20 3 r STARTED 106 92.4kb 127.0.0.1 switzerland
cs-payment-warn-2016.07.20 4 p UNASSIGNED
cs-payment-warn-2016.07.20 4 r UNASSIGNED
cs-payment-warn-2016.07.20 2 r STARTED 120 74.5kb 127.0.0.1 cyprus
cs-payment-warn-2016.07.20 2 p STARTED 120 74.5kb 127.0.0.1 germany
cs-payment-warn-2016.07.20 1 r STARTED 120 73.8kb 127.0.0.1 cyprus
cs-payment-warn-2016.07.20 1 p STARTED 120 73.8kb 127.0.0.1 germany
cs-payment-warn-2016.07.20 0 p UNASSIGNED
cs-payment-warn-2016.07.20 0 r UNASSIGNED
Manually rerouting an unassigned shard appears to work - (stripped back results set)
http://localhost:9200/_cluster/reroute
POST:
{
"dry_run": true,
"commands": [
{
"allocate": {
"index": "cs-payment-warn-2016.07.20",
"shard": 4,
"node": "switzerland" ,
"allow_primary": true
}
}
]
}
Response:
{
"acknowledged" : true,
"state" : {
"version" : 722,
"state_uuid" : "Vw2vPoCMQk2ZosjzviD4TQ",
"master_node" : "yhL7XXy-SKu_WAM-C33dzA",
"blocks" : {},
"nodes" : {},
"routing_table" : {
"indices" : {
"cs-payment-warn-2016.07.20" : {
"shards" : {
"3" : [{
"state" : "STARTED",
"primary" : true,
"node" : "yhL7XXy-SKu_WAM-C33dzA",
"relocating_node" : null,
"shard" : 3,
"index" : "cs-payment-warn-2016.07.20",
"version" : 22,
"allocation_id" : {
"id" : "x_Iq88hmTqiasrjW09hVuw"
}
}, {
"state" : "STARTED",
"primary" : false,
"node" : "1a8dgBscTUS3c7Pv4mN9CQ",
"relocating_node" : null,
"shard" : 3,
"index" : "cs-payment-warn-2016.07.20",
"version" : 22,
"allocation_id" : {
"id" : "DF-EUEy_SpeUElnZI6cgsQ"
}
}
],
"4" : [{
"state" : "INITIALIZING",
"primary" : true,
"node" : "1a8dgBscTUS3c7Pv4mN9CQ",
"relocating_node" : null,
"shard" : 4,
"index" : "cs-payment-warn-2016.07.20",
"version" : 1,
"allocation_id" : {
"id" : "1tw7C7YPQsWwm_O-8mYHRg"
},
"unassigned_info" : {
"reason" : "INDEX_CREATED",
"at" : "2016-07-26T14:20:15.395Z",
"details" : "force allocation from previous reason CLUSTER_RECOVERED, null"
}
}, {
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 4,
"index" : "cs-payment-warn-2016.07.20",
"version" : 1,
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-07-26T11:24:11.868Z"
}
}
],
"2" : [{
"state" : "STARTED",
"primary" : false,
"node" : "rlRQ2u0XQRqxWld-wSrOug",
"relocating_node" : null,
"shard" : 2,
"index" : "cs-payment-warn-2016.07.20",
"version" : 22,
"allocation_id" : {
"id" : "eQ-_vWNbRp27So0iGSitmA"
}
}, {
"state" : "STARTED",
"primary" : true,
"node" : "yhL7XXy-SKu_WAM-C33dzA",
"relocating_node" : null,
"shard" : 2,
"index" : "cs-payment-warn-2016.07.20",
"version" : 22,
"allocation_id" : {
"id" : "O1PU1_NVS8-uB2yBrG76MA"
}
}
],
"1" : [{
"state" : "STARTED",
"primary" : false,
"node" : "rlRQ2u0XQRqxWld-wSrOug",
"relocating_node" : null,
"shard" : 1,
"index" : "cs-payment-warn-2016.07.20",
"version" : 24,
"allocation_id" : {
"id" : "ZmxtOvorRVmndR15OJMkMA"
}
}, {
"state" : "STARTED",
"primary" : true,
"node" : "yhL7XXy-SKu_WAM-C33dzA",
"relocating_node" : null,
"shard" : 1,
"index" : "cs-payment-warn-2016.07.20",
"version" : 24,
"allocation_id" : {
"id" : "ZNgzePThQxS-iqhRSXzZCw"
}
}
],
"0" : [{
"state" : "UNASSIGNED",
"primary" : true,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "cs-payment-warn-2016.07.20",
"version" : 0,
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-07-26T11:24:11.868Z"
}
}, {
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "cs-payment-warn-2016.07.20",
"version" : 0,
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-07-26T11:24:11.868Z"
}
}
]
}
}
},
"routing_nodes" : {
"unassigned" : [{
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 4,
"index" : "cs-payment-warn-2016.07.20",
"version" : 1,
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-07-26T11:24:11.868Z"
}
}, {
"state" : "UNASSIGNED",
"primary" : true,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "cs-payment-warn-2016.07.20",
"version" : 0,
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-07-26T11:24:11.868Z"
}
}, {
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "cs-payment-warn-2016.07.20",
"version" : 0,
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-07-26T11:24:11.868Z"
}
}
]
},
"nodes" : {
"rlRQ2u0XQRqxWld-wSrOug" : [{
"state" : "STARTED",
"primary" : false,
"node" : "rlRQ2u0XQRqxWld-wSrOug",
"relocating_node" : null,
"shard" : 2,
"index" : "cs-payment-warn-2016.07.20",
"version" : 22,
"allocation_id" : {
"id" : "eQ-_vWNbRp27So0iGSitmA"
}
}, {
"state" : "STARTED",
"primary" : false,
"node" : "rlRQ2u0XQRqxWld-wSrOug",
"relocating_node" : null,
"shard" : 1,
"index" : "cs-payment-warn-2016.07.20",
"version" : 24,
"allocation_id" : {
"id" : "ZmxtOvorRVmndR15OJMkMA"
}
}
]
}
}
}
}

Elastic Search Index Status

I am trying to setup a scripted reindex operation as suggested in: http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
To go with the suggestion of creating a new index, aliasing then deleting the old index I would need to have a way to tell when the indexing operation on the new index was complete. Ideally via the REST interface.
It has 80 million rows to index and can take a few hours.
I can't find anything helpful in the docs..
You can try with _stats : http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-stats.html
Eg :
{
"_shards" : {
"total" : 10,
"successful" : 5,
"failed" : 0
},
"_all" : {
"primaries" : {
"docs" : {
"count" : 0,
"deleted" : 0
},
"store" : {
"size_in_bytes" : 575,
"throttle_time_in_millis" : 0
},
"indexing" : {
"index_total" : 0,
"index_time_in_millis" : 0,
"index_current" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0,
"noop_update_total" : 0,
"is_throttled" : false,
"throttle_time_in_millis" : 0
},
I think, you can compare _all.total.docs.count and _all.total.indexing.index_current

remove unallocated elasticsearch shard

My installation is working fine, but one index is showing yellow health due to 2 'extra' unassigned shards shown in it. How do I remove these extra shards? My current settings are 0 replicas and 2 shards.
curl -XGET 'http://localhost:9200/_cluster/health/tags?level=shards&pretty=true'
{
"cluster_name" : "elasticsearch_inspire",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 2,
"active_shards" : 2,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 2,
"indices" : {
"v1_tags" : {
"status" : "yellow",
"number_of_shards" : 2,
"number_of_replicas" : 1,
"active_primary_shards" : 2,
"active_shards" : 2,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 2,
"shards" : {
"0" : {
"status" : "yellow",
"primary_active" : true,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1
},
"1" : {
"status" : "yellow",
"primary_active" : true,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1
}
}
}
}
}
I recently had this issue where ES process was restarted on all nodes and I had some unassigned shards (cluster health was red).
I did the following to assign the shards manually and that did the trick for me (no restart was needed)
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
"commands": [
{
"allocate": {
"index": "INDEXNAME",
"shard": "SHARDNUMBER",
"node": "PICKANODE",
"allow_primary": 1
}
}
]
}'
Restarting the node seemed to be the only way to solve it.
If anyone has a less-aggressive solution, please post it :)

Resources