I've an ELK stack with two ElasticSearch nodes running and the cluster state turned red due to some unassigned shards which I can't get rid of. Looking up the unassigned shard, resp. the incomplete index with:
# curl -s elastic01.local:9200/_cat/shards | grep "logstash-2014.09.29"
Shows:
logstash-2014.09.29 4 p STARTED 745489 481.3mb 10.165.98.107 Crimson and the Raven
logstash-2014.09.29 4 r STARTED 745489 481.3mb 10.165.98.106 Glenn Talbot
logstash-2014.09.29 0 p STARTED 781110 502.3mb 10.165.98.107 Crimson and the Raven
logstash-2014.09.29 0 r STARTED 781110 502.3mb 10.165.98.106 Glenn Talbot
logstash-2014.09.29 3 p INITIALIZING 10.165.98.107 Crimson and the Raven
logstash-2014.09.29 3 r UNASSIGNED
logstash-2014.09.29 1 p STARTED 762991 490.1mb 10.165.98.107 Crimson and the Raven
logstash-2014.09.29 1 r STARTED 762991 490.1mb 10.165.98.106 Glenn Talbot
logstash-2014.09.29 2 p STARTED 761811 491.3mb 10.165.98.107 Crimson and the Raven
logstash-2014.09.29 2 r STARTED 761811 491.3mb 10.165.98.106 Glenn Talbot
My attempt to assign the shard to the other node fails:
curl XPOST -s 'http://elastic01.local:9200/_cluster/reroute?pretty=true' -d '{
"commands" : [ {
"allocate" : {
"index" : "logstash-2014.09.29",
"shard" : 3 ,
"node" : "Glenn Talbot",
"allow_primary" : 1
}
}
]
}'
With:
NO(primary shard is not yet active)]
I can't really seem to find an API to push the shard states any further. How could I proceed here?
Just for a complete picture, that what the system health looks like:
{
"cluster_name" : "logstash_es",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 114,
"active_shards" : 228,
"relocating_shards" : 0,
"initializing_shards" : 1,
"unassigned_shards" : 1
}
Thank you for your time and help
I actually ran into this situation with ElasticSearch 1.5 just the other day. After initially getting the same error, I simply repeated the /_cluster/reroute request the next day for lack of other ideas, and it worked, and it put the cluster back into a green state immediately.
Related
I am posting a more general question, after having found I may have more issues than low disk space:
optimise server operations with elasticsearch : addressing low disk watermarks and authentication failures
My issue is that my ES server crashes occasionally, and cannot figure out why.
I want to ensure reliability at least of days, and if error occur, restart the instance automatically.
Which best practices could I follow to debug ES on a small server instance, using a single node?
This is what I am looking at:
(useful resource - https://www.datadoghq.com/blog/elasticsearch-unassigned-shards/)
Check on available disk space - optimise server operations with elasticsearch : addressing low disk watermarks and authentication failures
Check on ES log (/var/log/elasticsearch):
...
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) [netty-common-4.1.6.Final.jar:4.1.6.Final]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
Caused by: org.elasticsearch.action.NoShardAvailableActionException
... 60 more
[2020-05-12T15:05:56,874][INFO ][o.e.c.r.a.AllocationService] [awesome3-master] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[en-awesome-wiki][2]] ...]).
[2020-05-12T15:10:48,998][DEBUG][o.e.a.a.c.a.TransportClusterAllocationExplainAction] [awesome3-master] explaining the allocation for [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false], found shard [[target-validation][4], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2020-05-12T15:05:54.260Z], delayed=false, allocation_status[no_attempt]]]
I spotted somewhere a shared allocation error. So I check:
curl -s 'localhost:9200/_cat/allocation?v'
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
15 616.2mb 10.6gb 12.5gb 23.1gb 45 127.0.0.1 127.0.0.1 awesome3-master
15 UNASSIGNED
What does this mean ? Are the indexed duplicated in more replicas (see below) ?
I check
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1425 100 1425 0 0 5137 0 --:--:-- --:--:-- --:--:-- 5144
target-validation 4 r UNASSIGNED CLUSTER_RECOVERED
target-validation 2 r UNASSIGNED CLUSTER_RECOVERED
target-validation 1 r UNASSIGNED CLUSTER_RECOVERED
target-validation 3 r UNASSIGNED CLUSTER_RECOVERED
target-validation 0 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 4 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 2 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 1 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 3 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 0 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 4 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 2 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 1 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 3 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 0 r UNASSIGNED CLUSTER_RECOVERED
and here I have a question: is ES trying to create new replicas each time an error is failing the system ?
So I look at an explaination:
curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
"shard" : {
"index" : "target-validation",
"index_uuid" : "ONFPE7UQQzWjrhG0ztlSdw",
"id" : 4,
"primary" : false
},
"assigned" : false,
"shard_state_fetch_pending" : false,
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2020-05-12T15:05:54.260Z",
"delayed" : false,
"allocation_status" : "no_attempt"
},
"allocation_delay_in_millis" : 60000,
"remaining_delay_in_millis" : 0,
"nodes" : {
"Ynm6YG-MQyevaDqT2n9OeA" : {
"node_name" : "awesome3-master",
"node_attributes" : { },
"store" : {
"shard_copy" : "AVAILABLE"
},
"final_decision" : "NO",
"final_explanation" : "the shard cannot be assigned because allocation deciders return a NO decision",
"weight" : 9.5,
"decisions" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated on the same node id [Ynm6YG-MQyevaDqT2n9OeA] on which it already exists"
}
]
}
}
}
Now, I would like to better understand what a shard is and what ES is attempting to do.
Should I delete unused replicas?
And finally, what should I do to test the service is "sufficiently" reliable ?
Kindly let me know if there are best practices to follow for debugging ES and tuning server.
My constraint are a small server and would be happy if server won't crash, just take a little bit longer.
EDIT
Found this very useful question :
Shards and replicas in Elasticsearch
and this answer may offer a solution:
https://stackoverflow.com/a/50641899/305883
Before testing it out as an answer, could you kindly help to figure out if / how back-up the indexes and estimating correct parameters?
I run 1 single server and assume, given the above configurations, number_of_shards should be 1 (1 single machine) and max number_of_replicas could be 2 (disk size should handle it) :
curl -XPUT 'localhost:9200/sampleindex?pretty' -H 'Content-Type: application/json' -d '
{
"settings":{
"number_of_shards":1,
"number_of_replicas":2
}
}'
I have an ES cluster of 2 nodes. As I restarted nodes the cluster status is yellow as some of the shards are unassigned. I've tried to google and the common solution is to reroute unassigned shards. Unfortunately, it doesn't work for me.
curl localhost:9200/_cluster/health?pretty=true
{
"cluster_name" : "infra",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 34,
"active_shards" : 68,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 31,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 68.68686868686868
}
curl localhost:9200/_cluster/settings?pretty
{
"persistent" : { },
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"enable" : "all"
}
}
}
}
}
curl localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open logstash-log-2016.05.13 5 2 88314 0 300.5mb 150.2mb
yellow open logstash-log-2016.05.12 5 2 254450 0 833.9mb 416.9mb
yellow open .kibana 1 2 3 0 47.8kb 25.2kb
green open .marvel-es-data-1 1 1 3 0 8.7kb 4.3kb
yellow open logstash-log-2016.05.11 5 2 313095 0 709.1mb 354.6mb
yellow open logstash-log-2016.05.10 5 2 613744 0 1gb 520.2mb
green open .marvel-es-1-2016.05.18 1 1 88720 495 89.9mb 45mb
green open .marvel-es-1-2016.05.17 1 1 69430 492 59.4mb 29.7mb
yellow open logstash-log-2016.05.17 5 2 188924 0 518.2mb 259mb
yellow open logstash-log-2016.05.18 5 2 226775 0 683.7mb 366.1mb
Rerouting
curl -XPOST 'localhost:9200/_cluster/reroute?pretty' -d '{
"commands": [
{
"allocate": {
"index": "logstash-log-2016.05.13",
"shard": 3,
"node": "elasticsearch-mon-1",
"allow_primary": true
}
}
]
}'
{
"error" : {
"root_cause" : [ {
"type" : "illegal_argument_exception",
"reason" : "[allocate] allocation of [logstash-log-2016.05.13][3] on node {elasticsearch-mon-1}{K-J8WKyZRB6bE4031kHkKA}{172.45.0.56}{172.45.0.56:9300} is not allowed, reason: [YES(allocation disabling is ignored)][NO(shard cannot be allocated on same node [K-J8WKyZRB6bE4031kHkKA] it already exists on)][YES(no allocation awareness enabled)][YES(allocation disabling is ignored)][YES(target node version [2.3.2] is same or newer than source node version [2.3.2])][YES(primary is already active)][YES(total shard limit disabled: [index: -1, cluster: -1] <= 0)][YES(shard not primary or relocation disabled)][YES(node passes include/exclude/require filters)][YES(enough disk for shard on node, free: [25.4gb])][YES(below shard recovery limit of [2])]"
} ],
"type" : "illegal_argument_exception",
"reason" : "[allocate] allocation of [logstash-log-2016.05.13][3] on node {elasticsearch-mon-1}{K-J8WKyZRB6bE4031kHkKA}{172.45.0.56}{172.45.0.56:9300} is not allowed, reason: [YES(allocation disabling is ignored)][NO(shard cannot be allocated on same node [K-J8WKyZRB6bE4031kHkKA] it already exists on)][YES(no allocation awareness enabled)][YES(allocation disabling is ignored)][YES(target node version [2.3.2] is same or newer than source node version [2.3.2])][YES(primary is already active)][YES(total shard limit disabled: [index: -1, cluster: -1] <= 0)][YES(shard not primary or relocation disabled)][YES(node passes include/exclude/require filters)][YES(enough disk for shard on node, free: [25.4gb])][YES(below shard recovery limit of [2])]"
},
"status" : 400
}
curl -XPOST 'localhost:9200/_cluster/reroute?pretty' -d '{
"commands": [
{
"allocate": {
"index": "logstash-log-2016.05.13",
"shard": 3,
"node": "elasticsearch-mon-2",
"allow_primary": true
}
}
]
}'
{
"error" : {
"root_cause" : [ {
"type" : "illegal_argument_exception",
"reason" : "[allocate] allocation of [logstash-log-2016.05.13][3] on node {elasticsearch-mon-2}{Rxgq2aWPSVC0pvUW2vBgHA}{172.45.0.166}{172.45.0.166:9300} is not allowed, reason: [YES(allocation disabling is ignored)][NO(shard cannot be allocated on same node [Rxgq2aWPSVC0pvUW2vBgHA] it already exists on)][YES(no allocation awareness enabled)][YES(allocation disabling is ignored)][YES(target node version [2.3.2] is same or newer than source node version [2.3.2])][YES(primary is already active)][YES(total shard limit disabled: [index: -1, cluster: -1] <= 0)][YES(shard not primary or relocation disabled)][YES(node passes include/exclude/require filters)][YES(enough disk for shard on node, free: [25.4gb])][YES(below shard recovery limit of [2])]"
} ],
"type" : "illegal_argument_exception",
"reason" : "[allocate] allocation of [logstash-log-2016.05.13][3] on node {elasticsearch-mon-2}{Rxgq2aWPSVC0pvUW2vBgHA}{172.45.0.166}{172.45.0.166:9300} is not allowed, reason: [YES(allocation disabling is ignored)][NO(shard cannot be allocated on same node [Rxgq2aWPSVC0pvUW2vBgHA] it already exists on)][YES(no allocation awareness enabled)][YES(allocation disabling is ignored)][YES(target node version [2.3.2] is same or newer than source node version [2.3.2])][YES(primary is already active)][YES(total shard limit disabled: [index: -1, cluster: -1] <= 0)][YES(shard not primary or relocation disabled)][YES(node passes include/exclude/require filters)][YES(enough disk for shard on node, free: [25.4gb])][YES(below shard recovery limit of [2])]"
},
"status" : 400
}
So it fails and doesn't make any change. Shards are still in unassigned state.
Thank you.
Added
curl localhost:9200/_cat/shards
logstash-log-2016.05.13 2 p STARTED 17706 31.6mb 172.45.0.166 elasticsearch-mon-2
logstash-log-2016.05.13 2 r STARTED 17706 31.5mb 172.45.0.56 elasticsearch-mon-1
logstash-log-2016.05.13 2 r UNASSIGNED
logstash-log-2016.05.13 4 p STARTED 17698 31.6mb 172.45.0.166 elasticsearch-mon-2
logstash-log-2016.05.13 4 r STARTED 17698 31.4mb 172.45.0.56 elasticsearch-mon-1
logstash-log-2016.05.13 4 r UNASSIGNED
For all the indices that are yellow you have configured 2 replicas:
health status index pri rep
yellow open logstash-log-2016.05.13 5 2
yellow open logstash-log-2016.05.12 5 2
yellow open .kibana 1 2
yellow open logstash-log-2016.05.11 5 2
yellow open logstash-log-2016.05.10 5 2
yellow open logstash-log-2016.05.17 5 2
yellow open logstash-log-2016.05.18 5 2
2 replicas on two nodes cluster is impossible. You need a third node for all the replicas to be assigned.
Or, decrease the number of replicas:
PUT /logstash-log-*,.kibana/_settings
{
"index": {
"number_of_replicas": 1
}
}
Had same problem with version 5.1.2
I tried below option and it worked out.
curl -XPUT 'localhost:9200/_cluster/settings' -d
'{ "transient":
{ "cluster.routing.allocation.enable" : "all" }
}'
After this it automatically allocated shards.
Problem: I've started five elasticsearch nodes, but only 66,84 % of the Data is in kibana available. When I check the cluster health with localhost:9200/_cluster/health?pretty=true I've got the following informations: {
"cluster_name" : "A2A",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 4,
"active_primary_shards" : 612,
"active_shards" : 613,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 304,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 66.8484187568157
}
And also all my indices are red, except of the kibana index.
Small Part:
red open logstash-2015.11.08 5 0 47256 668 50.5mb 50.5mb
red open logstash-2015.11.09 5 0 46540 1205 50.4mb 50.4mb
red open logstash-2015.11.06 5 0 65645 579 69.2mb 69.2mb
red open logstash-2015.11.07 5 0 62733 674 66.4mb 66.4mb
green open .kibana 1 1 2 0 19.7kb 9.8kb
red open logstash-2015.11.11 5 0 49254 1272 53mb 53mb
red open logstash-2015.11.12 5 0 50885 466 53.6mb 53.6mb
red open logstash-2015.11.10 5 0 49174 1288 52.6mb 52.6mb
red open logstash-2016.04.12 5 0 92508 585 104.8mb 104.8mb
red open logstash-2016.04.13 5 0 95120 279 107.2mb 107.2mb
I've tried to fix the problem with curl -XPUT 'localhost:9200/_settings' -d ' {"index.routing.allocation.disable_allocation": false}' but it doesn't work!
So has anyone of you some ideas how to assign my shards?
And when you need some other infos please ask and I will try to offer you the data:
Have you seen this answer? https://stackoverflow.com/a/23816954/1834331
You could also try restarting elasticsearch first: service elasticsearch restart.
Otherwise, just try reallocating the shards manually (as your indices have 5 shards, run the command with the shard flag 0, 1, 2, .. 5):
curl -XPOST -d '{ "commands" : [ {
"allocate" : {
"index" : "logstash-2015.11.08",
"shard" : 0,
"node" : "SOME_NODE_HERE",
"allow_primary":true
}
} ] }' http://localhost:9200/_cluster/reroute?pretty`
You can check the nodes with unassigned shards using: curl -s localhost:9200/_cat/shards | grep UNASS
if shards are stuck in unallocated they can be manually allocated. For example:
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
"commands": [{
"allocate": {
"index": "logstash-2015.11.07",
"shard": 5,
"node": "Frederick Slade",
"allow_primary": 1
}
}]
}'
See the Cluster Reroute documentation, including warnings on the use of allow_primary.
There were some disk issues on a Graylog2 server I use for debug logs. There are unassigned shards now:
curl -XGET http://host:9200/_cat/shards
graylog_292 1 p STARTED 751733 648.4mb 127.0.1.1 Doctor Leery
graylog_292 1 r UNASSIGNED
graylog_292 2 p STARTED 756663 653.2mb 127.0.1.1 Doctor Leery
graylog_292 2 r UNASSIGNED
graylog_290 0 p STARTED 299059 257.2mb 127.0.1.1 Doctor Leery
graylog_290 0 r UNASSIGNED
graylog_290 3 p STARTED 298759 257.1mb 127.0.1.1 Doctor Leery
graylog_290 3 r UNASSIGNED
graylog_290 1 p STARTED 298314 257.3mb 127.0.1.1 Doctor Leery
graylog_290 1 r UNASSIGNED
graylog_290 2 p STARTED 297722 257.1mb 127.0.1.1 Doctor Leery
graylog_290 2 r UNASSIGNED
....
It's over 400 shards. I can delete them without data loss, because it's a single node setup. In order to do this I need to loop over the index (graylog_xxx) and over the shard (1,2,...).
How do I loop over this (2 variables) with Bash? There are 2 variables for the deletion API call, which I need to replace (afaik):
curl -XPOST 'host:9200/_cluster/reroute' -d '{
"commands" : [ {
"allocate" : {
"index" : "$index",
"shard" : $shard,
"node" : "Doctor Leery",
"allow_primary" : true
}
}
]
}'
What also bothers me about this is, that the unassigned shards have no node. But in the API call I need to specify one.
Form the _cat/shards output you shared, it simply looks like those are unassigned replicas, which you can simply remove by updating the cluster settings and setting the replica count to 0, like this:
curl -XPUT 'localhost:9200/_settings' -d '{
"index" : {
"number_of_replicas" : 0
}
}'
After running the above curl, your cluster will be green again.
I'm very new to Elasticsearch and try to use it for analyze of data from Suricata IPS. Head plugin shows me this: yellow (131 of 262) unassigned shards
also getting this:
$ curl -XGET http://127.0.0.1:9200/_cluster/health?pretty
{
"cluster_name" : "elasticsearch_brew",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 131,
"active_shards" : 131,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 131,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0
}
How to get rid of those unassigned shards? And also Kibana says me this from time to time:
Error: Bad Gateway
at respond (https://www.server.kibana/index.js?_b=:85279:15)
at checkRespForFailure (https://www.server.kibana/index.js?_b=:85247:7)
at https://www.server.kibana/index.js?_b=:83885:7
at wrappedErrback (https://www.server.kibana/index.js?_b=:20902:78)
at wrappedErrback (https://www.server.kibana/index.js?_b=:20902:78)
at wrappedErrback (https://www.server.kibana/index.js?_b=:20902:78)
at https://www.server.kibana/index.js?_b=:21035:76
at Scope.$eval (https://www.server.kibana/index.js?_b=:22022:28)
at Scope.$digest (https://www.server.kibana/index.js?_b=:21834:31)
at Scope.$apply (https://www.server.kibana/index.js?_b=:22126:24)
I don't know if these problems connected to each other... Could please anyone help me to get it work. Thank you very much!
A cluster with only one node and indices that have one replica will always be yellow.
yellow is not a bad thing, the cluster works perfectly fine. The downside is it doesn't have the copies of the shards active.
You can have a green cluster if youbset the number of replicas to 0 or you add a second node to the cluster.
But, as I said, there is no problem if you have a yellow cluster.
Setting number of replicas to 0, cluster wide (all indices):
curl -XPUT "http://localhost:9200/_settings" -d'
{
"number_of_replicas" : 0
}'