UnavailableShardsException on ElasticSearch - elasticsearch

I'm using Elasticsearch on my dedicated server (not amazon). Recently it's giving me error like:
UnavailableShardsException[[tribune][4] Primary shard is not active or
isn't assigned is a known node. Timeout: [1m], request: delete
{[tribune][news][90755]}]
when ever I'm making /_cat/shards?v the result is:
index shard prirep state docs store ip node
tribune 4 p UNASSIGNED
tribune 4 r UNASSIGNED
tribune 0 p STARTED 5971 34mb ***.**.***.** Benny Beckley
tribune 0 r UNASSIGNED
tribune 3 p STARTED 5875 33.9mb ***.**.***.** Benny Beckley
tribune 3 r UNASSIGNED
tribune 1 p INITIALIZING ***.**.***.** Benny Beckley
tribune 1 r UNASSIGNED
tribune 2 p STARTED 5875 33.6mb ***.**.***.** Benny Beckley
tribune 2 r UNASSIGNED

Related

Replicas created on same node before being transferred

I have an Elasticsearch cluster made up of 3 nodes.
Every day, I have a batch that feeds in a new index composed of 3 shards then scales the number of replicas to 1. So at the end of the day I'm expecting every node to carry 1 primary and 1 replica.
The figure below shows the disk space usage on each node during this operation.
On node 0 everything seems to be going smoothly during that operation.
However, node 2 is idle most of the time at the beginning while node 1 seems to be is taking care of its own replica plus node 2 replica, before transferring it to node 2 (this is my own understanding, I might be wrong). This is causing a lot of pressure on the disk usage of node 1 which almost reaches 100% of disk space usage.
Why this behaviour? Shouldn't every node take care of its own replica here to even the load? Can I force it to do so somehow? This is worrying because when a disk reaches 100%, the entire node goes down as it happened in the past.
UPDATE to Val's answer:
You will find the outputs below
GET _cat/shards/xxxxxxxxxxxxxxxxxxxxxx_20210617?v
index shard prirep state docs store ip node
xxxxxxxxxxxxxxxxxxxxxx_20210617 1 p STARTED 8925915 13.4gb 172.23.13.255 es-master-0
xxxxxxxxxxxxxxxxxxxxxx_20210617 1 r STARTED 8925915 13.4gb 172.23.10.76 es-master-2
xxxxxxxxxxxxxxxxxxxxxx_20210617 2 r STARTED 8920172 13.4gb 172.23.24.221 es-master-1
xxxxxxxxxxxxxxxxxxxxxx_20210617 2 p STARTED 8920172 13.4gb 172.23.10.76 es-master-2
xxxxxxxxxxxxxxxxxxxxxx_20210617 0 p STARTED 8923889 13.4gb 172.23.24.221 es-master-1
xxxxxxxxxxxxxxxxxxxxxx_20210617 0 r STARTED 8923889 13.5gb 172.23.13.255 es-master-0
GET _cat/recovery/xxxxxxxxxxxxxxxxxxxxxx_20210617?v
index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
xxxxxxxxxxxxxxxxxxxxxx_20210617 0 382ms empty_store done n/a n/a 172.23.24.221 es-master-1 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 0 0 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 0 21.9m peer done 172.23.24.221 es-master-1 172.23.13.255 es-master-0 n/a n/a 188 188 100.0% 188 14467579393 14467579393 100.0% 14467579393 55835 55835 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 1 395ms empty_store done n/a n/a 172.23.13.255 es-master-0 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 0 0 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 1 9m peer done 172.23.13.255 es-master-0 172.23.10.76 es-master-2 n/a n/a 188 188 100.0% 188 14486949488 14486949488 100.0% 14486949488 0 0 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 2 17.8m peer done 172.23.10.76 es-master-2 172.23.24.221 es-master-1 n/a n/a 134 134 100.0% 134 14470475298 14470475298 100.0% 14470475298 1894 1894 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 2 409ms empty_store done n/a n/a 172.23.10.76 es-master-2 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 0 0 100.0%
First, if you have 3 nodes and your index has 3 primaries with each having 1 replica, there's absolutely no guarantee whatsoever that each node will hold one primary and one replica.
The only guarantees that you have is that:
the shard count will be balanced over the nodes and
a primary and its replica will never land on the same node.
That being said, it's perfectly possible for a node to get two primaries, another two replicas and the 3rd one gets one primary and one replica.
Looking at the chart, what I think happens in your case is that
node 2 gets two primaries and
node 0 gets one primary
Then, when you add the replica:
node 0 (which has only one primary) gets one replica (the curve is less steep)
node 1 (which has nothing so far) gets two replicas (the curve grows steeper)
node 2 stays flat because it already has two primaries
A little later, when node 1's disk approaches saturation, one shard is relocated away from it to node 2 (at 23:16 the curve starts to increase).
The end situation seems to be:
node 0 with one primary and one replica
node 1 with only one replica
node 2 with two primaries and one replica
I think it would be nice to confirm this with the following two commands:
# you can see where each shard is located now
GET _cat/shards/tax*?v
# you can see which shards went from which node to which node
GET _cat/recovery/indexname*?v

How to achieve even distribution of load on ElasticSearch with indexes with different amount of shards?

We are looking into addressing some performance issues with our ES cluster. We were looking into shard distribution on data nodes. I know that there is an advice to have shards evenly distributed between nodes - and here is my question:
For the cluster with 8 data nodes - we have some indexes that have 2 primary shards and 3 replicas (so 8 shards in total). We have also some indexes that have 1 primary shard and 3 replicas (so 4 in total).
My question is - is that setup can be consider "evenly distributed"? We were thinking that it is not and we were thinking about having indexes with 1 primary shard - 7 replicas (so every index will be hosted on 8 nodes) - but we don't know if such setup has any sense? If not - what would you recomend instead to distirbute shards more evenly?
Here is the result of shard cat query:
id1 0 p STARTED 2138 16.1mb x.x.x.x node1
id1 0 r STARTED 2138 16.1mb x.x.x.x node2
id1 0 r STARTED 2138 16.1mb x.x.x.x node3
id1 0 r STARTED 2138 16.1mb x.x.x.x node4
id2 0 r STARTED 3379 26.8mb x.x.x.x node5
id2 0 r STARTED 3379 26.8mb x.x.x.x node3
id2 0 r STARTED 3379 26.8mb x.x.x.x node4
id2 0 p STARTED 3379 26.8mb x.x.x.x node6
id3 0 r STARTED 20086 76.1mb x.x.x.x node1
id3 0 r STARTED 20086 76.1mb x.x.x.x node5
id3 0 p STARTED 20086 76.1mb x.x.x.x node6
id3 0 r STARTED 20086 76.1mb x.x.x.x node7
id4 0 r STARTED 2754 7.3mb x.x.x.x node2
id4 0 r STARTED 2754 7.3mb x.x.x.x node3
id4 0 r STARTED 2754 7.3mb x.x.x.x node8
id4 0 p STARTED 2754 7.3mb x.x.x.x node7
id5 0 r STARTED 10239 42.3mb x.x.x.x node1
id5 0 p STARTED 10239 42.3mb x.x.x.x node4
id5 0 r STARTED 10239 42.3mb x.x.x.x node6
id5 0 r STARTED 10239 42.3mb x.x.x.x node8
id6 0 r STARTED 13388 42.4mb x.x.x.x node1
id6 0 p STARTED 13388 42.4mb x.x.x.x node5
id6 0 r STARTED 13388 42.4mb x.x.x.x node3
id6 0 r STARTED 13388 42.4mb x.x.x.x node8
id7 1 r STARTED 27483 136.2mb x.x.x.x node2
id7 1 r STARTED 27483 136.2mb x.x.x.x node3
id7 1 r STARTED 27483 136.3mb x.x.x.x node8
id7 1 p STARTED 27483 136.2mb x.x.x.x node7
id7 0 r STARTED 27189 146.5mb x.x.x.x node1
id7 0 p STARTED 27189 146.6mb x.x.x.x node5
id7 0 r STARTED 27189 146.6mb x.x.x.x node4
id7 0 r STARTED 27189 146.7mb x.x.x.x node6
.kibana 0 r STARTED 13 106.8kb x.x.x.x node2
.kibana 0 p STARTED 13 106.8kb x.x.x.x node3
id8 1 r STARTED 13555 80.8mb x.x.x.x node2
id8 1 r STARTED 13555 80.8mb x.x.x.x node4
id8 1 r STARTED 13555 80.8mb x.x.x.x node8
id8 1 p STARTED 13555 80.8mb x.x.x.x node7
id8 0 r STARTED 13390 63mb x.x.x.x node1
id8 0 p STARTED 13390 62.7mb x.x.x.x node5
id8 0 r STARTED 13390 62.7mb x.x.x.x node6
id8 0 r STARTED 13390 62.8mb x.x.x.x node7
Distributing all shards on all ES data nodes for every index doesn't make sense for various reasons.
Number of primary shards should be chosen based on their size and helps you to horizontally scale the index.
Number of replica shards helps you in high availability and increasing search performance.
It's really difficult to achieve the perfect shards balance in the ES cluster(based on a number of shards, size, and traffic), Although based on your shards size which is really small(lesser than 100MB), you can go with 1 shard and 7 replicas for your all indices, having said this you need to benchmark and choose the correct number of shards and replica based on your cluster setup and use-cases.

Elasticsearch - Replicas is unassigned after reopen index INDEX_REOPENED error

I closed my index and reopen it and now my shards can't assigne.
curl -s -XGET localhost:9201/_cat/shards?h=index,shard,prirep,state,unassigned.reason | grep UNASSIGNED
2018.03.27-team-logs 2 r UNASSIGNED INDEX_REOPENED
2018.03.27-team-logs 5 r UNASSIGNED INDEX_REOPENED
2018.03.27-team-logs 3 r UNASSIGNED INDEX_REOPENED
2018.03.27-team-logs 4 r UNASSIGNED INDEX_REOPENED
2018.03.27-team-logs 1 r UNASSIGNED INDEX_REOPENED
2018.03.27-team-logs 0 r UNASSIGNED INDEX_REOPENED
2018.03.28-team-logs 2 r UNASSIGNED INDEX_REOPENED
2018.03.28-team-logs 5 r UNASSIGNED INDEX_REOPENED
2018.03.28-team-logs 3 r UNASSIGNED INDEX_REOPENED
2018.03.28-team-logs 4 r UNASSIGNED INDEX_REOPENED
2018.03.28-team-logs 1 r UNASSIGNED INDEX_REOPENED
2018.03.28-team-logs 0 r UNASSIGNED INDEX_REOPENED
Could anybody explain me what does error means and how to solve it? Before I closed it everything works fine. I configured 6 shards and 1 replica. Installed Elasticsearch 6.2.
EDIT:
Output of curl -XGET "localhost:9201/_cat/shards":
2018.03.29-team-logs 1 r STARTED 1739969 206.2mb 10.207.46.247 elk-es-data-hot-1.platform.osdc2.mall.local
2018.03.29-team-logs 1 p STARTED 1739969 173mb 10.206.46.246 elk-es-data-hot-2.platform.osdc1.mall.local
2018.03.29-team-logs 2 p STARTED 1739414 169.9mb 10.207.46.247 elk-es-data-hot-1.platform.osdc2.mall.local
2018.03.29-team-logs 2 r STARTED 1739414 176.3mb 10.207.46.248 elk-es-data-hot-2.platform.osdc2.mall.local
2018.03.29-team-logs 4 p STARTED 1740185 186mb 10.206.46.247 elk-es-data-hot-1.platform.osdc1.mall.local
2018.03.29-team-logs 4 r STARTED 1740185 169.4mb 10.206.46.246 elk-es-data-hot-2.platform.osdc1.mall.local
2018.03.29-team-logs 5 r STARTED 1739660 164.3mb 10.207.46.248 elk-es-data-hot-2.platform.osdc2.mall.local
2018.03.29-team-logs 5 p STARTED 1739660 180.1mb 10.206.46.246 elk-es-data-hot-2.platform.osdc1.mall.local
2018.03.29-team-logs 3 p STARTED 1740606 171.2mb 10.207.46.248 elk-es-data-hot-2.platform.osdc2.mall.local
2018.03.29-team-logs 3 r STARTED 1740606 173.4mb 10.206.46.247 elk-es-data-hot-1.platform.osdc1.mall.local
2018.03.29-team-logs 0 r STARTED 1740166 169.7mb 10.207.46.247 elk-es-data-hot-1.platform.osdc2.mall.local
2018.03.29-team-logs 0 p STARTED 1740166 187mb 10.206.46.247 elk-es-data-hot-1.platform.osdc1.mall.local
2018.03.28-team-logs 1 p STARTED 2075020 194.2mb 10.207.46.248 elk-es-data-hot-2.platform.osdc2.mall.local
2018.03.28-team-logs 1 r UNASSIGNED
2018.03.28-team-logs 2 p STARTED 2076268 194.9mb 10.206.46.247 elk-es-data-hot-1.platform.osdc1.mall.local
2018.03.28-team-logs 2 r UNASSIGNED
2018.03.28-team-logs 4 p STARTED 2073906 194.9mb 10.207.46.247 elk-es-data-hot-1.platform.osdc2.mall.local
2018.03.28-team-logs 4 r UNASSIGNED
2018.03.28-team-logs 5 p STARTED 2072921 195mb 10.207.46.248 elk-es-data-hot-2.platform.osdc2.mall.local
2018.03.28-team-logs 5 r UNASSIGNED
2018.03.28-team-logs 3 p STARTED 2074579 194.1mb 10.206.46.246 elk-es-data-hot-2.platform.osdc1.mall.local
2018.03.28-team-logs 3 r UNASSIGNED
2018.03.28-team-logs 0 p STARTED 2073349 193.9mb 10.207.46.248 elk-es-data-hot-2.platform.osdc2.mall.local
2018.03.28-team-logs 0 r UNASSIGNED
2018.03.27-team-logs 1 p STARTED 356769 33.5mb 10.207.46.246 elk-es-data-warm-1.platform.osdc2.mall.local
2018.03.27-team-logs 1 r UNASSIGNED
2018.03.27-team-logs 2 p STARTED 356798 33.6mb 10.206.46.244 elk-es-data-warm-2.platform.osdc1.mall.local
2018.03.27-team-logs 2 r UNASSIGNED
2018.03.27-team-logs 4 p STARTED 356747 33.7mb 10.207.46.246 elk-es-data-warm-1.platform.osdc2.mall.local
2018.03.27-team-logs 4 r UNASSIGNED
2018.03.27-team-logs 5 p STARTED 357399 33.8mb 10.207.46.245 elk-es-data-warm-2.platform.osdc2.mall.local
2018.03.27-team-logs 5 r UNASSIGNED
2018.03.27-team-logs 3 p STARTED 357957 33.7mb 10.206.46.245 elk-es-data-warm-1.platform.osdc1.mall.local
2018.03.27-team-logs 3 r UNASSIGNED
2018.03.27-team-logs 0 p STARTED 356357 33.4mb 10.207.46.245 elk-es-data-warm-2.platform.osdc2.mall.local
2018.03.27-team-logs 0 r UNASSIGNED
.kibana 0 p STARTED 2 12.3kb 10.207.46.247 elk-es-data-hot-1.platform.osdc2.mall.local
.kibana 0 r UNASSIGNED
Output of curl -XGET "localhost:9201/_cat/nodes":
10.207.46.248 8 82 0 0.07 0.08 0.11 d - elk-es-data-hot-2
10.206.46.245 9 64 0 0.04 0.11 0.08 d - elk-es-data-warm-1
10.207.46.249 11 90 0 0.00 0.01 0.05 m * elk-es-master-2
10.207.46.245 9 64 0 0.00 0.01 0.05 d - elk-es-data-warm-2
10.206.46.247 12 82 0 0.00 0.06 0.08 d - elk-es-data-hot-1
10.206.46.244 10 64 0 0.08 0.04 0.05 d - elk-es-data-warm-2
10.207.46.243 5 86 0 0.00 0.01 0.05 d - elk-kibana
10.206.46.248 10 92 1 0.04 0.18 0.24 m - elk-es-master-1
10.206.46.246 6 82 0 0.02 0.07 0.09 d - elk-es-data-hot-2
10.207.46.247 9 82 0 0.06 0.06 0.05 d - elk-es-data-hot-1
10.206.46.241 6 91 0 0.00 0.02 0.05 m - master-test
10.206.46.242 8 89 0 0.00 0.02 0.05 d - es-kibana
10.207.46.246 8 64 0 0.00 0.02 0.05 d - elk-es-data-warm-1
It is expected behaviour.
Elasticsearch will not put primary and replica shard on the same
node. You will need at least 2 nodes to have have 1 replica.
You can simply set replica to 0
PUT */_settings
{
"index" : {
"number_of_replicas" : 0
}
}
UPDATE:
After running following request
GET /_cluster/allocation/explain?pretty
we can see response here
https://pastebin.com/1ag1Z7jL
"explanation" : "there are too many copies of the shard allocated to
nodes with attribute [datacenter], there are [2] total configured
shard copies for this shard id and [3] total attribute values,
expected the allocated shard count per attribute [2] to be less than
or equal to the upper bound of the required number of shards per
attribute [1]"
Probbably you have zone setting used. Elasticsearch will avoid to put primary and replica shard in same zone
https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-awareness.html
With ordinary awareness, if one zone lost contact with the other zone,
Elasticsearch would assign all of the missing replica shards to a
single zone. But in this example, this sudden extra load would cause
the hardware in the remaining zone to be overloaded.
Forced awareness solves this problem by NEVER allowing copies of the
same shard to be allocated to the same zone.
For example, lets say we have an awareness attribute called zone, and
we know we are going to have two zones, zone1 and zone2. Here is how
we can force awareness on a node:
cluster.routing.allocation.awareness.force.zone.values: zone1,zone2
cluster.routing.allocation.awareness.attributes: zone

How does Elasticsearch manage its shards?

I have three nodes working together. On those nodes I have five indexes, each having 5 primary shards and each primary shard has 2 replicas. It looks like this (i cut to see only two indices out of 5 ):
![IMG]http://i59.tinypic.com/2ez1wjt.png
As you can see on the picture:
- the node 1 has primary shards 0 and 3 (and replicas 1,2 and 4)
- the node 2 has primary shards 2 (and replicas 0, 1, 3 and 4)
- the node 3 has primary shards 1 and 4 (and replicas 0 ,2 and 3)
and this is the case for each index (the 5 of them).
I understand that if I restart my nodes this "organisation" will change but still the "look" of index will be the same as index2, 3, 4, 5. For example, after restarting, I would have:
- the node 1 has primary shards 1 and 2 (and replicas 0, 3 and 4)
- the node 2 has primary shards 3 (and replicas 0, 1, 2 and 4)
- the node 3 has primary shards 0 and 4 (and replicas 1 ,2 and 3)
and this would be the case for each index (the 5 of them).
Is there a reason why I find always the same pattern for each of my index?
Thanks

Elasticsearch:how to change cluster health from yellow to green

I have a cluster with one node (by local). Health cluster is yellow. Now I add more one node, but shards can not be allocated in second node. So the health of my cluster is still yellow. I can not change this state to green, not like as this guide:health cluster example.
So how to change health state to green?
My cluster:
Cluster health:
curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "astrung",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 22,
"active_shards" : 22,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 20
}
Shard status:
curl -XGET 'http://localhost:9200/_cat/shards?v'
index shard prirep state docs store ip node
_river 0 p STARTED 2 8.1kb 192.168.1.3 One
_river 0 r UNASSIGNED
megacorp 4 p STARTED 1 3.4kb 192.168.1.3 One
megacorp 4 r UNASSIGNED
megacorp 0 p STARTED 2 6.1kb 192.168.1.3 One
megacorp 0 r UNASSIGNED
megacorp 3 p STARTED 1 2.2kb 192.168.1.3 One
megacorp 3 r UNASSIGNED
megacorp 1 p STARTED 0 115b 192.168.1.3 One
megacorp 1 r UNASSIGNED
megacorp 2 p STARTED 1 2.2kb 192.168.1.3 One
megacorp 2 r UNASSIGNED
mybucket 2 p STARTED 1 2.1kb 192.168.1.3 One
mybucket 2 r UNASSIGNED
mybucket 0 p STARTED 0 115b 192.168.1.3 One
mybucket 0 r UNASSIGNED
mybucket 3 p STARTED 2 5.4kb 192.168.1.3 One
mybucket 3 r UNASSIGNED
mybucket 1 p STARTED 1 2.2kb 192.168.1.3 One
mybucket 1 r UNASSIGNED
mybucket 4 p STARTED 1 2.5kb 192.168.1.3 One
mybucket 4 r UNASSIGNED
.kibana 0 r INITIALIZING 192.168.1.3 Two
.kibana 0 p STARTED 2 8.9kb 192.168.1.3 One
.marvel-kibana 2 p STARTED 0 115b 192.168.1.3 One
.marvel-kibana 2 r UNASSIGNED
.marvel-kibana 0 r INITIALIZING 192.168.1.3 Two
.marvel-kibana 0 p STARTED 1 2.9kb 192.168.1.3 One
.marvel-kibana 3 p STARTED 0 115b 192.168.1.3 One
.marvel-kibana 3 r UNASSIGNED
.marvel-kibana 1 p STARTED 0 115b 192.168.1.3 One
.marvel-kibana 1 r UNASSIGNED
.marvel-kibana 4 p STARTED 0 115b 192.168.1.3 One
.marvel-kibana 4 r UNASSIGNED
user_ids 4 p STARTED 11 5kb 192.168.1.3 One
user_ids 4 r UNASSIGNED
user_ids 0 p STARTED 7 25.1kb 192.168.1.3 One
user_ids 0 r UNASSIGNED
user_ids 3 p STARTED 11 4.9kb 192.168.1.3 One
user_ids 3 r UNASSIGNED
user_ids 1 p STARTED 8 28.7kb 192.168.1.3 One
user_ids 1 r UNASSIGNED
user_ids 2 p STARTED 11 8.5kb 192.168.1.3 One
user_ids 2 r UNASSIGNED
I suggest updating the replication factor of all the indices to 0 and then update it back to 1.
Here's a curl to get you started
curl -XPUT 'http://localhost:9200/_settings' -H 'Content-Type: application/json' -d '
{
"index" : {
"number_of_replicas" : 0
}
}'
like #mohitt said above, update number_of_replicas to zero(for local dev only,be careful to use in production)
you can run the below in Kibana DevTool Console:
PUT _settings
{
"index" : {
"number_of_replicas" : 0
}
}
thou recovery normally takes a long time, looking at the number and size of your documents, it should take a very sort time to recover.
Looks like you have issues with the nodes contacting each other, check firewall rules, ensure ports 9200 and 9300 are reachable from each.

Resources