Elasticsearch cannot assign shard 0 - elasticsearch

I'm new to Elastic Search and I'm having an index in red-state due to can't assign shard 0 error.
I found a way to get the explanation but I'm still lost on understanding and fixing it. The server's version is 7.5.2.
curl -XGET 'http://localhost:9200/_cluster/allocation/explain' returns
{
"index":"event_tracking",
"shard":0,
"primary":false,
"current_state":"unassigned",
"unassigned_info":{
"reason":"CLUSTER_RECOVERED",
"at":"2020-12-22T14:51:08.943Z",
"last_allocation_status":"no_attempt"
},
"can_allocate":"no",
"allocate_explanation":"cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions":[
{
"node_id":"cfsLU-nnRTGQG1loc4hdVA",
"node_name":"xxx-clustername",
"transport_address":"127.0.0.1:9300",
"node_attributes":{
"ml.machine_memory":"7992242176",
"xpack.installed":"true",
"ml.max_open_jobs":"20"
},
"node_decision":"no",
"deciders":[
{
"decider":"replica_after_primary_active",
"decision":"NO",
"explanation":"primary shard for this replica is not yet active"
},
{
"decider":"same_shard",
"decision":"NO",
"explanation":"the shard cannot be allocated to the same node on which a copy of the shard already exists [[event_tracking][0], node[cfsLU-nnRTGQG1loc4hdVA], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=TObxz0EFQbylZsyTiIH7SA], unassigned_info[[reason=CLUSTER_RECOVERED], at[2020-12-22T14:51:08.943Z], delayed=false, allocation_status[fetching_shard_data]]]"
},
{
"decider":"throttling",
"decision":"NO",
"explanation":"primary shard for this replica is not yet active"
}
]
}
]
}
I, more or less, understand the error message but I can't find the proper way to fix it. This server is not running on Docker, it's directly installed in the Linux machine.
curl -XGET 'http://localhost:9200/_cat/recovery/event_tracking?v' result
index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
event_tracking 0 54.5m existing_store translog n/a n/a 127.0.0.1 xxx-cluster n/a n/a 0 0 100.0% 106 0 0 100.0% 2857898852 7061000 6489585 91.9%
What can I try to resolve this?

Related

ElasticSearch BulkShardRequest failed due to org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor

I am storing logs into elastic search from my reactive spring application. I am getting the following error in elastic search:
Elasticsearch exception [type=es_rejected_execution_exception, reason=rejected execution of processing of [129010665][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[logs-dev-2020.11.05][1]] containing [index {[logs-dev-2020.11.05][_doc][0d1478f0-6367-4228-9553-7d16d2993bc2], source[n/a, actual length: [4.1kb], max length: 2kb]}] and a refresh, target allocation id: WwkZtUbPSAapC3C-Jg2z2g, primary term: 1 on EsThreadPoolExecutor[name = 10-110-23-125-common-elasticsearch-apps-dev-v1/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#6599247a[Running, pool size = 2, active threads = 2, queued tasks = 221, completed tasks = 689547]]]
My index settings:
{
"logs-dev-2020.11.05": {
"settings": {
"index": {
"highlight": {
"max_analyzed_offset": "5000000"
},
"number_of_shards": "3",
"provided_name": "logs-dev-2020.11.05",
"creation_date": "1604558592095",
"number_of_replicas": "2",
"uuid": "wjIOSfZOSLyBFTt1cT-whQ",
"version": {
"created": "7020199"
}
}
}
}
}
I have gone through this site:
https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster
I thought adjusting "write" size in thread-pool will resolve, but it is mentioned as not recommended in the site as below:
Adjusting the queue sizes is therefore strongly discouraged, as it is like putting a temporary band-aid on the problem rather than actually fixing the underlying issue.
So what else can we do improve the situation?
Other info:
Elastic Search version 7.2.1
Cluster health is good and they are 3 nodes in cluster
Index will be created on daily basis, there are 3 shards per index
While you are right, that increasing the thread_pool size is not a permanent solution, you will be glad to know that elasticsearch itself increased the size of write thread_pool(use in your bulk requests) from 200 to 10k in just a minor version upgrade. Please see the size of 200 in ES 7.8, while 10k of ES 7.9 .
If you are using the ES 7.X version, then you can also increase the size to if not 10k, then at least 1k(to avoid rejecting the requests).
If you want a proper fix, you need to do the below things
Find out if it's consistent or just some short-duration burst of write requests, while gets cleared in some time.
If it's consistent, then you need to figure out if have all the write optimization is in place, please refer to my short-tips to improve index speed.
See, if you have reached the full-capacity of your data-nodes, and if yes, scale your cluster to handle the increased/legitimate load.

Elasticsearch restore to a new cluster with differnt number of nodes

I have a ops cluster with 5 nodes (1 master, 1 client, and 3 data nodes). I want to restore a backup of this onto a new test cluster with only 3 nodes (1 master, 1 client, 1 data). I only have 1 data node in my test cluster at the moment and wasn't planning to add any additional data nodes on my test cluster.
The issue I'm having is that when I try to restore to my test cluster, only some of the shards get assigned. Most of them stay in the UNASSIGNED state. I've tried to use the reroute api but it fails. See below
Does my test cluster have to have the same number of nodes as my ops cluster I'm restoring from? If so is there any work around for this?
{
"error": {
"root_cause": [
{
"type": "reroute_transport_exception",
"reason": ["myhost_master"[myhostip:9200][cluster:admin/reroute]"
}
],
"type": "illegal_argument_exception",
"reason": "resovled [myhostip] into [3] nodes, where excpeted to be resolved to a single node"
],
"status": 400
}

How to delete data from a particular shard

I have got a index with 5 primary shards and no replicas.
One of my shard(shard 1) is in unassigned state. When i checked the log file, i found out below error:
2obv65.nvd, _2vfjgt.fdx, _3e3109.si, _3dwgm5_Lucene45_0.dvm, _3aks2g_Lucene45_0.dvd, _3d9u9f_76.del, _3e30gm.cfs, _3cvkyl_es090_0.tim, _3e309p.nvd, _3cvkyl_es090_0.blm]]; nested: FileNotFoundException[_101a65.si]; ]]
When i checked the index, i could not find the 101a65.si file for the shard 1.
I am unable to locate the missing .si file. I tried a lot but could not assign the shard 1 again.
Is there any other way to make the shard 1 assign again? or do i need to delete the entire shard 1 data?
Please suggest.
Normally in the stack trace you should see the path to the corrupted shard, something like MMapIndexInput(path="path/to/es/db/nodes/node_number/indices/name_of_index/1/index/some_file) (here the 1 is the shard number)
Normally deleting path/to/es/db/nodes/node_number/indices/name_of_index/1 should help the shard recover. If you still see it unassigned try sending this command to your cluster (normally as per the documentation, it should work, though I'm not sure about ES 1.x syntax and commands):
POST _cluster/reroute
{
"commands" : [
{
"allocate" : {
"index" : "myIndexName",
"shard" : 1,
"node" : "myNodeName",
"allow_primary": true
}
}
]
}

UnavailableShardsException when running tests with 1 shard and 1 node

We are running our tests (PHP application) in Docker. Some tests use Elasticsearch.
We have configured Elasticsearch to have only 1 node and 1 shard (for simplicity). Here is the config we added to the default:
index.number_of_shards: 1
index.number_of_replicas: 0
Sometimes when the tests run, they fail because of the following Elasticsearch response:
{
"_indices":{
"acme":{
"_shards":{
"total":1,
"successful":0,
"failed":1,
"failures":[
{
"index":"acme",
"shard":0,
"reason":"UnavailableShardsException[[acme][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: delete_by_query {[acme][product], query [{\"query\":{\"term\":{\"product_id\":\"3\"}}}]}]"
}
]
}
}
}
}
The error message extracted from the response:
UnavailableShardsException[[acme][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: delete_by_query {[acme][product], query [{\"query\":{\"term\":{\"product_id\":\"3\"}}}]}]
Why would our client fail to connect to Elasticsearch's node or shard randomly? Is this something to do with the fact that we have only 1 shard? Is this a bad thing?

ElasticSearch UNASSIGNED indices fix without data loss

for whatever reason a bunch of indices became UNASSIGNED. I'm looking for a way of assigning them to a cluster node without loosing any data.
I tried using the following API call, but it results in data loss, unfortunately (due to allow_primary):
curl -XPOST 'localhost:9200/_cluster/reroute?pretty' -d '{
"commands" : [ {
"allocate" : {
"index" : "index-name",
"shard" : "0",
"allow_primary" : true,
"node" : "node-name"
}
}
]
}'
I also keep getting the following entries in elasticsearch.log:
[2015-03-16 11:51:12,181][DEBUG][action.search.type ] [cluster node] All shards failed for phase: [query_fetch]
[2015-03-16 11:51:12,450][DEBUG][action.search.type ] [cluster node] All shards failed for phase: [query_fetch]
[2015-03-16 11:51:19,349][DEBUG][action.bulk ] [cluster node] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-03-16 11:51:20,057][DEBUG][action.bulk ] [cluster node] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
Any help would be appreciated.

Resources