Elasticsearch reboot, number_of_pending_tasks keeps increasing indefinitely

Elasticsearch reboot, number_of_pending_tasks keeps increasing indefinitely - elasticsearch

I had to restart the master elasticsearch, the status was red, then after some time the status went yellow (primary shards get assigned).
Now when I'm doing the query curl http://x.x.x.x/_cluster/health?pretty I can see that the "number_of_pending_tasks" keeps increasing (now it is at 200k)
I had a look at the pending tasks and I can see that it is mainly this tasks that get buffered:
, {
"insert_order" : 58176,
"priority" : "NORMAL",
"source" : "indices_store",
"executing" : false,
"time_in_queue_millis" : 619596,
"time_in_queue" : "10.3m"
},
In the meantime I get the error about a rejected execution due to the queue capacity:
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 200) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler#34c87ed9
How can I solve this?

Related

Radis Caching Issue (Time Out issue)

I am facing radis issue. I have implemented centeralized radis cluster approch. But whenever the maximum number of services data get cached in the radis it causes an issue, "Read time out" occured.
Please anyone can give me suggestions what can i do?
Please find below properties:
spring :
redis :
database : ${redis.database}
#host : ${redis.host}
#port : ${redis.port}
password : ${redis.password}
pool :
max-active : 8
max-wait : 10000
max-idle : 8
min-idle : 1
timeout : 10000
redis :
cluster :
nodes : 10.xxx.x.xx:1234,10.xxx.x.xx:12365
database : 0
password : 'xxxxxx'
Error Logs:
2022-02-11 13:59:59,996 ERROR [nio-9120-exec-1008] c.h.l.framework.aop.ControllerAspect - [AGW2202111359590622605456]:
redis.clients.jedis.exceptions.JedisDataException: ERR Error running script (call to f_e78e961698f118a64316964cac1ca9e6008c29da): #user_script:34: #user_script: 34: -OOM command not allowed when used memory > 'maxmemory'.
at redis.clients.jedis.Protocol.processError(Protocol.java:127)
at redis.clients.jedis.Protocol.process(Protocol.java:161)
at redis.clients.jedis.Protocol.read(Protocol.java:215)
at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:340)
at redis.clients.jedis.Connection.getOne(Connection.java:322)
at redis.clients.jedis.BinaryJedis.evalsha(BinaryJedis.java:3142)
at redis.clients.jedis.BinaryJedis.evalsha(BinaryJedis.java:3135)
at redis.clients.jedis.BinaryJedisCluster$119.execute(BinaryJedisCluster.java:1270)
at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:120)
at redis.clients.jedis.JedisClusterCommand.runBinary(JedisClusterCommand.java:81)
at redis.clients.jedis.BinaryJedisCluster.evalsha(BinaryJedisCluster.java:1272)
at com.hisun.lemon.framework.jedis.JedisClusterConnection.evalSha(JedisClusterConnection.java:67)
at sun.reflect.GeneratedMethodAccessor1795.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.data.redis.core.CloseSuppressingInvocationHandler.invoke(CloseSuppressingInvocationHandler.java:57)
at com.sun.proxy.$Proxy537.evalSha(Unknown Source)

Elasticsearch 7.4 incorrectly complaining a snapshot is already running

After solving Something inside Elasticsearch 7.4 cluster is getting slower and slower with read timeouts now and then there is still something off in my cluster. Whenever I run the snapshot command it gives me a 503, when I run it one or two times again it suddenly starts and creates a snapshot just fine. The opster.com online tool suggests something about snapshots not being configured, however when I run the verify command suggested by it, everything looks fine.
$ curl -s -X POST 'http://127.0.0.1:9201/_snapshot/elastic_backup/_verify?pretty'
{
"nodes" : {
"JZHgYyCKRyiMESiaGlkITA" : {
"name" : "elastic7-1"
},
"jllZ8mmTRQmsh8Sxm8eDYg" : {
"name" : "elastic7-4"
},
"TJJ_eHLIRk6qKq_qRWmd3w" : {
"name" : "elastic7-3"
},
"cI-cn4V3RP65qvE3ZR8MXQ" : {
"name" : "elastic7-2"
}
}
}
But then:
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
{
"error" : {
"root_cause" : [
{
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.11.27] a snapshot is already running"
}
],
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.11.27] a snapshot is already running"
},
"status" : 503
}
Could it be that one of the 4 nodes is in the believe that a snapshot is already running, and that this task randomly gets assigned to one of the nodes so that when running it a few times eventually it will make a snapshot? If so, how could I figure out which of the nodes is saying the snapshot is already running?
Furthermore I noticed heap is much higher on one of the nodes, what is a normal heap usage?
$ curl -s http://127.0.0.1:9201/_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.0.1.215 59 99 7 0.38 0.38 0.36 dilm - elastic7-1
10.0.1.218 32 99 1 0.02 0.17 0.22 dilm * elastic7-4
10.0.1.212 11 99 1 0.04 0.17 0.21 dilm - elastic7-3
10.0.1.209 36 99 3 0.42 0.40 0.36 dilm - elastic7-2
Last night it happened again while I’m sure nothing was already snapshotting and so now I ran the following commands to confirm the weird response, at least I would not expect to get this error at this point.
$ curl http://127.0.0.1:9201/_snapshot/elastic_backup/_current?pretty
{
"snapshots" : [ ]
}
$ curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
{
"error" : {
"root_cause" : [
{
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.12.03] a snapshot is already running"
}
],
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.12.03] a snapshot is already running"
},
"status" : 503
}
When I run it a 2nd (or sometimes 3rd) time it will all of a sudden be creating a snapshot.
And note that when I don't run it that 2nd or 3rd times the snapshot will never appear, so I'm 100% sure no snapshot is running at the moment of this error.
There is no SLM configured as far as I know:
{ }
The repo is configured properly AFAICT:
$ curl http://127.0.0.1:9201/_snapshot/elastic_backup?pretty
{
"elastic_backup" : {
"type" : "fs",
"settings" : {
"compress" : "true",
"location" : "elastic_backup"
}
}
}
Also in the config it is mapped to the same folder that is an NFS mount of an Amazon EFS. It is available and accessible and on successful snapshots shows new data.
As part of the cronjob I have added to query _cat/tasks?v, so hopefully tonight we will see more. Because just now when I ran the command manually it ran without problems:
$ curl localhost:9201/_cat/tasks?v ; curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty' ; curl localhost:9201/_cat/tasks?v
action task_id parent_task_id type start_time timestamp running_time ip node
cluster:monitor/tasks/lists JZHgYyCKRyiMESiaGlkITA:15885091 - transport 1607068277045 07:51:17 209.6micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:24278976 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277044 07:51:17 62.7micros 10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:15885092 JZHgYyCKRyiMESiaGlkITA:15885091 direct 1607068277045 07:51:17 57.4micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:23773565 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277045 07:51:17 84.7micros 10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:3418325 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277046 07:51:17 56.9micros 10.0.1.209 elastic7-2
{
"snapshot" : {
"snapshot" : "snapshot-2020.12.04",
"uuid" : "u2yQB40sTCa8t9BqXfj_Hg",
"version_id" : 7040099,
"version" : "7.4.0",
"indices" : [
"log-db-1-2020.06.18-000003",
"log-db-2-2020.02.19-000002",
"log-db-1-2019.10.25-000001",
"log-db-3-2020.11.23-000002",
"log-db-3-2019.10.25-000001",
"log-db-2-2019.10.25-000001",
"log-db-1-2019.10.27-000002"
],
"include_global_state" : true,
"state" : "SUCCESS",
"start_time" : "2020-12-04T07:51:17.085Z",
"start_time_in_millis" : 1607068277085,
"end_time" : "2020-12-04T07:51:48.537Z",
"end_time_in_millis" : 1607068308537,
"duration_in_millis" : 31452,
"failures" : [ ],
"shards" : {
"total" : 28,
"failed" : 0,
"successful" : 28
}
}
}
action task_id parent_task_id type start_time timestamp running_time ip node
indices:data/read/search JZHgYyCKRyiMESiaGlkITA:15888939 - transport 1607068308987 07:51:48 2.7ms 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists JZHgYyCKRyiMESiaGlkITA:15888942 - transport 1607068308990 07:51:48 223.2micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:24282763 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308989 07:51:48 61.5micros 10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:15888944 JZHgYyCKRyiMESiaGlkITA:15888942 direct 1607068308990 07:51:48 78.2micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:23777841 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308990 07:51:48 63.3micros 10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:3422139 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308991 07:51:48 60micros 10.0.1.209 elastic7-2
Last night (2020-12-12) during cron I have had it run the following commands:
curl localhost:9201/_cat/tasks?v
curl localhost:9201/_cat/thread_pool/snapshot?v
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
curl localhost:9201/_cat/tasks?v
sleep 1
curl localhost:9201/_cat/thread_pool/snapshot?v
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
sleep 1
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
sleep 1
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
And the output for it is following:
action task_id parent_task_id type start_time timestamp running_time ip node
cluster:monitor/tasks/lists JZHgYyCKRyiMESiaGlkITA:78016838 - transport 1607736001255 01:20:01 314.4micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:82228580 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001254 01:20:01 66micros 10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:55806094 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001255 01:20:01 74micros 10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:78016839 JZHgYyCKRyiMESiaGlkITA:78016838 direct 1607736001255 01:20:01 94.3micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:63582174 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001255 01:20:01 73.6micros 10.0.1.209 elastic7-2
node_name name active queue rejected
elastic7-2 snapshot 0 0 0
elastic7-4 snapshot 0 0 0
elastic7-1 snapshot 0 0 0
elastic7-3 snapshot 0 0 0
{
"error" : {
"root_cause" : [
{
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] a snapshot is already running"
}
],
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] a snapshot is already running"
},
"status" : 503
}
action task_id parent_task_id type start_time timestamp running_time ip node
cluster:monitor/nodes/stats JZHgYyCKRyiMESiaGlkITA:78016874 - transport 1607736001632 01:20:01 39.6ms 10.0.1.215 elastic7-1
cluster:monitor/nodes/stats[n] TJJ_eHLIRk6qKq_qRWmd3w:82228603 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001631 01:20:01 39.2ms 10.0.1.212 elastic7-3
cluster:monitor/nodes/stats[n] jllZ8mmTRQmsh8Sxm8eDYg:55806114 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001632 01:20:01 39.5ms 10.0.1.218 elastic7-4
cluster:monitor/nodes/stats[n] cI-cn4V3RP65qvE3ZR8MXQ:63582204 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001632 01:20:01 39.4ms 10.0.1.209 elastic7-2
cluster:monitor/nodes/stats[n] JZHgYyCKRyiMESiaGlkITA:78016875 JZHgYyCKRyiMESiaGlkITA:78016874 direct 1607736001632 01:20:01 39.5ms 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists JZHgYyCKRyiMESiaGlkITA:78016880 - transport 1607736001671 01:20:01 348.9micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:78016881 JZHgYyCKRyiMESiaGlkITA:78016880 direct 1607736001671 01:20:01 188.6micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:82228608 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001671 01:20:01 106.2micros 10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:63582209 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001672 01:20:01 96.3micros 10.0.1.209 elastic7-2
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:55806120 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001672 01:20:01 97.8micros 10.0.1.218 elastic7-4
node_name name active queue rejected
elastic7-2 snapshot 0 0 0
elastic7-4 snapshot 0 0 0
elastic7-1 snapshot 0 0 0
elastic7-3 snapshot 0 0 0
{
"snapshot" : {
"snapshot" : "snapshot-2020.12.12",
"uuid" : "DgwuBxC7SWirjyVlFxBnng",
"version_id" : 7040099,
"version" : "7.4.0",
"indices" : [
"log-db-sbr-2020.06.18-000003",
"log-db-other-2020.02.19-000002",
"log-db-sbr-2019.10.25-000001",
"log-db-trace-2020.11.23-000002",
"log-db-trace-2019.10.25-000001",
"log-db-sbr-2019.10.27-000002",
"log-db-other-2019.10.25-000001"
],
"include_global_state" : true,
"state" : "SUCCESS",
"start_time" : "2020-12-12T01:20:02.544Z",
"start_time_in_millis" : 1607736002544,
"end_time" : "2020-12-12T01:20:27.776Z",
"end_time_in_millis" : 1607736027776,
"duration_in_millis" : 25232,
"failures" : [ ],
"shards" : {
"total" : 28,
"failed" : 0,
"successful" : 28
}
}
}
{
"error" : {
"root_cause" : [
{
"type" : "invalid_snapshot_name_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
}
],
"type" : "invalid_snapshot_name_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
},
"status" : 400
}
{
"error" : {
"root_cause" : [
{
"type" : "invalid_snapshot_name_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
}
],
"type" : "invalid_snapshot_name_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
},
"status" : 400
}
Also the cluster is green at the moment, management queues are not full, everything seems good.
Also there is only one repository:
curl http://127.0.0.1:9201/_cat/repositories?v
id type
elastic_backup fs

So it turned out that trouble started due to a recent upgrade to Docker 19.03.6 and going from 1x Docker Swarm manager + 4x Docker Swarm worker to 5x Docker Swarm manager + 4x Docker Swarm worker. In both instances Elastic ran on the workers. Because of this upgrade/change we were presented with a change in the number of network interfaces inside the containers. Because of this we had to had 'publish_host' in Elastic to make things work again.
To fix the problem we had to get rid of publishing the Elastic ports over the ingress network so that the additional network interfaces went away. Next we could drop the 'publish_host' setting. This made things work a bit better. But to really solve our issues we had to change the Docker Swarm deploy endpoint_mode to dnsrr so that things would not go through the Docker Swarm routing mesh.
We always already had 'Connection reset by peer' issues, but since the change this became worse and made Elasticsearch present strange issues. I guess running Elasticsearch inside a Docker Swarm (or any other Kubernetes or something) can be a tricky thing to debug.
Using tcpdump in the containers and conntrack -S on the hosts we were able to see perfectly fine connections being reset for no reason. Another solution was to have the kernel drop mismatching packets (instead of sending resets), but preventing the use of DNAT/SNAT in this instance as much as possible seemed to solve things too.

Elasticsearch version 7.4 only supports one snapshots operation at a time.
From the error it seems previously triggered snapshot was already running when you triggered a new snapshot and Elasticsearch throws concurrent_snapshot_execution_exception.
You can check list of currently running snapshot by using
GET /_snapshot/elastic_backup/_current.
I suggest you should check first if any snapshot operation is running for your elasticsearch cluster using above api. If no snapshot operation is currently running then only you should trigger new snapshot.
P.S : From Elasticsearch version 7.7 onwards elasticsearch do support concurrent snapshots as well. So if you plan to perform concurrent snapshots operation in you cluster then you should upgrade ES version 7.7 or above.

indexing nutch 1.13 into apache solr 6.6.0 - elasticsearch error

I am trying to create an index into solr but I keep getting this error when I run the following command. I am not using elasticsearch at all and i have even uncommented it under the nutch-site.xml.
Indexer: java.lang.RuntimeException: Missing elastic.cluster and elastic.host. At least one of them should be set in nutch-site.xml
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length in bytes. (default 2500500)
elastic.exponential.backoff.millis : elastic bulk exponential backoff initial delay in milliseconds. (default 100)
elastic.exponential.backoff.retries : elastic bulk exponential backoff max retries. (default 10)
elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600)
at org.apache.nutch.indexwriter.elastic.ElasticIndexWriter.setConf(ElasticIndexWriter.java:255)
at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
at org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:57)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
Any ideas?

Why es cluster stop to work until i delete the old index?

In es document,it introduce that,If we restart Node 1,If Node 1 still has copies of the old shards, it will try to reuse them, copying over from the primary shard only the files that have changed in the meantime.
So I did an experiment.
Here are 5 nodes in my cluster,Primary shards 1 is saved in node 1,and replica shards 1 is saved in node 2.When i restart node 1 and node 2,Primary shards 1's state become UNASSIGNED,and replica shards 1's state become UNASSIGNED too,the health of the cluster become red,and the health never become green.And the cluster stop to work until i delete the old index.
Here is part of the master log.
[ERROR][marvel.agent ] [es10] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
at java.lang.Thread.run(Thread.java:745)
Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[8]: index [.marvel-es-data], type [cluster_info], id [nm4dj3ucSRGsdautV_GDDw], message [UnavailableShardsException[[.marvel-es-data][1] primary shard is not active Timeout: [1m], request: [shard bulk {[.marvel-es-data][1]}]]]];
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:106)
... 3 more
Caused by: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[8]: index [.marvel-es-data], type [cluster_info], id [nm4dj3ucSRGsdautV_GDDw], message [UnavailableShardsException[[.marvel-es-data][1] primary shard is not active Timeout: [1m], request: [shard bulk {[.marvel-es-data][1]}]]]]
at org.elasticsearch.marvel.agent.exporter.local.LocalBulk.flush(LocalBulk.java:114)
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:101)
... 3 more
[2016-02-19 12:53:18,769][ERROR][marvel.agent ] [es10] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
at java.lang.Thread.run(Thread.java:745)
Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[8]: index [.marvel-es-data], type [cluster_info], id [nm4dj3ucSRGsdautV_GDDw], message [UnavailableShardsException[[.marvel-es-data][1] primary shard is not active Timeout: [1m], request: [shard bulk {[.marvel-es-data][1]}]]]];
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:106)
... 3 more
Caused by: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[8]: index [.marvel-es-data], type [cluster_info], id [nm4dj3ucSRGsdautV_GDDw], message [UnavailableShardsException[[.marvel-es-data][1] primary shard is not active Timeout: [1m], request: [shard bulk {[.marvel-es-data][1]}]]]]
at org.elasticsearch.marvel.agent.exporter.local.LocalBulk.flush(LocalBulk.java:114)
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:101)
... 3 more

Elasticsearch Debugging

Our elasticsearch is a mess. The cluster health is always in red and ive decided to look into it and salvage it if possible. But I have no idea where to begin with. Here is some info regarding our cluster:
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 6,
"active_primary_shards" : 91,
"active_shards" : 91,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 201,
"number_of_pending_tasks" : 0
}
The 6 nodes:
host ip heap.percent ram.percent load node.role master name
es04e.p.comp.net 10.0.22.63 30 22 0.00 d m es04e-es
es06e.p.comp.net 10.0.21.98 20 15 0.37 d m es06e-es
es08e.p.comp.net 10.0.23.198 9 44 0.07 d * es08e-es
es09e.p.comp.net 10.0.32.233 62 45 0.00 d m es09e-es
es05e.p.comp.net 10.0.65.140 18 14 0.00 d m es05e-es
es07e.p.comp.net 10.0.11.69 52 45 0.13 d m es07e-es
Straight away you can see I have a very large number of unassigned shards (201). I came across this answer and tried it and got 'acknowledged:true', but there was no change in the either of the above posted sets of info.
Next I logged into one of the nodes es04 and went through the log files. the first log file has a few lines that caught my attention
[2015-05-21 19:44:51,561][WARN ][transport.netty ] [es04e-es] exception caught on transport layer [[id: 0xbceea4eb]], closing connection
and
[2015-05-26 15:14:43,157][INFO ][cluster.service ] [es04e-es] removed {[es03e-es][R8sz5RWNSoiJ2zm7oZV_xg][es03e.p.sojern.net][inet[/10.0.2.16:9300]],}, reason: zen-disco-receive(from master [[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]]])
[2015-05-26 15:22:28,721][INFO ][cluster.service ] [es04e-es] removed {[es02e-es][XZ5TErowQfqP40PbR-qTDg][es02e.p.sojern.net][inet[/10.0.2.229:9300]],}, reason: zen-disco-receive(from master [[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]]])
[2015-05-26 15:32:00,448][INFO ][discovery.ec2 ] [es04e-es] master_left [[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]]], reason [shut_down]
[2015-05-26 15:32:00,449][WARN ][discovery.ec2 ] [es04e-es] master left (reason = shut_down), current nodes: {[es07e-es][etJN3eOySAydsIi15sqkSQ][es07e.p.sojern.net][inet[/10.0.2.69:9300]],[es04e-es][3KFMUFvzR_CzWRddIMdpBg][es04e.p.sojern.net][inet[/10.0.1.63:9300]],[es05e-es][ZoLnYvAdTcGIhbcFRI3H_A][es05e.p.sojern.net][inet[/10.0.1.140:9300]],[es08e-es][FPa4q07qRg-YA7hAztUj2w][es08e.p.sojern.net][inet[/10.0.2.198:9300]],[es09e-es][4q6eACbOQv-TgEG0-Bye6w][es09e.p.sojern.net][inet[/10.0.2.233:9300]],[es06e-es][zJ17K040Rmiyjf2F8kjIiQ][es06e.p.sojern.net][inet[/10.0.1.98:9300]],}
[2015-05-26 15:32:00,450][INFO ][cluster.service ] [es04e-es] removed {[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]],}, reason: zen-disco-master_failed ([es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]])
[2015-05-26 15:32:36,741][INFO ][cluster.service ] [es04e-es] new_master [es04e-es][3KFMUFvzR_CzWRddIMdpBg][es04e.p.sojern.net][inet[/10.0.1.63:9300]], reason: zen-disco-join (elected_as_master)
In this section i realized there were a few nodes es01, es02, es03 which were deleted.
After this, all log files(around 30 of them) have only 1 line:
[2015-05-26 15:43:49,971][DEBUG][action.bulk ] [es04e-es] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
I have checked all the nodes and they have same version of ES and logstash. I realize this is a big complicated issues but if anyone can find out the issue and nudge me in the right direction it will be HUGE help

I believe this might be because at some point you have a split brain issue and there were 2 versions of same shard in 2 clusters. One or both might have got different sets of data and 2 versions of shard might have come into existence. At some point you might have restarted the whole system and some shards might have gone to red state.
First see if there is data loss , if there is , aforementioned case could be the reason. Next make sure you set minimum master nodes to N/2+1 ( N is the number of shards ) , so that this issue wont surface again.
YOu can use the shard reroute API on the red shards and see if its moving out of red state. You might loose the shard data here , but then that is the the only way i have seen to being back the cluster state to green.

Please try to install Elastic-head plugin to check, to check shard status. you will able to see which shards are corrupted.
Try flush or optimize option.
Also restart Elastic sometime works.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch reboot, number_of_pending_tasks keeps increasing indefinitely - elasticsearch

Related

Radis Caching Issue (Time Out issue)

Elasticsearch 7.4 incorrectly complaining a snapshot is already running

indexing nutch 1.13 into apache solr 6.6.0 - elasticsearch error

Why es cluster stop to work until i delete the old index?

Elasticsearch Debugging

Categories

Resources