AWS Elasticsearch frequently getting into yellow state - elasticsearch

I have AWS Elasticsearch running with 3 Master nodes(C4.large) and 10
data nodes(C5.large). Recently I am experiencing frequently that my
domain gets into yellow state for some time around 30 mins and then i
do nothing it changes to green.
when i used the query GET /_cluster/allocation/explain?pretty this is what i see
{
"index" : "lgst-",
"shard" : 4,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "NODE_LEFT",
"at" : "2021-01-06T13:15:38.721Z",
"details" : "node_left [**************]",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "yes",
"allocate_explanation" : "can allocate the shard",
"target_node" : {
"id" : "****************",
"name" : "********************"
},
I couldn't understand what does it mean and how do i over come it. Any help please would be appreciated.

Looks like you are using spot instances in your cluster and cause for this is that nodes in your AWS is not stable as shown clearly in the unassigned_info
"unassigned_info" : {
"reason" : "NODE_LEFT",
"at" : "2021-01-06T13:15:38.721Z",
"details" : "node_left [**************]",
"last_allocation_status" : "no_attempt"
},
I would suggest chaning the instance types if you are using ec2 spot instances and check why nodes are getting disconnected in your cluster with AWS support.

Related

elasticsearch = cannot allocate because allocation is not permitted to any of the nodes

I have elasticsearch as a single node cluster.
One of the indexes is yellow with the explanation below.
I have read all the material here and in general, I did not find a solution for this problem.
here is the index info:
yellow open research-pdl 8_TrwZieRM6oBes8sGBUWg 1 1 416656058 0 77.9gb 77.9gb
this command POST _cluster/reroute?retry_failed does not seems to be doing anything.
the setup is running on docker, I have 650GB free space.
{
"index" : "research-pdl",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2020-12-16T05:21:19.977Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "5zzXP2kCQ9eDI0U6WY4j9Q",
"node_name" : "37f65704d9bb",
"transport_address" : "172.19.0.2:9300",
"node_attributes" : {
"ml.machine_memory" : "67555622912",
"xpack.installed" : "true",
"transform.node" : "true",
"ml.max_open_jobs" : "20"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "a copy of this shard is already allocated to this node [[research-pdl][0], node[5zzXP2kCQ9eDI0U6WY4j9Q], [P], s[STARTED], a[id=J7IX30jBSP2jXl5-IGp0BQ]]"
}
]
}
]
}
Thanks
The exception message is very clear, Elasticsearch never assigns the replica of the same primary shard on the same node for high availability reasons.
a copy of this shard is already allocated to this node
[[research-pdl][0], node[5zzXP2kCQ9eDI0U6WY4j9Q], [P], s[STARTED],
a[id=J7IX30jBSP2jXl5-IGp0BQ]]
And as you have a single node cluster, so you will not have another other node where your replicas can be assigned.
Solutions
Add more nodes to your cluster, so that replicas can be assigned on other nodes. (preferred way)
Reduce the replica shards to 0, this can cause data-loss and performance issues. (if at all, you don't have the option to add data-nodes and you want the green state for your cluster).
You can update the replica counts using cluster update API.

Duplicated ElasticSearch documents

We use spring boot application to insert/update elastic search documents. Our data provider sends ous data via Kafka. Our app process events, tries to find a record and insert record If not exists or update if received record is different than saved. There shouldn't be any duplicated record in elasticsearch.
App inserts/update documents with IMMEDIATE refresh
Problem:
Occasionally we have to remove all data and load them again, becouse there are duplicated records. I found out that these cloned records differs only with insert date. Its usually a few hours difference.
Generally it works as expected, detailed integration tests on org.codelibs.elasticsearch-cluster-runner are green.
Example metadata from elastic search query:
{
"docs" : [
{
"_index" : "reference",
"_type" : "reference",
"_id" : "s0z-BHIBCvxpj4TjysIf",
"_version" : 1,
"_seq_no" : 17315835,
"_primary_term" : 40,
"found" : true,
"_source" : {
...
"insertedDate" : 1589221706262,
...
}
},
{
"_index" : "reference",
"_type" : "reference",
"_id" : "jdVCBHIBXucoJmjM8emL",
"_version" : 1,
"_seq_no" : 17346529,
"_primary_term" : 41,
"found" : true,
"_source" : {
...
"insertedDate" : 1589209395577,
...
}
}
]
}
Tests
I loaded many times data to local instance of ES - no duplications
I created a few long working integrational tests with big number of inserts, updates, queries on local instance of org.codelibs.elasticsearch-cluster-runner with 1 to 5 nodes in memory- no duplications 
Details:
Elastic Search version - 7.5
ES connection with org.elasticsearch.client.RestHighLevelClient
The reason has been found. One of the nodes had problems to establish a connection and liked to disconnect sometimes.

elasticsearch - snapshot creation failed due to RepositoryMissingException

I'm trying to create a snapshot in s3 bucket. After running request to create the new snapshot, i'm checking the status of the new snapshot and i see that snapshot state is PARTIAL, due to RepositoryMissingException.
Why is that happening ?
More information:
snapshot configuration:
$ curl localhost:9200/_cat/repositories
s3_repository s3
creation of new snapshot:
$ curl -XPUT localhost:9200/_snapshot/s3_repository/snap10
{"accepted":true}
get details about created snapshot (here we can see the failure):
$ curl localhost:9200/_snapshot/s3_repository/snap10?pretty
{
"snapshots" : [ {
"snapshot" : "snap10",
"version_id" : 2040699,
"version" : "2.4.6",
"indices" : [ "twitter" ],
"state" : "PARTIAL",
"start_time" : "2018-09-27T08:24:13.431Z",
"start_time_in_millis" : 1538036653431,
"end_time" : "2018-09-27T08:24:13.823Z",
"end_time_in_millis" : 1538036653823,
"duration_in_millis" : 392,
"failures" : [ {
"index" : "twitter",
"shard_id" : 1,
"reason" : "RepositoryMissingException[[s3_repository] missing]",
"node_id" : "0yJw77XwSX62rUnhDAAclw",
"status" : "INTERNAL_SERVER_ERROR"
}, {
"index" : "twitter",
"shard_id" : 0,
"reason" : "RepositoryMissingException[[s3_repository] missing]",
"node_id" : "WEzVGyjXSLWuzfD_w-sBlA",
"status" : "INTERNAL_SERVER_ERROR"
} ],
"shards" : {
"total" : 2,
"failed" : 2,
"successful" : 0
}
} ]
}
Can you please assist with the issue ? why the error says that RepositoryMissingException?
Please let me know if more information is needed.
In the end the issue was that cloud-aws plugin was installed only on master node. Once I installed the plugin on the data nodes - it worked.

auditbeat failure in ELK : index_not_found_exception

I followed the guidelines to install auditbeats in ELK to send my auditd logs to ELK, but unfortunately I just can't seem to be able to make it work. I checked my config files multiple times and I just can't wrap my head around it. When I lookup the index "auditbeat-*" in Kibana, it finds no results at all.
When I check the state of the module itself, I get :
curl localhost:9200/auditbeat-6.2.1-2018.02.14/_search?pretty
{
"error" : {
"root_cause" : [ {
"type" : "index_not_found_exception",
"reason" : "no such index",
"resource.type" : "index_or_alias",
"resource.id" : "auditbeat-6.2.1-2018.02.14",
"index" : "auditbeat-6.2.1-2018.02.14"
} ],
"type" : "index_not_found_exception",
"reason" : "no such index",
"resource.type" : "index_or_alias",
"resource.id" : "auditbeat-6.2.1-2018.02.14",
"index" : "auditbeat-6.2.1-2018.02.14"
},
"status" : 404
}
so I am not sure where to take it from there. I tried sending those via both ElasticSearch and Logstach but I keep getting the same results no matter what.
Thanks,
so it turns out this is happening because the port is bound to address 127.0.0.1 instead of 0.0.0.0.

How to fix elasticsearch node version number in a cluster (having 2 different version nodes)

In my elasticsearch cluster i have 2 nodes. 1 is of version 1.7.3 and other of version 1.7.5.(my elasticsearch v1.7.3 got corrupted so reinstalled 1.7.5)
how can i upgrade the node from 1.7.3 to 1.7.5
refered: https://www.elastic.co/guide/en/elasticsearch/reference/1.7/setup-upgrade.html#rolling-upgrades.
But could not get the procedure for upgradation of nodes version.
kindly help me through this.
my cluster is green.
and nodes are as follows:
{
"cluster_name" : "graylog2",
"nodes" : {
"mC4Osz5IS0OLy2E8QbqZLQ" : {
"name" : "Decay II",
"transport_address" : "inet[/127.0.0.1:9300]",
"host" : "localhost",
"ip" : "127.0.0.1",
"version" : "1.7.5",
"build" : "00f95f4",
"http_address" : "inet[/127.0.0.1:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 957,
"max_file_descriptors" : 65535,
"mlockall" : false
}
},
"qCDvg4XCREmj_iGmbt4v4w" : {
"name" : "graylog2-server",
"transport_address" : "inet[/127.0.0.1:9350]",
"host" : "localhost",
"ip" : "127.0.0.1",
"version" : "1.7.3",
"build" : "05d4530",
"attributes" : {
"client" : "true",
"data" : "false",
"master" : "false"
},
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 8937,
"max_file_descriptors" : 64000,
"mlockall" : false
}
}
i suspect the difference in the version is the cause of the graylog refusing connection with the elasticsearch cluster
please help
Hmmmm I can't think of one of two ways to do this
Add 3rd node on 1.75, wait for it to go green. Shutdown 2, upgrade and re-introduce to cluster.
Shutdown cluster. Upgrade node to 1.75. Start node mC4Osz5IS0OLy2E8QbqZLQ first and then the other one.

Resources