I have an Elasticsearch cluster setup on kubernetes. Recently logstash was not able to push any data to the cluster because one of the node in the cluster was out of disk space.
This was the error in logstash
[Ruby-0-Thread-13#[main]>worker1: /usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:383] elasticsearch - retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
The es-master had marked the node as read only because the available disk space crossed the threshold
[WARN ][o.e.c.r.a.DiskThresholdMonitor] [es-master-65ccf55794-pm4xz] flood stage disk watermark [95%] exceeded on [SaRCGuyyTBOxTjNtvjui-g][es-data-1][/data/data/nodes/0] free: 9.1gb[2%], all indices on this node will be marked read-only
Following this I freed up resources on that node and now it has enough space available (almost 50%). But logstash is still not able to push data to elastic search and is logging the same error above.
I have the following questions
Will elasticsearch recover from this automatically?
If not, should I restart the cluster? Is it enough if I just restart the data nodes, or should I be restarting the master and the indest nodes as well?
Is there any way to mark the indices writable again without restart?
You have to manually reset the read-only block on your indices.
You can see documentation here in the cluster.routing.allocation.disk.watermark.flood_stage block
The index block must be released manually once there is enough disk
space available to allow indexing operations to continue.
PUT /<your index name>/_settings
{
"index.blocks.read_only_allow_delete": null
}
Related
I am running a three node Elasticsearch (ELK) cluster. All nodes have all and the same roles, e.g. data, master, etc. The disk on node 3 where the data folder is assigned became corrupt and that data is probably unrecoverable. The other nodes are running normally and one of them assumed the master role instead.
Will the cluster work normally if I replace the disk and make the empty directory available to elastic again, or am I risking crashing the whole cluster?
EDIT: As this is not explicitly mentioned in the answer, yes, if you add your node with an empty data folder, the cluster will continue normally as if you added a new node to the cluster, but you have to deal with the missing data. In my case, I lost the data as I do not have replicas.
Let me try to explain that in simple way.
Your data got corrupt at node-3 so if you add that that node again, it will not have the older data, i.e. the shards stored in node-3 will remain unavailable for the cluster.
Did you have the replica shards configured for the indexes?
What is the current status(yellow/red) of the cluster when you have
node-3 removed?
If a primary shard isn't available then the master-node promotes one of the active replicas to become the new primary. If there are currently no active replicas then status of the cluster will remain red.
I have an elasticsearch cluster setup, on k8s with one statefulset for elasticseach master nodes(3 of them) and another statefulset for elasticsearch data nodes (15 of them).
During shard re-allocation due to a few of the data nodes reaching their capacity, we have now encountered an error on the data node(on a few of them actually) that goes like this:
uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: alias [alias-id_100536] has more than one write index [index-abc ,index-def]
we have encountered this issue earlier as well and the solution that worked for us was to find the hash of the index via the _cat/indices/index-abc api and bash onto the vm of the data node and delete the entire directory with the hash value, the index would again reach the desired replication count once the data node was up, so we had no data loss.
However, now when we try to use the _cat/indices api on the index that has the write state, we see that the master node says that there are no indices with that name.
Out of the 2 indexes that are thrown in the error, we are able to find the index on the master nodes for only one of them. We can try to delete them one by one on the data node after find the hash value from the _cat/indices api, however, i wanted to know what would be the ideal recovery method for this?
Elasticsearch version: 6.7.2
Thanks
I was able to solve this problem.
We ended up writing a script that lists the directories of the elasticsearch data nodes indices path.
That way we were able to get a list of all uuid's present on that data node.
We next ran the _cat/indices?format=json api call on the master node.
We collected the uuid's from this api and did a set() - set(<_cat/indices-uuid>).
This gave us a list of all indices that were considered dangling. And we manually deleted the directories with the rm -rf command.
This brought up the nodes.
I have deployed a logging project on OpenShift and have created a PV on a GlusterFS, which consists of a 3 servers running in a cluster.
I have created a volume on a GlusterFS and enabled sharding feature on it. However, Elasticsearch doesn't run on it as expected, and I get this error:
FAIL: Configuration for 'config' failed because of UnavailableShardsException[[.searchguard.logging-es-6akrt1z2-11-66509][0] primary shard is not active Timeout: [1m], request: [index {[.searchguard.logging-es-6akrt1z2-11-66509][config][0], source[{"searchguard":{"dynamic":{"http":{"xff":{"enabled":true,"remoteIpHeader":"x-forwarded-for","trustedProxies":".*","internalProxies":".*"}},"authc":{"authentication_domain_proxy":{"enabled":true,"order":0,"http_authenticator":{"challenge":false,"type":"proxy","config":{"user_header":"x-proxy-remote-user"}},"authentication_backend":{"type":"noop"}},"authentication_domain_basic_internal":{"enabled":true,"order":1,"http_authenticator":{"type":"clientcert","challenge":false},"authentication_backend":{"type":"noop"}}}}}}]}]]
How could I fix this?
OK, I solved the issue. It looks like default shard size on a volume was to small, so I set it to a 50GB. Now, a new question arises, what happens if I set Elasticsearch to use 2 shards, and then a shard grows beyond 50 GB?
We have 3 nodes ES cluster and that has been hosted on AWS. We can see below error message, after the amazon network connectivity issue( see https://status.aws.amazon.com/) which is happen on today. Could you please advise how I can bring up the cluster again to good state, without any data?
[index.store ] [ [.marvel-2015.03.19][0] Failed to open / find files while reading metadata snapshot
[2017-02-10 01:54:54,379][WARN ][index.engine.internal ] [.marvel-2015.03.16][0] failed engine [corrupted preexisting index]
org.apache.lucene.index.CorruptIndexException: [.marvel-2015.03.16][0] Preexisting corrupted index [corrupted_Jja1GRiPTFyzm4G_tuEvsg] caused by: CorruptIndexException[codec footer mismatch: actua
l footer=1431655765 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/es-data//nodes/0/indices/.marvel-2015.03.16/0/index/_83k_es090_0.doc"))]
I would say, compare the data nodes on each node. Try to identify the node with anomaly which may be in the form of file entries with corruptted??? flag or higher data node folder size than the other nodes. If you are lucky then you would have a balanced cluster and other nodes would have full index in the form of primary and replica shards. In such a scenario you can then delete the data folder of the node with anomaly and restart the cluster which then will balance itself again.
Using ES 1.3.1 version
Found a brain-split then restart the entire cluster. Now only the latest index got correctly allocated, leave all other indices unassigned...
I've checked on several nodes, there are index data saved on disk, and I've tried to restart those nodes, still won't get a shard allocate...
Please see this screen shot:
http://i.stack.imgur.com/d6jT7.png
I've tried the "Cluster reroute": http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-reroute.html. However, got a exception like "cannot allocate Primary Shard"...
Please help and any comment is welcome. Thanks a lot.
Don't allocate primary shards with the _cluster/reroute API, this will create an empty shard with no data.
Try setting your replica count to 0.
If that doesn't work, set index.gateway logging to be TRACE and restart a node that contains saved index data for one of the unassigned shards. What do you see in the logs for that node or in the logs for the master node?