Elasticsearch dies on primary node at random times. What should I look for in troubleshooting? - elasticsearch

I have 2 Elasticsearch VM's running (4 GB RAM VM's) and they are configured in 1 cluster with 2 nodes. It's been running fine for months but all of a sudden in the last week the primary node has been just ending with no reason (that I can find).
So when I try to restart the node it takes a while to re-sync but does so eventually.
I have things set to only use 1/2 the RAM on the server and it seems to be doing that but looking at my free memory and HTOP I'm seeing more than that consumed.
My browser-based plugins (Bigdesk, Marvel, elasticsearch-head) also respond very very slowly now. I have noticed that my marvel index files are massive - over 1 GB a day. Can I remove these to improve performance? The marvel data is way more than my actual index/searchable data.
What else can I do to tweak and optimize this system?
Thanks.

Found a solution and hopefully this will help others. Did some other searching as I noticed there were a ton of Marvel indices and they were quite large. I found Curator and ran a delete on indices older than 30 days.
curator --host <IP ADDRESS> delete indices --older-than 30 --time-unit days --timestring '%Y.%m.%d'
That command purged all the indices over 30 days old and the cluster went green immediately and all plug-ins are working like normal. Granted, I know I'm losing historical data over 30 days but that is ok.
HTH someone else that might be in the same boat.

Related

Increasing number of nodes for elasticsearch doesn't speed up the queries

I am pretty new to Elasticsearch.
Right know I have pretty expensive search query, it uses nested querying, highlighting, fuzziness and so on. But right now I can't changed that.
So I decided I will try to upgrade my cluster, previously it had 3 nodes, 16GB, 0.5CPU each.
Now I upgraded it to 6 nodes, 16GB, 3CPU each.
I was hoping for some speedup on queries however I did not get it. For some reasons it started to be even slower. For example queries which took 12 seconds now last about 17 seconds.
Is there anything more I have to do after upgrading number of nodes. Because after reading some docs I though it's all automatic.
I will put result of /_nodes query maybe it will give some more insight. I have to use external service to share json because of char limit
https://wtools.io/paste-code/bFUC

Full Index Rebuild with 5M documents on Elastic search 7.6 takes 1 week to complete

Hope all are doing well
Currently Elastic 7.6 is running on cluster with 3 nodes and it's having 5M records in it.
While performing the Full rebuild of indexed is taking huge amount of time. But other server metrics like CPU and GC is within the limit
And similarly I had tried ES re index process which is also time consuming. Applied few workarounds like by setting Refresh interval to -1 and Replica to 0 but it still takes very long time.
I have almost 168 shards bt the segment count is lil more which is showing 800+.
Kindly help and suggest to resolve this issue. Any leads would be much appreciated.
Definitely, it looks like you have some bad configuration of cluster, nodes, and important indexing related setting which is causing this much slow performance.

ElasticSearch stats API taking long time to response

We have an ES cluster at AWS running with the following setup:
(I know, i need minimum 3 master nodes)
1 Coordinator
2 Data nodes
1 Master Node
Data nodes spec:
CPU: 8 Cores
Ram: 20GB
Disk: 1TB ssd 4000 IOPS
Problem:
ES endpoints for Search, Delete, Backup, Cluster Heatlh, Insert are working fine.
Since yesterday some endpoints like /_cat/indices, /_nodes/_local/stats and etc, started to take too long to respond(more than 4 minutes) :( and consequently our Kibana is in red state(Timeout after 30000ms)
Useful info:
All Shards are OK (3500 in total)
The cluster is in green state
X-pack disabled
Average of 1gb/shard
500k document count.
Requests made by localhost at AWS
CPU, DISK, RAM, IOPS all fine
Any ideas?
Thanks in advance :)
EDIT/SOLUTION 1:
After a few days i found out what was the problem, but first a little bit context...
We use Elasticsearch for storing user audit messages, and mobile error messages, at the first moment (obiviously in a rush to deliver new microservices and remove load from our MongoDB cluster) we designed elasticsearch indices by day, so every day a new indice was created and at the end of the day that indice had arround 6 ~ 9gb of data.
Six months later, almost 180 indices bigger, and 720 primary shards open we bumped into this problem.
Then i did read this again(the basics!) :
https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html
After talking to the team responsible for this microservice we redesigned our indices to a monthly index, and guess what? problem solved!
Now our cluster is much faster than before and this simple command saved me some sweet nights of sleep.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
Thanks!

Elasticsearch - queries throttle cpu

Recently our cluster has seen extreme performance degradation. We had 3 nodes, 64 GB, 4CPU (2 core) each for an index that is 250M records, 60GB large. Performance was acceptable for months.
Since then we've:
1. Added a fourth server, same configuration.
2. Split the index into two indexes, query them with an alias
3. Disable paging (windows server 2012)
4. Added synonym analysis on one field
Our cluster can now survive for a few hours before it's basically useless. I have to restart elastic on each node to rectify the problem. We tried bumping each node to 8 cpus (2 cores) with little to no gain.
One issue is that EVERY QUERY uses up 100% of the cpu of whatever node it hits. Every query is facetted on 3+ fields, which hasn't changed since our cluster was healthy. Unfortunately I'm not sure if this was an happening before, but certainly it seems like an issue. We need to be able to respond to more than one request every few seconds obviously. When multiple requests come in at the same time the performance doesn't seem to get worse for those particular responses. Again, over time, the performance slows to a crawl; the CPU (all cores) stays maxed out indefinitely.
I'm using elasticsearch 1.3.4 and the plugin elasticsearch-analysis-phonetic 2.3.0 on every box and have been even when our performance wasn't so terrible.
Any ideas?
UPDATE:
it seems like the performance issue is due to index aliasing. When I pointed the site to a single index that ultimately stores about 80% of the data, the CPU wasn't being throttled. There were still a few 100% spikes, but they were much shorter. When I pointed it back to the alias (which points to two indexes total), I could literally bring the cluster down by refreshing the page a dozen times quickly: CPU usage goes to 100% every query and gets stuck there with many in a row.
Is there a known issue with elastic search aliases? Am I using the alias incorrectly?
UPDATE 2:
Found the cause in the logs. Paging queries are TERRIBLE. Is this a known bug in elastic? If I run an empty query then try and view the last page (from 100,000,000 e.g.) it brings the whole cluster down. That SINGLE QUERY. It gets through the first 1.5M results then quits, all the while taking up 100% of the CPU for over a minute.
UPDATE 3:
So here's somethings else strange. Pointing to an old index on dev (same size, no aliases) and trying to reproduce the paging issue; the cluster doesn't get hit immediately. It has 1% cpu usage for the first 20 seconds after the query. The query returns with an error before the CPU usage every goes up. About 2 minutes later, CPU usage spikes to 100% and server basically crashes (can't do anything else because CPU is so over taxed). On the production index this CPU load is instantaneous (it happens immediately after a query is made)
Without checking certain metrics it is very difficult to identify the cause of slow response or any other issue. But from the data you have mentioned it looks like there are to many cache evictions happening thereby increasing the number of Garbage Collection on your nodes. A frequent Garbage Collection (mainly the old GC) will consume lot of CPU. This in turn will start to affect all cluster.
As you have mentioned it started giving issues only after you added another node. This surprises me. Is there any increase in the traffic?.
Can you include the output of _stats API taken at the time when your cluster slows down. It will have lot of information from which I can make a better diagnosis. Also include a sample of the query.
I suggest you to install bigdesk so that you can have a graphical view of your cluster health more easily.

what is the max number of nodes in rethinkdb?

I've read on rethinkdb's doc that we can have a number of nodes from one to sixteen but actually I don't know if it is a way of speaking or a real limit.
I launched 20 VirtualBox VMs to create a cluster and I found troubles to have all nodes in the cluster online at the same time, 3 or 4 nodes loose connectivity. This makes sense with the 16 limit but I havent found similar limits for other nosql databases.
Is 16 a real maximum number of nodes per cluster limit on rethinkdb?
thanks!
Short answer is: There is no hard limit.
It is written 16 machines because that is what we have tested so far.
Some tests have been run with 64 nodes and while it doesn't scale as much as it should, it still works.
RethinkDB is aiming for a smooth experience with 100 servers and 100.000 tables -- see https://github.com/rethinkdb/rethinkdb/issues/1861 to track progress.
Also if you run 20 VMs on the same machine, the host may not have enough resources to run the cluster, which would explains the timeouts.

Resources