Nodes crashing from sudden Out of Memory error - elasticsearch

We are currently running ES 6.5 in 71 nodes using hot-warm architecture. We have 3 master nodes, 34 warm nodes and 34 hot nodes. Hot nodes have 64GB of RAM , 30GB of them for the heap. In the warm nodes we have 128GB of RAM and also 30GB dedicated for the heap.
We've been suffering from some sudden crashes on the hot nodes, these crashes don't happen only when the ingestion rate it's at the peak. Since the cluster is fine with a higher ingestion rate I don't believe we are hitting any limit yet. I got the heap dump from the hot nodes when they crash and I see that 80% of the heap is being used by byte arrays, which means that 80% of the heap (24GB!) are byte arrays of documents we want to index.
I've also analyzed the tasks (GET _tasks?nodes=mynode&detailed) being executed in a hot node right before it crashes and I saw that there are more than 1300 bulk indexing tasks active in the node at that time, 1300 bulk indexing tasks are about 20GB!! of data. Some of those tasks have been running for more than 40 seconds! A healthy node shows about 100 bulk tasks being executed.
Why does ES allow to have 1300 bulk indexing tasks in a node if the bulk indexing queue size is only 10? shouldn't it be rejecting bulk requests if it's already executing 1300?? Is there a way to limit the amount of tasks being executed in a node at a time and reject if we cross certain limit?
I wanted to mention that there are no queries running in the hot nodes at all. I also wanted to mention that the cluster has been fine with higher ingestion rates and only sometimes it seems like some of the nodes get stuck with many indexing bulk requests, they go into full gcs all the time and that makes the node to crash from Out Of Memory, followed by the rest of the nodes. When one or two of the hot nodes start to suffer from Full GCs the rest of the hot nodes are totally fine. The document id is generated by ES so there shouldn't be any hotspotting as far as I know, and if there was, it should be happening all the time.
Honestly I'm running out of ideas and I don't know what else could I check to find out the root of the cause. So any help would be great!
Thanks in advance!

Related

High CPU usage on elasticsearch nodes

we have been using a 3 node Elasticsearch(7.6v) cluster running in docker container. I have been experiencing very high cpu usage on 2 nodes(97%) and moderate CPU load on the other node(55%). Hardware used are m5 xlarge servers.
There are 5 indices with 6 shards and 1 replica. The update operations take around 10 seconds even for updating a single field. similar case is with delete. however querying is quite fast. Is this because of high CPU load?
2 out of 5 indices, continuously undergo a update and write operations as they listen from a kafka stream. size of the indices are 15GB, 2Gb and the rest are around 100MB.
You need to provide more information to find the root cause:
All the ES nodes are running on different docker containers on the same host or different host?
Do you have resource limit on your ES docker containers?
How much heap size of ES and is it 50% of host machine RAM?
Node which have high CPU, holds the 2 write heavy indices which you mentioned?
what is the refresh interval of your indices which receives high indexing requests.
what is the segment size of your 15 GB indices, use https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-segments.html to get this info.
What all you have debugged so far and is there is any interesting info you want to share to find the issue?

Elasticsearch high cpu usage and node queue high

My elastic cluster is getting high cpu usage some times.Performing some tests i found a node with a strange behavior in all cpu spike moments.
using /_cat/thread_pool i got this:
At this point , everything is ok in search.active in all nodes. But after some time ( 7 to 15 seconds) , all nodes show 0(zero) in search.active minus 6th node on list that continuos grow, and at this moment my cluster cpu get 100% of usage. This happen repeatedly with same node.
Obs:
This behavior is new in my cluster that goes running ok for 6 months more.
Im using AWS Elasticsearch version 1.5
I believe that node is damaged. Have any way do confirm this?

Elasticsearch 1.5.2 High JVM Heap in 2 nodes even without bulk indexing

We have been facing multiple downtimes recently, especially after few hours of Bulk Indexing. To avoid further downtimes, we disabled bulk indexing temporarily and added another node. Now downtimes have stopped but two out 6 nodes permanently remain at JVM Heap > 80%
We have a 6 node cluster currently (previously 5), each being EC2 c3.2xlarge with 16 GB ram, 8 GB of JVM heap, all master+data. We're using Elasticsearch 1.5.2 which has known issues like [OOM Thrown on merge thread](https://issues.apache.org/jira/browse/LUCENE-6670 OOM Thrown on merge thread), and we faced the same regularly.
There are two major indices used frequently for Search and autosuggest having doc count/size as follows:
health status index pri rep docs.count docs.deleted store.size
green open aggregations 5 0 16507117 3653185 46.2gb
green open index_v10 5 0 3445495 693572 44.8gb
Ideally, we keep at least one replica for each index, but our last attempt to add replica with 5 nodes resulted in OOM errors and full heap, so we turned it back to 0.
We also had two bulk update jobs running between 12-6 AM, each updating about 3 million docs with 2-3 fields on a daily basis. They were scheduled at 1.30 AM and 4.30 AM, each sending bulk feed with 100 docs (about 12 KB in size) to BULK api using a bash script having a sleep time of .25s between each to avoid too many parallel requests. When we started the bulk update, we had max 2 million docs to update daily, but the doc count almost doubled in a short span (to 3.8 million) and we started seeing Search response time spikes mostly between 4-6 AM and sometimes even later. Our average Search response time also increased from 60-70 ms to 150+ ms. A week ago, master left due to ping timeout, and soon after that we received shard failed error for one index. On investigating further, we found that this specific shard's data was inaccessible. To save unavailability of data, we restarted the node and reindexed the data.
However, the node downtime happened many more times, and each time Shards went into UNASSIGNED or INITIALIZING state. We finally deleted the index and started fresh. But heavy indexing again brought OutOfMemory Errors and node downtime, with same shard issue and data loss. To avoid further downtimes, we stopped all bulk jobs and reindexed data at a very slow rate.
We also added one more node to distribute load. Yet, currently we have 3 nodes with JVM constantly above 75+%, 2 being 80+ always. We have noticed that number of segments and their size is relatively high on these nodes (about 5 GB), but using optimize index on these would risk increasing heap again, with a probability of downtime.
Another important point to note is that our tomcat apps hit only 3 of all nodes (for normal search and indexing), and mostly one of the other two node was used for bulk indexing. Thus, out of three query+indexing receiving node, one node, and the left node for bulk indexing has relatively high heap.
There are following known issues with our configuration and indexing approach, which are planning to fix:
Bulk indexing hits only one node, thus increasing its heap, and causes slightly high GC pauses.
mlockall is set to false
Snapshot is needed to revert index in such cases, we're under planning phase when this incident happened.
We can merge 2 bulk jobs into one, to avoid too indexing request under queue at the same time.
We can use optimize API at regular interval in the bulk indexing script to avoid existence of too many segments.
Elasticsearch yml: (only relevant and enabled settings mentioned)
master: true
index.number_of_shards: 5
index.number_of_replicas: 2
path.conf: /etc/elasticsearch
path.data: /data
transport.tcp.port: 9300
transport.tcp.compress: false
http.port: 9200
http.enabled: true
gateway.type: local
gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 3
discovery.zen.minimum_master_nodes: 4 # Now that we have 6 nodes
discovery.zen.ping.timeout: 3s
discovery.zen.ping.multicast.enabled: false
Node stats:
Pastebin link
Hot threads:
Pastebin link
If I understand well you have 6 servers, each of one of them is running elasticsearch node.
What I would to is to run more than one node on each server separating the roles, node that act as client, node that acts as data node, and node that act as master. I think that you can have two nodes on each server.
3 servers: data + client
3 servers: data + master
The client nodes and the master nodes will need less amount of RAM. The configuration files will be more complex but it will work better.

Why is there unequal CPU usage as elasticsearch cluster scales?

I have a 15 node elasticsearch cluster and am indexing a lot of documents. The documents are of the form { "message": "some sentences" }. When I had a 9 node cluster, I could get CPU utilization upto 80% on all of them, when I turned it into a 15 node cluster, i get 90% CPU usage on 4 nodes and only ~50% on the rest.
The specification of the cluster is:
15 Nodes c4.2xlarge EC2 insatnces
15 shards, no replicas
There is load balancer in-front of all the instances and the instances are accessed through the load balancer.
Marvel is running and is used to monitor the cluster
Refresh interval 1s
I could index 50k docs/sec on 9 nodes and only 70k docs/sec on 15 nodes. Shouldn't I be able to do more?
I'm not yet an expert on scalability and load balancing in ES but some things to consider :
load balancing should be native in ES thus having a load balancer in-front can actually mitigate the in-house load balancing results. It's kind of like having a speed limitation on your car but manually using the brakes, it doesn't make that much sense since your speed limitator should already do the job and will be prevented from doing it right when you input "manual regulation". Have you tried not using your load balancer and just using the native load balancing to see how it fares ?
while having more CPU / computation power across different servers / shards, it also forces you to go through multiple shards every time you write/read a document, thus if 1 shard can do N computations, M shards won't actually be able to do M*N computations
having 15 shards is probably overkill in a lot of cases
having 15 shards but no replication is weird/bad since if any of your 15 servers falls, you won't be able to access your whole index
you can actually hold multiple nodes on a single server
What is your index size in terms of storage ?

Cassandra compaction taking too much time to complete

Initially we had 12 nodes in Cassandra cluster and with 500GB of data load on each node major compaction use to complete in 20 hours.
Now we have upgraded the cluster to 24 nodes and with same data size that is 500 GB on each node major compaction is taking 5 days.(hardware configuration of each node is exactly same and we are using cassandra-0.8.2 )
So what could be the possible reason for this slowdown?
Is increased cluster size causing this issue?
Compaction is is a completely local operation, so cluster size would not affect it. Request volume would, and so would data volume.

Resources