Elasticsearch service hang and kills while data insertion jvm heap - elasticsearch

I am using elasticsearch 5.6.13 version, I need some experts configurations for the elasticsearch. I have 3 nodes in the same system (node1,node2,node3) where node1 is master and else 2 data nodes. I have number of indexes around 40, I created all these indexes with default 5 primary shards and some of them have 2 replicas.
What I am facing the issue right now, My data (scraping) is growing day by day and I have 400GB of the data in my one of index. similarly 3 other indexes are also very loaded.
From some last days I am facing the issue while insertion of data my elasticsearch hangs and then the service is killed which effect my processing. I have tried several things. I am sharing the system specs and current ES configuration + logs. Please suggest some solution.
The System Specs:
RAM: 160 GB,
CPU: AMD EPYC 7702P 64-Core Processor,
Drive: 2 TB SSD (The drive in which the ES installed still have 500 GB left)
ES Configuration JVM options:
-Xms26g,
-Xmx26g
(I just try this but not sure what is the perfect heap size for my scenario)
I just edit this above lines and the rest of the file is as defult. I edit this on all three nodes jvm.options files.
ES LOGS
[2021-09-22T12:05:17,983][WARN ][o.e.m.j.JvmGcMonitorService] [sashanode1] [gc][170] overhead, spent [7.1s] collecting in the last [7.2s]
[2021-09-22T12:05:21,868][WARN ][o.e.m.j.JvmGcMonitorService] [sashanode1] [gc][171] overhead, spent [3.7s] collecting in the last [1.9s]
[2021-09-22T12:05:51,190][WARN ][o.e.m.j.JvmGcMonitorService] [sashanode1] [gc][172] overhead, spent [27.7s] collecting in the last [23.3s]
[2021-09-22T12:06:54,629][WARN ][o.e.m.j.JvmGcMonitorService] [cluster_name] [gc][173] overhead, spent [57.5s] collecting in the last [1.1m]
[2021-09-22T12:06:56,536][WARN ][o.e.m.j.JvmGcMonitorService] [cluster_name] [gc][174] overhead, spent [1.9s] collecting in the last [1.9s]
[2021-09-22T12:07:02,176][WARN ][o.e.m.j.JvmGcMonitorService] [cluster_name] [gc][175] overhead, spent [5.4s] collecting in the last [5.6s]
[2021-09-22T12:06:56,546][ERROR][o.e.i.e.Engine ] [cluster_name] [index_name][3] merge failed
java.lang.OutOfMemoryError: Java heap space
[2021-09-22T12:06:56,548][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [cluster_name] fatal error in thread [elasticsearch[cluster_name][bulk][T#25]], exiting
java.lang.OutOfMemoryError: Java heap space
Some more logs
[2021-09-22T12:10:06,526][INFO ][o.e.n.Node ] [cluster_name] initializing ...
[2021-09-22T12:10:06,589][INFO ][o.e.e.NodeEnvironment ] [cluster_name] using [1] data paths, mounts [[(D:)]], net usable_space [563.3gb], net total_space [1.7tb], spins? [unknown], types [NTFS]
[2021-09-22T12:10:06,589][INFO ][o.e.e.NodeEnvironment ] [cluster_name] heap size [1.9gb], compressed ordinary object pointers [true]
[2021-09-22T12:10:07,239][INFO ][o.e.n.Node ] [cluster_name] node name [sashanode1], node ID [2p-ux-OXRKGuxmN0efvF9Q]
[2021-09-22T12:10:07,240][INFO ][o.e.n.Node ] [cluster_name] version[5.6.13], pid[57096], build[4d5320b/2018-10-30T19:05:08.237Z], OS[Windows Server 2019/10.0/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_261/25.261-b12]
[2021-09-22T12:10:07,240][INFO ][o.e.n.Node ] [cluster_name] JVM arguments [-Xms2g, -Xmx2g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Delasticsearch, -Des.path.home=D:\Databases\ES\elastic and kibana 5.6.13\es_node_1, -Des.default.path.logs=D:\Databases\ES\elastic and kibana 5.6.13\es_node_1\logs, -Des.default.path.data=D:\Databases\ES\elastic and kibana 5.6.13\es_node_1\data, -Des.default.path.conf=D:\Databases\ES\elastic and kibana 5.6.13\es_node_1\config, exit, -Xms2048m, -Xmx2048m, -Xss1024k]
Also in my ES folder there are so many files with the random names (java_pid197036.hprof)
Further details can be shared please suggest any further configurations.
Thanks
The output for _cluster/stats?pretty&human is
{ "_nodes": { "total": 3, "successful": 3, "failed": 0 }, "cluster_name": "cluster_name", "timestamp": 1632375228033, "status": "red", "indices": { "count": 42, "shards": { "total": 508, "primaries": 217, "replication": 1.3410138248847927, "index": { "shards": { "min": 2, "max": 60, "avg": 12.095238095238095 }, "primaries": { "min": 1, "max": 20, "avg": 5.166666666666667 }, "replication": { "min": 1.0, "max": 2.0, "avg": 1.2857142857142858 } } }, "docs": { "count": 107283077, "deleted": 1047418 }, "store": { "size": "530.2gb", "size_in_bytes": 569385384976, "throttle_time": "0s", "throttle_time_in_millis": 0 }, "fielddata": { "memory_size": "0b", "memory_size_in_bytes": 0, "evictions": 0 }, "query_cache": { "memory_size": "0b", "memory_size_in_bytes": 0, "total_count": 0, "hit_count": 0, "miss_count": 0, "cache_size": 0, "cache_count": 0, "evictions": 0 }, "completion": { "size": "0b", "size_in_bytes": 0 }, "segments": { "count": 3781, "memory": "2gb", "memory_in_bytes": 2174286255, "terms_memory": "1.7gb", "terms_memory_in_bytes": 1863786029, "stored_fields_memory": "105.6mb", "stored_fields_memory_in_bytes": 110789048, "term_vectors_memory": "0b", "term_vectors_memory_in_bytes": 0, "norms_memory": "31.9mb", "norms_memory_in_bytes": 33527808, "points_memory": "13.1mb", "points_memory_in_bytes": 13742470, "doc_values_memory": "145.3mb", "doc_values_memory_in_bytes": 152440900, "index_writer_memory": "0b", "index_writer_memory_in_bytes": 0, "version_map_memory": "0b", "version_map_memory_in_bytes": 0, "fixed_bit_set": "0b", "fixed_bit_set_memory_in_bytes": 0, "max_unsafe_auto_id_timestamp": 1632340789677, "file_sizes": { } } }, "nodes": { "count": { "total": 3, "data": 3, "coordinating_only": 0, "master": 1, "ingest": 3 }, "versions": [ "5.6.13" ], "os": { "available_processors": 192, "allocated_processors": 96, "names": [ { "name": "Windows Server 2019", "count": 3 } ], "mem": { "total": "478.4gb", "total_in_bytes": 513717497856, "free": "119.7gb", "free_in_bytes": 128535437312, "used": "358.7gb", "used_in_bytes": 385182060544, "free_percent": 25, "used_percent": 75 } }, "process": { "cpu": { "percent": 5 }, "open_file_descriptors": { "min": -1, "max": -1, "avg": 0 } }, "jvm": { "max_uptime": "1.9d", "max_uptime_in_millis": 167165106, "versions": [ { "version": "1.8.0_261", "vm_name": "Java HotSpot(TM) 64-Bit Server VM", "vm_version": "25.261-b12", "vm_vendor": "Oracle Corporation", "count": 3 } ], "mem": { "heap_used": "5gb", "heap_used_in_bytes": 5460944144, "heap_max": "5.8gb", "heap_max_in_bytes": 6227755008 }, "threads": 835 }, "fs": { "total": "1.7tb", "total_in_bytes": 1920365228032, "free": "499.1gb", "free_in_bytes": 535939969024, "available": "499.1gb", "available_in_bytes": 535939969024 }, "plugins": [ ], "network_types": { "transport_types": { "netty4": 3 }, "http_types": { "netty4": 3 } } } }
The jvm.options file.
## JVM configuration
################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms26g
-Xmx26g
################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################
## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
## optimizations
# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch
## basic
# force the server VM (remove on 32-bit client JVMs)
-server
# explicitly set the stack size (reduce to 320k on 32-bit client JVMs)
-Xss1m
# set to headless, just in case
-Djava.awt.headless=true
# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8
# use our provided JNA always versus the system one
-Djna.nosys=true
# use old-style file permissions on JDK9
-Djdk.io.permissionsUseCanonicalPath=true
# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true
## heap dumps
# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError
# specify an alternative path for heap dumps
# ensure the directory exists and has sufficient space
#-XX:HeapDumpPath=${heap.dump.path}
## GC logging
#-XX:+PrintGCDetails
#-XX:+PrintGCTimeStamps
#-XX:+PrintGCDateStamps
#-XX:+PrintClassHistogram
#-XX:+PrintTenuringDistribution
#-XX:+PrintGCApplicationStoppedTime
# log GC status to a file with time stamps
# ensure the directory exists
#-Xloggc:${loggc}
# By default, the GC log file will not rotate.
# By uncommenting the lines below, the GC log file
# will be rotated every 128MB at most 32 times.
#-XX:+UseGCLogFileRotation
#-XX:NumberOfGCLogFiles=32
#-XX:GCLogFileSize=128M
# Elasticsearch 5.0.0 will throw an exception on unquoted field names in JSON.
# If documents were already indexed with unquoted fields in a previous version
# of Elasticsearch, some operations may throw errors.
#
# WARNING: This option will be removed in Elasticsearch 6.0.0 and is provided
# only for migration purposes.
#-Delasticsearch.json.allow_unquoted_field_names=true
and the elasticsearch.yml (master node)
cluster.name: cluster_name
node.name: node1
node.master : true
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: ["192.168.11.159", "192.168.11.157"]

My issue is solved, It is due to the heap size issue, actually I am running the ES as service and the heap size is by default 2 GB and it is not reflecting.
I just install the new service with the updated options.jvm file with heap size of 10 GB, and then run my cluster. It reflect the heap size from 2 GB to 10 GB.
And my problem is solved. Thanks for the suggestions.
to check your heap size use this command.
http://localhost:9200/_cat/nodes?h=heap*&v

Related

Is possible change priority to task create_snapshot from NORMAL to HIGH or URGENT?

I has a cluster elasticsearch with 6 data nodes and 3 master.
When execute the snapshot I receive the error "process_cluster_event_timeout_exception".
I look in my cluster "/_cat/pending_tasks" it has 69 tasks with priority HIGH and source put-mapping
My cluster is for centralized log and have this process to put data in cluster:
logstash - collect from Redis and put to Elasticsearch
apm-server
filebeat
metricbeat
I stay removing beats and some applications from apm-server
Is possible change priority to task create_snapshot from NORMAL to HIGH or URGENT?
It is not a solution, how to I check the correct size for my cluster?
*Normally i keep 7 days the indice in my cluster because the backup.
But because the error, I removed the process to exclude the old data
GET _cat/nodes?v&s=node.role:desc
ip
heap.percent
ram.percent
cpu
load_1m
load_5m
load_15m
node.role
master
name
10.0.2.8
47
50
0
0.00
0.00
0.00
mi
-
prd-elasticsearch-i-020
10.0.0.7
14
50
0
0.00
0.00
0.00
mi
-
prd-elasticsearch-i-0ab
10.0.1.1
47
77
29
1.47
1.72
1.66
mi
*
prd-elasticsearch-i-0e2
10.0.2.7
58
95
19
8.04
8.62
8.79
d
-
prd-elasticsearch-i-0b4
10.0.2.4
59
97
20
8.22
8.71
8.76
d
-
prd-elasticsearch-i-00d
10.0.1.6
62
94
38
11.42
8.87
8.89
d
-
prd-elasticsearch-i-0ff
10.0.0.6
67
97
25
8.97
10.45
10.47
d
-
prd-elasticsearch-i-01a
10.0.0.9
57
98
32
11.63
9.64
9.17
d
-
prd-elasticsearch-i-005
10.0.1.0
62
96
19
10.45
9.53
9.31
d
-
prd-elasticsearch-i-088
My cluster definitions:
{
"_nodes": {
"total": 9,
"successful": 9,
"failed": 0
},
"cluster_name": "prd-elasticsearch",
"cluster_uuid": "xxxx",
"timestamp": 1607609607018,
"status": "green",
"indices": {
"count": 895,
"shards": {
"total": 14006,
"primaries": 4700,
"replication": 1.98,
"index": {
"shards": {
"min": 2,
"max": 18,
"avg": 15.649162011173184
},
"primaries": {
"min": 1,
"max": 6,
"avg": 5.251396648044692
},
"replication": {
"min": 1,
"max": 2,
"avg": 1.9787709497206705
}
}
},
"docs": {
"count": 14896803950,
"deleted": 843126
},
"store": {
"size_in_bytes": 16778620001453
},
"fielddata": {
"memory_size_in_bytes": 4790672272,
"evictions": 0
},
"query_cache": {
"memory_size_in_bytes": 7689832903,
"total_count": 2033762560,
"hit_count": 53751516,
"miss_count": 1980011044,
"cache_size": 4087727,
"cache_count": 11319866,
"evictions": 7232139
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 155344,
"memory_in_bytes": 39094918196,
"terms_memory_in_bytes": 31533157295,
"stored_fields_memory_in_bytes": 5574613712,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 449973760,
"points_memory_in_bytes": 886771949,
"doc_values_memory_in_bytes": 650401480,
"index_writer_memory_in_bytes": 905283962,
"version_map_memory_in_bytes": 1173400,
"fixed_bit_set_memory_in_bytes": 12580800,
"max_unsafe_auto_id_timestamp": 1607606224903,
"file_sizes": {}
}
},
"nodes": {
"count": {
"total": 9,
"data": 6,
"coordinating_only": 0,
"master": 3,
"ingest": 3
},
"versions": [
"6.8.1"
],
"os": {
"available_processors": 108,
"allocated_processors": 108,
"names": [
{
"name": "Linux",
"count": 9
}
],
"pretty_names": [
{
"pretty_name": "CentOS Linux 7 (Core)",
"count": 9
}
],
"mem": {
"total_in_bytes": 821975162880,
"free_in_bytes": 50684043264,
"used_in_bytes": 771291119616,
"free_percent": 6,
"used_percent": 94
}
},
"process": {
"cpu": {
"percent": 349
},
"open_file_descriptors": {
"min": 429,
"max": 9996,
"avg": 6607
}
},
"jvm": {
"max_uptime_in_millis": 43603531934,
"versions": [
{
"version": "1.8.0_222",
"vm_name": "OpenJDK 64-Bit Server VM",
"vm_version": "25.222-b10",
"vm_vendor": "Oracle Corporation",
"count": 9
}
],
"mem": {
"heap_used_in_bytes": 137629451248,
"heap_max_in_bytes": 205373571072
},
"threads": 1941
},
"fs": {
"total_in_bytes": 45245361229824,
"free_in_bytes": 28231010959360,
"available_in_bytes": 28231011147776
},
"plugins": [
{
"name": "repository-s3",
"version": "6.8.1",
"elasticsearch_version": "6.8.1",
"java_version": "1.8",
"description": "The S3 repository plugin adds S3 repositories",
"classname": "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
"extended_plugins": [],
"has_native_controller": false
}
],
"network_types": {
"transport_types": {
"security4": 9
},
"http_types": {
"security4": 9
}
}
}
}
Data Nodes: 6 instances r4.4xlarge
Master Nodes: 3 instances m5.large
No It is not possible to change priority of task create_snapshot.
As you have 69 pending tasks, it seems you are doing too many mapping updates.
Regarding correct size of cluster, I would recommend you to go through following blog posts :
https://www.elastic.co/blog/found-sizing-elasticsearch
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/sizing-domains.html

finding active framework current resource usage in mesos

Which HTTP endpoint will help me to find all the active frameworks current resource utilization?
We want this information because we want to dynamically scale Mesos cluster and our algorithm needs information regarding what resources each active framework is using.
I think to focus on the frameworks is not really what you would want to to. What you're after is probably the Mesos Slave utilization, which can be requested via calling
http://{mesos-master}:5050/master/state-summary
In the JSON answer, you'll find a slaves property which contains an array of slave objects:
{
"hostname": "192.168.0.3",
"cluster": "mesos-hw-cluster",
"slaves": [{
"id": "bd9c29d7-8530-4c5b-8c50-5d2f60dffbf6-S2",
"pid": "slave(1)#192.168.0.1:5051",
"hostname": "192.168.0.1",
"registered_time": 1456826950.99075,
"resources": {
"cpus": 12.0,
"disk": 1840852.0,
"mem": 63304.0,
"ports": "[31000-32000]"
},
"used_resources": {
"cpus": 5.75,
"disk": 0.0,
"mem": 14376.0,
"ports": "[31000-31000, 31109-31109, 31267-31267, 31699-31699, 31717-31717, 31907-31907, 31979-31980]"
},
"offered_resources": {
"cpus": 0.0,
"disk": 0.0,
"mem": 0.0
},
"reserved_resources": {},
"unreserved_resources": {
"cpus": 12.0,
"disk": 1840852.0,
"mem": 63304.0,
"ports": "[31000-32000]"
},
"attributes": {},
"active": true,
"version": "0.27.1",
"TASK_STAGING": 0,
"TASK_STARTING": 0,
"TASK_RUNNING": 7,
"TASK_FINISHED": 18,
"TASK_KILLED": 27,
"TASK_FAILED": 3,
"TASK_LOST": 0,
"TASK_ERROR": 0,
"framework_ids": ["bd9c29d7-8530-4c5b-8c50-5d2f60dffbf6-0000", "bd9c29d7-8530-4c5b-8c50-5d2f60dffbf6-0002"]
},
...
}
You could iterate over all the slave objects and calculate the overall ressource usage by summarizing the resources and then subtract the summary of the used_resources.
See
http://mesos.apache.org/documentation/latest/endpoints/master/state-summary/
http://mesos.apache.org/documentation/latest/endpoints/

How to monitor all CircuitBreakers in ElasticSearch

Is it possible to monitor all Circuit Breakers limits and size?
Fielddata Breaker can be monitored using this by node:
GET _nodes/stats/breaker,http
But how can we monitor the other Breakers like breaker.request and breaker.total ?
ElasticSearch-version: 1.3.5
I think those breakers are available from 1.4.x on. See this PR in github with details that seem to indicate this.
And I've tested shortly this and I can see the additional requests breaker:
"breakers": {
"request": {
"limit_size_in_bytes": 415550668,
"limit_size": "396.2mb",
"estimated_size_in_bytes": 0,
"estimated_size": "0b",
"overhead": 1,
"tripped": 0
},
"fielddata": {
"limit_size_in_bytes": 623326003,
"limit_size": "594.4mb",
"estimated_size_in_bytes": 2847496,
"estimated_size": "2.7mb",
"overhead": 1.03,
"tripped": 0
},
"parent": {
"limit_size_in_bytes": 727213670,
"limit_size": "693.5mb",
"estimated_size_in_bytes": 2847496,
"estimated_size": "2.7mb",
"overhead": 1,
"tripped": 0
}
}

elasticsearch: Shard replica sizing is disparate

This morning, we were alerted that a few machines in a 4 node Elasticsearch clusters are running low on disk space. We're using 5 shards, with single replication.
The status report shows index sizes that line up with the disk usage we're seeing. However, the confusing thing, is that the replicated shards are very out of line for shards 2 and 4. I understand that shard sizes can vary between replicas; however the size differences we are seeing are enormous:
"shards": {
...
"2": [
{
"routing": {
"state": "STARTED",
"primary": true,
"node": "LV__Sh-vToyTcuuxwnZaAg",
"relocating_node": null,
"shard": 2,
"index": "eventdata"
},
"state": "STARTED",
"index": {
"size_in_bytes": 87706293809
},
......
},
{
"routing": {
"state": "STARTED",
"primary": false,
"node": "U7eYVll7ToWS9lPkyhql6g",
"relocating_node": null,
"shard": 2,
"index": "eventdata"
},
"state": "STARTED",
"index": {
"size_in_bytes": 42652984946
},
Some interesting data points:
There's a lot of merging happening on the cluster at the moment. Could this be a factor? If merges make a copy of the index prior to merging, then that would explain a lot.
The bigger shard is almost exactly twice the size of the the smaller shard. coincidence? (I think not).
Why are we seeing indices at double the size of their replicas in our cluster? merging? some other reason?

What is the meaning of throttle_time_in_millis Elasticsearch stats?

I created an index in a 4 nodes Elasticsearch cluster. I added about 3.5 M documents using the java Elasticsearch API.
When asking for the stats i get a very high number in throttle_time_in_millis as follows:
{
"_shards": {
"total": 10,
"successful": 10,
"failed": 0
},
"_all": {
"primaries": {
"docs": {
"count": 3855540,
"deleted": 0
},
"store": {
"size_in_bytes": 1203074796,
"throttle_time_in_millis": 980255
},
"indexing": {
"index_total": 3855540,
"index_time_in_millis": 426300,
"index_current": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0
},
What is the meaning of throttle_time_in_millis?
What could be the reason for this to increase?
Thx in advance
I'm not 100% sure on this but looking at the Java source code available here and the description of Store stats available here I think this is a measure of total time taken to merge the segments. It could be an indication of poor disk I/O. an increase in throttle_time_in_millis would mean the disk is failing, but only if you have previous benchmarks that show it used to be quicker. If this figure is consistently this high then I would argue it's just a symptom of the disk type you're using or the number of documents you are storing. If you're using a traditional HDD could you try a switch to an SSD?

Resources