What is the meaning of throttle_time_in_millis Elasticsearch stats? - performance

I created an index in a 4 nodes Elasticsearch cluster. I added about 3.5 M documents using the java Elasticsearch API.
When asking for the stats i get a very high number in throttle_time_in_millis as follows:
{
"_shards": {
"total": 10,
"successful": 10,
"failed": 0
},
"_all": {
"primaries": {
"docs": {
"count": 3855540,
"deleted": 0
},
"store": {
"size_in_bytes": 1203074796,
"throttle_time_in_millis": 980255
},
"indexing": {
"index_total": 3855540,
"index_time_in_millis": 426300,
"index_current": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0
},
What is the meaning of throttle_time_in_millis?
What could be the reason for this to increase?
Thx in advance

I'm not 100% sure on this but looking at the Java source code available here and the description of Store stats available here I think this is a measure of total time taken to merge the segments. It could be an indication of poor disk I/O. an increase in throttle_time_in_millis would mean the disk is failing, but only if you have previous benchmarks that show it used to be quicker. If this figure is consistently this high then I would argue it's just a symptom of the disk type you're using or the number of documents you are storing. If you're using a traditional HDD could you try a switch to an SSD?

Related

Elasticsearch service hang and kills while data insertion jvm heap

I am using elasticsearch 5.6.13 version, I need some experts configurations for the elasticsearch. I have 3 nodes in the same system (node1,node2,node3) where node1 is master and else 2 data nodes. I have number of indexes around 40, I created all these indexes with default 5 primary shards and some of them have 2 replicas.
What I am facing the issue right now, My data (scraping) is growing day by day and I have 400GB of the data in my one of index. similarly 3 other indexes are also very loaded.
From some last days I am facing the issue while insertion of data my elasticsearch hangs and then the service is killed which effect my processing. I have tried several things. I am sharing the system specs and current ES configuration + logs. Please suggest some solution.
The System Specs:
RAM: 160 GB,
CPU: AMD EPYC 7702P 64-Core Processor,
Drive: 2 TB SSD (The drive in which the ES installed still have 500 GB left)
ES Configuration JVM options:
-Xms26g,
-Xmx26g
(I just try this but not sure what is the perfect heap size for my scenario)
I just edit this above lines and the rest of the file is as defult. I edit this on all three nodes jvm.options files.
ES LOGS
[2021-09-22T12:05:17,983][WARN ][o.e.m.j.JvmGcMonitorService] [sashanode1] [gc][170] overhead, spent [7.1s] collecting in the last [7.2s]
[2021-09-22T12:05:21,868][WARN ][o.e.m.j.JvmGcMonitorService] [sashanode1] [gc][171] overhead, spent [3.7s] collecting in the last [1.9s]
[2021-09-22T12:05:51,190][WARN ][o.e.m.j.JvmGcMonitorService] [sashanode1] [gc][172] overhead, spent [27.7s] collecting in the last [23.3s]
[2021-09-22T12:06:54,629][WARN ][o.e.m.j.JvmGcMonitorService] [cluster_name] [gc][173] overhead, spent [57.5s] collecting in the last [1.1m]
[2021-09-22T12:06:56,536][WARN ][o.e.m.j.JvmGcMonitorService] [cluster_name] [gc][174] overhead, spent [1.9s] collecting in the last [1.9s]
[2021-09-22T12:07:02,176][WARN ][o.e.m.j.JvmGcMonitorService] [cluster_name] [gc][175] overhead, spent [5.4s] collecting in the last [5.6s]
[2021-09-22T12:06:56,546][ERROR][o.e.i.e.Engine ] [cluster_name] [index_name][3] merge failed
java.lang.OutOfMemoryError: Java heap space
[2021-09-22T12:06:56,548][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [cluster_name] fatal error in thread [elasticsearch[cluster_name][bulk][T#25]], exiting
java.lang.OutOfMemoryError: Java heap space
Some more logs
[2021-09-22T12:10:06,526][INFO ][o.e.n.Node ] [cluster_name] initializing ...
[2021-09-22T12:10:06,589][INFO ][o.e.e.NodeEnvironment ] [cluster_name] using [1] data paths, mounts [[(D:)]], net usable_space [563.3gb], net total_space [1.7tb], spins? [unknown], types [NTFS]
[2021-09-22T12:10:06,589][INFO ][o.e.e.NodeEnvironment ] [cluster_name] heap size [1.9gb], compressed ordinary object pointers [true]
[2021-09-22T12:10:07,239][INFO ][o.e.n.Node ] [cluster_name] node name [sashanode1], node ID [2p-ux-OXRKGuxmN0efvF9Q]
[2021-09-22T12:10:07,240][INFO ][o.e.n.Node ] [cluster_name] version[5.6.13], pid[57096], build[4d5320b/2018-10-30T19:05:08.237Z], OS[Windows Server 2019/10.0/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_261/25.261-b12]
[2021-09-22T12:10:07,240][INFO ][o.e.n.Node ] [cluster_name] JVM arguments [-Xms2g, -Xmx2g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Delasticsearch, -Des.path.home=D:\Databases\ES\elastic and kibana 5.6.13\es_node_1, -Des.default.path.logs=D:\Databases\ES\elastic and kibana 5.6.13\es_node_1\logs, -Des.default.path.data=D:\Databases\ES\elastic and kibana 5.6.13\es_node_1\data, -Des.default.path.conf=D:\Databases\ES\elastic and kibana 5.6.13\es_node_1\config, exit, -Xms2048m, -Xmx2048m, -Xss1024k]
Also in my ES folder there are so many files with the random names (java_pid197036.hprof)
Further details can be shared please suggest any further configurations.
Thanks
The output for _cluster/stats?pretty&human is
{ "_nodes": { "total": 3, "successful": 3, "failed": 0 }, "cluster_name": "cluster_name", "timestamp": 1632375228033, "status": "red", "indices": { "count": 42, "shards": { "total": 508, "primaries": 217, "replication": 1.3410138248847927, "index": { "shards": { "min": 2, "max": 60, "avg": 12.095238095238095 }, "primaries": { "min": 1, "max": 20, "avg": 5.166666666666667 }, "replication": { "min": 1.0, "max": 2.0, "avg": 1.2857142857142858 } } }, "docs": { "count": 107283077, "deleted": 1047418 }, "store": { "size": "530.2gb", "size_in_bytes": 569385384976, "throttle_time": "0s", "throttle_time_in_millis": 0 }, "fielddata": { "memory_size": "0b", "memory_size_in_bytes": 0, "evictions": 0 }, "query_cache": { "memory_size": "0b", "memory_size_in_bytes": 0, "total_count": 0, "hit_count": 0, "miss_count": 0, "cache_size": 0, "cache_count": 0, "evictions": 0 }, "completion": { "size": "0b", "size_in_bytes": 0 }, "segments": { "count": 3781, "memory": "2gb", "memory_in_bytes": 2174286255, "terms_memory": "1.7gb", "terms_memory_in_bytes": 1863786029, "stored_fields_memory": "105.6mb", "stored_fields_memory_in_bytes": 110789048, "term_vectors_memory": "0b", "term_vectors_memory_in_bytes": 0, "norms_memory": "31.9mb", "norms_memory_in_bytes": 33527808, "points_memory": "13.1mb", "points_memory_in_bytes": 13742470, "doc_values_memory": "145.3mb", "doc_values_memory_in_bytes": 152440900, "index_writer_memory": "0b", "index_writer_memory_in_bytes": 0, "version_map_memory": "0b", "version_map_memory_in_bytes": 0, "fixed_bit_set": "0b", "fixed_bit_set_memory_in_bytes": 0, "max_unsafe_auto_id_timestamp": 1632340789677, "file_sizes": { } } }, "nodes": { "count": { "total": 3, "data": 3, "coordinating_only": 0, "master": 1, "ingest": 3 }, "versions": [ "5.6.13" ], "os": { "available_processors": 192, "allocated_processors": 96, "names": [ { "name": "Windows Server 2019", "count": 3 } ], "mem": { "total": "478.4gb", "total_in_bytes": 513717497856, "free": "119.7gb", "free_in_bytes": 128535437312, "used": "358.7gb", "used_in_bytes": 385182060544, "free_percent": 25, "used_percent": 75 } }, "process": { "cpu": { "percent": 5 }, "open_file_descriptors": { "min": -1, "max": -1, "avg": 0 } }, "jvm": { "max_uptime": "1.9d", "max_uptime_in_millis": 167165106, "versions": [ { "version": "1.8.0_261", "vm_name": "Java HotSpot(TM) 64-Bit Server VM", "vm_version": "25.261-b12", "vm_vendor": "Oracle Corporation", "count": 3 } ], "mem": { "heap_used": "5gb", "heap_used_in_bytes": 5460944144, "heap_max": "5.8gb", "heap_max_in_bytes": 6227755008 }, "threads": 835 }, "fs": { "total": "1.7tb", "total_in_bytes": 1920365228032, "free": "499.1gb", "free_in_bytes": 535939969024, "available": "499.1gb", "available_in_bytes": 535939969024 }, "plugins": [ ], "network_types": { "transport_types": { "netty4": 3 }, "http_types": { "netty4": 3 } } } }
The jvm.options file.
## JVM configuration
################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms26g
-Xmx26g
################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################
## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
## optimizations
# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch
## basic
# force the server VM (remove on 32-bit client JVMs)
-server
# explicitly set the stack size (reduce to 320k on 32-bit client JVMs)
-Xss1m
# set to headless, just in case
-Djava.awt.headless=true
# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8
# use our provided JNA always versus the system one
-Djna.nosys=true
# use old-style file permissions on JDK9
-Djdk.io.permissionsUseCanonicalPath=true
# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true
## heap dumps
# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError
# specify an alternative path for heap dumps
# ensure the directory exists and has sufficient space
#-XX:HeapDumpPath=${heap.dump.path}
## GC logging
#-XX:+PrintGCDetails
#-XX:+PrintGCTimeStamps
#-XX:+PrintGCDateStamps
#-XX:+PrintClassHistogram
#-XX:+PrintTenuringDistribution
#-XX:+PrintGCApplicationStoppedTime
# log GC status to a file with time stamps
# ensure the directory exists
#-Xloggc:${loggc}
# By default, the GC log file will not rotate.
# By uncommenting the lines below, the GC log file
# will be rotated every 128MB at most 32 times.
#-XX:+UseGCLogFileRotation
#-XX:NumberOfGCLogFiles=32
#-XX:GCLogFileSize=128M
# Elasticsearch 5.0.0 will throw an exception on unquoted field names in JSON.
# If documents were already indexed with unquoted fields in a previous version
# of Elasticsearch, some operations may throw errors.
#
# WARNING: This option will be removed in Elasticsearch 6.0.0 and is provided
# only for migration purposes.
#-Delasticsearch.json.allow_unquoted_field_names=true
and the elasticsearch.yml (master node)
cluster.name: cluster_name
node.name: node1
node.master : true
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: ["192.168.11.159", "192.168.11.157"]
My issue is solved, It is due to the heap size issue, actually I am running the ES as service and the heap size is by default 2 GB and it is not reflecting.
I just install the new service with the updated options.jvm file with heap size of 10 GB, and then run my cluster. It reflect the heap size from 2 GB to 10 GB.
And my problem is solved. Thanks for the suggestions.
to check your heap size use this command.
http://localhost:9200/_cat/nodes?h=heap*&v

Possible to get the size_in_bytes for records matching a specific query?

The documentation on the stats api indicates that we can do the following:
http://es.cluster.ip.addr:9200/indexname/_stats
Which resuls in an output like:
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"_all": {
"primaries": {
"docs": {
"count": 32930,
"deleted": 0
},
"store": {
"size_in_bytes": 3197332,
"throttle_time_in_millis": 0
},
// ... etc
}
}
}
My question is, is there a way to obtain the file size for a specific set of records, specific such as when we run a search query:
http://es.cluster.ip.addr:9200/indexname/type/_search?q=identifier:123
So essentially, the size_in_bytes for all records matching the identifier 123?

Timeout in Elastic search query

I have following elastic search query, I want to apply timeout. So I used
"timeout" param.
GET testdata-2016.04.14/_search
{
"size": 10000,
"timeout": "1ms"
}
I have set timeout to be 1ms, but I observed that query is taking time about more than 5000ms. I have tried the query as below also:
GET testdata-2016.04.14/_search?timeout=1ms
{
"size": 10000
}
IN both cases, I am getting below response after approx. 5000ms.
{
"took": 126,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 26536,
"max_score": 1,
"hits": [
{
...................
...................
}
}
}
I am not sure what is happening here. Is anything missing in above queries ?
Please help.
I have tried to find out solution on google but didn't find any working solution.

elasticsearch group every X values

Using elasticsearch, I query for a particular field. Is there then a way to aggregate every X values?
For instance, let's say I query 10 documents for field "myField", returning 10 values,
1, 4, 2, 4, 5, 3, 3, 2, 1, 4.
Is there a way to aggregate such that every 2 values are averaged, yielding
2.5, 3, 4, 2.5, 2.5 ?
You can do some interesting--and perhaps inadvisable--stuff with scripted metric aggregations. They let you define map-reduce scripts that run against your documents. You can get yourself in trouble with this, of course.
But just to see if I could do it, I set up a simple, single-shard index with the data you provided:
PUT /test_index
{"settings": {"number_of_shards": 1}}
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"num":1}
{"index":{"_id":2}}
{"num":4}
{"index":{"_id":3}}
{"num":2}
{"index":{"_id":4}}
{"num":4}
{"index":{"_id":5}}
{"num":5}
{"index":{"_id":6}}
{"num":3}
{"index":{"_id":7}}
{"num":3}
{"index":{"_id":8}}
{"num":2}
{"index":{"_id":9}}
{"num":1}
{"index":{"_id":10}}
{"num":4}
Then I can average every two documents like this:
POST /test_index/_search
{
"size": 0,
"aggs": {
"profit": {
"scripted_metric": {
"init_script" : "_agg['nums'] = []; _agg['avgs'] = [];",
"map_script" : "_agg.nums.add(doc['num'].value); if(_agg.nums.size() == 2){ _agg.avgs.add((_agg.nums[0] + _agg.nums[1])/2.0); _agg['nums'] = [];}",
"combine_script" : "return _agg.avgs",
"reduce_script" : "return _aggs"
}
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0,
"hits": []
},
"aggregations": {
"profit": {
"value": [
[
2.5,
3,
4,
2.5,
2.5
]
]
}
}
}
It doesn't seem to respect sorting order in the query, though the outcome is deterministic as far as I can tell.
What I did here only works with a single shard; you could probably generalize it somehow if you wanted to tinker with it long enough.
Also, big fat disclaimer: doing this in production might be a bad idea. You'd want to test this sort of thing out on small data sets first before you potentially crash your cluster with out-of-memory errors. Also only use scripting if your cluster isn't open to the big bad Internet.
Here is some code I used to play around with it:
http://sense.qbox.io/gist/c31f089e63200127fd9ca09992004db8bb11b890

elasticsearch: Shard replica sizing is disparate

This morning, we were alerted that a few machines in a 4 node Elasticsearch clusters are running low on disk space. We're using 5 shards, with single replication.
The status report shows index sizes that line up with the disk usage we're seeing. However, the confusing thing, is that the replicated shards are very out of line for shards 2 and 4. I understand that shard sizes can vary between replicas; however the size differences we are seeing are enormous:
"shards": {
...
"2": [
{
"routing": {
"state": "STARTED",
"primary": true,
"node": "LV__Sh-vToyTcuuxwnZaAg",
"relocating_node": null,
"shard": 2,
"index": "eventdata"
},
"state": "STARTED",
"index": {
"size_in_bytes": 87706293809
},
......
},
{
"routing": {
"state": "STARTED",
"primary": false,
"node": "U7eYVll7ToWS9lPkyhql6g",
"relocating_node": null,
"shard": 2,
"index": "eventdata"
},
"state": "STARTED",
"index": {
"size_in_bytes": 42652984946
},
Some interesting data points:
There's a lot of merging happening on the cluster at the moment. Could this be a factor? If merges make a copy of the index prior to merging, then that would explain a lot.
The bigger shard is almost exactly twice the size of the the smaller shard. coincidence? (I think not).
Why are we seeing indices at double the size of their replicas in our cluster? merging? some other reason?

Resources