Slow index speed of Elasticsearch - performance

We deployed ES 2.0 on 3 EC2 c4.4xlarge(16 cores, 32gb memory) nodes, allocating 16G for ES, attached 500GB with io1/4000 IOPS on each.
Problem : We are expecting great performance from this hardware config, however a very slow indexing speed is observed.
Our document is about 10-50k in size, we are using Java transport client to insert. The speed was alright for the first 50,000 at roughly 1000/second, and dramatically slow down to 100-200/second.
In the meanwhile we are looking at the low resource consumption:
CPU is about 1-20% only (16 Core CPU)
IO write is about 4-10Mb/second only
Memory consumption is about 20-30% only
Requirements :So I cannot understand why it is so slow while all the recourses are so free, what can I do to enhance the efficiency? Thanks.
Here is the config file we are using:
cluster.name: {{ env }}-{{ app }}
path.data: /data/es
path.logs: /data/es-logs
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: ["xxxx"]
bootstrap.mlockall: true
threadpool.search.queue_size: 300
threadpool.index.type: fixed
threadpool.index.size: 16
threadpool.index.queue_size: 250000
index.refresh_interval: 1s
index.translog.flush_threshold_ops: 50000
indices.memory.index_buffer_size: 30%
indices.memory.min_shard_index_buffer_size: 12mb
indices.memory.min_index_buffer_size: 96mb
script.inline: on
script.indexed: on
http.cors.enabled: true
http.cors.allow-origin: /https?:\/\/localhost(:[0-9]+)?/
Here is htop and iostat while running the job:

Upgrade your ES to latest version, because in recent releases they have made it more production friendly and most stable release now is the latest one 2.3
You can try following things to make indexing go faster:
Make some master nodes, separate from Data nodes as it will reduce load on all your cluster.
Disable OS swapping, ES takes care of that and Check your heap size on all your machines Heap Sizing
Check your documents are of similar size always, you can make use of bulk indexing and tweak you settings in there like chunk_size in number of records or in memory size
If you are using script try to optimize that as they make the indexing slow, you can store the scripted value if possible as preprocessing, as ES is not designed to handle scripting.
Check number of shards per node and try to balance that out across nodes using Routing
Read more on how ES guys suggest production ready system to work Elasticsearch in Production
One more blog on increasing Elasticsearch Indexing performance Performance Considerations for Elasticsearch Indexing
Check this answer for optimal way to setup ELK Stack on three servers. Optimal way to set up ELK stack on three servers

Related

memory management for elasticsearch

I am trying to calculate good balance of total memory in three node es cluster.
If I have three node e.s cluster each with 32G memory, 8 vcpu. Which combination would be more suitable for balancing memory between all the components? I know there will be no fixed answers but just trying to get as accurate as I can.
different elasticsearch components will be used are beats (filebeat, metricbeat,heartbeat), logstash, elasticsearch, kibana.
most use case for this cluster will be, application logs getting indexed and running query on them like fetch average response time for 7 days,30 days, how many are different status codes for last 24 hrs, 7 days etc through curl calls, so aggregation will be used and other use case is monitoring, seeing logs through kibana but no ML jobs or dashboard creation etc..
After going through below official docs, its recommended to set heap size as below,
logstash -
https://www.elastic.co/guide/en/logstash/current/jvm-settings.html#heap-size
The recommended heap size for typical ingestion scenarios should be no less than 4GB and no more than 8GB.
elasticsearch -
https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html#set-jvm-heap-size
Set Xms and Xmx to no more than 50% of your total memory. Elasticsearch requires memory for purposes other than the JVM heap
Kibana -
I have't found default or recommended memory for kibana but in our test cluster of single node of 8G memory it is taking 1.4G as total (256 MB/1.4 GB)
beats -
not found what is the default or recommended memory for beats but they will also consume more or less.
What should the ideal combination from below?
32G = 16G for OS + 16G for Elasticsearch heap.
for logstash 4G from 16G of OS, say three beats will consume 4G, kibana 2G
this leaves OS with 6G and if any new component has to be install in future like say APM or any other OS related then they all will have only 6G with OS.
Above is, per offical recommendation for all components. (i.e 50% for OS and 50% for es)
32G = 8G for elasticsearch heap. (25% for elasticsearch)
4G for logstash + beats 4G + kibana 2G
this leaves 14G for OS and for any future component.
I am missing to cover something that can change this memory combination?
Any suggestion by changing in above combination or any new combination is appreciated.
Thanks,

Elasticsearch and Fluentd optimisation for log cluster

we are using Elasticsearch and Fluentd for Central logging platform. below is our Config details:
Elasticsearch Cluster:
Master Nodes: 64Gb Ram, 8 CPU, 9 instances
Data Nodes: 64Gb Ram, 8 CPU, 40 instances
Coordinator Nodes: 64Gb Ram, 8Cpu, 20 instances
Fluentd: at any given time we have around 1000+ fluentd instances writing logs to Elasticsearch coordinator nodes.
and on daily basis we create around 700-800 indices and which total to 4K shards on daily basis. and we keep maximum 40K shards on cluster.
we started facing performance issue on Fluentd side, where fluentd instances fails to write logs. common issues are :
1. read time out
2. request time out
3. {"time":"2021-07-02","level":"warn","message":"failed to flush the buffer. retry_time=9 next_retry_seconds=2021-07-02 07:23:08 265795215088800420057/274877906944000000000 +0000 chunk=\"5c61e5fa4909c276a58b2efd158b832d\" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error=\"could not push logs to Elasticsearch cluster ({:host=>\\\"logs-es-data.internal.tech\\\", :port=>9200, :scheme=>\\\"http\\\"}): [429] {\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[parent] Data too large, data for [<http_request>] would be [32274168710/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32268504992/30gb], new bytes reserved: [5663718/5.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=17598408008/16.3gb, model_inference=0/0b, accounting=0/0b]\\\",\\\"bytes_wanted\\\":32274168710,\\\"bytes_limit\\\":31621696716,\\\"durability\\\":\\\"TRANSIENT\\\"}],\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[parent] Data too large, data for [<http_request>] would be [32274168710/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32268504992/30gb], new bytes reserved: [5663718/5.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=17598408008/16.3gb, model_inference=0/0b, accounting=0/0b]\\\",\\\"bytes_wanted\\\":32274168710,\\\"bytes_limit\\\":31621696716,\\\"durability\\\":\\\"TRANSIENT\\\"},\\\"status\\\":429}\"","worker_id":0}
looking for guidance on this, how we can optimise our Logs cluster?
Well, by the looks of it, you have exhausted your parent circuit breaker limit of 95% of Heap Memory.
The error you mentioned has been mentioned in the elasticsearch docs -
[1]: https://www.elastic.co/guide/en/elasticsearch/reference/current/fix-common-cluster-issues.html#diagnose-circuit-breaker-errors
. The page also refers to a few steps you can take to Reduce JVM memory pressure, which can be helpful to reduce this error.
You can also try increasing this limit to 98%, using the dynamic command -
PUT /_cluster/settings
{
"persistent" : {
"indices.breaker.total.limit" : "98%"
}
}
But I would suggest this be performance tested before applying in production.
Since your request is 30GB, which is a bit too much, for a more reliable solution, I would suggest increasing your log scrapers frequency, so that it makes more frequent posts to ES with smaller-sized data blocks.

How to load entire Solr index into memory to increase performance?

My website gets 10 - 30 hits per second (including bot crawls). I indexed 6 millions records (from a mysql table) in Solr. When I retrieve 30 records using q=something and sort=random_, Solr takes 200 to 300 milliseconds to respond, sometimes 100ms.
I tried to improve retrieval using solr.RAMDirectoryFactory setting but I got an out of memory error. I know solr.RAMDirectoryFactory setting is not persistent. So, which is best option to increase caching and loading whole index into memory.
I am using Digital Ocean 8GB server for Solr.
Solr Settings..
<filterCache class="solr.FastLRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>
<queryResultCache class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>
<documentCache class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>
Solr version:
solr-spec 7.2.1
solr-impl 7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:54:21
lucene-spec 7.2.1
lucene-impl 7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01
Arguments:
-DSTOP.KEY=solrrocks-DSTOP.PORT=7983-Djetty.home=/opt/solr/server-Djetty.port=8983-Dlog4j.configuration=file:/var/solr/log4j.properties-Dsolr.data.home=-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf-Dsolr.install.dir=/opt/solr-Dsolr.jetty.https.port=8983-Dsolr.log.dir=/var/solr/logs-Dsolr.log.muteconsole-Dsolr.solr.home=/var/solr/data-Duser.timezone=UTC-XX:+CMSParallelRemarkEnabled-XX:+CMSScavengeBeforeRemark-XX:+ParallelRefProcEnabled-XX:+PrintGCApplicationStoppedTime-XX:+PrintGCDateStamps-XX:+PrintGCDetails-XX:+PrintGCTimeStamps-XX:+PrintHeapAtGC-XX:+PrintTenuringDistribution-XX:+UseCMSInitiatingOccupancyOnly-XX:+UseConcMarkSweepGC-XX:+UseGCLogFileRotation-XX:+UseParNewGC-XX:-OmitStackTraceInFastThrow-XX:CMSInitiatingOccupancyFraction=50-XX:CMSMaxAbortablePrecleanTime=6000-XX:ConcGCThreads=4-XX:GCLogFileSize=20M-XX:MaxTenuringThreshold=8-XX:NewRatio=3-XX:NumberOfGCLogFiles=9-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs-XX:ParallelGCThreads=4-XX:PretenureSizeThreshold=64m-XX:SurvivorRatio=4-XX:TargetSurvivorRatio=90-Xloggc:/var/solr/logs/solr_gc.log-Xms512m-Xmx512m-Xss256k-verbose:gc
Thanks in advance
It's important to remember that with an 8GB server and Solr Heap set to 512M that Lucene (not Solr!) will use the rest of the available memory on the machine (minus what the O.S. needs etc.)
So let's say for example that the OS needs 512M of RAM, and your Solr Heap is 512M - then there's 7GB left for Lucene. If you are new to Solr and Lucene this is a great read on how Lucene's memory works.
How big are your indexes? You can check your /solr/data folders with du -h.
To clarify INCREASING Solr Heap will make the situation worse (there will be less memory for Lucene). To avoid swapping of RAM to disk you also need to turn off swap (see for example this).
There are lots of knobs and buttons in Solr and Lucene and your instance that need to be tweaked to help ensure your entire index is in memory. Even then remember that things like Java GC, CPU speed, memory speed, and having the index pre-warmed into memory will impact response time significantly.
To learn more see
http://blog.mikemccandless.com/2011/04/catching-slowdowns-in-lucene.html
https://lucidworks.com/2011/09/14/estimating-memory-and-storage-for-lucenesolr/
https://vienergie.wordpress.com/2014/10/19/solr-tuning-maximizing-your-solr-performance/
http://blog.mikemccandless.com/2012/11/lucene-with-zing-part-2.html

Elasticsearch config tweaking with limited memory

I have following scenario:
A single machine with 32GB of ram runs Elasticsearch 2.4, there is one index with 5 shards that is 25gb in size.
On that index we are constantly indexing new data, plus doing full-text search queries that check about 95% documents - no aggregations. The instance generates a lot of CPU load - there is no swapping.
My question is: how should I tweak elasticsearch memory usage? (I don't have an option to add another machine at this moment)
Should I assign more memory to ES HEAP like 25GB (going over 50% memory that readme advises to not do do), or should I assign minimal HEAP like 1GB-2GB and assume Lucene will cache all the index in memory since its full-text searches?
Right now 50% of server memory so 16GB in this case seems to work best for us.

OOM issue in Elasticsearch Cluster 1.1.1 Environment

I have an Elasticsearch 1.1.1 cluster with two nodes. With a configured heap of 18G each. (RAM on each node is 32G)
Totally we have 6 Shards and one replica for each shard. ES runs on a 64bit JVM on Ubuntu box.
There is only one index in our cluster. Cluster health looks Green. Document count on each node is close to 200Million.
Data used on each cluster node is around 150GB. There are no unassigned shards.
System is encountering the OOM issue (java.lang.OutOfMemoryError: Java heap space).
content of elasticsearch.yml
bootstrap.mlockall: true
transport.tcp.compress: true
indices.fielddata.cache.size: 35%
indices.cache.filter.size: 30%
indices.cache.filter.terms.size: 1024mb
indices.memory.index_buffer_size: 25%
indices.fielddata.breaker.limit: 20%
threadpool:
search:
type: cached
size: 100
queue_size: 1000
It has been noticed that the instances of org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector is occupying most of the heapspace
(around 45%)
I am new to ES. Could someone guide (or comment) on situation on the OOM issue, What could be the cause as we have lot of heapspace allocated ?
To be blunt: You are flogging a dead horse. 1.x is not maintained any more and there are good reasons for that. In the case of OOM: Elasticsearch replaced field data wherever possible with doc values and added more circuit breakers.
What is further complicating the issue is that there is no more documentation for 1.1 on the official docs — only 0.90, 1.3, 1.4,... So at the very least you should upgrade to 1.7 (the latest 1.x release).
Turning to your OOM issue what you could try:
Increase your heap size, decrease the amount of data you are querying, add more nodes, use doc values (on 2.x and up).
And your indices.fielddata.breaker.limit looks fishy to me. I think this config parameter has been renamed to indices.breaker.fielddata.limit in 1.4 and the Elasticsearch Guide states:
In Fielddata Size, we spoke about adding a limit to the size of
fielddata, to ensure that old unused fielddata can be evicted. The
relationship between indices.fielddata.cache.size and
indices.breaker.fielddata.limit is an important one. If the
circuit-breaker limit is lower than the cache size, no data will ever
be evicted. In order for it to work properly, the circuit breaker
limit must be higher than the cache size.

Resources