Performance when using NFS storage for Elasticsearch - performance

I have a server with 32 cores, 62 GB of RAM but we have NFS storage and I think it's starting to bottleneck our daily work. In our Kibana errors like queue_size are appearing more frequently. We just got a new (same) server to use it as a replica and share the load, will this help? What other recomendations you have? We have multiple dashboards with like 20 different variables each, will they be evenly distributed between the primary node and the replica? Unfortunately, local storage is not an option.

Are you actively indexing data on these nodes? If yes you can increase refresh_interval
PUT /myindex/_settings
{
"index" : {
"refresh_interval" : "30s"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html to make system less demanding for IO. You can completely disable refresh functionality and trigger it manually.
PUT /myindex/_settings
{
"index" : {
"refresh_interval" : "-1"
}
}
POST /myindex/_refresh
Take a look on Bulk API it significantly decrease load on indexing stage.
Adding new servers to cluster helps too. Elasticsearch designed to scale horizontally. From my experience you can run 6-8 virtual nodes on server you have described. Put more shards to evenly distribute load.
Do you see what is your bottleneck (Lan, IO, CPU)?

Related

Elasticsearch 7.x circuit breaker - data too large - troubleshoot

The problem:
Since the upgrading from ES-5.4 to ES-7.2 I started getting "data too large" errors, when trying to write concurrent bulk request (or/and search requests) from my multi-threaded Java application (using elasticsearch-rest-high-level-client-7.2.0.jar java client) to an ES cluster of 2-4 nodes.
My ES configuration:
Elasticsearch version: 7.2
custom configuration in elasticsearch.yml:
thread_pool.search.queue_size = 20000
thread_pool.write.queue_size = 500
I use only the default 7.x circuit-breaker values, such as:
indices.breaker.total.limit = 95%
indices.breaker.total.use_real_memory = true
network.breaker.inflight_requests.limit = 100%
network.breaker.inflight_requests.overhead = 2
The error from elasticsearch.log:
{
"error": {
"root_cause": [
{
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
"bytes_wanted": 3144831050,
"bytes_limit": 3060164198,
"durability": "PERMANENT"
}
],
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
"bytes_wanted": 3144831050,
"bytes_limit": 3060164198,
"durability": "PERMANENT"
},
"status": 429
}
Thoughts:
I'm having hard time to pin point the source of the issue.
When using ES cluster nodes with <=8gb heap size (on a <=16gb vm), the problem become very visible, so, one obvious solution is to increase the memory of the nodes.
But I feel that increasing the memory only hides the issue.
Questions:
I would like to understand what scenarios could have led to this error?
and what action can I take in order to handle it properly?
(change circuit-breaker values, change es.yml configuration, change/limit my ES requests)
The reason is that the heap of the node is pretty full and being caught by the circuit breaker is nice because it prevents the nodes from running into OOMs, going stale and crash...
Elasticsearch 6.2.0 introduced the circuit breaker and improved it in 7.0.0. With the version upgrade from ES-5.4 to ES-7.2, you are running straight into this improvement.
I see 3 solutions so far:
Increase heap size if possible
Reduce the size of your bulk requests if feasible
Scale-out your cluster as the shards are consuming a lot of heap, leaving nothing to process the large request. More nodes will help the cluster to distribute the shards and requests among more nodes, what leads to a lower AVG heap usage on all nodes.
As an UGLY workaround (not solving the issue) one could increase the limit after reading and understanding the implications:
So I've spent some time researching how exactly ES implemented the new circuit breaker mechanism, and tried to understand why we are suddenly getting those errors?
the circuit breaker mechanism exists since the very first versions.
we started experience issues around it when moving from version 5.4 to 7.2
in version 7.2 ES introduced a new way for calculating circuit-break: Circuit-break based on real memory usage (why and how: https://www.elastic.co/blog/improving-node-resiliency-with-the-real-memory-circuit-breaker, code: https://github.com/elastic/elasticsearch/pull/31767)
In our internal upgrade of ES to version 7.2, we changed the jdk from 8 to 11.
also as part of our internal upgrade we changed the jvm.options default configuration, switching the official recommended CMS GC with the G1GC GC which have a fairly new support by elasticsearch.
considering all the above, I found this bug that was fixed in version 7.4 regarding the use of circuit-breaker together with the G1GC GC: https://github.com/elastic/elasticsearch/pull/46169
How to fix:
change the configuration back to CMS GC.
or, take the fix. the fix for the bug is just a configuration change that can be easily changed and tested in your deployment.

Elasticsearch posting documents : FORBIDDEN/12/index read-only / allow delete (api)]

Running Elasticsearch version 7.3.0, I posted 50 million documents in my index. when trying to post more documents to Elasticsearch i keep getting this message :
response code: 403 cluster_block_exception [FORBIDDEN/12/index read-only / allow delete (api)];
Disk watermark exceeded
I have 40 GB of free data and extended disk space but still keep getting this error
Any thoughts on what can be causing this ?
You must have hit the floodstage watermark at 95%. Where to go from here:
Free disk space (you seem to have done that already).
Optionally change the default setting. 5% of 400GB might be a bit too aggressive for blocking write operations. You can either use percent or absolute values for this — this is just a sample and you might want to pick different values:
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "5gb",
"cluster.routing.allocation.disk.watermark.high": "2gb",
"cluster.routing.allocation.disk.watermark.flood_stage": "1gb",
"cluster.info.update.interval": "1m"
}
}
You must reset the index blocking (either per affected index or on a global level with all):
PUT /_all/_settings
{
"index.blocks.read_only_allow_delete": null
}
BTW in 7.4 this will change: Once you go below the high watermark, the index will unlock automatically.

Best practices for ElasticSearch initial bulk import

I'm running a docker setup with ElasticSearch, Logstash, Filebeat and Kibana inspired by the Elastic Docker Compose. I need to initial load 15 GB og logfiles into the system (Filebeat->Logstash->ElasticSearch) but I'm having some issues with performance.
It seems that Filebeat/Logstash is outputting too much work for ElasticSearch. After some time I begin to see a bunch of errors in ElasticSearch like this:
[INFO ][o.e.i.IndexingMemoryController] [f8kc50d] now throttling indexing for shard [log-2017.06.30]: segment writing can't keep up
I've found this old documentation article on how to disable merge throttling: https://www.elastic.co/guide/en/elasticsearch/guide/master/indexing-performance.html#segments-and-merging.
PUT /_cluster/settings
{
"transient" : {
"indices.store.throttle.type" : "none"
}
}
But in current version (ElasticSearch 6) it gives me this error:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "transient setting [indices.store.throttle.type], not dynamically updateable"
}
],
"type": "illegal_argument_exception",
"reason": "transient setting [indices.store.throttle.type], not dynamically updateable"
},
"status": 400
}
How can I solve the above issue?
The VM has 4 CPU cores (Intel Xeon E5-2650) and ElasticSearch is assigned 4GB of RAM, Logstash and Kibana 1GB each. Swapping is disabled using "swapoff -a". X-pack and monitoring is enabled. I only have one ES node for this log server. Do I need to have multiple node for this initial bulk import?
EDIT1:
Changing the number_of_replicas and refresh_interval seems to make it perform better. Still testing.
PUT /log-*/_settings
{
"index.number_of_replicas" : "0",
"index.refresh_interval" : "-1"
}
Most likely the bottleneck is IO (you can confirm this running iostat, also it would be useful if you post ES monitoring screenshots), so you need to reduce pressure on it.
Default ES configuration causes generation of many index segments during a bulk load. To fix this, for the bulk load, increase index.refresh_interval (or set it to -1) - see doc. The default value is 1 sec, which causes new segment to be created every 1 second, also try to increase batch size and see if it helps.
Also if you use spinning disks,set index.merge.scheduler.max_thread_count to 1. This will allow only one thread to perform segments merging and will reduce contention for IO between segments merging and indexing.

Advice on ElasticSearch Cluster Configuration

Brand new to Elasticsearch. I've been doing tons of reading, but I am hoping that the experts on SO might be able to weigh in on my cluster configuration to see if there is something that I am missing.
Currently I am using ES (1.7.3) to index some very large text files (~700 million lines) per file and looking for one index per file. I am using logstash (V2.1) as my method of choice for indexing the files. Config file is here for my first index:
input {
file {
path => "L:/news/data/*.csv"
start_position => "beginning"
sincedb_path => "C:/logstash-2.1.0/since_db_news.txt"
}
}
filter {
csv {
separator => "|"
columns => ["NewsText", "Place", "Subject", "Time"]
}
mutate {
strip => ["NewsText"]
lowercase => ["NewsText"]
}
}
output {
elasticsearch {
action => "index"
hosts => ["xxx.xxx.x.xxx", "xxx.xxx.x.xxx"]
index => "news"
workers => 2
flush_size => 5000
}
stdout {}
}
My cluster contains 3 boxes running on Windows 10 with each running a single node. ES is not installed as a service and I am only standing up one master node:
Master node: 8GB RAM, ES_HEAP_SIZE = 3500m, Single Core i7
Data Node #1: 8GB RAM, ES_HEAP_SIZE = 3500m, Single Core i7
This node is currently running the logstash instance with LS_HEAP_SIZE= 3000m
Data Node #2: 16GB RAM, ES_HEAP_SIZE = 8000m, Single Core i7
I have ES currently configured at the default 5 shards + 1 duplicate per index.
At present, each node is configured to write data to an external HD and logs to another.
In my test run, I am averaging 10K events per second with Logstash. My main goal is to optimize the speed at which these files are loaded into ES. I am thinking that I should be closer to 80K based on what I have read.
I have played around with changing the number of workers and flush size, but can't seem to get beyond this threshold. I think I may be missing something fundamental.
My questions are two fold:
1) Is there anything that jumps out as fishy about my cluster configuration or some advice that may improve the process?
2) Would it help if I ran an instance of logstash on each data node indexing separate files?
Thanks so much for any and all help in advance and for taking the time to read.
-Zinga
i'd first have a look at whether logstash or es is the bottleneck in your setup. try ingesting the file without the es output. what throughtput are you getting from plain/naked logstash.
if this is considerably higher then you can continue on the es side of things. A good starter might be:
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html
If plain logstash doesn't yield a significant increase in throughput you can can try increasing / parallelising logstash across your machines.
hope that helps

es_rejected_execution_exception rejected execution

I'm getting the following error when doing indexing.
es_rejected_execution_exception rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1#16248886
on EsThreadPoolExecutor[bulk, queue capacity = 50,
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#739e3764[Running,
pool size = 16, active threads = 16, queued tasks = 51, completed
tasks = 407667]
My current setup:
Two nodes. One is the master (data: true, master: true) while the other one is data only (data: true, master: false). They are both EC2 I2.4XL (16 Cores, 122GB RAM, 320GB instance storage). 2 shards, 1 replication.
Those two nodes are being fed by our aggregation server which has 20 separate workers. Each worker makes bulk indexing request to our ES cluster with 50 items to index. Each item is between 1000-4000 characters.
Current server setup: 4x client facing servers -> aggregation server -> ElasticSearch.
Now the issue is this error only started occurring when we introduced the second node. Before when we had one machine, we got consistent indexing throughput of 20k request per second. Now with two machine, once it hits the 10k mark (~20% CPU usage)
we start getting some of the errors outlined above.
But here is the interesting thing which I have noticed. We have a mock item generator which generates a random document to be indexed. Generally these documents are of the same size, but have random parameters. We use this to do the stress test and check the stability. This mock item generator sends requests to aggregation server which in turn passes them to Elasticsearch. The interesting thing is, we are able to index around 40-45k (# ~80% CPU usage) items per second without getting this error. So it seems really interesting as to why we get this error. Has anyone seen this error or know what could be causing it?

Resources