Improving Response Time and Reducing CPU Utilization while using Elasticsearch Percolation - performance

I'm looking for ways to cut down the response time on elasticsearch percolation and reducing the CPU utilization while, its being performed.
Tried a bunch of steps and was successful in bring down the response time but, it impacted the CPU utilization. I'm using elasticsearch 5.6 and I'm checking whether, I can get a response time of less than 2 seconds at-least.
The steps are mentioned below:
Tried running the percolator query with 1 node and 1 shard. The response time was very poor. It varied between 40 - 37 seconds.
Tried running the percolator query with 1 node and 3 shards. The response time was better but, not great. It varied between 16 - 14 seconds. This was a scenario, where I attempted over-allocation of shards to see, if that made a difference and though, the response time got better, the CPU utilization was over 90% on a 4 core and 32 GB VM. Memory spike was there but, nothing alarming. I think, memory would become a concern, if consecutive percolator queries would have been attempted.
Tried running the percolator query with 1 node and 10 shards. The response time was better but, not great. It varied between 15 - 13 seconds.
Checked out some links on elasticsearch git discussion and tried reducing the terms but, that started affected the scoring and hence, had to abandon this step as, the scoring and matching should be consistent for the use case, I'm trying out.
Links that, I referred to are mentioned below.
How to improve percolator performance in ElasticSearch?
https://github.com/elastic/elasticsearch/issues/26307
https://github.com/elastic/elasticsearch/issues/25445

Related

Increasing number of nodes for elasticsearch doesn't speed up the queries

I am pretty new to Elasticsearch.
Right know I have pretty expensive search query, it uses nested querying, highlighting, fuzziness and so on. But right now I can't changed that.
So I decided I will try to upgrade my cluster, previously it had 3 nodes, 16GB, 0.5CPU each.
Now I upgraded it to 6 nodes, 16GB, 3CPU each.
I was hoping for some speedup on queries however I did not get it. For some reasons it started to be even slower. For example queries which took 12 seconds now last about 17 seconds.
Is there anything more I have to do after upgrading number of nodes. Because after reading some docs I though it's all automatic.
I will put result of /_nodes query maybe it will give some more insight. I have to use external service to share json because of char limit
https://wtools.io/paste-code/bFUC

Elasticsearch application latency investigation

We have an Elasticsearch setup w/ [data, master, client] nodes. Client receives only query traffic, pass query to data nodes, gets the response, sends back to caller (based on my general understanding).
We are seeing that 'took' latency in the query response is around ~16ms, but our application who is measuring latency when calling into client is around ~90ms. Here are some numbers on our setup:
ES Setup: 3 client nodes (60GB/3 cpu/30GB heap each), 3 data nodes (80GB/16 cpu/30GB heap each), 3 master nodes. Its a k8 based helm chart setup.
client/data have enough cpu/mem (based on k8's pod level cpu/mem usages)
QPS - 20 req/sec
Shard size ~ 24GB, 0 replicas. Each shard is on a separate data-node. Indices are using mmapfs/preload "*"
Query types: bool query w/ 3 match clauses and 3 should for boosting on few fields. We have "_source=true".
Our documents are quite bigger with (mean, p90, p99) as (200kb, 400kb, 800kb)
Our response size is of the order of (mean, p99) (164kb, 840kb). We also observed latencies for bigger response sizes is much higher than the baseline.
Can someone comment on following questions:
How can we know more exactly where is this extra latency is
introduced? When reading about "took" here, it includes the querying and response forming stages. But something happens after that so that our application measured latency jumps to ~90ms. Where else can I look to look more into this increase? I have access to Prometheus ES dash and K8 pods usages, but all of them look normal and no spikes.
Are there some ES settings we can play with to optimize this
latency? We feel its mostly due to bigger response sizes. Can there be some compression introduced in ES to help w/ this?
Your question is very broad with very less information on your deployment, like types of search queries, index mapping, cluster/nodes/indices specification, and QPS..it's generally very difficult to suggest anything without looking at the system which has performance issues...
Reg, client nodes, yes, they receive the traffic but they also calculate the Global result from the local result set received from each shard involved in the search request. so they do the heavy processing and should have enough capacity otherwise would become bottleneck, even though your data nodes calculate the local result fast, processing at the client node would take more time and overall took time would increase.
You can also see if you have a room to improve some of the suggestions I wrote in this blog post.
Hope this helps.

A very long delay between indexing a document in ElasticSearch and seeing it in search results

We are experiencing a problem with ES 5.4.2 on production. About 25% of our indexed documents are only appearing in search results after 3 seconds or more after success response from indexing request. For some documents, it's more than 7 seconds. Our cluster consists of three nodes with 59GBs of RAM of which 31GBs are dedicated to ES heap. Heap usage never goes over 75% with GC dropping it down to ~25-30%. Indexing rate is relatively low, less than 10 documents per second. Our average refresh time is ~300 ms. Refresh interval is default 1s.
With these numbers, I am expecting at most ~1.5 s delay between indexation and search availability. What am I missing here? Is it possible to reduce this search availability delay? Maybe I'm missing some metric that is worth monitoring?

Elasticsearch Indexing Speed Degrade with the Time

I am doing some performance tuning in elastic search for my project and I need some help in improving the elastic search indexing speed. I am using ES 5.1.1 and I have 2 nodes setup with 8 shards for the index. I have the servers for 2 nodes with 16GB RAM and 12CPUs allocated for each server with 2.2GHz clock speed. I need to index around 25,000,000 documents within 1.5 hours, which I am currently doing in around 4 hours. I have done the following config changes to improve the indexing time.
Setting ‘indices.store.throttle.type’ to ‘none’
Setting ‘refresh_interval’ to ‘-1’
Increasing ‘translog.flush_threshold_size’ to 1GB
Setting ‘number_of_replicas’ to ‘0’
Using 8 shards for the index
Setting VM Options -Xms8g -Xmx8g (Half of the RAM size)
I am using the bulk processor to generate the documents in my java application and I’m using the following configurations to setup the bulk processor.
Bulk Actions Count : 10000
Bulk Size in MB : 100
Concurrent Requests : 100
Flush Interval : 30
Initially I can index around 356167 in the first minute. But with the time, It decreases and after around 1 hour its around 121280 docs per minute.
How can I keep the indexing rate steady over the time? Is there any other ways to improve the performance?
I highly encourage not to change configuration parameters like the translog flush size, the throttling, unless you know what you are doing (and this does not mean reading some blog post on the internet :-)
Try a single shard per server and especially reduce the bulk size to something like 10MB. 100MB * 100 concurrent requests means you need 10GB of heap to deal with those (without doing anything else). I suppose not all of the documents get indexed because of your rejected tasks in your threadpools.
Start small and get bigger instead of starting big but not have any insight in your indexing.

Elasticsearch - queries throttle cpu

Recently our cluster has seen extreme performance degradation. We had 3 nodes, 64 GB, 4CPU (2 core) each for an index that is 250M records, 60GB large. Performance was acceptable for months.
Since then we've:
1. Added a fourth server, same configuration.
2. Split the index into two indexes, query them with an alias
3. Disable paging (windows server 2012)
4. Added synonym analysis on one field
Our cluster can now survive for a few hours before it's basically useless. I have to restart elastic on each node to rectify the problem. We tried bumping each node to 8 cpus (2 cores) with little to no gain.
One issue is that EVERY QUERY uses up 100% of the cpu of whatever node it hits. Every query is facetted on 3+ fields, which hasn't changed since our cluster was healthy. Unfortunately I'm not sure if this was an happening before, but certainly it seems like an issue. We need to be able to respond to more than one request every few seconds obviously. When multiple requests come in at the same time the performance doesn't seem to get worse for those particular responses. Again, over time, the performance slows to a crawl; the CPU (all cores) stays maxed out indefinitely.
I'm using elasticsearch 1.3.4 and the plugin elasticsearch-analysis-phonetic 2.3.0 on every box and have been even when our performance wasn't so terrible.
Any ideas?
UPDATE:
it seems like the performance issue is due to index aliasing. When I pointed the site to a single index that ultimately stores about 80% of the data, the CPU wasn't being throttled. There were still a few 100% spikes, but they were much shorter. When I pointed it back to the alias (which points to two indexes total), I could literally bring the cluster down by refreshing the page a dozen times quickly: CPU usage goes to 100% every query and gets stuck there with many in a row.
Is there a known issue with elastic search aliases? Am I using the alias incorrectly?
UPDATE 2:
Found the cause in the logs. Paging queries are TERRIBLE. Is this a known bug in elastic? If I run an empty query then try and view the last page (from 100,000,000 e.g.) it brings the whole cluster down. That SINGLE QUERY. It gets through the first 1.5M results then quits, all the while taking up 100% of the CPU for over a minute.
UPDATE 3:
So here's somethings else strange. Pointing to an old index on dev (same size, no aliases) and trying to reproduce the paging issue; the cluster doesn't get hit immediately. It has 1% cpu usage for the first 20 seconds after the query. The query returns with an error before the CPU usage every goes up. About 2 minutes later, CPU usage spikes to 100% and server basically crashes (can't do anything else because CPU is so over taxed). On the production index this CPU load is instantaneous (it happens immediately after a query is made)
Without checking certain metrics it is very difficult to identify the cause of slow response or any other issue. But from the data you have mentioned it looks like there are to many cache evictions happening thereby increasing the number of Garbage Collection on your nodes. A frequent Garbage Collection (mainly the old GC) will consume lot of CPU. This in turn will start to affect all cluster.
As you have mentioned it started giving issues only after you added another node. This surprises me. Is there any increase in the traffic?.
Can you include the output of _stats API taken at the time when your cluster slows down. It will have lot of information from which I can make a better diagnosis. Also include a sample of the query.
I suggest you to install bigdesk so that you can have a graphical view of your cluster health more easily.

Resources