ElasticSearch - slow aggregations and impact on performance of other operations - elasticsearch

There is an aggregation to identify duplicate records:
{
"size": 0,
"aggs": {
"myfield": {
"terms": {
"field": "myfield.keyword",
"size": 250,
"min_doc_count": 2
}
}
}
}
However it is missing many duplicates due to the low size. The actual cardinality is over 2 million. If size is changed to the actual size or some other much larger number, all of the duplicate documents are found, but the operation takes 5X more time to complete.
If I change the size to a larger number, should I expect slow performance or other adverse effects on other operations while this is running?

Yes, size param is very critical in Elasticsearch aggregation performance and if you change it very big number like 10k (limit set by Elasticsearch but you can change that by changing search.max_buckets but it will surely have adverse impact not only on the aggregation you are running but on all the operation running in Elasticsearch cluster.
As you are using terms aggregation which is of bucket aggregation, you can read more here
Note: Reason for increasing the latency when you increase the size is that Elasticsearch has to do significant processing creating those many buckets and compute the entries for those buckets.

Related

Navigating terms aggregation in Elastic with very large number of buckets

Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!
There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side

distinct count on hive does not match cardinality count on elasticsearch

I have loaded data into my elasticsearch cluster from hive using the elasticsearch-hadoop plugin from elastic.
I need to fetch a count of unique account numbers. I have the following queries written in both hql and queryDSL, BUT they are returning different counts.
Hive Query:
select count(distinct account) from <tableName> where capacity="550";
// Returns --> 71132
Similarly, in Elasticsearch the query looks like this:
{
"query": {
"bool": {
"must": [
{"match": { "capacity": "550"}}
]
}
},
"aggs": {
"unique_account": {
"cardinality": {
"field": "account"
}
}
}
}
// Returns --> 71607
Am I doing something wrong? What can I do to match the two queries?
Note: There are exactly the same number of records in hive and elasticsearch.
"the first approximate aggregation provided by Elasticsearch is
the cardinality metric
...
As mentioned at the top of this chapter, the cardinality metric is an
approximate algorithm. It is based on the HyperLogLog++ (HLL)
algorithm."
https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
For the OP
precision_threshold
"precision_threshold accepts a number from 0–40,000. Larger values are
treated as equivalent to 40,000.
...
Although not guaranteed by the
algorithm, if a cardinality is under the threshold, it is almost
always 100% accurate. Cardinalities above this will begin to trade
accuracy for memory savings, and a little error will creep into the
metric."
https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
You might also want to take a look at "Support for precise cardinality aggregation #15876"
For the OP, 2
"I have tried several numbers..."
You have 71,132 distinct values while the precision threshold limit is 40,000, therefore the cardinality is over the threshold, which means accuracy is traded for memory saving.
This is how the chosen implementation (based on HyperLogLog++ algorithm) works.
Cardinality does not ensure accurate count even with 40000 precision_threshold. There is another way to get accurate distinct count of a field.
This article on "Accurate Distinct Count and Values from Elasticsearch" explains in detail the solution as well as it's accuracy over Cardinality.

How to get random documents from Elasticsearch indexes with 50 million documents each

I'd like to sample 2000 random documents from approximately 60 ES indexes holding about 50 million documents each, for a total of about 3 billion documents overall. I've tried doing the following on the Kibana Dev Tools page:
GET some_index_abc_*/_search
{
"size": 2000,
"query": {
"function_score": {
"query": {
"match_phrase": {
"field_a": "some phrase"
}
},
"random_score": {}
}
}
}
But this query never returns. Upon refreshing the Dev Tools page, I get a page that tells me that the ES cluster status is red (doesn't seem to be a coincidence - I've tried several times). Other queries (counts, simple match_all queries) without the random function work fine. I've read that function score queries tend to be slow, but using a random function score is the only method I've been able to find for getting random documents from ES. I'm wondering if there might be any other, faster way that I can sample random documents from multiple large ES indexes.
EDIT: I would like to do this random sampling entirely using built-in ES functionality, if possible - I do not want to write any code to e.g. implement reservoir sampling on my end. I also tried running my query with a much smaller size - 10 documents - and I got the same result as for 2000.

ElasticSearch circuit_breaking_exception (Data too large) with significant_terms aggregation

The query:
{
"aggregations": {
"sigTerms": {
"significant_terms": {
"field": "translatedTitle"
},
"aggs": {
"assocs": {
"significant_terms": {
"field": "translatedTitle"
}
}
}
}
},
"size": 0,
"from": 0,
"query": {
"range": {
"timestamp": {
"lt": "now+1d/d",
"gte": "now/d"
}
}
},
"track_scores": false
}
Error:
{
"bytes_limit": 6844055552,
"bytes_wanted": 6844240272,
"reason": "[request] Data too large, data for [<reused_arrays>] would be larger than limit of [6844055552/6.3gb]",
"type": "circuit_breaking_exception"
}
Index size is 5G. How much memory does the cluster need to execute this query?
You can try to increase the request circuit breaker limit to 41% (default is 40%) in your elasticsearch.yml config file and restart your cluster:
indices.breaker.request.limit: 41%
Or if you prefer to not restart your cluster you can change the setting dynamically using:
curl -XPUT localhost:9200/_cluster/settings -d '{
"persistent" : {
"indices.breaker.request.limit" : "41%"
}
}'
Judging by the numbers showing up (i.e. "bytes_limit": 6844055552, "bytes_wanted": 6844240272), you're just missing ~190 KB of heap, so increasing by 1% to 41% you should get 17 MB of additional heap (your total heap = ~17GB) for your request breaker which should be sufficient.
Just make sure to not increase this value too high, as you run the risk of going OOM since the request circuit breaker also shares the heap with the fielddata circuit breaker and other components.
I am not sure what you are trying to do, but I'm curious to find out. Since you get that exception, I can assume the cardinality of that field is not small. You are basically trying to see, I guess, the relationships between all the terms in that field, based on significance.
The first significant_terms aggregation will consider all the terms from that field and establish how "significant" they are (calculating frequencies of that term in the whole index and then comparing those with the frequencies from the range query set of documents).
After it's doing that (for all the terms), you want a second significant_aggregation that should do the first step, but now considering each term and doing for it another significant_aggregation. That's gonna be painful. Basically, you are computing number_of_term * number_of_terms significant_terms calculations.
The big question is what are you trying to do?
If you want to see a relationship between all the terms in that field, that's gonna be expensive for the reasons explained above. My suggestion is to run a first significant_terms aggregation, take the first 10 terms or so and then run a second query with another significant_terms aggregation but limiting the terms by probably doing a parent terms aggregation and include only those 10 from the first query.
You can, also, take a look at sampler aggregation and use that as a parent for your only one significant terms aggregation.
Also, I don't think increasing the circuit breaker limit is the real solution. Those limits were chosen with a reason. You can increase that and maybe it will work, but it has to make you ask yourself if that's the right query for your use case (as it doesn't sound like it is). That limit value that it's in the exception might not be the final one... reused_arrays refers to an array class in Elasticsearch that is resizeable, so if more elements are needed, the array size is increased and you may hit the circuit breaker again, for another value.
Circuit breakers are designed to deal with situations when request processing needs more memory than available. You can set limit by using following query
PUT /_cluster/settings
{
"persistent" : {
"indices.breaker.request.limit" : "45%"
}
}
You can get more information on
https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/index-modules-fielddata.html

Term filter causes memory to spike drastically

The term filter that is used:
curl -XGET 'http://localhost:9200/my-index/my-doc-type/_search' -d '{
"filter": {
"term": {
"void": false
}
},
"fields": [
[
"user_id1",
"user_name",
"date",
"status",
"q1",
"q1_unique_code",
"q2",
"q3"
]
],
"size": 50000,
"sort": [
"date_value"
]
}'
The void field is a boolean field.
The index store size is 504mb.
The elasticsearch setup consists of only a single node and the index
consists of only a single shard and 0 replicas. The version of
elasticsearch is 0.90.7
The fields mentioned above is only the first 8 fields. The actual
term filter that we execute has 350 fields mentioned.
We noticed the memory spiking by about 2-3gb though the store size is only 504mb.
Running the query multiple times seems to continuously increase the memory.
Could someone explain why this memory spike occurs?
It's quite an old version of Elasticsearch
You're returning 50,000 records in one get
Sorting the 50k records
Your documents are pretty big - 350 fields.
Could you instead return a smaller number of records? and then page through them?
Scan and Scroll could help you.
it's not clear whether you've indexed individual fields - this could help as the _source being read from disk may be incurring a memory overhead.

Resources