Elasticsearch takes in minute for respond to aggregation query - elasticsearch

Document count: 4 Billion
disc size : 2 TB
Primary: 5
replica: 2
master node : 3
data node: 4 * [16cpu and 64GB ram]
heap size: 30GB
mlock enable : true
It takes up to 3 minutes to respond to aggregation queries. On subsequent request, it caches and speeds things up. Is there a way to speed the aggregation on the first query?
Example aggregation query:
{
"query": {
"bool": {
"must": [],
"must_not": [],
"should": []
}
},
"size": 0,
"aggs": {
"agg_;COUNT_ROWS;5d8b0621690e727ff775d4ed": {
"terms": {
"field": "feild1.keyword",
"size": 10000,
"shard_size": 100,
"order": {
"_term": "asc"
}
},
"aggs": {
"agg_;COUNT_ROWS;5d8b0621690e727ff775d4ec": {
"terms": {
"field": "feild2.keyword",
"size": 30,
"shard_size": 100,
"order": {
"_term": "asc"
}
},
"aggs": {
"agg_HouseHold;COUNT_DISTINCT": {
"cardinality": {
"field": "feild3.keyword",
"precision_threshold": 40000
}
}
}
}
}
}
}
}

If I understand right, you are running the query on a single instance, with a total of 15 shards, 5 of which are primaries. The first terms aggregation have a size of 10,000. that is a high number that effects performance. consider moving to composite-aggregation in order to use pagination and not to try to squeeze it to a huge response.
Also, the shard_size doesn't make much sense to me, as you only query 5 shards, and asking for 10,000 results - bringing 100 results from 5 shards would yield 500 results, which is not enough. I would drop this shard_size param, or set a higher value in order for it to make sense.

Related

ElasticSearch Query: Can't get all aggregated results

In ElasticSearch, I have 5 shards, and I have documents like this in each shards.
[{userId:XX, name:YYY, bookId:"123abc" },....]
I group by "bookId" and try to get all books that have been borrowed by more than one user:
{
"size": 0,
"aggs": {
"group_by_bookId": {
"terms": {
"field": "bookId.keyword",
"size": 10000
},
"aggs": {
"having_several": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count > 1"
}
}
}
}
}
and this query do return say 500 buckets, which less than 10000, that means the query does return all documents it has. But when I search one field "bookId" = "123abc",there are 5 documents in return, but key="123abc" doesn't show in that return buckets "key" list, and I know that we have 5 documents that "bookId" = "123abc" located in each shard(Each shard only has one document that "bookId" = "123abc"). "aggs" works only on each shard, and combinates all returns from each shard, each shard only have one document for "bookId"="123abc", so it doesn't return that doucment in each shard.
So my question is any solution(elasticsearch query) for finding all documents that have duplicated value, no matter which shard those documents located.
I think that min_doc_count could help you.
The query will look like this:
{
"size": 0,
"aggs": {
"group_by_bookId": {
"terms": {
"field": "bookId.keyword",
"size": 10000,
"min_doc_count": 1
}
}
}
}
From the Elasticsearch documentation, min_doc_count is supposed to sort the results after the marge phase:
The min_doc_count criterion is only applied after merging local terms
statistics of all shards.
If this won't work, you can try to play with shard_size parameter in the query.

Improve ES Agg query - getting circuit_breaking_exception

I run aggregation that on 2 indices: idx-2020-07-21, idx-2020-07-22
The target:
Get all documents,
but in the case of duplicate id (50% are), get the one from the latest index using the index name.
This is the query I'm running
{
"size": 0,
"aggregations": {
"latest_item": {
"composite": {
"size": 1000,
"sources": [
{
"product": {
"terms": {
"field": "_id",
"missing_bucket": false,
"order": "asc"
}
}
}
]
},
"aggregations": {
"max_date": {
"top_hits": {
"from": 0,
"size": 1,
"version": false,
"explain": false,
"sort": [
{
"_index": {
"order": "desc"
}
}
]
}
}
}
}
}
}
Each index size is 8G with ~1M docs. ES version 7.5
and it takes around 8Min to aggregate, most of the times I get
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [32933676058/30.6gb], which is larger than the limit of [32641751449/30.3gb].
Is there a better way to write this query?
How do I deal with this exception?
I run a java job that query ES every 10 min, I noticed it happened a lot in the second time,
do I need to release any resources or something? I use restHighLevelClient.searchAsync() with a listener that call again with the next key until I get null.
The cluster has 3 nodes, 32G each.
I tries to play with the bucket size it didn't help a lot.
Thanks!

How do I select the top term buckets based on a rescore function in Elasticsearch

Consider the following query for Elasticsearch 5.6:
{
"size": 0,
"query": {
"match_all": {}
},
"rescore": [
{
"window_size": 10000,
"query": {
"rescore_query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
],
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1
}
},
"top_score": {
"max": {
"script": {
"inline": "_score"
}
}
}
}
}
}
}
This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.
Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.
I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.
The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.
Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?
From documentation :
The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.
As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.
I have no idea on how you can combine rescore and aggregations. Sorry :(
I think I have a pretty great solution to this problem, but I'll let the bounty continue to expiration incase someone comes up with a better approach.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 10000
},
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1,
"sort": [
{
"_script": {
"type": "number",
"script": {
"source": "doc['topic_score'].value"
},
"order": "desc"
}
}
]
}
},
"top_score": {
"max": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
}
}
}
}
}
}
The sampler aggregation will take the top N hits per shard from the core query and run aggregations over those. Then in the max aggregator that defines the bucket order I use the exact same script as the one I use to pick a top hit from the bucket. Now the buckets and the top hits are running over the same top N sets of items and the buckets will order by the max of the same score, generated from the same script. Unfortunately I still need run the script once to order the buckets and once to pick a top hit within the bucket, and you could use the rescore instead for the top hits ordering, but either way it has to run twice and I found it was faster as a sort script then as a rescore

Elasticsearch aggregation and filters

Hi friends I am trying to make a search bar in my website. I have thousands of company articles. When i run this code:
GET articles/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "assistant",
"fields": ["title"]
}
}
]
}
},
"size": 0,
"aggs": {
"by_company": {
"terms": {
"field": "company.keyword",
"size": 10
}
}
}
}
The result is:
"aggregations": {
"by_company": {
"doc_count_error_upper_bound": 5,
"sum_other_doc_count": 409,
"buckets": [
{
"key": "University of Miami",
"doc_count": 6
},
{
"key": "Brigham & Women's Hospital(BWH)",
"doc_count": 4
},
So now I wanna filter articles of University of Miami so i run following query:
GET indeed_psql/job/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "assistant",
"fields": ["title"]
}
}
],
"filter": {
"term": {
"company.keyword": "University of Miami"
}
}
}
},
"size": 0,
"aggs": {
"by_company": {
"terms": {
"field": "company.keyword",
"size": 10
}
}
}
}
But now the result is:
"aggregations": {
"by_company": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "University of Miami",
"doc_count": 7
}
]
}
Why there are suddenly seven of them when in the previous aggregation were 6 ??? This also happens with other university filters. What am I doing wrong ? I am not using standard tokenizer and from filters I use english_stemmer, english_stopwords, english_keywords. Thanks for your help.
It's likely that your first query document counts are wrong. In your first response, the "doc_count_error_upper_bound" is 5, meaning that some of the terms in your returned aggregation were not present as top results in each of the underlying queried shards. The document count will always be too low rather than too high because it could have been "missed" during the process of querying a shard for the top N keys.
How many shards do you have? For instance, if there are 3 shards, and your aggregation size is 3 and your distribution of documents was something like this:
Shard 1 Shard 2 Shard 3
3 BYU 3 UMiami 3 UMiami
2 UMich 2 BWH 2 UMich
2 MGH 2 UMich 1 BWH
1 UMiami 1 MGH 1 BYU
Your resulting top 3 terms from each shard are merged into:
6 UMiami // returned
6 UMich // returned
3 BWH // returned
3 BYU
2 MGH
From which, only the top three results are returned. Almost all these keys are undercounted.
You can see in this scenario, the UMiami document in Shard 1 would not make it into consideration because it is beyond the depth of 3. But if you filter to ONLY look at UMiami, you would necessarily pull back any associated docs in each shard and end up with an accurate count.
You can play around with the shard_size parameter so that Elasticsearch looks a little deeper into each shard too get a more approximate count. But given that there are 7 total documents for this facet, it's likely there's only one occurrence of it on one of your shards so it will be hard to surface it in the top aggregations without grabbing all of the documents for that shard.
You can read more about the count approximation and error derivation here-- tldr, Elasticsearch is making a guess about the total number of documents for that facet based on top aggregations in each individual shard.

How to use Scroll on Elasticsearch aggregation?

I am using Elasticsearch 5.3. I am aggregating on some data but the results are far too much to return in a single query. I tried using size = Integer.MAX_VALUE; but even that has proved to be less. In ES search API, there is a method to scroll through the search results. Is there a similar feature to use for the org.elasticsearch.search.aggregations.AggregationBuilders.terms aggregator and how do I use it? Can the search scroll API be used for the aggregators?
In ES 5.3, you can partition the terms buckets and retrieve one partition per request.
For instance, in the query below, you can request to partition your buckets into 10 partitions and only return the first partition. It will return ~10x less data than if you wanted to retrieve all buckets at once.
{
"size": 0,
"aggs": {
"my_terms": {
"terms": {
"field": "my_field",
"include": {
"partition": 0,
"num_partitions": 10
},
"size": 10000
}
}
}
}
You can then make the second request by increasing the partition to 1 and so on
{
"size": 0,
"aggs": {
"my_terms": {
"terms": {
"field": "my_field",
"include": {
"partition": 1, <--- increase this up until partition 9
"num_partitions": 10
},
"size": 10000
}
}
}
}
To add this in your Java code, you can do it like this:
TermsAggregationBuilder agg = AggregationBuilders.terms("my_terms");
agg.includeExclude(new IncludeExclude(0, 10));

Resources