How to control the elasticsearch aggregation results with From / Size? - elasticsearch

I have been trying to add pagination in elasticsearch term aggregation. In query we can add the pagination like,
{
"from": 0, // to add the start to control the pagination
"size": 10,
"query": { }
}
this is pretty clear, but when I want to add pagination to aggregation, I read a lot about it, but I couldn't find anything, My code looks like this,
{
"from": 0,
"size": 0,
"aggs": {
"group_by_name": {
"terms": {
"field": "name",
"size": 20
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"size": 1
}
}
}
}
}
}
Is there any way to create pagination with a function or any other suggestions?

Seems like you probably want partitions. From the docs:
Sometimes there are too many unique terms to process in a single request/response pair so it can be useful to break the analysis up into multiple requests. This can be achieved by grouping the field’s values into a number of partitions at query-time and processing only one partition in each request.
Basically you add "include": { "partition": n, "num_partitions": x },, where n is the page and x is the number of pages.
Unfortunately this feature was added fairly recently. If the tags can be believed on the GitHub Issue which spawned this feature, you'll need to be on at least Elasticsearch 5.2 or better.

Related

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.
Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

How do I select the top term buckets based on a rescore function in Elasticsearch

Consider the following query for Elasticsearch 5.6:
{
"size": 0,
"query": {
"match_all": {}
},
"rescore": [
{
"window_size": 10000,
"query": {
"rescore_query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
],
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1
}
},
"top_score": {
"max": {
"script": {
"inline": "_score"
}
}
}
}
}
}
}
This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.
Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.
I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.
The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.
Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?
From documentation :
The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.
As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.
I have no idea on how you can combine rescore and aggregations. Sorry :(
I think I have a pretty great solution to this problem, but I'll let the bounty continue to expiration incase someone comes up with a better approach.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 10000
},
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1,
"sort": [
{
"_script": {
"type": "number",
"script": {
"source": "doc['topic_score'].value"
},
"order": "desc"
}
}
]
}
},
"top_score": {
"max": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
}
}
}
}
}
}
The sampler aggregation will take the top N hits per shard from the core query and run aggregations over those. Then in the max aggregator that defines the bucket order I use the exact same script as the one I use to pick a top hit from the bucket. Now the buckets and the top hits are running over the same top N sets of items and the buckets will order by the max of the same score, generated from the same script. Unfortunately I still need run the script once to order the buckets and once to pick a top hit within the bucket, and you could use the rescore instead for the top hits ordering, but either way it has to run twice and I found it was faster as a sort script then as a rescore

Which is the most effective way to get all the results of aggregation

I have the following query:
GET my-index-*/my-type/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"script" : "code"
},
"aggs": {
"dates": {
"date_range": {
"field": "created_time",
"ranges": [
{
"from": "2017-12-09T00:00:00.000",
"to": "2017-12-09T16:00:00.000"
},
{
"from": "2017-12-10T00:00:00.000",
"to": "2017-12-10T16:00:00.000"
}
]
}
},
"total_count": {
"sum_bucket": {
"buckets_path": "dates._count"
}
},
"bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalCount": "total_count"
},
"script": "params.totalCount == 0"
}
}
}
}
}
}
The result of this query is a bunch of buckets. What I need is the list of keys of my buckets. The problem is the aggregation result size is 10 by default, after getting those 10, my bucket_filter filters them by total count, and I get only some of those 10. I need to have all the results, which means I need to specify "size" = n, where n is the distinct count of code values, so that I don't lose any data. I have billions of documents, so in my case n is about 30.000. When I tried executing the query, "Out of memory" occurred on cluster, so I guess it's not the best idea. Is there a good way to get all the results for my query?
Unfortunately this is not recommended for high carnality fields with 30K unique values. The reason is because of memory cost and the large amount of data it needs to collect from the shards as you've discovered. It might work, but then you need more memory...
A more efficient solution is to use the Scroll API and specify in fields in your search request the values you want to retrieve from a field, and then store these values either in your client in-memory or stream it.
Update: since ES 6.5 this has been possible with Composite aggregations, see https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html

How to use Scroll on Elasticsearch aggregation?

I am using Elasticsearch 5.3. I am aggregating on some data but the results are far too much to return in a single query. I tried using size = Integer.MAX_VALUE; but even that has proved to be less. In ES search API, there is a method to scroll through the search results. Is there a similar feature to use for the org.elasticsearch.search.aggregations.AggregationBuilders.terms aggregator and how do I use it? Can the search scroll API be used for the aggregators?
In ES 5.3, you can partition the terms buckets and retrieve one partition per request.
For instance, in the query below, you can request to partition your buckets into 10 partitions and only return the first partition. It will return ~10x less data than if you wanted to retrieve all buckets at once.
{
"size": 0,
"aggs": {
"my_terms": {
"terms": {
"field": "my_field",
"include": {
"partition": 0,
"num_partitions": 10
},
"size": 10000
}
}
}
}
You can then make the second request by increasing the partition to 1 and so on
{
"size": 0,
"aggs": {
"my_terms": {
"terms": {
"field": "my_field",
"include": {
"partition": 1, <--- increase this up until partition 9
"num_partitions": 10
},
"size": 10000
}
}
}
}
To add this in your Java code, you can do it like this:
TermsAggregationBuilder agg = AggregationBuilders.terms("my_terms");
agg.includeExclude(new IncludeExclude(0, 10));

Pipeline aggregations in ElasticSearch 1.5

I'm wondering if it is, in any way, possible to make ES run aggregations on other aggregations all in the same query?
Basically, that's called pipelining.
I'm talking about ElasticSearch 1.5, yes I know, that's unfortunate but I'm stuck with AWS and that's what they're selling, I have to live with that.
I'm guessing that is not possible, so I'll write the next phase of the question right away.
Assuming I can query ES multiple times based on results from previous queries, how would you do the following:
Have a list of the top 100 tags that sorted by the number of appearances in the documents? (I have a field tags for each record, I'd like to know which tags are the most common) - in the past hour.
Having that, for each of the 100 tags; have the number of appearances split by 1-hour buckets (denote by Y the number representing the last hour).
Then, calculate the by how many percents, Y deviates from the average value of all the other 1-hour buckets.
Thank you for helping !!!
Basically, that's called pipelining.
No. Pipeline Aggregations did not appear until Elasticsearch 2.0. For what it's worth, Elastic does offer its own ESaaS offering with Elastic Cloud. It also runs on AWS.
... how would you do the following
The first two follow more of a flow of scope rather than working on the values.
{
"query": {
"filtered": {
"filter": {
"range" : {
"timestamp": {
"gte": "now-1h"
}
}
}
}
}
}
This will give you the last hour of data.
{
"size": 0,
"aggs": {
"group_by_tag": {
"terms": {
"field": "tag",
"size": 100
}
}
}
}
This will give you the top 100 tags for all time.
If you put them together, then you get the top 100 tags in the past hour.
For the second request, it sounds like you want a mix of that, but you also want more than just the last hour.
Whenever performing an aggregation (or GROUP BY query for that matter), you need to think about incremental steps. If you want to group by hour, then do something, then that's the order that it needs to happen in. So it's not a matter of "now that I have the last hour, let's get the other hours too". Once you've narrowed you window (scope), then you can't go back in general.
So to get number 2, we need to look at it differently. Group by as many hours as you're interested in looking at (how many 1-hour buckets do you want), then get those and then get the count per bucket. I'll take a guess and say that you want 24, 1-hour buckets (note 24 * 100 is 2400, which is not insignificant!).
That's a lot of buckets, so maybe we can think about the question differently.
I want the last hour results of top 100.
I want all top 100 average for X time (where you define X, and having it reduced will make it faster, but naturally limited to the window of selection). By limiting with the filter, we reduce the scope of the overall aggregation:
This may look like this:
{
"size": 0,
"query": {
"filtered": {
"filter": {
"range" : {
"timestamp": {
"gte": "now-24h"
}
}
}
}
},
"aggs": {
"group_by_hour_and_day": {
"date_range": {
"field": "timestamp",
"ranges": [
{ "from": "now-1h" },
{ "to": "now-1h" }
]
},
"aggs": {
"group_by_tag": {
"field": "tag",
"size": 100
}
}
}
}
}
The problem with this request is that it gives you now-24 to now-1h, then now-1h to now. That's pretty loosely what you requested, but it doesn't give it by term (which may or may not matter). Instead, the term is given by time instead (again, steps/order matters). You can then say that the previous 24h average is the responding doc count of the wider window, divided by the window size (23 in this case for 23 hours). If you want to include the last hour in the average, then you can change "to": "now-1h" to "to": "now".
We can perhaps flip this to give us the answer differently, but with a little bit more effort (where query still limits by the max time range to consider):
{
"size": 0,
"query": { ... },
"aggs": {
"group_by_tag": {
"terms": {
"field": "tag",
"size": 100
},
"aggs": {
"group_by_range": {
"field": "timestamp",
"ranges": [
{ "from": "now-1h" },
{ "to": "now-1h" }
]
}
}
}
}
}
Notice that now we aggregate by tag first across the full scope. You could remove the second date_range aggregation as a result because you now have the total for the time window. The problem with this approach is that you could end up with a very popular tag in the last hour that is not popular enough in the past full range, and so it won't appear at all.
The solution to that is to add an extra step unfortunately, by making two top-level aggregations. One for the top 100 in the full scope and one for the top 100 in the last hour.
{
"size": 0,
"query": { ... },
"aggs": {
"group_by_tag": {
"terms": {
"field": "tag",
"size": 100
}
},
"group_by_last_hour": {
"filter": {
"range": {
"timestamp": {
"gte": "now-1h"
}
}
},
"aggs": {
"terms": {
"field": "tag",
"size": 100
}
}
}
}
}
This gives the top 100 for the full window -- whatever that might be -- and then it also separately gives the top 100 for the last hour.
Then, calculate the by how many percents, Y deviates from the average value of all the other 1-hour buckets.
Do this on the client side based on whichever form you care to use, and calculate the average by cross-comparing.
And considering the type of query, you should then cache the result, which allows you to play with larger window sizes than might be otherwise desirable.

Resources