Limit The Number Of Results Processed By An Aggregation - elasticsearch

I have a query with an aggregation. I want the aggregation to only operate on the top 500 hits returned by the query.
For example, let's say I have an index of comments. I want to query the top 500 matching comments and aggregate them based on the poster, so that I may answer the question: "Who are the top kitten and puppy posters?".
The query might look something like this:
POST comments/_search
{
"query": {
"query_string": {
"query": "\"kittens\" OR \"puppies\"",
"default_field": "body"
}
},
"aggs": {
"posters": {
"terms": {
"field": "poster"
}
}
}
}
The problem with this is, as far as I know, the aggregation will operate on ALL returned results, not the top 500.
Things I've Already Tried/Considered:
size at the query root only changes the number of hits returned by
the query, but has no effect on the aggregation.
size inside the
terms aggregation only affects the total number of buckets to return.
There used to be a limit filter in older versions that would limit the number of hits returned by a query (and therefore the number processed by the aggregation) but that was deprecated in favor of...
terminate-after which doesn't work because the results aren't sorted by score before being returned so I couldn't get the top 500, just a set of 500
Does anyone know how to limit the documents processed by an aggregation to only the top results?
EDIT: I'm using ES version 6.3

I think you are looking for sampler aggregation. You will have to wrap your poster aggregation into the sample aggregation.
The shard_size parameter is number of document that will be considered for the subaggregation. In your case 500.
{
"query": {
"query_string": {
"query": "\"kittens\" OR \"puppies\"",
"default_field": "body"
}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 500
},
"aggs": {
"posters": {
"terms": {
"field": "poster"
}
}
}
}
}
}

Related

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.
Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

How to limit search results from each index in a multi index search query?

I am using Elasticsearch version 6.3 and I want to make queries across multiple indices.Elasticsearch has support for this and I can give multiple indices as comma separated values in the url with one query in request body and also give size parameter to limit the number of search results returned.However this limits the size of the overall search results and might lead to no results from some indexes- so instead I want to fetch first n number of results from each index.
I tried using multi search api (_msearch) but with that it seems I have to give the same query and size for all indexes and that works, but I am not able to get a single aggregation over the entire result , is there any way to address both the issues?
Solution 1:
You're on the right path with the _msearch query. What I would do is to issue one query per index (no aggregations!) with the size you want for that index, as well as another query just for the aggregations, like this:
{ "index": "index1" }
{ "size": 5, "query": { ... }}
{ "index": "index2" }
{ "size": 5, "query": { ... }}
{ "index": "index3" }
{ "size": 5, "query": { ... }}
{ "index": "index1,index2,index3" }
{ "size": 0, "query": { ... }, "aggs": { ... } }
So the first three queries will return document hits from each of the three indexes and the last query will return the aggregation computed on all indexes, but no documents.
Solution 2:
Another way to tackle this if you have a small size, is to have a single query in the query part and then aggregate on the index name and retrieve hits from each index using top_hits, like this:
POST index1,index2,index3/_search
{
"size": 0,
"query": { ... },
"aggs": {
"indexes": {
"terms": {
"field": "_index",
"size": 50
},
"aggs": {
"hits": {
"top_hits": {
"size": 5
}
}
}
}
}
}

instruct elasticsearch to return random results from different types

I have an index in ES with say 3 types A,B,C. Each type holds 1000 products. When the user makes a query with no scoring , then ES returns first all results from A, then all from B and then all from C.
What I need is to present mixed results from the 3 types.
I looked into the random scoring but it s not quite what I need.
Any ideas?
Do you really need randomness or simple 3 results from a type? Three results from each type could be realized through the top hits aggregation. First you aggregate by the _type field, then the top hits aggregation is applied:
{
"query": {
"function_score": {
"query": {
"match_all": {
}
},
"random_score": {
"seed": 137677928418000
}
}
},
"aggs": {
"all_type": {
"terms": {
"field": "_type"
},
"aggs": {
"by_top_hit": {
"top_hits": {
"size": 3
}
}
}
}
}
}
Edit: I added random scoring, to get random results, I think to get special numbers of documents for each _type is difficult, a solution is probably to get just enough from all _type fields.

Aggregation on top 100 documents sorted by a field

I would like to do a terms aggregation on top 100 documents sorted on a field (not relevance score!).
I know how to do the aggregation:
{
"query": {
"match_all" : {}
},
"aggs" : {
"mydata_agg" : {
"terms": {
"field" : "title"
}
}
}
}
and I know how to get top 100 documents sorted on a field:
{
"query": {
"match_all": {}
},
"sort": {
"units_sold": {
"order": "desc"
}
},
"size": 100
}
But how do I run the terms aggregation on those 100 sorted documents? I could use a range filter but then I need to specify myself the cutoff value of units_sold that results in top 100 documents. results. I prefer to do everything in one query. Is that possible?
I have searched for couple hours but was unable to find a solution.
The term aggregation creates buckets, and we need to sort the outcome of the first aggregation. this can be done using bucket_sort.Read this article for more information.

Is there a way to have elasticsearch return a hit per generated bucket during an aggregation?

right now I have a query like this:
{
"query": {
"bool": {
"must": [
{
"match": {
"uuid": "xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxxx"
}
},
{
"range": {
"date": {
"from": "now-12h",
"to": "now"
}
}
}
]
}
},
"aggs": {
"query": {
"terms": [
{
"field": "query",
"size": 3
}
]
}
}
}
The aggregation works perfectly well, but I can't seem to find a way to control the hit data that is returned, I can use the size parameter at the top of the dsl, but the hits that are returned are not returned in the same order as the bucket so the bucket results do not line up with the hit results. Is there any way to correct this or do I have to issue 2 separate queries?
To expand on Filipe's answer, it seems like the top_hits aggregation is what you are looking for, e.g.
{
"query": {
... snip ...
},
"aggs": {
"query": {
"terms": {
"field": "query",
"size": 3
},
"aggs": {
"top": {
"top_hits": {
"size": 42
}
}
}
}
}
}
Your query uses exact matches (match and range) and binary logic (must, bool) and thus should probably be converted to use filters instead:
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"uuid": "xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxxx"
}
},
{
"range": {
"date": {
"from": "now-12h",
"to": "now"
}
}
}
]
}
}
As for the aggregations,
The hits that are returned do not represent all the buckets that were returned. so if have buckets for terms 'a', 'b', and 'c' I want to have hits that represent those buckets as well
Perhaps you are looking to control the scope of the buckets? You can make an aggregation bucket global so that it will not be influenced by the query or filter.
Keep in mind that Elasticsearch will not "group" hits in any way -- it is always a flat list ordered according to score and additional sorting options.
Aggregations can be organized in a nested structure and return computed or extracted values, in a specific order. In the case of terms aggregation, it is in descending count (highest number of hits first). The hits section of the response is never influenced by your choice of aggregations. Similarly, you cannot find hits in the aggregation sections.
If your goal is to group documents by a certain field, yes, you will need to run multiple queries in the current Elasticsearch release.
I'm not 100% sure, but I think there's no way to do that in the current version of Elasticsearch (1.2.x). The good news is that there will be when version 1.3.x gets released:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

Resources