Filter documents prior to bucketing using GeoTile Aggregation in Elasticsearch - elasticsearch

I am looking for an example where documents are filtered prior to bucketing via the GeoTile aggregation. For example, I would like to have buckets that hold the number of documents where some value is greater than x. Any pointers would be appreciated. Right now I have:
{
"aggs": {
"avg_my_field": {
"avg": {
"field": "properties.my_field"
}
},
"aggs": {
"large-grid": {
"geotile_grid": {
"field": "coordinates",
"precision": 8
}
}
}
}
}
I don't know where to go from here. Any pointers would be appreciated.

Simply add a top-level filter aggregation.
In pseudo code:
POST /your-index/_search
{
aggs:
filter_agg_name:
filter:
...actual filters
aggs:
...the rest of your aggs
}
Applied to your particular use case:
POST _search
{
"aggs": {
"my_applicable_filters": {
"filter": {
"bool": {
"must": [
{
"range": {
"some_numeric_or_date_field": {
"gte": 42
}
}
}
]
}
},
"aggs": {
"avg_my_field": {
"avg": {
"field": "properties.my_field"
}
},
"large-grid": {
"geotile_grid": {
"field": "coordinates",
"precision": 8
}
}
}
}
}
}
Note that your original aggregation query wasn't syntactically correct. You were close but keep in mind that:
1. Some aggregations can have direct children (sub-aggregations) of the form:
POST /your-index/_search
{
aggs:
top_level_agg_name:
agg_type:
...agg_def
aggs:
1st_child_name:
...1st_child_defs
2nd_child_name:
...2nd_child_defs
...
}
I said some because the avg aggregation does not support sub-aggregations (since it's not a bucket aggregation). That's the reason I've applied the following instead:
2. Aggregations can run irrespective of each other while specified in a single request:
POST /your-index/_search
{
aggs:
some_agg_name:
agg_type:
...agg_def
other_agg_name:
agg_type:
...agg_def
...
}
That way, you can get the average of properties.my_field AND geo-cluster your coordinates at the same time.
Conversely, when you realize that geotile_grid is indeed a bucket aggregation capable of accepting sub-aggregations, you can first group your docs by the corresponding geo hash and then calculate the average. Now that I think about it, that may've been your original intent 😉.
Speaking of moments of clarity, you can learn a lot about how aggregations relate to each other in my recently released Elasticsearch Handbook.

Related

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.
Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

Get max bucket of terms aggregation (with pipeline aggregation)

I was wondering how to get the bucket with the highest doc_count when using a terms aggregation with Elasticsearch. I'm using the Kibana sample data kibana_sample_data_flights:
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"destinations": {
"terms": {
"field": "DestCityName"
}
}
}
}
If there was a single bucket with the max doc_count value I could set the size of the terms aggregation to 1, however this doesn't work if there are two buckets with the same max doc_count value.
Since I came accross pipeline aggregations, I feel there should be an easy way to achieve this. The max bucket aggregation seems to be able to deal with multiple max buckets, since the guide says this:
[...] which identifies the bucket(s) with the maximum value of [...]
However the only way to make this work was using a work-around with a sub-aggregation using value_count:
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"destinations": {
"terms": {
"field": "DestCityName"
},
"aggs": {
"counter": {
"value_count": {
"field": "_id"
}
}
}
},
"max_destination": {
"max_bucket": {
"buckets_path": "destinations>counter"
}
}
}
}
a) Is there a better way in general, to find the terms bucket with the max value?
b) Is there a better way using pipeline aggrations?
Thanks in advance!
Well you can simplify as below and you don't need to make use of value_count aggregation.
However, unfortunately using max_bucket is the only way to get what you are looking for.
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"destinations": {
"terms": {
"field": "DestCityName"
}
},
"max_destination": {
"max_bucket": {
"buckets_path": "destinations>_count" <---- Note the usage of _count
}
}
}
}
Hope this helps!

"Filter then Aggregation" or just "Filter Aggregation"?

I am working on ES recently and I found that I could achieve the almost same result but I have no clear idea as to the DIFFERENCE between these two.
"Filter then Aggregation"
POST kibana_sample_data_flights/_search
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"term": {
"DestCountry": "CA"
}
}
}
},
"aggs": {
"ca_weathers": {
"terms": { "field": "DestWeather" }
}
}
}
"Filter Aggregation"
POST kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"ca": {
"filter": {
"term": {
"DestCountry": "CA"
}
},
"aggs": {
"_weathers": {
"terms": { "field": "DestWeather" }
}
}
}
}
}
My Questions
Why there are two similar functions? I believe I am wrong about it but what's the difference then?
(please do ignore the result format, it's not the question I am asking ;p)
Which is better if I want to filter out the unrelated/unmatched and start the aggregation on lots of documents?
When you use it in "query", you're creating a context on ALL the docs in your index. In this case, it acts like a normal filter like: SELECT * FROM index WHERE (my_filter_condition1 AND my_filter_condition2 OR my_filter_condition3...).
When you use it in "aggs", you're creating a context on ALL the docs that might have (or haven't) been previously filtered. Let's say that if you have an structure like:
#OPTION A
{
"aggs":{
t_shirts" : {
"filter" : { "term": { "type": "t-shirt" } }
}
}
}
Without a "query", is exactly the same as having
#OPTION B
{
"query":{
"filter" : { "term": { "type": "t-shirt" } }
}
}
BUT the results will be returned in different fields.
In the Option A, the results will be returned in the aggregations field.
In the Option B, the results will be returned in the hits field.
I would recommend to apply your filters always on the query part, so you can work with subsecuent aggregations of the already filtered docs. Also because Aggrgegations cost more performance than queries.
Hope this is helpful! :D
Both filters, used in isolation, are equivalent. If you load no results (hits), then there is no difference. But you can combine listing and aggregations. You can query or filter your docs for listing, and calculate aggregations on bucket further limited by the aggs filter. Like this:
POST kibana_sample_data_flights/_search
{
"size": 100,
"query": {
"bool": {
"filter": {
"term": {
... some other filter
}
}
}
},
"aggs": {
"ca_filter": {
"term": {
"TestCountry": "CA"
}
},
"aggs": {
"ca_weathers": {
"terms": { "field": "DestWeather" }
}
}
}
}
But more likely you will need the other way, ie. make aggregations on all docs, to display summary informations, while you display docs from specific query. In this case you need to combine aggragations with post_filter.
Answer from #Val's comment, I may just quote here for reference:
In option A, the aggregation will be run on ALL documents. In option B, the documents are first filtered and the aggregation will be run only on the selected documents. Say you have 10M documents and the filter select only a 100, it's pretty evident that option B will always be faster.

How do I select the top term buckets based on a rescore function in Elasticsearch

Consider the following query for Elasticsearch 5.6:
{
"size": 0,
"query": {
"match_all": {}
},
"rescore": [
{
"window_size": 10000,
"query": {
"rescore_query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
],
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1
}
},
"top_score": {
"max": {
"script": {
"inline": "_score"
}
}
}
}
}
}
}
This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.
Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.
I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.
The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.
Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?
From documentation :
The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.
As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.
I have no idea on how you can combine rescore and aggregations. Sorry :(
I think I have a pretty great solution to this problem, but I'll let the bounty continue to expiration incase someone comes up with a better approach.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 10000
},
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1,
"sort": [
{
"_script": {
"type": "number",
"script": {
"source": "doc['topic_score'].value"
},
"order": "desc"
}
}
]
}
},
"top_score": {
"max": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
}
}
}
}
}
}
The sampler aggregation will take the top N hits per shard from the core query and run aggregations over those. Then in the max aggregator that defines the bucket order I use the exact same script as the one I use to pick a top hit from the bucket. Now the buckets and the top hits are running over the same top N sets of items and the buckets will order by the max of the same score, generated from the same script. Unfortunately I still need run the script once to order the buckets and once to pick a top hit within the bucket, and you could use the rescore instead for the top hits ordering, but either way it has to run twice and I found it was faster as a sort script then as a rescore

Is there a way to have elasticsearch return a hit per generated bucket during an aggregation?

right now I have a query like this:
{
"query": {
"bool": {
"must": [
{
"match": {
"uuid": "xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxxx"
}
},
{
"range": {
"date": {
"from": "now-12h",
"to": "now"
}
}
}
]
}
},
"aggs": {
"query": {
"terms": [
{
"field": "query",
"size": 3
}
]
}
}
}
The aggregation works perfectly well, but I can't seem to find a way to control the hit data that is returned, I can use the size parameter at the top of the dsl, but the hits that are returned are not returned in the same order as the bucket so the bucket results do not line up with the hit results. Is there any way to correct this or do I have to issue 2 separate queries?
To expand on Filipe's answer, it seems like the top_hits aggregation is what you are looking for, e.g.
{
"query": {
... snip ...
},
"aggs": {
"query": {
"terms": {
"field": "query",
"size": 3
},
"aggs": {
"top": {
"top_hits": {
"size": 42
}
}
}
}
}
}
Your query uses exact matches (match and range) and binary logic (must, bool) and thus should probably be converted to use filters instead:
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"uuid": "xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxxx"
}
},
{
"range": {
"date": {
"from": "now-12h",
"to": "now"
}
}
}
]
}
}
As for the aggregations,
The hits that are returned do not represent all the buckets that were returned. so if have buckets for terms 'a', 'b', and 'c' I want to have hits that represent those buckets as well
Perhaps you are looking to control the scope of the buckets? You can make an aggregation bucket global so that it will not be influenced by the query or filter.
Keep in mind that Elasticsearch will not "group" hits in any way -- it is always a flat list ordered according to score and additional sorting options.
Aggregations can be organized in a nested structure and return computed or extracted values, in a specific order. In the case of terms aggregation, it is in descending count (highest number of hits first). The hits section of the response is never influenced by your choice of aggregations. Similarly, you cannot find hits in the aggregation sections.
If your goal is to group documents by a certain field, yes, you will need to run multiple queries in the current Elasticsearch release.
I'm not 100% sure, but I think there's no way to do that in the current version of Elasticsearch (1.2.x). The good news is that there will be when version 1.3.x gets released:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

Resources