Elastic: refer to a calculated metric value inside the filter of another aggregation - elasticsearch

I'm wondering if it is possible to refer to a computed metric value (I'm calculating the median of price in my documents) inside the filter of another aggregation.
Specifically, I know that I can calculate the median like this:
"aggs":{
"median": {
"percentiles" : {
"field" : "price",
"percents": [50]
}
},
...
}
But now can I refer to this value inside another aggregation, like this:
"aggs": {
"exact": {
"filter": {
"bool": {
"must": [
{
"range": {
"price": {
"gte": 1000,
"lte": median
}
}
}
]
}
}
},
...
}
Please let me know if I can provide any more details. I've been reading Elastic docs all day and it seems like I could do it with some combination of scripting and pipeline aggregations, but I haven't figured it out yet.
Thanks in advance.

Related

How to sum the size of documents within a time interval?

I'm attempting to estimate the sum of size of n documents across an index using below query :
GET /events/_search
{
"query": {
"bool":{
"must": [
{"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
]
}
},
"aggs": {
"total_size": {
"sum": {
"field": "doc['_source'].bytes"
}
}
}
}
This returns documents but the size of the aggregation is 0 :
"aggregations" : {
"total_size" : {
"value" : 0.0
}
}
How to sum the size of documents within a time interval ?
The best way to achieve what you want is to actually add another field that contains the real source size at indexing time.
However, if you want to run it once to see how it looks like, you can leverage runtime fields to compute this at search time, just know that it can put a heavy burden on your cluster. Since the Painless scripting language doesn't yet provide a way to transform the source document to the same JSON you sent at indexing time, we can only approximate the value you're looking for by stringifying the _source Hashmap, yielding this:
GET /events/_search
{
"runtime_mappings": {
"source.size": {
"type": "double",
"script": """
def size = params._source.toString().length() * 8;
emit(size);
"""
}
},
"query": {
"bool":{
"must": [
{"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
]
}
},
"aggs": {
"size": {
"sum": {
"field": "source.size"
}
}
}
}
Another way is to install the Mapper size plugin so that you can make use of the _size field computed at indexing time.

ElasticSearch function score query (range filter)

I want to use document scoring instead of filtering.
As a user I can enter something like buyingPrice (from-to) 50-150€.
This works well with origin,offset,scale - e.g.:
gauss:{
buyingPrice:{
origin:100€
offset:100€
scale:200€
}
}
}
Problem is now, when a user only enters one side - e.g. from 50€
Expected behavior would be, that all buyingPrices above 50€ get full score. The ones below 50€ get a score lower than the full one.
How can I achieve that with ElasticSearch?
You can add a filter inside function score, so function score will only affect those documents
{
"query": {
"function_score": {
"functions": [
{...}, --> other functions
{
"filter": {
"range": {
"price": {
"lte": 50
}
}
},
"gauss": {
"price": {
"origin": 50,
"offset": 0,
"scale": 200
}
}
}
]
}
}

Filter documents prior to bucketing using GeoTile Aggregation in Elasticsearch

I am looking for an example where documents are filtered prior to bucketing via the GeoTile aggregation. For example, I would like to have buckets that hold the number of documents where some value is greater than x. Any pointers would be appreciated. Right now I have:
{
"aggs": {
"avg_my_field": {
"avg": {
"field": "properties.my_field"
}
},
"aggs": {
"large-grid": {
"geotile_grid": {
"field": "coordinates",
"precision": 8
}
}
}
}
}
I don't know where to go from here. Any pointers would be appreciated.
Simply add a top-level filter aggregation.
In pseudo code:
POST /your-index/_search
{
aggs:
filter_agg_name:
filter:
...actual filters
aggs:
...the rest of your aggs
}
Applied to your particular use case:
POST _search
{
"aggs": {
"my_applicable_filters": {
"filter": {
"bool": {
"must": [
{
"range": {
"some_numeric_or_date_field": {
"gte": 42
}
}
}
]
}
},
"aggs": {
"avg_my_field": {
"avg": {
"field": "properties.my_field"
}
},
"large-grid": {
"geotile_grid": {
"field": "coordinates",
"precision": 8
}
}
}
}
}
}
Note that your original aggregation query wasn't syntactically correct. You were close but keep in mind that:
1. Some aggregations can have direct children (sub-aggregations) of the form:
POST /your-index/_search
{
aggs:
top_level_agg_name:
agg_type:
...agg_def
aggs:
1st_child_name:
...1st_child_defs
2nd_child_name:
...2nd_child_defs
...
}
I said some because the avg aggregation does not support sub-aggregations (since it's not a bucket aggregation). That's the reason I've applied the following instead:
2. Aggregations can run irrespective of each other while specified in a single request:
POST /your-index/_search
{
aggs:
some_agg_name:
agg_type:
...agg_def
other_agg_name:
agg_type:
...agg_def
...
}
That way, you can get the average of properties.my_field AND geo-cluster your coordinates at the same time.
Conversely, when you realize that geotile_grid is indeed a bucket aggregation capable of accepting sub-aggregations, you can first group your docs by the corresponding geo hash and then calculate the average. Now that I think about it, that may've been your original intent 😉.
Speaking of moments of clarity, you can learn a lot about how aggregations relate to each other in my recently released Elasticsearch Handbook.

Can Elasticsearch search by geo distance and other attributes at the same time?

In Elasticsearch, when I search by geo-distance to a point, can I at the same time filter by another attribute, such as a number being within a range, so that both filters need to be true for the result to come back?
Sure, use bool query, where you can specify multiple clauses in must and (or) filter blocks. Be aware that clauses in must block will contribute to the relevance score and clauses in filter block will not (read more about query and filter context).
For example, query that at same time search by geo-distance with contribution to score and filter an age being within a range without contribution to score:
{
"query": {
"bool": {
"must": [
{
"geo_distance": {
"distance": "100km",
"pin.location": {
"lat": 38.889248,
"lon": -77.050636
}
}
}
],
"filter": [
{
"range": {
"age": {
"gte": 18,
"lte": 65
}
}
}
]
}
}
}

Is there a way to have elasticsearch return a hit per generated bucket during an aggregation?

right now I have a query like this:
{
"query": {
"bool": {
"must": [
{
"match": {
"uuid": "xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxxx"
}
},
{
"range": {
"date": {
"from": "now-12h",
"to": "now"
}
}
}
]
}
},
"aggs": {
"query": {
"terms": [
{
"field": "query",
"size": 3
}
]
}
}
}
The aggregation works perfectly well, but I can't seem to find a way to control the hit data that is returned, I can use the size parameter at the top of the dsl, but the hits that are returned are not returned in the same order as the bucket so the bucket results do not line up with the hit results. Is there any way to correct this or do I have to issue 2 separate queries?
To expand on Filipe's answer, it seems like the top_hits aggregation is what you are looking for, e.g.
{
"query": {
... snip ...
},
"aggs": {
"query": {
"terms": {
"field": "query",
"size": 3
},
"aggs": {
"top": {
"top_hits": {
"size": 42
}
}
}
}
}
}
Your query uses exact matches (match and range) and binary logic (must, bool) and thus should probably be converted to use filters instead:
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"uuid": "xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxxx"
}
},
{
"range": {
"date": {
"from": "now-12h",
"to": "now"
}
}
}
]
}
}
As for the aggregations,
The hits that are returned do not represent all the buckets that were returned. so if have buckets for terms 'a', 'b', and 'c' I want to have hits that represent those buckets as well
Perhaps you are looking to control the scope of the buckets? You can make an aggregation bucket global so that it will not be influenced by the query or filter.
Keep in mind that Elasticsearch will not "group" hits in any way -- it is always a flat list ordered according to score and additional sorting options.
Aggregations can be organized in a nested structure and return computed or extracted values, in a specific order. In the case of terms aggregation, it is in descending count (highest number of hits first). The hits section of the response is never influenced by your choice of aggregations. Similarly, you cannot find hits in the aggregation sections.
If your goal is to group documents by a certain field, yes, you will need to run multiple queries in the current Elasticsearch release.
I'm not 100% sure, but I think there's no way to do that in the current version of Elasticsearch (1.2.x). The good news is that there will be when version 1.3.x gets released:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

Resources