ElasticSearch - order with min in aggregation - elasticsearch

I have objects in the index that are related by an id, which groups them.
The group creation time is the time between the min createdAt object in the group and the max createdAt object in the group.
I'd like to order these groups by the min or max time, how can I do this?
{
"size":0,
"aggs":{
"intervals":{
"composite":{
"size":10000,
"sources":[
{
"totalId":{
"terms":{
"field":"totalId"
}
},
"name": {
"terms":{
"field":"name"
}
}
}
]
},
"aggs": {
"createdAtStart": {
"min": {"field": "createdAt", "format": "YYYY-MM-DD'T'HH:mm:ssZ"}, "order": { "createdAtStart": "desc" }
},
"createdAtEnd": {
"max": {"field": "createdAt", "format": "YYYY-MM-DD'T'HH:mm:ssZ"}
}
}
}
}
I'm using order wrong:
Found two aggregation type definitions

You cannot achieve that with a composite aggregation because the terms source is not orderable by the values of a sub-aggregation, like it is the case with a "normal" terms aggregation. (also the date formats are wrong)
So the correct query that will give you want you want is this one:
{
"size": 0,
"aggs": {
"totalId": {
"terms": {
"field": "totalId",
"order": {
"createdAtStart": "asc"
}
},
"aggs": {
"createdAtStart": {
"min": {
"field": "createdAt",
"format": "yyyy-MM-dd'T'HH:mm:ssZ"
}
},
"createdAtEnd": {
"max": {
"field": "createdAt",
"format": "yyyy-MM-dd'T'HH:mm:ssZ"
}
}
}
}
}
}
Because of the way the composite aggregation works, it's not possible to achieve what you want. The reason is that the composite aggregation has been created in order to "paginate" over a big amount of buckets. That pagination is defined by the way the buckets are ordered. If it was possible to sort buckets according to sub-aggregations, it would mean that all buckets would need to be pre-computed and pre-sorted before returning the first page of results, which would completely defeat the purpose of this aggregation.

You are adding an extra {
{
"size": 0,
"aggs": {
"intervals": {
"composite": {
"size": 10000,
"sources": [
{
"totalId": {
"terms": {
"field": "totalId"
}
}
}
] <-- note this
},
"aggs": {
"createdAtStart": {
"min": {
"field": "createdAt",
"format": "YYYY-MM-DD'T'HH:mm:ssZ"
},
"order": {
"createdAtStart": "desc"
}
},
"createdAtEnd": {
"max": {
"field": "createdAt",
"format": "YYYY-MM-DD'T'HH:mm:ssZ"
}
}
}
}
}
}

Related

Bucket sort on dynamic aggregation name

I would like to sort my aggregations value from quantity.
But my problem is that each aggregation have a name that couldn't be know in advance :
Given this query :
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"datetime": {
"gte": "2021-01-01",
"lte": "2021-12-09"
}
}
}
]
}
},
"aggs": {
"sorting": {
"bucket_sort": {
"sort": [
{
"year>quantity": {
"order": "desc"
}
}
]
}
},
"UNKNOWN_1": {
"aggs": {
"year": {
"filter": {
"bool": {
"must": [
{
"range": {
"datetime": {
"gte": "2021-01-01",
"lte": "2021-12-09"
}
}
}
]
}
},
"aggs": {
"quantity": {
"sum": {
"field": "item.quantity"
}
}
}
}
}
},
"UNKNOWN_2": {
"aggs": {
"year": {
"aggs": {
"quantity": {
"sum": {
"field": "item.quantity"
}
}
}
}
}
},
....
}
}
it miss one level on my bucket_sort aggregation to reach that quantity value.
Here is one elastic record :
{
datetime: '2021-12-01',
item.quantity: 5
}
Note that I have remove the biggest part of the request for comprehension, like filter aggregation, ect....
I tried something with wildcard :
"sorting": {
"bucket_sort": {
"sort": [
{
"*>year>quantity": {
"order": "desc"
}
}
]
}
},
But got the same error....
Is it possible to achieve this behaviour ?
I think you misunderstood the "bucket_sort" aggregation: it won't sort your aggregations but it sorts the buckets coming from one multi-bucket aggregation. Also the bucket_sort aggregation has to be subordinate to that multi-bucket aggregation.
From the docs:
[The bucket sort aggregation is] "a parent pipeline aggregation which sorts the buckets of its parent multi-bucket aggregation"
If I get it correct, you try to create "buckets" with specific filter aggregations and you can't know in advance how many of those filter aggregations you create.
For that you can use the "multi filters" aggregation where you can specify as many filters as you want and each of them creates a bucket.
Subordinated to that filters-aggregation you can create one single sum aggregation on item.quantity.
Also subordinated to the filters-aggregations you then add your buckets_sort aggregation, where you also just have to name the sibling "sum" aggregation.
All in all it might look like that:
{
"aggs": {
"your_filters": {
"filters": {
"filters": {
"unknown_1": {
"range": {
"datetime": {
"gte": "2021-01-01",
"lte": "2021-12-09"
}
}
},
"unknown_2": {
/** more filters here... **/
}
}
},
"aggs": {
"quantity": {
"sum": {
"field": "item.quantity"
}
},
"sorting": {
"bucket_sort": {
"sort": [
{ "quantity": { "order": "desc" } }
]
}
}
}
}
}
}

Elasticsearch sort data upon all buckets

I am trying to make an es sort but I am struggling.
The base story of my data is that I have for example product definition which can consist of various products. (We call them abstract and concrete).
Let's say I have product A that is abstract it can consist of product B,C,D (called concretes).
I also for example have product E that can have F as a concrete and so on.
I want to aggregate the products by their abstract (to only show 1 of each concrete) and then sort all concretes based on some criteria.
I have written the following that doesn't work as expected.
"aggs": {
"category:58": {
"aggs": {
"products": {
"aggs": {
"abstract": {
"top_hits": {
"size": 1,
"sort": [
{
"criteria1": {
"order": "desc"
}
},
{
"_score": {
"order": "desc"
}
},
{
"criteria3": {
"missing": "_last",
"order": "asc",
"unmapped_type": "integer"
}
}
]
}
}
},
"terms": {
"field": "abstract_id",
"size": 10
}
}
},
"filter": {
"term": {
"categories.id": {
"value": "58"
}
}
}
}
},
If I got it correctly this will create 10 buckets and each bucket will have one product, and then my sort sorts a single product, where I should be sorting the entire result. The question is where do I place my sort that is currently in aggs->abstract.
If I remove the grouping by abstract_id and change it to something that is unique then the sorting does work, but then for one abstract product I can get all concretes displayed which I don't want to be the case.
I saw that I can't sort on terms so I'm kinda clueless now.
I ended up using multiple aggregations and then doing a bucket sort.
The query I ended up with looks like this
"aggs": {
"abstract": {
"top_hits": {
"size": 1
}
},
"criteria3": {
"sum": {
"field": "custom_filed_foo_bar"
}
},
"criteria1": {
"sum": {
"field": "boosted_value"
}
},
"criteria2": {
"max": {
"script":{
"source": "_score"
}
}
},
"sorting": {
"bucket_sort": {
"sort": [
{
"criteria1": {
"order": "desc"
}
},
{
"criteria2": {
"order": "desc"
}
},
{
"criteria3": {
"order": "desc"
}
}
]
}
}
I don't know if it's the correct approach but seems to be working

How to define percentage of result items with specific field in Elasticsearch query?

I have a search query that returns all items matching users that have type manager or lead.
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [
{
"terms": {
"type": ["manager", "lead"]
}
}
]
}
}
}
Is there a way to define what percentage of the results should be of type "manager"?
In other words, I want the results to have 80% of users with type manager and 20% with type lead.
I want to make a suggestion to use bucket_path aggregation. As I know this aggregation needs to be run in sub-aggs of a histogram aggregation. As you have such field in your mapping so I think this query should work for you:
{
"size": 0,
"aggs": {
"NAME": {
"date_histogram": {
"field": "my_datetime",
"interval": "month"
},
"aggs": {
"role_type": {
"terms": {
"field": "type",
"size": 10
},
"aggs": {
"count": {
"value_count": {
"field": "_id"
}
}
}
},
"role_1_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_1 / (params.role_1+params.role_2)*100"
}
},
"role_2_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_2 / (params.role_1+params.role_2)*100"
}
}
}
}
}
}
Please let me know if it didn't work well for you.

Elasticsearch Aggregations: Only return results of one of them?

I'm trying to find a way to only return the results of one aggregation in an Elasticsearch query. I have a max bucket aggregation (the one that I want to see) that is calculated from a sum bucket aggregation based on a date histogram aggregation. Right now, I have to go through 1,440 results to get to the one I want to see. I've already removed the results of the base query with the size: 0 modifier, but is there a way to do something similar with the aggregations as well? I've tried slipping the same thing into a few places with no luck.
Here's the query:
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "2018-11-28",
"lte": "2018-11-28"
}
}
},
"aggs": {
"hits_per_minute": {
"date_histogram": {
"field": "timestamp",
"interval": "minute"
},
"aggs": {
"total_hits": {
"sum": {
"field": "hits_count"
}
}
}
},
"max_transactions_per_minute": {
"max_bucket": {
"buckets_path": "hits_per_minute>total_hits"
}
}
}
}
Fortunately enough, you can do that with bucket_sort aggregation, which was added in Elasticsearch 6.4.
Do it with bucket_sort
POST my_index/doc/_search
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "2018-11-28",
"lte": "2018-11-28"
}
}
},
"aggs": {
"hits_per_minute": {
"date_histogram": {
"field": "timestamp",
"interval": "minute"
},
"aggs": {
"total_hits": {
"sum": {
"field": "hits_count"
}
},
"max_transactions_per_minute": {
"bucket_sort": {
"sort": [
{"total_hits": {"order": "desc"}}
],
"size": 1
}
}
}
}
}
}
This will give you a response like this:
{
...
"aggregations": {
"hits_per_minute": {
"buckets": [
{
"key_as_string": "2018-11-28T21:10:00.000Z",
"key": 1543957800000,
"doc_count": 3,
"total_hits": {
"value": 11
}
}
]
}
}
}
Note that there is no extra aggregation in the output and the output of hits_per_minute is truncated (because we asked to give exactly one, topmost bucket).
Do it with filter_path
There is also a generic way to filter the output of Elasticsearch: Response filtering, as this answer suggests.
In this case it will be enough to just do the following query:
POST my_index/doc/_search?filter_path=aggregations.max_transactions_per_minute
{ ... (original query) ... }
That would give the response:
{
"aggregations": {
"max_transactions_per_minute": {
"value": 11,
"keys": [
"2018-12-04T21:10:00.000Z"
]
}
}
}

Elasticsearch aggregation doesn't work with nested-type fields

I can't make elasticsearch aggregation+filter to work with nested fields. The data schema (relevant part) is like this:
"mappings": {
"rb": {
"properties": {
"project": {
"type": "nested",
"properties": {
"age": {
"type": "long"
},
"name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Essentially "rb" object contains a nested field called "project" which contains two more fields - "name" and "age". Query I'm running:
"aggs": {
"root": {
"aggs": {
"group": {
"aggs": {
"filtered": {
"aggs": {
"order": {
"percentiles": {
"field": "project.age",
"percents": ["50"]
}
}
},
"filter": {
"range": {
"last_updated": {
"gte": "2015-01-01",
"lt": "2015-07-01"
}
}
}
}
},
"terms": {
"field": "project.name",
"min_doc_count": 5,
"order": {
"filtered>order.50": "asc"
},
"shard_size": 10,
"size": 10
}
}
},
"nested": {
"path": "project"
}
}
}
This query is supposed to produce top 10 projects (project.name field) which match the date filter, sorted by their median age, ignoring projects with less than 5 mentions in the database. Median should be calculated only for projects matching the filter (date range).
Despite having more than a hundred thousands objects in the database, this query produces empty list. No errors, just empty response. I've tried it both on ES 1.6 and ES 2.0-beta.
I've re-organized your aggregation query a bit and I could get some results showing up. The main point is type since you are aggregating around a nested type, I took out the filter aggregation on the last_updated field and moved it up the hierarchy as the first aggregation. Then comes the nested aggregation on the project field and finally the terms and the percentile.
That seems to work out pretty well. Please try.
{
"size": 0,
"aggs": {
"filtered": {
"filter": {
"range": {
"last_updated": {
"gte": "2015-01-01",
"lt": "2015-07-01"
}
}
},
"aggs": {
"root": {
"nested": {
"path": "project"
},
"aggs": {
"group": {
"terms": {
"field": "project.name",
"min_doc_count": 5,
"shard_size": 10,
"order": {
"order.50": "asc"
},
"size": 10
},
"aggs": {
"order": {
"percentiles": {
"field": "project.age",
"percents": [
"50"
]
}
}
}
}
}
}
}
}
}
}

Resources