ElasticSearch Approximating GROUP_BY, SUM, SORT, with PAGINATION - elasticsearch

I'm using Elasticsearch and I see questions now and then about doing some aggregations with sorting or aggregations with paging, but I never see anything GROUP_BY, SUM, SORT, and PAGINATION together. If I were to write what I'm looking for as SQL, here it is (without the PAGINATION).
select invoice_date, address_2, company_name, sum(amount)
from my_table
group by invoice_date, address_2, company_name
order by sum(amount) desc
I tried doing this using many different techniques like composite aggregation, however it appears I can't do the ORDER_BY with this on the summation.
# composite aggregation
POST /746ee3a6-2b87-4288-9f20-3bf3a9e47e93/_search
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{ "Address2": { "terms": { "field": "Address2" } } },
{ "Company_Description": { "terms": { "field": "Company_Description" } } },
{ "InvoiceDate": { "terms": { "field": "InvoiceDate" } } }
]
},
"aggregations": {
"summation": {
"sum": { "field": "GrossValue" }
}
}
}
}
}
I tried repeated nested aggregations but I saw a comment somewhere that with many nested levels you can't ORDER_BY either.
POST /746ee3a6-2b87-4288-9f20-3bf3a9e47e93/_search
{
"size":0,
"from":0,
"sort":[{"Address2":"asc"}],
"query":{"bool":{"must":[{"match":{"taxonomy_full_code":-1}}]}},
"track_total_hits":true,
"aggs":{
"agg0":{
"terms":{"field":"Address2"},
"aggs":{
"agg1":{
"terms":{"field":"Company_Description"},
"aggs":{
"agg2":{
"terms":{"field":"InvoiceDate"},
"aggs":{"sum(GrossValue)":{"sum":{"field":"GrossValue"}}
}
}
}
}
}
}
}
}
Same with multi-term aggregation
# multi-term aggregation
POST /746ee3a6-2b87-4288-9f20-3bf3a9e47e93/_search
{
"size": 0,
"aggs": {
"rule_builder": {
"multi_terms": {
"terms": [
{"field": "Address2"},
{"field": "Company_Description"},
{"field": "InvoiceDate"}
]
},
"aggs":{
"sum(GrossValue)":{"sum":{"field":"GrossValue"}}
}
}
}
}
I'm using Elasticsearch 7.14. Is what I'm looking for possible in this version?

Related

Filter out terms aggregation buckets in elasticsearch after applying aggregation

Below is snapshot of the dataset:
recordNo employeeId employeeStatus employeeAddr
1 employeeA Permanent
2 employeeA ABC
3 employeeB Contract
4 employeeB CDE
I want to get the list of employees along with employeeStatus and employeeAddr.
So I am using terms aggregation on employeeId and then using sub-aggregations of employeeStatus and employeeAddr to get these details.
Below query returns the results correctly.
{
"aggregations": {
"Employee": {
"terms": {
"field": "employeeID"
},
"aggregations": {
"employeeStatus": {
"terms": {"field": "employeeStatus"}
},
"employeeAddr": {
"terms": {"field": "employeeAddr"}
}
}
}
}
}
Now I want only the employees which are in Permanent status. So I am applying filter aggregation.
{
"aggregations": {
"filter_Employee_employeeID": {
"filter": {
"bool": {
"must": [
{
"match": {
"employeeStatus": {"query": "Permanent"}
}
}
]
}
},
"aggregations": {
"Employee": {
"terms": {
"field": "employeeID"
},
"aggregations": {
"employeeStatus": {
"terms": {"field": "employeeStatus"}
},
"employeeAddr": {
"terms": {"field": "employeeAddr"}
}
}
}
}
}
}
}
Now the problem is that the employeeAddr aggregation returns no buckets for employeeA because record 2 gets filtered out before the aggregation is done.
Assuming that I cannot modify the data set and I want to achieve the result with a single elastic query, how can I do it?
I checked the Bucket Selector pipeline aggregation but it only works for metric aggregations.
Is there a way to filter out term buckets after the aggregation is applied?
If I understood correctly you want to preserve the aggregations even if you use some kind of filter. To achieve that, try using the post_filter clause.
You can check the docs here
The clause is applied "outside" the aggregation. Using your example, it should look like this:
{
"aggregations": {
"filter_Employee_employeeID": {
"aggregations": {
"Employee": {
"terms": {
"field": "employeeID"
},
"aggregations": {
"employeeStatus": {
"terms": {
"field": "employeeStatus"
}
},
"employeeAddr": {
"terms": {
"field": "employeeAddr"
}
}
}
}
}
}
},
"post_filter": {
"bool": {
"must": [
{
"match": {
"employeeStatus": {
"query": "Permanent"
}
}
}
]
}
}
}
I tested a combination of the include field for the terms aggregation, plus using a bucket_selector with document count would give you the desired result.
Filtering term values is here.
Bucket selector using document count is here
the subtlety here is that, yes you need numeric values, but you can also reference meta/custom fields that elasticsearch has
{
"aggregations": {
"Employee": {
"terms": {
"field": "employeeId.keyword"
},
"aggregations": {
"employeeStatus": {
"terms": {"field": "employeeStatus", "include": "Permanent"}
},
"employeeAddr": {
"terms": {"field": "employeeAddr"}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "employeeStatus._bucket_count"
},
"script": {
"source": "params.count != 0"
}
}
}
}
}
}
}
I tested this on 7.10 and it worked, returning only employeeA, with the address included.

Elasticsearch Pagination with timestamp range

Elasticsearch official documentation introduce that elasticsearch can realize pagination by composite aggregations.
The composite aggregation will fetch data many times to get all results.
So my question is, Can I use range from now-1h to now when I execute composite aggregation?
If I can. How to composite aggregation query keep source data unchanging when every range query have different now.
If I can't. My query below has no error and the result seems to be right.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"timestamp": {
"gte": "now-1h"
}
}
}
]
}
},
"aggs": {
"user_device": {
"composite": {
"after": {
"user_name": "alen.lv"
},
"size": 100,
"sources": [
{
"user_name": {
"terms": {
"field": "user_name"
}
}
}
]
},
"aggs": {
"user_mac": {
"terms": {
"field": "user_mac",
"size": 1000
}
}
}
}
}
}

ElasticSearch - order with min in aggregation

I have objects in the index that are related by an id, which groups them.
The group creation time is the time between the min createdAt object in the group and the max createdAt object in the group.
I'd like to order these groups by the min or max time, how can I do this?
{
"size":0,
"aggs":{
"intervals":{
"composite":{
"size":10000,
"sources":[
{
"totalId":{
"terms":{
"field":"totalId"
}
},
"name": {
"terms":{
"field":"name"
}
}
}
]
},
"aggs": {
"createdAtStart": {
"min": {"field": "createdAt", "format": "YYYY-MM-DD'T'HH:mm:ssZ"}, "order": { "createdAtStart": "desc" }
},
"createdAtEnd": {
"max": {"field": "createdAt", "format": "YYYY-MM-DD'T'HH:mm:ssZ"}
}
}
}
}
I'm using order wrong:
Found two aggregation type definitions
You cannot achieve that with a composite aggregation because the terms source is not orderable by the values of a sub-aggregation, like it is the case with a "normal" terms aggregation. (also the date formats are wrong)
So the correct query that will give you want you want is this one:
{
"size": 0,
"aggs": {
"totalId": {
"terms": {
"field": "totalId",
"order": {
"createdAtStart": "asc"
}
},
"aggs": {
"createdAtStart": {
"min": {
"field": "createdAt",
"format": "yyyy-MM-dd'T'HH:mm:ssZ"
}
},
"createdAtEnd": {
"max": {
"field": "createdAt",
"format": "yyyy-MM-dd'T'HH:mm:ssZ"
}
}
}
}
}
}
Because of the way the composite aggregation works, it's not possible to achieve what you want. The reason is that the composite aggregation has been created in order to "paginate" over a big amount of buckets. That pagination is defined by the way the buckets are ordered. If it was possible to sort buckets according to sub-aggregations, it would mean that all buckets would need to be pre-computed and pre-sorted before returning the first page of results, which would completely defeat the purpose of this aggregation.
You are adding an extra {
{
"size": 0,
"aggs": {
"intervals": {
"composite": {
"size": 10000,
"sources": [
{
"totalId": {
"terms": {
"field": "totalId"
}
}
}
] <-- note this
},
"aggs": {
"createdAtStart": {
"min": {
"field": "createdAt",
"format": "YYYY-MM-DD'T'HH:mm:ssZ"
},
"order": {
"createdAtStart": "desc"
}
},
"createdAtEnd": {
"max": {
"field": "createdAt",
"format": "YYYY-MM-DD'T'HH:mm:ssZ"
}
}
}
}
}
}

sql to es : get limit page and order result on agg

SELECT
max( timestamp ) AS first_time,
min( timestamp ) AS last_time,
src_ip,
threat_target ,
count(*) as count
FROM
traffic
GROUP BY
src_ip,
threat_target
ORDER BY
first_time desc
LIMIT 0 ,10
I want to get this result, but I don't know how to get limit size and where to use sort
{
"size": 0,
"aggregations": {
"src_ip": {
"aggregations": {
"threat_target": {
"aggregations": {
"last_time": {
"max": {
"field": "`timestamp`"
}
},
"first_time": {
"min": {
"field": "`timestamp`"
}
}
},
"terms": {
"field": "threat_target.keyword"
}
}
},
"terms": {
"field": "src_ip.keyword"
}
}
}
}
Aggregation Pagination is generally not supported in Elastic Search, however, composite aggregation provides a way to paginate your aggregation.
Unlike the other multi-bucket aggregation the composite aggregation can be used to paginate all buckets from a multi-level aggregation efficiently.
Excerpt from Composite-Aggregation ES Docs.
CHECK: THIS
Except "ORDER BY first_time desc", below query should run fine for you. I don't think ordering on any fields other than the grouping fields (src_ip,
threat_target) is possible.
GET traffic/_search
{
"size": 0,
"aggs": {
"my_bucket": {
"composite": {
"size": 2, //<=========== PAGE SIZE
/*"after":{ // <========== INCLUDE THIS FROM Second request onwards, passing after_key of the last output here for next page
"src_ip" : "1.2.3.5",
"threat_target" : "T3"
},*/
"sources": [
{
"src_ip": {
"terms": {
"field": "source_ip",
"order": "desc"
}
}
},
{
"threat_target": {
"terms": {
"field": "threat_target"
}
}
}
]
},
"aggs": {
"first_time": {
"max": {
"field": "first_time"
}
}
}
}
}
}

How to take best n Docs of multiple buckets and do sub aggregation on it

I like to take the best n Documents per user which is stored as user_id in my index.
This wouldn't be a problem until now.
It can be done like this:
{
"query":{
"match":{
"field":{
"query":"query_string"
}
}
},
"aggs":{
"group_by_user":{
"terms":{
"field":"user_id"
},
"aggs":{
"top_n":{
"top_hits":{
"size":10
}
}
}
}
}
}
But now I like to do a sub-aggregation on it to calculate some expensive scoring and this isn't possible anymore, because top_hits is a metric aggregation.
"aggs":{
"max_score_per_user":{
"max":{
"script":"advanced_scoring"
}
}
}
}
I also can't user the rescore feature with the window parameter, because I first have to bucket the documents per user and then take the best n per user.
The range query would work, but the results of the tf-idf scoring aren't comparable. So I can't define a proper range.
So is this just not possible or what am I doing wrong?
You can make your max_score_per_user a sub aggregation of group_by_user and the top_n aggregation a sub aggregation of max_score_per_user:
{
"query": {
"match": {
"field": {
"query": "query_string"
}
}
},
"aggs": {
"group_by_user": {
"terms": {
"field": "user_id",
"order": {
"max_score_per_user": "desc"
}
},
"aggs": {
"max_score_per_user": {
"max": {
"script": "advanced_scoring"
},
"aggs": {
"top_n": {
"top_hits": {
"size": 10
}
}
}
}
}
}
}
}

Resources