Elasticsearch: sort terms aggregation buckets by non-key column - elasticsearch

Data
I have objects persisted in an ES index. Each of them has a myKey and myName string fields (persisted as keyword fields). There is no guarantee that myName will always be the same for the same myKey. E.g. the following two entries share the same myKey but have different myName values:
{
"myKey": "123asd",
"myName": "United States",
...
},
{
"myKey": "123asd",
"myName": "United States of America",
...
},
{
"myKey": "456fgh",
"myName": "United Kingdom",
...
}
Challenge
I need to select and return all distinct myKey values, find and display the most likely myName (most occurances within the context of myKey) AND sort the resulting buckets by myName.
So far I managed the following:
Select the distinct myKey values by using a terms aggregation.
Select the corresponding first myName value to each myKey by using a top_hits aggregation.
Sorted by myKey using the order clause of the terms aggregation.
This is the code of the aggregation:
"aggs": {
"distinct": {
"terms": {
"field": "myKey",
"order": {
"_key": "desc" <----- this sorts the buckets by myKey
}
},
"aggs": {
"tops": {
"top_hits": {
"size": 1,
"_source": {
"includes": ["myName"]
}
}
}
}
}
I read up on the ES documentation explaining how one can introduce a second aggregation returning a single metric. This appears to address numeric field only though. myName is not numeric.
Is there a way to sort the buckets in ES by myName?
Any help greatly appreciated.
Edit on 2. Sept 2020
Based on the asking of user #joe, current and the expected result are as follows.
Current result
As it is apparent, the sorting of the buckets is based on the key: 123asd comes before 456fgh:
"aggregations" : {
"distinct" : {
"buckets" : [
{
"key" : "123asd",
"tops" : {
"hits" : {
"hits" : [
{
"_source" : {
"myName" : "United States"
}
}
]
}
}
},
{
"key" : "456fgh",
"tops" : {
"hits" : {
"hits" : [
{
"_source" : {
"myName" : "United Kingdom"
}
}
]
}
}
}
]
}
}
Expected result
The task is to sort the buckets based on the extra selected field myName: United Kingdom comes before United States:
"aggregations" : {
"distinct" : {
"buckets" : [
{
"key" : "456fgh",
"tops" : {
"hits" : {
"hits" : [
{
"_source" : {
"myName" : "United Kingdom"
}
}
]
}
}
},
{
"key" : "123asd",
"tops" : {
"hits" : {
"hits" : [
{
"_source" : {
"myName" : "United States"
}
}
]
}
}
}
]
}
}

By doing _count:desc, you've only ordered the top agg alphabetically...
Have you tried the following which looks for the most frequest myNames under a given myKey?
{
"size": 0,
"aggs": {
"by_key": {
"terms": {
"field": "myKey",
"order": {
"_key": "desc"
}
},
"aggs": {
"by_name": {
"terms": {
"field": "myName",
"order":{
"_count": "desc"
}
}
}
}
}
}
}
Or are you looking to sort the parent myKey agg by the result of the child myName agg?
EDIT
Sorting a parent agg by the result of a multi-bucket child aggregation results in the following error:
Buckets can only be sorted on a sub-aggregator path that is built out
of zero or more single-bucket aggregations within the path and a final
single-bucket or a metrics aggregation at the path end.
In other words, what you're trying to achieve is not possible and here's nicely explained why.
Had your child aggregation been numeric (or single-bucket), it would've been possible.
For now your only option appears to be post-processing (or rather post-sorting) the current response in the frontend (or wherever you're using these aggs).

Related

How to filter by sub-aggregated results in Elasticsearch

I've got the following elastic search query in order to get the number of product sales per hour grouped by product id and hour of sale.
POST /my_sales/_search?size=0
{
"aggs": {
"sales_per_hour": {
"date_histogram": {
"field": "event_time",
"fixed_interval": "1h",
"format": "yyyy-MM-dd:HH:mm"
},
"aggs": {
"sales_per_hour_per_product": {
"terms": {
"field": "name.keyword"
}
}
}
}
}
}
One example of data :
{
"#timestamp" : "2020-10-29T18:09:56.921Z",
"name" : "my-beautifull_product",
"event_time" : "2020-10-17T08:01:33.397Z"
}
This query returns several buckets (one per hour and per product) but i would like to only retrieve those who have a doc_count higher than 10 for example, is it possible ?
For those results i would like to know the id of the product and the event_time bucket.
Thanks for your help.
Perhaps using the Bucket Selector feature will help on filtering out the results.
Try out this below search query:
{
"aggs": {
"sales_per_hour": {
"date_histogram": {
"field": "event_time",
"fixed_interval": "1h",
"format": "yyyy-MM-dd:HH:mm"
},
"aggs": {
"sales_per_hour_per_product": {
"terms": {
"field": "name.keyword"
},
"aggs": {
"the_filter": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count > 10"
}
}
}
}
}
}
}
}
It will filter out all the documents, whose count is greater than 10 based on "params.the_doc_count > 10"
Thank you for your help this is not far from what i would like but not exactly ; with the bucket selector i have something like this :
"aggregations" : {
"sales_per_hour" : {
"buckets" : [
{
"key_as_string" : "2020-08-31:23:00",
"key" : 1598914800000,
"doc_count" : 16,
"sales_per_hour_per_product" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "my_product_1",
"doc_count" : 2
},
{
"key" : "my_product_2",
"doc_count" : 2
},
{
"key" : "myproduct_3",
"doc_count" : 12
}
]
}
}
]
}
And sometimes none of the buckets are greater than 10, is it possible to have the same thing but with the filter on _count applied to the second level aggregation (sales_per_hour_per_product) and not on the first level (sales_per_hour) ?

combine output of a a first filter as input of a second filter

We have an elasticsearch instance with entries with two tagged fields.
sessionid
message
In a first filter, I find all entries where the message contains a certain substring. Each of those entries contains a sessionid,
In a second filter, I want to find all messages, where the sessionid matches one of the sessionids returned by the first filter. This filter should go through all entries a second time.
Example, in the log below (sessionid;message)
1234;miss 1
2456;miss 2
1234;match
When filtering for the string "match" in the message part, I would get as output of the combined query:
1234;miss 1
1234;match
We are using KQL.
Background: We want an easy way to follow complete flows with an error-string in a message, in a multithreaded environment.
I understand why you'd want to do that in one go but it's not possible in ElasticSearch. You cannot "revisit" documents which you've already ruled out by a different query -- searching for match would disqualify all misss.
It's unfortunate you have the log message combined with the ID but you can try this:
Find all that match match (pun intended) -- I'm assuming you do have a keyword field available
GET your_index/_search
{
"query": {
"regexp": {
"separated_msg.keyword": ".*\\;match.*"
}
}
}
Post-process the hits and extract the session IDs
Run session ID matching:
GET your_index/_search
{
"query": {
"regexp": {
"separated_msg.keyword": "1234;.*"
}
}
}
or on multiple IDs using a bool should:
GET your_index/_search
{
"query": {
"bool": {
"should": [
{
"regexp": {
"separated_msg.keyword": "1234;.*"
}
},
{
"regexp": {
"separated_msg.keyword": "4567;.*"
}
}
]
}
}
}
If a unique numeric value can be assigned to each message ex 1 for "match", 2 for "miss 1" then bucket selector and top_hits can be used.
{
"size": 0,
"aggs": {
"sessionid": {
"terms": {
"field": "sessionid", --> first get all unique sessionids
"size": 10
},
"aggs": {
"documents":{
"top_hits": {
"size": 10
}
},
"messageid": {
"terms": {
"field": "messageid", ---> get unique sessionId
"size": 10
},
"aggs": {
"matching_messageid": { ---> select a bucket with key(message Id) as 2
"bucket_selector": {
"buckets_path": {
"key": "_key"
},
"script": "params.key==2"
}
}
}
},
"my_bucket": {
"bucket_selector": {
"buckets_path": {
"hits": "messageid._bucket_count"
},
"script": "params.hits>0"--> if bucket not empty then consider that sessionid
}
}
}
}
}
}
Result
"aggregations" : {
"sessionid" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1234,
"doc_count" : 2,
"documents" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index31",
"_type" : "_doc",
"_id" : "MTAYpnABheSAx2q_eNEF",
"_score" : 1.0,
"_source" : {
"sessionid" : 1234,
"message" : "miss 1",
"messageid" : 1
}
},
{
"_index" : "index31",
"_type" : "_doc",
"_id" : "MjAYpnABheSAx2q_n9FW",
"_score" : 1.0,
"_source" : {
"sessionid" : 1234,
"message" : "match",
"messageid" : 2
}
}
]
}
},
"messageid" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 2,
"doc_count" : 1
}
]
}
}
]
}
}
If a given message has timestamp(max/min) then max_path can be used to select buckets with given messages.
The best approach to above problem will be to use nested documents
{
"sessionid":1234,
"messages":[
{
"message":"match"
},
{
"message":"miss 1"
}
]
}
````
then the problem can be resolved by nested query. If logstash is used then above structure can generated while indexing.

Count nested objects no more than once in each document in Elasticsearch

I have an index with documents of the following structure:
{
"_id" : "1234567890abcdef",
...
"entities" : [
{
"name" : "beer",
"evidence_start" : 12,
"evidence_end" : 16
},
{
"name" : "water",
"evidence_start" : 55,
"evidence_end" : 60
},
{
"name" : "beer",
"evidence_start" : 123,
"evidence_end" : 127
},
...
]
}
entities is an object of type nested here. I need to count how many documents contain mentions about beer.
The issue is that an obvious bucket aggregation returns the amount of mentions, not documents, so that if beer is mentioned twice in the same document, it adds up 2 to the total result as well.
A query I use to do that is:
{
...
"aggs": {
"entities": {
"nested": {
"path": "entities"
},
"aggs": {
"entity_count": {
"terms": {
"field": "entities.name",
"size" : 20
}
}
}
}
},
...
}
Is there a way of counting only distinct mentions without scripting?
Many thanks in advance.
you simply need to a reverse nested aggregation as a sub aggregation, to count the number of "main documentd" instead of nested documents.
You should try
{
...
"aggs": {
"entities": {
"nested": {
"path": "entities"
},
"aggs": {
"entity_count": {
"terms": {
"field": "entities.name",
"size" : 20
},
"aggs": {
"main_document_count": {
"reverse_nested": {}
}
}
}
}
}
},
...
}

Sort aggregation buckets by shared field values

I would like to group documents based on a group field G. I use the „field aggregation“ strategy described in the Elastic documention to sort the buckets by the maximal score of the contained documents (called 'field collapse example in the Elastic doc), like this:
{
"query": {
"match": {
"body": "elections"
}
},
"aggs": {
"top_sites": {
"terms": {
"field": "domain",
"order": {
"top_hit": "desc"
}
},
"aggs": {
"top_tags_hits": {
"top_hits": {}
},
"top_hit" : {
"max": {
"script": {
"source": "_score"
}
}
}
}
}
}
}
This query also includes the top hits in each bucket.
If the maximal score is not unique for the buckets, I would like to specify a second order column. From the application context I know that inside a bucket all documents share the same value for a field F. Therefore, this field should be employed as the second order column.
How can I realize this in Elastic? Is there a way to make a field from the top hits subaggregation useable in the enclosing aggregation?
Any ideas? Many thanks!
It seems you can. In this page all the sorting strategy for terms aggregation are listed.
And they is an example of multi criteria buckets sorting :
Multiple criteria can be used to order the buckets by providing an
array of order criteria such as the following:
GET /_search
{
"aggs" : {
"countries" : {
"terms" : {
"field" : "artist.country",
"order" : [ { "rock>playback_stats.avg" : "desc" }, { "_count" : "desc" } ]
},
"aggs" : {
"rock" : {
"filter" : { "term" : { "genre" : "rock" }},
"aggs" : {
"playback_stats" : { "stats" : { "field" : "play_count" }}
}
}
}
}
}
}

Elasticsearch, how to return unique values of two fields

I have an index with 20 different fields. I need to be able to pull unique docs where combination of fields "cat" and "sub" are unique.
In SQL it would look this way: select unique cat, sub from table A;
I can do it for one field this way:
{
"size": 0,
"aggs" : {
"unique_set" : {
"terms" : { "field" : "cat" }
}
}}
but how do I add another field to check uniqueness across two fields?
Thanks,
SQL's SELECT DISTINCT [cat], [sub] can be imitated with a Composite Aggregation.
{
"size": 0,
"aggs": {
"cat_sub": {
"composite": {
"sources": [
{ "cat": { "terms": { "field": "cat" } } },
{ "sub": { "terms": { "field": "sub" } } }
]
}
}
}
}
Returns...
"buckets" : [
{
"key" : {
"cat" : "a",
"sub" : "x"
},
"doc_count" : 1
},
{
"key" : {
"cat" : "a",
"sub" : "y"
},
"doc_count" : 2
},
{
"key" : {
"cat" : "b",
"sub" : "y"
},
"doc_count" : 3
}
]
The only way to solve this are probably nested aggregations:
{
"size": 0,
"aggs" : {
"unique_set_1" : {
"terms" : {
"field" : "cats"
},
"aggregations" : {
"unique_set_2": {
"terms": {"field": "sub"}
}
}
}
}
}
Quote:
I need to be able to pull unique docs where combination of fields "cat" and "sub" are unique.
This is nonsense; your question is unclear. You can have 10s unique pairs {cat, sub}, and 100s unique triplets {cat, sub, field_3}, and 1000s unique documents Doc{cat, sub, field3, field4, ...}.
If you are interested in document counts per unique pair {"Category X", "Subcategory Y"} then you can use Cardinality aggregations. For two or more fields you will need to use scripting which will come with performance hit.
Example:
{
"aggs" : {
"multi_field_cardinality" : {
"cardinality" : {
"script": "doc['cats'].value + ' _my_custom_separator_ ' + doc['sub'].value"
}
}
}
}
Alternate solution: use nested Terms terms aggregations.

Resources