Elasticsearch - Sort results of Terms aggregation by key string length - sorting

I am querying ES with a Terms aggregation to find the first N unique values of a string field foo where the field contains a substring bar, and the document matches some other constraints.
Currently I am able to sort the results by the key string alphabetically:
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": {"_key": "asc"},
"size": N
}
}
}
}
This gives results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
},
{
"key": "z_bar_z",
"doc_count": 1
}
]
}
}
}
How can I change the order option so that the buckets are sorted by the length of the strings in the foo key field, so that the results are like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
}
]
}
}
}
This is desired because a shorter string is closer to the search substring so is considered a 'better' match so should appear earlier in the results than a longer string.
Any alternative way to sort the buckets by how similar they are to the original substring would also be helpful.
I need the sorting to occur in ES so that I only have to load the top N results from ES.

I worked out a way to do this.
I used a sub-aggregation per dynamic bucket to calculate the length of the key string as another field.
Then I was able to sort by this new length field first, then by the actual key so keys of the same length are sorted alphabetically.
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": [
{"key_length": "asc"},
{"_key": "asc"}
],
"size": N
},
"aggs": {
"key_length": {
"max": {"script": "doc['foo.raw'].value.length()" }
}
}
}
}
}
This gave me results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "dd_bar_dd",
"doc_count": 1
},
{
"key": "bbb_bar_bbb",
"doc_count": 1
}
]
}
}
}
which is what I wanted.

Related

Elasticsearch return document ids while doing aggregate query

Is it possible to get an array of elasticsearch document id while group by, i.e
Current output
"aggregations": {,
"types": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Text Document",
"doc_count": 3310
},
{
"key": "Unknown",
"doc_count": 15
},
{
"key": "Document",
"doc_count": 13
}
]
}
}
Desired output
"aggregations": {,
"types": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Text Document",
"doc_count": 3310,
"ids":["doc1","doc2", "doc3"....]
},
{
"key": "Unknown",
"doc_count": 15,
"ids":["doc11","doc12", "doc13"....]
},
{
"key": "Document",
"doc_count": 13
"ids":["doc21","doc22", "doc23"....]
}
]
}
}
Not sure if this is possible in elasticsearch or not,
below is my aggregation query:
{
"size": 0,
"aggs": {
"types": {
"terms": {
"field": "docType",
"size": 10
}
}
}
}
Elasticsearch version:
6.3.2
You can use top_hits aggregation which will return all documents under an aggregation. Using source filtering you can select fields under hits
Query:
"aggs": {
"district": {
"terms": {
"field": "docType",
"size": 10
},
"aggs": {
"docs": {
"top_hits": {
"size": 10,
"_source": ["ids"]
}
}
}
}
}
For anyone interested, another solution is to create a custom key value using a script to create a string of delineated values from the doc, including the id. It may not be pretty, but you can then parse it out later - and if you just need something minimal like the doc id, it may be worth it.
{
"size": 0,
"aggs": {
"types": {
"terms": {
"script": "doc['docType'].value+'::'+doc['_id'].value",
"size": 10
}
}
}
}

Sorting percentiles aggregation with NaN values

I'm using ElasticSearch 2.3.3 and I have the following aggregation:
"aggregations": {
"mainBreakdown": {
"terms": {
"field": "location_i",
"size": 10,
"order": [
{
"comments>medianTime.50": "asc"
}
]
},
"aggregations": {
"comments": {
"filter": {
"term": {
"type_i": 120
}
},
"aggregations": {
"medianTime": {
"percentiles": {
"field": "time_l",
"percents": [
50.0
]
}
}
}
}
}
}
}
for better understanding I've added to field names a postfix which tells the field mapping:
_i = integer
_l = long (timestamp)
And aggregation response is:
"aggregations": {
"mainBreakdown": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 100,
"doc_count": 2,
"comments": {
"doc_count": 1,
"medianTime": {
"values": {
"50.0": 20113
}
}
}
},
{
"key": 121,
"doc_count": 14,
"comments": {
"doc_count": 0,
"medianTime": {
"values": {
"50.0": "NaN"
}
}
}
}
]
}
}
My problem is that the medianTime aggregation, sometimes has value of NaN because the parent aggregation comments has 0 matched documents, and then the result with the NaN will always be last on both "asc" and "desc" order.
I've tried adding "missing": 0 inside percentiles aggregation but it still returns a NaN.
Can you please help me sorting my buckets by medianTime that and when it's "asc" ordering the NaN values will be first and when its "desc" they will be last?
NaN's are not numbers, so they will always be last.
After a short discussion on elasticsearch github, we decided its the appropriate way to handle NaN's.
https://github.com/elastic/elasticsearch/issues/36402

Is it possible to returns other fields when you aggregate results on Elasticsearch?

Here is the mappings of my index PublicationsLikes:
id : String
account : String
api : String
date : Date
I'm currently making an aggregation on ES where I group the results counts by the id (of the publication).
{
"key": "<publicationId-1>",
"doc_count": 25
},
{
"key": "<publicationId-2>",
"doc_count": 387
},
{
"key": "<publicationId-3>",
"doc_count": 7831
}
The returned "key" (the id) is an information but I also need to select another fields of the publication like account and api. A bit like that:
{
"key": "<publicationId-1>",
"api": "Facebook",
"accountId": "65465z4fe6ezf456ezdf",
"doc_count": 25
},
{
"key": "<publicationId-2>",
"api": "Twitter",
"accountId": "afaez5f4eaz",
"doc_count": 387
}
How can I manage this?
Thanks.
This requirement is best achieved by top_hits aggregation, where you can sort the documents in each bucket and choose the first and also you can control which fields you want returned:
{
"size": 0,
"aggs": {
"publications": {
"terms": {
"field": "id"
},
"aggs": {
"sample": {
"top_hits": {
"size": 1,
"_source": ["api","accountId"]
}
}
}
}
}
}
You can use subaggregation for this.
GET /PublicationsLikes/_search
{
"aggs" : {
"ids": {
"terms": {
"field": "id"
},
"aggs": {
"accounts": {
"terms": {
"field": "account",
"size": 1
}
}
}
}
}
}
Your result will not exactly what you want but it will be a bit similar:
{
"key": "<publicationId-1>",
"doc_count": 25,
"accounts": {
"buckets": [
{
"key": "<account-1>",
"doc_count": 25
}
]
}
},
{
"key": "<publicationId-2>",
"doc_count": 387,
"accounts": {
"buckets": [
{
"key": "<account-2>",
"doc_count": 387
}
]
}
},
{
"key": "<publicationId-3>",
"doc_count": 7831,
"accounts": {
"buckets": [
{
"key": "<account-3>",
"doc_count": 7831
}
]
}
}
You can also check the link to find more information
Thanks both for your quick replies. I think the first solution is the most "beautiful" (in terms of request but also to retrieves the results) but both seems to be sub aggregations queries.
{
"size": 0,
"aggs": {
"publications": {
"terms": {
"size": 0,
"field": "publicationId"
},
"aggs": {
"sample": {
"top_hits": {
"size": 1,
"_source": ["accountId", "api"]
}
}
}
}
}
}
I think I must be careful to size=0 parameter, so, because I work in the Java Api, I decided to put INT.Max instead of 0.
Thnaks a lot guys.

How to aggregate and roll up values from child to parent in Elastic Search

I am a newbie to Elastic Search and I am trying to find out how to handle the scenario briefed here. I am having a schema where a document may contain data such as
{
"country":"US",
"zone": "East",
"cluster": "Cluster1",
"time_taken": 4500,
"status": 0
},
{
"country":"US",
"zone": "East",
"cluster": "Cluster1",
"time_taken": 5000,
"status": 0
},
{
"country":"US",
"zone": "East",
"cluster": "Cluster1",
"time_taken": 5000,
"status": 1
},
{
"country":"US",
"zone": "East",
"cluster": "Cluster2",
"time_taken": 5000,
"status": 0
}
Where status = 0 for success, 1 for failure
I would want to show a result in a way that it can reflect a hierarchy with values from "success" like
US/East/Cluster1 = 66% (which is basically 2 success and 1 failure)
US/East/Cluster2 = 100% (which is basically 1 success)
US/East = 75%
US = 75%
Alternatively, if there is also a way to get the time taken average for success and failure scenarios spread across this hierarchy like denoted above, would be great.
I think a terms aggregation should get the job done for you.
In order to satisfy your first query examples (% success per cluster), try something like this:
{
"aggs": {
"byCluster": {
"terms": {
"field": "cluster"
},
"aggs": {
"success_or_fail": {
"terms": {
"field": "status"
}
}
}
}
}
}
This returns a result that looks something like this:
"aggregations": {
"byCluster": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cluster1",
"doc_count": 3,
"success_or_fail": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 0,
"doc_count": 2
},
{
"key": 1,
"doc_count": 1
}
]
}
},
{
"key": "cluster2",
"doc_count": 1,
"success_or_fail": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 0,
"doc_count": 1
}
]
}
}
]
}
}
You can take the doc_count for the 0 bucket of the "success_or_fail" (arbitrary name) aggregation and divide it by the doc_count for the corresponding cluster. This will give you the % success for each cluster. (2/3 for "cluster1" and 1/1 for "cluster2").
The same type of aggregation could be used to group by "country" and "zone".
UPDATE
You can also nest a avg aggregation next to the "success_or_fail" terms aggregation, in order to achieve the average time taken you were looking for.
As in:
{
"query": {
"match_all": {}
},
"aggs": {
"byCluster": {
"terms": {
"field": "cluster"
},
"aggs": {
"success_or_fail": {
"terms": {
"field": "status"
},
"aggs": {
"avg_time_taken": {
"avg": {
"field": "time_taken"
}
}
}
}
}
}
}
}

How to get an Elasticsearch aggregation with multiple fields

I'm attempting to find related tags to the one currently being viewed. Every document in our index is tagged. Each tag is formed of two parts - an ID and text name:
{
...
meta: {
...
tags: [
{
id: 123,
name: 'Biscuits'
},
{
id: 456,
name: 'Cakes'
},
{
id: 789,
name: 'Breads'
}
]
}
}
To fetch the related tags I am simply querying the documents and getting an aggregate of their tags:
{
"query": {
"bool": {
"must": [
{
"match": {
"item.meta.tags.id": "123"
}
},
{
...
}
]
}
},
"aggs": {
"baked_goods": {
"terms": {
"field": "item.meta.tags.id",
"min_doc_count": 2
}
}
}
}
This works perfectly, I am getting the results I want. However, I require both the tag ID and name to do anything useful. I have explored how to accomplish this, the solutions seem to be:
Combine the fields when indexing
A script to munge together the fields
A nested aggregation
Option one and two are are not available to me so I have been going with 3 but it's not responding in an expected manner. Given the following query (still searching for documents also tagged with 'Biscuits'):
{
...
"aggs": {
"baked_goods": {
"terms": {
"field": "item.meta.tags.id",
"min_doc_count": 2
},
"aggs": {
"name": {
"terms": {
"field": "item.meta.tags.name"
}
}
}
}
}
}
I will get this result:
{
...
"aggregations": {
"baked_goods": {
"buckets": [
{
"key": "456",
"doc_count": 11,
"name": {
"buckets": [
{
"key": "Biscuits",
"doc_count": 11
},
{
"key": "Cakes",
"doc_count": 11
}
]
}
}
]
}
}
}
The nested aggregation includes both the search term and the tag I'm after (returned in alphabetical order).
I have tried to mitigate this by adding an exclude to the nested aggregation but this slowed the query down far too much (around 100 times for 500000 docs). So far the fastest solution is to de-dupe the result manually.
What is the best way to get an aggregation of tags with both the tag ID and tag name in the response?
Thanks for making it this far!
By the looks of it, your tags is not nested.
For this aggregation to work, you need it nested so that there is an association between an id and a name. Without nested the list of ids is just an array and the list of names is another array:
"item": {
"properties": {
"meta": {
"properties": {
"tags": {
"type": "nested", <-- nested field
"include_in_parent": true, <-- to, also, keep the flat array-like structure
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "string"
}
}
}
}
}
}
}
Also, note that I've added to the mapping this line "include_in_parent": true which means that your nested tags will, also, behave like a "flat" array-like structure.
So, everything you had so far in your queries will still work without any changes to the queries.
But, for this particular query of yours, the aggregation needs to change to something like this:
{
"aggs": {
"baked_goods": {
"nested": {
"path": "item.meta.tags"
},
"aggs": {
"name": {
"terms": {
"field": "item.meta.tags.id"
},
"aggs": {
"name": {
"terms": {
"field": "item.meta.tags.name"
}
}
}
}
}
}
}
}
And the result is like this:
"aggregations": {
"baked_goods": {
"doc_count": 9,
"name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 123,
"doc_count": 3,
"name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "biscuits",
"doc_count": 3
}
]
}
},
{
"key": 456,
"doc_count": 2,
"name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cakes",
"doc_count": 2
}
]
}
},
.....

Resources