Sorting percentiles aggregation with NaN values - elasticsearch

I'm using ElasticSearch 2.3.3 and I have the following aggregation:
"aggregations": {
"mainBreakdown": {
"terms": {
"field": "location_i",
"size": 10,
"order": [
{
"comments>medianTime.50": "asc"
}
]
},
"aggregations": {
"comments": {
"filter": {
"term": {
"type_i": 120
}
},
"aggregations": {
"medianTime": {
"percentiles": {
"field": "time_l",
"percents": [
50.0
]
}
}
}
}
}
}
}
for better understanding I've added to field names a postfix which tells the field mapping:
_i = integer
_l = long (timestamp)
And aggregation response is:
"aggregations": {
"mainBreakdown": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 100,
"doc_count": 2,
"comments": {
"doc_count": 1,
"medianTime": {
"values": {
"50.0": 20113
}
}
}
},
{
"key": 121,
"doc_count": 14,
"comments": {
"doc_count": 0,
"medianTime": {
"values": {
"50.0": "NaN"
}
}
}
}
]
}
}
My problem is that the medianTime aggregation, sometimes has value of NaN because the parent aggregation comments has 0 matched documents, and then the result with the NaN will always be last on both "asc" and "desc" order.
I've tried adding "missing": 0 inside percentiles aggregation but it still returns a NaN.
Can you please help me sorting my buckets by medianTime that and when it's "asc" ordering the NaN values will be first and when its "desc" they will be last?

NaN's are not numbers, so they will always be last.
After a short discussion on elasticsearch github, we decided its the appropriate way to handle NaN's.
https://github.com/elastic/elasticsearch/issues/36402

Related

Elasticsearch - Sort results of Terms aggregation by key string length

I am querying ES with a Terms aggregation to find the first N unique values of a string field foo where the field contains a substring bar, and the document matches some other constraints.
Currently I am able to sort the results by the key string alphabetically:
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": {"_key": "asc"},
"size": N
}
}
}
}
This gives results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
},
{
"key": "z_bar_z",
"doc_count": 1
}
]
}
}
}
How can I change the order option so that the buckets are sorted by the length of the strings in the foo key field, so that the results are like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
}
]
}
}
}
This is desired because a shorter string is closer to the search substring so is considered a 'better' match so should appear earlier in the results than a longer string.
Any alternative way to sort the buckets by how similar they are to the original substring would also be helpful.
I need the sorting to occur in ES so that I only have to load the top N results from ES.
I worked out a way to do this.
I used a sub-aggregation per dynamic bucket to calculate the length of the key string as another field.
Then I was able to sort by this new length field first, then by the actual key so keys of the same length are sorted alphabetically.
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": [
{"key_length": "asc"},
{"_key": "asc"}
],
"size": N
},
"aggs": {
"key_length": {
"max": {"script": "doc['foo.raw'].value.length()" }
}
}
}
}
}
This gave me results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "dd_bar_dd",
"doc_count": 1
},
{
"key": "bbb_bar_bbb",
"doc_count": 1
}
]
}
}
}
which is what I wanted.

Stats Aggregation with Min Mode in ElasticSearch

I have the below mapping in ElasticSearch
{
"properties":{
"Costs":{
"type":"nested",
"properties":{
"price":{
"type":"integer"
}
}
}
}
}
So every document has an Array field Costs, which contains many elements and each element has price in it. I want to find the min and max price with the condition being - that from each array the element with the minimum price should be considered. So it is basically min/max among the minimum value of each array.
Lets say I have 2 documents with the Costs field as
Costs: [
{
"price": 100,
},
{
"price": 200,
}
]
and
Costs: [
{
"price": 300,
},
{
"price": 400,
}
]
So I need to find the stats
This is the query I am currently using
{
"costs_stats":{
"nested":{
"path":"Costs"
},
"aggs":{
"price_stats_new":{
"stats":{
"field":"Costs.price"
}
}
}
}
}
And it gives me this:
"min" : 100,
"max" : 400
But I need to find stats after taking minimum elements of each array for consideration.
So this is what i need:
"min" : 100,
"max" : 300
Like we have a "mode" option in sort, is there something similar in stats aggregation also, or any other way of achieving this, maybe using a script or something. Please suggest. I am really stuck here.
Let me know if anything is required
Update 1:
Query for finding min/max among minimums
{
"_source":false,
"timeout":"5s",
"from":0,
"size":0,
"aggs":{
"price_1":{
"terms":{
"field":"id"
},
"aggs":{
"price_2":{
"nested":{
"path":"Costs"
},
"aggs":{
"filtered":{
"aggs":{
"price_3":{
"min":{
"field":"Costs.price"
}
}
},
"filter":{
"bool":{
"filter":{
"range":{
"Costs.price":{
"gte":100
}
}
}
}
}
}
}
}
}
},
"minValue":{
"min_bucket":{
"buckets_path":"price_1>price_2>filtered>price_3"
}
}
}
}
Only few buckets are coming and hence the min/max is coming among those, which is not correct. Is there any size limit.
One way to achieve your use case is to add one more field id, in each document. With the help of id field terms aggregation can be performed, and so buckets will be dynamically built - one per unique value.
Then, we can apply min aggregation, which will return the minimum value among numeric values extracted from the aggregated documents.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"Costs": {
"type": "nested"
}
}
}
}
Index Data:
{
"id":1,
"Costs": [
{
"price": 100
},
{
"price": 200
}
]
}
{
"id":2,
"Costs": [
{
"price": 300
},
{
"price": 400
}
]
}
Search Query:
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id",
"size": 15 <-- note this
},
"aggs": {
"nested_entries": {
"nested": {
"path": "Costs"
},
"aggs": {
"min_position": {
"min": {
"field": "Costs.price"
}
}
}
}
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 100.0
}
}
},
{
"key": 2,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 300.0
}
}
}
]
}
Using stats aggregation also, it can be achieved (if you add one more field id that uniquely identifies your document)
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id",
"size": 15 <-- note this
},
"aggs": {
"costs_stats": {
"nested": {
"path": "Costs"
},
"aggs": {
"price_stats_new": {
"stats": {
"field": "Costs.price"
}
}
}
}
}
}
}
}
Update 1:
To find the maximum value among those minimums (as seen in the above query), you can use max bucket aggregation
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id",
"size": 15 <-- note this
},
"aggs": {
"nested_entries": {
"nested": {
"path": "Costs"
},
"aggs": {
"min_position": {
"min": {
"field": "Costs.price"
}
}
}
}
}
},
"maxValue": {
"max_bucket": {
"buckets_path": "id_terms>nested_entries>min_position"
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 100.0
}
}
},
{
"key": 2,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 300.0
}
}
}
]
},
"maxValue": {
"value": 300.0,
"keys": [
"2"
]
}
}

Filtering aggregation results

This question is a subquestion of this question. Posting as a separate question for attention.
Sample Docs:
{
"id":1,
"product":"p1",
"cat_ids":[1,2,3]
}
{
"id":2,
"product":"p2",
"cat_ids":[3,4,5]
}
{
"id":3,
"product":"p3",
"cat_ids":[4,5,6]
}
Ask: To get products belonging to a particular category. e.g cat_id = 3
Query:
GET product/_search
{
"size": 0,
"aggs": {
"cats": {
"terms": {
"field": "cats",
"size": 10
},"aggs": {
"products": {
"terms": {
"field": "name.keyword",
"size": 10
}
}
}
}
}
}
Question:
How to filter the aggregated result for cat_id = 3 here. I tried bucket_selector as well but it is not working.
Note: Due to multi-value of cat_ids filtering and then aggregation isn't working
You can filter values, on the basis of which buckets will be created.
It is possible to filter the values for which buckets will be created.
This can be done using the include and exclude parameters which are
based on regular expression strings or arrays of exact values.
Additionally, include clauses can filter using partition expressions.
Adding a working example with index data, search query, and search result
Index Data:
{
"id":1,
"product":"p1",
"cat_ids":[1,2,3]
}
{
"id":2,
"product":"p2",
"cat_ids":[3,4,5]
}
{
"id":3,
"product":"p3",
"cat_ids":[4,5,6]
}
Search Query:
{
"size": 0,
"aggs": {
"cats": {
"terms": {
"field": "cat_ids",
"include": [ <-- note this
3
]
},
"aggs": {
"products": {
"terms": {
"field": "product.keyword",
"size": 10
}
}
}
}
}
}
Search Result:
"aggregations": {
"cats": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 2,
"products": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "p1",
"doc_count": 1
},
{
"key": "p2",
"doc_count": 1
}
]
}
}
]
}

Elasticsearch return document ids while doing aggregate query

Is it possible to get an array of elasticsearch document id while group by, i.e
Current output
"aggregations": {,
"types": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Text Document",
"doc_count": 3310
},
{
"key": "Unknown",
"doc_count": 15
},
{
"key": "Document",
"doc_count": 13
}
]
}
}
Desired output
"aggregations": {,
"types": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Text Document",
"doc_count": 3310,
"ids":["doc1","doc2", "doc3"....]
},
{
"key": "Unknown",
"doc_count": 15,
"ids":["doc11","doc12", "doc13"....]
},
{
"key": "Document",
"doc_count": 13
"ids":["doc21","doc22", "doc23"....]
}
]
}
}
Not sure if this is possible in elasticsearch or not,
below is my aggregation query:
{
"size": 0,
"aggs": {
"types": {
"terms": {
"field": "docType",
"size": 10
}
}
}
}
Elasticsearch version:
6.3.2
You can use top_hits aggregation which will return all documents under an aggregation. Using source filtering you can select fields under hits
Query:
"aggs": {
"district": {
"terms": {
"field": "docType",
"size": 10
},
"aggs": {
"docs": {
"top_hits": {
"size": 10,
"_source": ["ids"]
}
}
}
}
}
For anyone interested, another solution is to create a custom key value using a script to create a string of delineated values from the doc, including the id. It may not be pretty, but you can then parse it out later - and if you just need something minimal like the doc id, it may be worth it.
{
"size": 0,
"aggs": {
"types": {
"terms": {
"script": "doc['docType'].value+'::'+doc['_id'].value",
"size": 10
}
}
}
}

How to aggregate and roll up values from child to parent in Elastic Search

I am a newbie to Elastic Search and I am trying to find out how to handle the scenario briefed here. I am having a schema where a document may contain data such as
{
"country":"US",
"zone": "East",
"cluster": "Cluster1",
"time_taken": 4500,
"status": 0
},
{
"country":"US",
"zone": "East",
"cluster": "Cluster1",
"time_taken": 5000,
"status": 0
},
{
"country":"US",
"zone": "East",
"cluster": "Cluster1",
"time_taken": 5000,
"status": 1
},
{
"country":"US",
"zone": "East",
"cluster": "Cluster2",
"time_taken": 5000,
"status": 0
}
Where status = 0 for success, 1 for failure
I would want to show a result in a way that it can reflect a hierarchy with values from "success" like
US/East/Cluster1 = 66% (which is basically 2 success and 1 failure)
US/East/Cluster2 = 100% (which is basically 1 success)
US/East = 75%
US = 75%
Alternatively, if there is also a way to get the time taken average for success and failure scenarios spread across this hierarchy like denoted above, would be great.
I think a terms aggregation should get the job done for you.
In order to satisfy your first query examples (% success per cluster), try something like this:
{
"aggs": {
"byCluster": {
"terms": {
"field": "cluster"
},
"aggs": {
"success_or_fail": {
"terms": {
"field": "status"
}
}
}
}
}
}
This returns a result that looks something like this:
"aggregations": {
"byCluster": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cluster1",
"doc_count": 3,
"success_or_fail": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 0,
"doc_count": 2
},
{
"key": 1,
"doc_count": 1
}
]
}
},
{
"key": "cluster2",
"doc_count": 1,
"success_or_fail": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 0,
"doc_count": 1
}
]
}
}
]
}
}
You can take the doc_count for the 0 bucket of the "success_or_fail" (arbitrary name) aggregation and divide it by the doc_count for the corresponding cluster. This will give you the % success for each cluster. (2/3 for "cluster1" and 1/1 for "cluster2").
The same type of aggregation could be used to group by "country" and "zone".
UPDATE
You can also nest a avg aggregation next to the "success_or_fail" terms aggregation, in order to achieve the average time taken you were looking for.
As in:
{
"query": {
"match_all": {}
},
"aggs": {
"byCluster": {
"terms": {
"field": "cluster"
},
"aggs": {
"success_or_fail": {
"terms": {
"field": "status"
},
"aggs": {
"avg_time_taken": {
"avg": {
"field": "time_taken"
}
}
}
}
}
}
}
}

Resources