I need to calculate the weighted average value using the elastic search, I can't change the structure of the documents. If we assume that there are 2 indexed documents. The first document
const doc1 = {
"id": "1",
"userId: "2",
"scores" : [
{
"name": "score1",
"value": 93.0
},
{
"name": "score2",
"value": 90.0
},
{
"name": "score3",
"value": 76.0
}
],
"metadata": {
"weight": 130
}
}
Second document
const doc2 = {
"id": "2",
"userId: "2",
"scores" : [
{
"name": "score1",
"value": 80.0
},
{
"name": "score2",
"value": 70.0
},
{
"name": "score3",
"value": 88.0
}
],
"metadata": {
"weight": 50
}
}
Calculations should be done by the following formula:
score1Avg = (doc1.scores['score1'].value * doc1.metadata.weight +
doc2.scores['score1'].value * doc2.metadata.weight)/(doc1.weight+doc2.weight)
score2Avg = (doc1.scores['score2'].value * doc1.metadata.weight +
doc2.scores['score2'].value * doc2.metadata.weight)/(doc1.weight+doc2.weight)
score3Avg = (doc1.scores['score3'].value * doc1.metadata.weight +
doc2.scores['score3'].value * doc2.metadata.weight)/(doc1.weight+doc2.weight)
I tried something with nested type for mapping scores, but I can't access the parent document field metadata.weight. How this should be approached, should I use nested type mapping or this can be done in some other way without that?
Edit: I ended up storing scores element as separated documents. Instead of doc1, now I have the following documents.
{
"id": "1",
"userId: "2",
"score": {
"name": "score1",
"value": 93.0
},
"metadata": {
"weight": 130
}
}
{
"id": "1",
"userId: "2",
"score": {
"name": "score2",
"value": 90.0
},
"metadata": {
"weight": 130
}
}
{
"id": "1",
"userId: "2",
"score": {
"name": "score3",
"value": 76.0
},
"metadata": {
"weight": 130
}
}
And the query is:
GET /scores/_search
{
"size": 0,
"aggs": {
"group_by_score_and_user": {
"composite": {
"sources": [
{
"scoreName": {
"terms": {
"field": "score.name.keyword"
}
}
},{
"userId": {
"terms": {
"field": "userId.keyword"
}
}
}
]
},
"aggs": {
"avg": {
"weighted_avg": {
"value":{ "field": "score.value" },
"weight":{ "field": "metadata.weight" }
}
}
}
}
}
}
Btw, the query with the script approach against 5k documents takes 120 ms on average compared to this which takes about 35-40 ms over 100k documents.
Edited to fit the requirement in the comment, like I said before this is not an optimal solution at all, the usage of scripts + params._source + my subpar java will cause this to be very slow or unusable with a lot of docs.
Still I learned a lot
Mapping:
{
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"userId": {
"type": "keyword"
},
"scores": {
"properties": {
"name": {
"type": "keyword"
},
"value": {
"type": "float"
}
}
},
"metadata": {
"properties": {
"weight": {
"type": "float"
}
}
}
}
}
}
Docs:
POST ron_test/_doc/1
{
"id": "1",
"userId": "2",
"scores" : [
{
"name": "score1",
"value": 93.0
},
{
"name": "score2",
"value": 90.0
},
{
"name": "score3",
"value": 76.0
}
],
"metadata": {
"weight": 130
}
}
POST ron_test/_doc/2
{
"id": "2",
"userId": "2",
"scores" : [
{
"name": "score1",
"value": 80.0
},
{
"name": "score2",
"value": 70.0
},
{
"name": "score3",
"value": 88.0
}
],
"metadata": {
"weight": 50
}
}
POST ron_test/_doc/3
{
"id": "2",
"userId": "2",
"scores" : [
{
"name": "score1",
"value": 80.0
},
{
"name": "score2",
"value": 70.0
},
{
"name": "score9",
"value": 88.0
}
],
"metadata": {
"weight": 12
}
}
POST ron_test/_doc/4
{
"id": "2",
"userId": "2",
"scores" : [
{
"name": "score9",
"value": 50.0
}
],
"metadata": {
"weight": 17
}
}
Query
GET ron_test/_search
{
"size": 0,
"aggs": {
"weigthed_avg": {
"scripted_metric": {
"init_script": """
state.name_to_sum = new HashMap();
state.name_to_weight = new HashMap();
""",
"map_script": """
for (score in params._source['scores']){
def name = score['name'];
def value = score['value'];
def weight = doc['metadata.weight'].value;
if (state.name_to_sum.containsKey(name)){
state.name_to_sum[name] += value * weight;
}
else {
state.name_to_sum[name] = value * weight;
}
if (state.name_to_weight.containsKey(name)){
state.name_to_weight[name] += weight;
}
else {
state.name_to_weight[name] = weight;
}
}
""",
"combine_script": "return [state.name_to_sum, state.name_to_weight]",
"reduce_script": """
def total_score_per_name = new HashMap();
def total_weigth_per_name = new HashMap();
for (state in states){
total_score_per_name = Stream.concat(total_score_per_name.entrySet().stream(), state[0].entrySet().stream())
.collect(Collectors.groupingBy(Map.Entry::getKey,
Collectors.summingDouble(Map.Entry::getValue)));
total_weigth_per_name = Stream.concat(total_weigth_per_name.entrySet().stream(), state[1].entrySet().stream())
.collect(Collectors.groupingBy(Map.Entry::getKey,
Collectors.summingDouble(Map.Entry::getValue)));
}
def results = new HashMap();
total_score_per_name.forEach((name, score) -> results[name] = score / total_weigth_per_name[name]);
return results;
"""
}
}
}
}
Results
{
"took" : 258,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"weigthed_avg" : {
"value" : {
"score9" : 65.72413793103448,
"score2" : 83.54166666666667,
"score3" : 79.33333333333333,
"score1" : 88.80208333333333
}
}
}
}
More info on scripted metrics
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-scripted-metric-aggregation.html
Btw, the way I would choose to simplify this is to insert the metadata.weight value inside every nested score
Related
I Have following data in ElasticSearch index some_index.
[ {
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 1,
"cart_status": "new",
"grandTotal": 12,
"event": "some_event",
"timestamp": "2022-12-01T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 1,
"cart_status": "paid",
"grandTotal": 12,
"event": "some_event",
"timestamp": "2022-12-02T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 2,
"cart_status": "new",
"grandTotal": 23,
"event": "some_event",
"timestamp": "2022-12-01T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 2,
"cart_status": "paid",
"grandTotal": 23,
"event": "some_event",
"timestamp": "2022-12-04T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 3,
"cart_status": "new",
"grandTotal": 17,
"event": "some_event",
"timestamp": "2022-12-01T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 3,
"cart_status": "new",
"grandTotal": 17,
"event": "some_event",
"timestamp": "2022-12-04T00:00:00.000Z"
}
}
}
]
What I want to get is sum of the grandTotals by the latest cart_statuses of each cart within a given time range.
Having the example above, the result for timestamp >= 2022-12-01 00:00:00 and timestamp<= 2022-12-03 00:00:00 should be something like
cart_status:new, sum grandTotal: 40 because within that time range latest status new have cart_id 3 and 2.
and cart_status:paid, sum grandTotal: 12 and this one because paid is the latest status of only cart_id=1.
What I tried is to use sub-aggregation on top_result, top_hits but ElasticSearch complains that "Aggregator [top_result] of type [top_hits] cannot accept sub-aggregations"
Besides I tried with collapse as well to get the latest by status, but according to docs there is also no possibility to aggregate over the results of collapse.
Can someone please help me solving this, it seems like a common calculation but not very trivial in ElasticSearch.
In SQL this is quite easy with window functions.
I want to avoid persisting intermediate data into another index. Because I need the dynamic query, as the users may want to get their calculations for any time range.
you can try the following way. meanwhile, for card_status, sum value will be 52 as it includes card_id 1 that has "new" as card status along with 2 and 3 for given timestamp.
Mappings:
PUT some_index
{
"mappings" : {
"properties": {
"timestamp" : {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||strict_date_optional_time ||epoch_millis"
},
"cart_id" : {
"type": "keyword"
},
"cart_status" : {
"type": "keyword"
},
"grand_total" : {
"type": "long"
},
"event":{
"type": "keyword"
}
}
}
}
Bulk Insert:
POST _bulk
{ "index" : { "_index" : "some_index", "_id" : "1" } }
{ "cart_id" : "1" , "grand_total":12, "cart_status" : "new","timestamp":"2022-12-01T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "2" } }
{ "cart_id" : "1" , "grand_total":12, "cart_status" : "paid","timestamp":"2022-12-02T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "3" } }
{ "cart_id" : "2" , "grand_total":23, "cart_status" : "new","timestamp":"2022-12-01T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "4" } }
{ "cart_id" : "2" , "grand_total":23, "cart_status" : "paid","timestamp":"2022-12-04T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "5" } }
{ "cart_id" : "3" , "grand_total":17, "cart_status" : "new","timestamp":"2022-12-01T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "6" } }
{ "cart_id" : "3" , "grand_total":17, "cart_status" : "new","timestamp":"2022-12-04T00:00:00.000Z", "event" : "some_event"}
Query:
GET some_index/_search
{
"size":0,
"query": {
"bool": {
"filter": [
{
"range": {
"timestamp": {
"gte": "2022-12-01 00:00:00",
"lte": "2022-12-03 00:00:00"
}
}
}
]
}
},
"aggs": {
"card_status": {
"terms": {
"field": "cart_status"
},
"aggs": {
"grandTotal": {
"sum": {
"field": "grand_total"
}
}
}
}
}
}
Output:
{
"took": 86,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"card_status": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "new",
"doc_count": 3,
"grandTotal": {
"value": 52
}
},
{
"key": "paid",
"doc_count": 1,
"grandTotal": {
"value": 12
}
}
]
}
}
}
Using Elasticsearch 7.9.0
My document looks like this
{
"student": {
"marks": [
{
"sub": 80
},
{
"sub": 90
},
{
"sub": 100
}
]
}
}
I need one more field total_marks in the response of GET API
Something like this
{
"hits": [
{
"_index": "abc",
"_type": "_doc",
"_id": "blabla",
"_score": null,
"_source": {
"student": {
"marks": [
{
"sub": 80
},
{
"sub": 90
},
{
"sub": 100
}
]
}
},
"total_marks": 270
}
]
}
I tried using script_fields
My query is
GET sample/_search
{
"query": {
"match_all": {}
},
"script_fields": {
"total_marks": {
"script": {
"source": """double sum = 0.0;
for( item in params._source.student.marks)
{ sum = sum + item.sub }
return sum;"""
}
}
}
}
I got response as
{
"hits": [
{
"_index": "abc",
"_type": "_doc",
"_id": "blabla",
"_score": null,
"_source": {
"student": {
"marks": [
{
"sub": 80
},
{
"sub": 90
},
{
"sub": 100
}
]
}
},
"fields": {
"total_marks": [
270
]
}
}
]
}
Is thare any way to get as expected?
Any better/optimal solution would be helps a lot.
Thank you.
Terms aggregation and sum aggregation can be used to find total marks per group
{
"aggs": {
"students": {
"terms": {
"field": "student.id.keyword",
"size": 10
},
"aggs": {
"total_marks": {
"sum": {
"field": "student.marks.sub"
}
}
}
}
}
}
Result
"aggregations" : {
"students" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"total_marks" : {
"value" : 270.0
}
}
]
}
}
This will be faster than script but Pagination will be easier in query as compared to aggregation. So choose accordingly.
Best option may be to have it calculated at index time. If those fields are not changing frequently.
index_name: my_data-2020-12-01
ticket_number: T123
ticket_status: OPEN
ticket_updated_time: 2020-12-01 12:22:12
index_name: my_data-2020-12-01
ticket_number: T124
ticket_status: OPEN
ticket_updated_time: 2020-12-01 12:32:11
index_name: my_data-2020-12-02
ticket_number: T123
ticket_status: INPROGRESS
ticket_updated_time: 2020-12-02 12:33:12
index_name: my_data-2020-12-02
ticket_number: T125
ticket_status: OPEN
ticket_updated_time: 2020-12-02 14:11:45
I want to create a saved search with group by ticket_number field get unique doc with latest ticket status (ticket_status). Is it possible?
You can simply query again, I am assuming you are using Kibana for visualization purpose. in your query, you need to filter based on the ticket_number and sort based on ticket_updated_time.
Working example
Index mapping
{
"mappings": {
"properties": {
"ticket_updated_time": {
"type": "date"
},
"ticket_number" :{
"type" : "text"
},
"ticket_status" : {
"type" : "text"
}
}
}
}
Index sample docs
{
"ticket_number": "T123",
"ticket_status": "OPEN",
"ticket_updated_time": "2020-12-01T12:22:12"
}
{
"ticket_number": "T123",
"ticket_status": "INPROGRESS",
"ticket_updated_time": "2020-12-02T12:33:12"
}
Now as you can see, both the sample documents belong to the same ticket_number with different status and updated time.
Search query
{
"size" : 1, // fetch only the latest status document, if you remove this, will get other ticket with different status.
"query": {
"bool": {
"filter": [
{
"match": {
"ticket_number": "T123"
}
}
]
}
},
"sort": [
{
"ticket_updated_time": {
"order": "desc"
}
}
]
}
And search result
"hits": [
{
"_index": "65180491",
"_type": "_doc",
"_id": "2",
"_score": null,
"_source": {
"ticket_number": "T123",
"ticket_status": "INPROGRESS",
"ticket_updated_time": "2020-12-02T12:33:12"
},
"sort": [
1606912392000
]
}
]
If you need to group by ticket_number field, then you can use aggregation as well
Index Mapping:
{
"mappings": {
"properties": {
"ticket_updated_time": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
}
}
}
}
Search Query:
{
"size": 0,
"aggs": {
"unique_id": {
"terms": {
"field": "ticket_number.keyword",
"order": {
"latestOrder": "desc"
}
},
"aggs": {
"latestOrder": {
"max": {
"field": "ticket_updated_time"
}
}
}
}
}
}
Search Result:
"buckets": [
{
"key": "T125",
"doc_count": 1,
"latestOrder": {
"value": 1.606918305E12,
"value_as_string": "2020-12-02 14:11:45"
}
},
{
"key": "T123",
"doc_count": 2,
"latestOrder": {
"value": 1.606912392E12,
"value_as_string": "2020-12-02 12:33:12"
}
},
{
"key": "T124",
"doc_count": 1,
"latestOrder": {
"value": 1.606825931E12,
"value_as_string": "2020-12-01 12:32:11"
}
}
]
I'm looking to do a Max aggregation on a value of the property under my document, the property is a list of complex object (key and value). Here's my data:
[{
"id" : "1",
"listItems" :
[
{
"key" : "li1",
"value" : 100
},
{
"key" : "li2",
"value" : 5000
}
]
},
{
"id" : "2",
"listItems" :
[
{
"key" : "li3",
"value" : 200
},
{
"key" : "li2",
"value" : 2000
}
]
}]
When I do the Nested Max Aggregation on "listItems.value", I'm expecting the max value returned to be 200 (and not 5000), reason being I want the logic to first figure the MIN value under listItems for each document, then doing the Max Aggregation on that. Is it possible to do something like this?
Thanks.
The search query performs the following aggregation :
Terms aggregation on the id field
Min aggregation on listItems.value
Max bucket aggregation that is a sibling pipeline aggregation which identifies the bucket(s) with the maximum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s).
Please refer to nested aggregation, to get a detailed explanation on it.
Adding a working example with index data, index mapping, search query, and search result.
Index Mapping:
{
"mappings": {
"properties": {
"listItems": {
"type": "nested"
},
"id":{
"type":"text",
"fielddata":"true"
}
}
}
}
Index Data:
{
"id" : "1",
"listItems" :
[
{
"key" : "li1",
"value" : 100
},
{
"key" : "li2",
"value" : 5000
}
]
}
{
"id" : "2",
"listItems" :
[
{
"key" : "li3",
"value" : 200
},
{
"key" : "li2",
"value" : 2000
}
]
}
Search Query:
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id"
},
"aggs": {
"nested_entries": {
"nested": {
"path": "listItems"
},
"aggs": {
"min_position": {
"min": {
"field": "listItems.value"
}
}
}
}
}
},
"maxValue": {
"max_bucket": {
"buckets_path": "id_terms>nested_entries>min_position"
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 100.0
}
}
},
{
"key": "2",
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 200.0
}
}
}
]
},
"maxValue": {
"value": 200.0,
"keys": [
"2"
]
}
}
Initial post was mentioning nested aggregation, thus i was sure question is about nested documents. Since i've come to solution before seeing another answer, i'm keeping the whole thing for history, but actually it differs only in adding nested aggregation.
The whole process can be explained like that:
Bucket each document into single bucket.
Use nested aggregation to be able to aggregate on nested documents.
Use min aggregation to find minimum value within all document nested documents, and by that, for document itself.
Finally, use another aggregation to calculate maximum value among results of previous aggregation.
Given this setup:
// PUT /index
{
"mappings": {
"properties": {
"children": {
"type": "nested",
"properties": {
"value": {
"type": "integer"
}
}
}
}
}
}
// POST /index/_doc
{
"children": [
{ "value": 12 },
{ "value": 45 }
]
}
// POST /index/_doc
{
"children": [
{ "value": 7 },
{ "value": 35 }
]
}
I can use those aggregations in request to get required value:
{
"size": 0,
"aggs": {
"document": {
"terms": {"field": "_id"},
"aggs": {
"children": {
"nested": {
"path": "children"
},
"aggs": {
"minimum": {
"min": {
"field": "children.value"
}
}
}
}
}
},
"result": {
"max_bucket": {
"buckets_path": "document>children>minimum"
}
}
}
}
{
"aggregations": {
"document": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "O4QxyHQBK5VO9CW5xJGl",
"doc_count": 1,
"children": {
"doc_count": 2,
"minimum": {
"value": 7.0
}
}
},
{
"key": "OoQxyHQBK5VO9CW5kpEc",
"doc_count": 1,
"children": {
"doc_count": 2,
"minimum": {
"value": 12.0
}
}
}
]
},
"result": {
"value": 12.0,
"keys": [
"OoQxyHQBK5VO9CW5kpEc"
]
}
}
}
There also should be a workaround using script for calculating max - all that you will need to do is just find and return smallest value in document in such script.
I want to remove 15 minutes from all dates in the history that are less than 15 minutes old.
So I have to compare the date now - 15 minutes to the record date.
However, when I retrieve the date, it can not compare it because it is like a String and adding ".value" returns that the attribute does not exist.
Error response :
"if(ctx._source.histories[i].creation_date.value"
dynamic getter [java.lang.String, value] not found
Try other solutions with others error :
"if(ctx._source.histories[i].creation_date.date"
"if(ctx._source.histories[i].creation_date.getMillis()"
"if(ctx._source.histories[i].creation_date.value.getMillis()"
Update request (elasticsearch.js) :
{
"query": { "term": { "user_id": "USER_ID" } },
"script":
{
"lang": "painless",
"source": "for(int i = ctx._source.histories.length-1; i > 0; --i){ if(ctx._source.histories[i].creation_date.value > params.date) { ctx._source.histories[i].creation_date -= 1000 * 60 * 15; } }",
"params": { "date": new Date() - 1000 * 60 * 15 }
}
}
Mapping :
{
"mappings":
{
"_doc":
{
"properties":
{
"histories":
{
"type": "nested",
"properties":
{
"type": { "type": "text" },
"key": { "type": "text" },
"value": { "type": "text" },
"ip": { "type": "ip" },
"useragent": { "type": "text" },
"creation_date": { "type": "date" }
}
}
}
}
}
}
Infos elasticsearch :
{
"name" : "ZZZ",
"cluster_name" : "YYY",
"cluster_uuid" : "XXX",
"version" : {
"number" : "6.5.2",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "WWW",
"build_date" : "2018-11-29T23:58:20.891072Z",
"build_snapshot" : false,
"lucene_version" : "7.5.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
Sample datas :
{
"hits":
{
"total": 1,
"max_score": 4.13468,
"hits":
[
{
"_index": "myindex",
"_type": "_doc",
"_id": "H1dQ4WgBypYasGfnnXXI",
"_score": 4.13468,
"_source":
{
"infos":
{
"firsname": "John",
"lastname": "Doe",
"mail": "john.doe#stackoverflow.com"
},
"histories":
[
{
"type": "auth",
"key": "try",
"value": "fail",
"ip": "127.0.0.1",
"useragent": "iPhoneX",
"creation_date": "2019-02-19T16:49:00.396Z"
},
{
"type": "auth",
"key": "try",
"value": "fail",
"ip": "127.0.0.1",
"useragent": "iPhoneX",
"creation_date": "2019-02-19T16:50:00.396Z"
}
]
}
}
]
}
}
I think I might have something that might help you (tested on ES 6.6.0).
{
"query": {
"match_all": {}
},
"script": {
"lang": "painless",
"source": """
// parse params.data to Instant
def paramDate = Instant.parse(params.date);
for(int i = ctx._source.histories.length-1; i > 0; --i) {
// parse the creation date to Instant
def creationDate = Instant.parse(ctx._source.histories[i].creation_date);
// check time difference between both
if (ChronoUnit.MINUTES.between(creationDate, paramDate) <= 15) {
// remove 15 minutes if condition satisfied
ctx._source.histories[i].creation_date = creationDate.minusSeconds(900).toString();
}
}
""",
"params": {
"date": "2019-02-19T16:45:00.000Z"
}
}
}
Note: I'm using triple quotes to make the query more readable, but feel free to inline it again as you see fit and remove the comments.