Compute difference between field and aggregated field - elasticsearch

I have to run complex aggregation and one of its steps is computing sum of sold_qty field, and then I need to subtract this sum with non aggregated field all_qty. My data looks like:
{item_id: XXX, sold_qty: 1, all_qty: 20, price: 100 }
{item_id: XXX, sold_qty: 3, all_qty: 20, price: 100 }
{item_id: YYY, sold_qty: 1, all_qty: 20, price: 80 }
These are transactions from offer. The all_qty and price fields are redundant - express single values from other structure - offers and just duplicated in all transactions from single offer (identified by item_id).
In the terms of SQL what I need is:
SELECT (all_qty - sum(sold_qty)) * price GROUP BY item_id
What I've done is aggregation
'{
"query": {"term": {"seller": 9059247}},
"size": 0,
"aggs": {
"group_by_offer": {
"terms": { "field": "item_id", size: 0},
"aggs": { "sold_sum": {"sum": {"field": "sold_qty"}}}
}
}
}'
But I don't know what to do next to achieve my goal.

Since you are already storing redundant fields, if I were you, I would also store the result of all_price = all_qty * price and sold_price = sold_qty * price. It's is not mandatory but it will be faster at execution time than executing scripts to make the same computation.
{item_id: XXX, sold_qty: 1, sold_price: 20, all_qty: 20, price: 100, all_price: 2000 }
{item_id: XXX, sold_qty: 3, sold_price: 300, all_qty: 20, price: 100, all_price: 2000 }
{item_id: YYY, sold_qty: 1, sold_price: 80, all_qty: 20, price: 80, all_price: 1600 }
All you'd have to do next is to sum sold_price and average all_price and simply get the difference between both using a bucket_script pipeline aggregation:
{
"query": {
"term": {
"seller": 9059247
}
},
"size": 0,
"aggs": {
"group_by_offer": {
"terms": {
"field": "item_id",
"size": 0
},
"aggs": {
"sold_sum": {
"sum": {
"field": "sold_price"
}
},
"all_sum": {
"avg": {
"field": "all_price"
}
},
"diff": {
"bucket_script": {
"buckets_path": {
"sold": "sold_sum",
"all": "all_sum"
},
"script": "params.all - params.sold"
}
}
}
}
}
}

Related

Aggregation value inside array of array elasticsearch

i have json structure like this:
[{
'id': 1,
'result': [{
"score": 0.0,
"result_rules": [{
"rule_id": "sr-1",
},
{
"rule_id": "sr-2",
}
]
}]
},
{
'id': 2,
'result': [{
"score": 0.0,
"result_rules": [{
"rule_id": "sr-1",
},
{
"rule_id": "sr-4",
}
]
}]
}]
i want to count rule_id, so the result would be:
[
{
'rule_id': 'sr-1',
'doc_count': 2
},
{
'rule_id': 'sr-2',
'doc_count': 1
},
{
'rule_id': 'sr-4',
'doc_count': 1
}
]
i've tried something like this, but it's showing empty aggregation
{
"aggs":{
"group_by_rule_id":{
"terms":{
"field": "result.result_rules.rule_id.keyword"
}
}
}
}
For aggregation on nested structure you would have to use nested aggregation.
See the example on ES DOC.

_update_by_query fails to update all documents in ElasticSearch

I have over 30 million documents in Elasticsearch (version - 6.3.3), I am trying to add new field to all existing documents and setting the value to 0.
For example: I want to add start field which does not exists previously in Twitter document, and set it's initial value to 0, in all 30 million documents.
In my case I was able to update 4 million only. If I try to check the submitted task with TASK API http://localhost:9200/_task/{taskId}, result from says something like ->
{
"completed": false,
"task": {
"node": "Jsecb8kBSdKLC47Q28O6Pg",
"id": 5968304,
"type": "transport",
"action": "indices:data/write/update/byquery",
"status": {
"total": 34002005,
"updated": 3618000,
"created": 0,
"deleted": 0,
"batches": 3619,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1.0,
"throttled_until_millis": 0
},
"description": "update-by-query [Twitter][tweet] updated with Script{type=inline, lang='painless', idOrCode='ctx._source.Twitter.start = 0;', options={}, params={}}",
"start_time_in_millis": 1574677050104,
"running_time_in_nanos": 466805438290,
"cancellable": true,
"headers": {}
}
}
The query I am executing against ES , is something like:
curl -XPOST "http://localhost:9200/_update_by_query?wait_for_completion=false&conflicts=proceed" -H 'Content-Type: application/json' -d'
{
"script": {
"source": "ctx._source.Twitter.start = 0;"
},
"query": {
"exists": {
"field": "Twitter"
}
}
}'
Any suggestions would be great, thanks

How do I sort buckets by Term Aggregation's nested doc_count?

I have an index, invoices, that I need to aggregate into yearly buckets then sort.
I have succeeded in using Bucket Sort to sort my buckets by simple sum values (revenue and tax). However, I am struggling to sort by more deeply nested doc_count values (status).
I want to order my buckets not only by revenue, but also by the number of docs with a status field equal to 1, 2, 3 etc...
The documents in my index looks like this:
"_source": {
"created_at": "2018-07-07T03:11:34.327Z",
"status": 3,
"revenue": 68.474,
"tax": 6.85,
}
I request my aggregations like this:
const params = {
index: 'invoices',
size: 0,
body: {
aggs: {
sales: {
date_histogram: {
field: 'created_at',
interval: 'year',
},
aggs: {
total_revenue: { sum: { field: 'revenue' } },
total_tax: { sum: { field: 'tax' } },
statuses: {
terms: {
field: 'status',
},
},
sales_bucket_sort: {
bucket_sort: {
sort: [{ total_revenue: { order: 'desc' } }],
},
},
},
},
},
},
}
The response (truncated) looks like this:
"aggregations": {
"sales": {
"buckets": [
{
"key_as_string": "2016-01-01T00:00:00.000Z",
"key": 1451606400000,
"doc_count": 254,
"total_tax": {
"value": 735.53
},
"statuses": {
"sum_other_doc_count": 0,
"buckets": [
{
"key": 2,
"doc_count": 59
},
{
"key": 1,
"doc_count": 58
},
{
"key": 5,
"doc_count": 57
},
{
"key": 3,
"doc_count": 40
},
{
"key": 4,
"doc_count": 40
}
]
},
"total_revenue": {
"value": 7355.376005351543
}
},
]
}
}
I want to sort by key: 1, for example. Order the buckets according to which one has the greatest number of docs with a status value of 1. I tried to order my terms aggregation, then specify the desired key like this:
statuses: {
terms: {
field: 'status',
order: { _key: 'asc' },
},
},
sales_bucket_sort: {
bucket_sort: {
sort: [{ 'statuses.buckets[0]._doc_count': { order: 'desc' } }],
},
},
However this did not work. It didn't error, it just doesn't seem to have any effect.
I noticed someone else on SO had a similar question many years ago, but I was hoping a better answer had emerged since then: Elasticsearch aggregation. Order by nested bucket doc_count
Thanks!
Nevermind I figured it out. I added a separate filter aggregation like this:
aggs: {
total_revamnt: { sum: { field: 'revamnt' } },
total_purchamnt: { sum: { field: 'purchamnt' } },
approved_invoices: {
filter: {
term: {
status: 1,
},
},
},
Then I was able to bucket sort that value like this:
sales_bucket_sort: {
bucket_sort: {
sort: [{ 'approved_invoices>_count': { order: 'asc' } }],
},
},
In case if anyone comes to this issue again. Latest update tried with Elasticsearch version 7.10 could work in this way:
sales_bucket_sort: {
bucket_sort: {
sort: [{ '_count': { order: 'asc' } }],
},
}
With only _count specified, it will automatically take the doc_count and sort accordingly.
I believe this answer will just sort by the doc_count of the date_histogram aggregation, not the nested sort.
JP's answer works: create a filter with the target field: value then sort by it.

elasticsearch spend all time in build_scorer

When we've upgraded our ES from ES 1.4 to ES 5.2 we got performance problem with such type of queries:
{
"_source": false,
"from": 0,
"size": 50,
"profile": true,
"query": {
"bool": {
"filter": [
{
"ids": {
"values": [<list of 400 ids>],
"boost": 1
}
}
],
"should": [
{
"terms": {
"views": [ <list od 20 ints> ]
}
]
"minimum_should_match": "0",
"boost": 1
}
}
}
When profiling we found that the problem with build_scorer, which call for each segment:
1 shard;
20 segments;
took: 55
{
"type": "BooleanQuery",
"description": "views:[9875227 TO 9875227] views:[6991599 TO 6991599] views:[6682953 TO 6682953] views:[6568587 TO 6568587] views:[10080097 TO 10080097] views:[9200174 TO 9200174] views:[9200174 TO 9200174] views:[10080097 TO 10080097] views:[9966870 TO 9966870] views:[6568587 TO 6568587] views:[6568587 TO 6568587] views:[8538669 TO 8538669] views:[8835038 TO 8835038] views:[9200174 TO 9200174] views:[7539089 TO 7539089] views:[6991599 TO 6991599] views:[8222303 TO 8222303] views:[9342166 TO 9342166] views:[7828288 TO 7828288] views:[9699294 TO 9699294] views:[9108691 TO 9108691] views:[9431297 TO 9431297] views:[7539089 TO 7539089] views:[6032694 TO 6032694] views:[9491741 TO 9491741] views:[9498225 TO 9498225] views:[8051047 TO 8051047] views:[9866955 TO 9866955] views:[8222303 TO 8222303] views:[9622214 TO 9622214]",
"time": "39.70427700ms",
"breakdown": {
"score": 99757,
"build_scorer_count": 20,
"match_count": 0,
"create_weight": 37150,
"next_doc": 0,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 0,
"score_count": 110,
"build_scorer": 38648674,
"advance": 918274,
"advance_count": 291
},
So 38 ms of total 55ms was taken by build_scorer, it seems weired.
On ES 1.5 we have about the same number of segments but query run 10x faster
Unfortunately ES 1.x doesn't have profiler to check how many times build_scorer executes in ES 1.x
So the question is why build_scorer_count equal to number of segments and how can we tackle this performance issue?

Elasticsearch Query - how to?

I have the data in the following format in Elastic Search (from sense)
POST slots/slot/1
{
locationid:"1",
roomid:"10",
starttime: "08:45"
}
POST slots/slot/2
{
locationid:"1",
roomid:"10",
starttime: "09:00"
}
POST slots/slot/3
{
locationid:"2",
roomid:"100",
starttime: "08:45"
}
POST slots/slot/4
{
locationid:"2",
roomid:"101",
starttime: "09:00"
}
POST slots/slot/5
{
locationid:"3",
roomid:"200",
starttime: "09:30"
}
In short , the data is in the following format.
A Location has multiple rooms and each room has multiple slots of 15 minutes. So slot 1 for Room10 starts at 8:45 and ends at 09:00, Slot 2 for same room starts at 09:00 and ends at 09:15
Locationid RoomId Starttime
--------------------------------------
1 10 08:45
1 10 09:00
2 100 08:45
2 101 09:00
3 200 09:30
Im trying to write a query/filter which will give me all locations where a room is available with two or three slots.
For e.g Find a location that has 08:45 slot and 09:00 slot (configurable)
Answer should be location 1 only
Should Not be location 2 as room 100 has 08:45 slot but not the 09:00 slot. Room 101 has 09:00 slot but doesnt have the 08:45 slot
I believe this is not the best approach , but my attempt for the answer
POST slots/slot/_search?pretty=true&search_type=count
{
"facets": {
"locationswithslots": {
"terms": {
"field": "locationid",
"script" : "term + \"_\" + _source.roomid",
"size": 10
},
"facet_filter":
{
"terms":
{
"starttime":
[
"08:45",
"09:00"
]
}
}
}
}
}
This gives the answer as below
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0,
"hits": []
},
"facets": {
"locationswithslots": {
"_type": "terms",
"missing": 0,
"total": 4,
"other": 0,
"terms": [
{
"term": "1_10",
"count": 2
},
{
"term": "2_101",
"count": 1
},
{
"term": "2_100",
"count": 1
}
]
}
}
}
Now I need to figure out a way to filter the facets that return count 2 as I passed in 2 slots in the filter.
Any other option possible?

Resources