Elastic Search return object with sum aggregation - elasticsearch

I am trying to get a list of the top 100 guests by revenue generated with Elastic Search. To do this I am using a terms and a sum aggregation. However it does return the correct values, I wan to return the entire guest object with the aggregation.
This is my query:
GET reservations/_search
{
"size": 0,
"aggs": {
"top_revenue": {
"terms": {
"field": "total",
"size": 100,
"order": {
"top_revenue_hits": "desc"
}
},
"aggs": {
"top_revenue_sum": {
"sum": {
"field": "total"
}
}
}
}
}
}
This returns a list of the top 100 guests but only the amount they spent:
{
"aggregations" : {
"top_revenue" : {
"doc_count_error_upper_bound" : -1,
"sum_other_doc_count" : 498,
"buckets" : [
{
"key" : 934.9500122070312,
"doc_count" : 8,
"top_revenue_hits" : {
"value" : 7479.60009765625
}
},
{
"key" : 922.0,
"doc_count" : 6,
"top_revenue_hits" : {
"value" : 5532.0
}
},
...
]
}
}
}
How can I get the query to return the entire guests object, not only the sum amount.
When I run GET reservations/_search it returns:
{
"hits": [
{
"_index": "reservations",
"_id": "1334620",
"_score": 1.0,
"_source": {
"id": "1334620",
"total": 110.8,
"payment": "unpaid",
"contact": {
"name": "John Doe",
"email": "john#mail.com"
}
}
},
... other reservations
]
}
I want to get this to return with the sum aggregation.
I have tried to use a top_hits aggregation, using _source it does return the entire guest object but it does not show the total amount spent. And when adding _source to the sum aggregation it gives an error.
Can I return the entire guest object with a sum aggregation or is this not the correct way?

I assumed that contact.name is keyword in the mapping. Following query should work for you.
{
"size": 0,
"aggs": {
"guests": {
"terms": {
"field": "contact.name",
"size": 100
},
"aggs": {
"sum_total": {
"sum": {
"field": "total"
}
},
"sortBy": {
"bucket_sort": {
"sort": [
{ "sum_total": { "order": "desc" } }
]
}
},
"guest": {
"top_hits": {
"size": 1
}
}
}
}
}
}

Related

Get top values from Elasticsearch bucket

I have some items with brand
I want to return N records, but no more than x from each bucket
So far I have my buckets grouped by brand
"aggs": {
"brand": {
"terms": {
"field": "brand"
}
}
}
"aggregations" : {
"brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "brandA",
"doc_count" : 130
},
{
"key" : "brandB",
"doc_count" : 127
}
]
}
But how do I access specific bucket and get top x values from there?
You can use top hits sub aggregation to get documents under each brand. You can sort those documents and define a size too.
{
"aggs": {
"brand": {
"terms": {
"field": "brand",
"size": 10 --> no of brands
},
"aggs": {
"top_docs": {
"top_hits": {
"sort": [
{
"date": {
"order": "desc"
}
}
],
"size": 1 --> no of documents under each brand
}
}
}
}
}
}

Sub-aggregate a multi-level nested composite aggregation

I'm trying to set up a search query that should composite aggregate a collection by a multi-level nested field and give me some sub-aggregation metrics from this collection. I was able to fetch the composite aggregation with its buckets as expected but the sub-aggregation metrics come with 0 for all buckets. I'm not sure if I am failing to correctly point out what fields the sub-aggregation should consider or if it should be placed inside a different part of the query.
My collection looks similar to the following:
{
id: '32ead132eq13w21',
statistics: {
clicks: 123,
views: 456
},
categories: [{ //nested type
name: 'color',
tags: [{ //nested type
slug: 'blue'
},{
slug: 'red'
}]
}]
}
Bellow you can find what I have tried so far. All buckets come with clicks sum as 0 even though all documents have a set clicks value.
GET /acounts-123321/_search
{
"size": 0,
"aggs": {
"nested_categories": {
"nested": {
"path": "categories"
},
"aggs": {
"nested_tags": {
"nested": {
"path": "categories.tags"
},
"aggs": {
"group": {
"composite": {
"size": 100,
"sources": [
{ "slug": { "terms" : { "field": "categories.tags.slug"} }}
]
},
"aggregations": {
"clicks": {
"sum": {
"field": "statistics.clicks"
}
}
}
}
}
}
}
}
}
}
The response body I have so far:
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1304,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"nested_categories" : {
"doc_count" : 1486,
"nested_tags" : {
"doc_count" : 1486,
"group" : {
"buckets" : [
{
"key" : {
"slug" : "red"
},
"doc_count" : 268,
"clicks" : {
"value" : 0.0
}
}, {
"key" : {
"slug" : "blue"
},
"doc_count" : 122,
"clicks" : {
"value" : 0.0
},
.....
]
}
}
}
}
}
In order for this to work, all sources in the composite aggregation would need to be under the same nested context.
I've answered something similar a while ago. The asker needed to put the nested values onto the top level. You have the opposite challenge -- given that the stats.clicks field is on the top level, you'd need to duplicate it across each entry of the categories.tags which, I suspect, won't be feasible because you're likely updating these stats every now and then…
If you're OK with skipping the composite approach and using the terms agg without it, you could make the summation work by jumping back to the top level thru reverse_nested:
{
"size": 0,
"aggs": {
"nested_tags": {
"nested": {
"path": "categories.tags"
},
"aggs": {
"by_slug": {
"terms": {
"field": "categories.tags.slug",
"size": 100
},
"aggs": {
"back_to_parent": {
"reverse_nested": {},
"aggs": {
"clicks": {
"sum": {
"field": "statistics.clicks"
}
}
}
}
}
}
}
}
}
}
This'll work just as fine but won't offer pagination.
Clarification
If you needed a color filter, you could do:
{
"size": 0,
"aggs": {
"categories_parent": {
"nested": {
"path": "categories"
},
"aggs": {
"filtered_by_color": {
"filter": {
"term": {
"categories.name": "color"
}
},
"aggs": {
"nested_tags": {
"nested": {
"path": "categories.tags"
},
"aggs": {
"by_slug": {
"terms": {
"field": "categories.tags.slug",
"size": 100
},
"aggs": {
"back_to_parent": {
"reverse_nested": {},
"aggs": {
"clicks": {
"sum": {
"field": "statistics.clicks"
}
}
}
}
}
}
}
}
}
}
}
}
}
}

How to aggregate until a certain value is reached in ElasticSearch?

I would like to aggregate a list of documents (each of them has two fields - timestamp and amount) by "amount" field until a certain value is reached. For example I would like to get list of documents sorted by timestamp which total amount is equal to 100. Is it possible to do in one query?
Here is my query which returns total amount - I would like to add here a condition to stop aggregation when a certain value is reached.
{
"query": {
"bool": {
"filter": [
{
"range": {
"timestamp": {
"gte": 1525168583
}
}
}
]
}
},
"aggs": {
"total_amount": {
"sum": {
"field": "amount"
}
}
},
"sort": [
"timestamp"
],
"size": 10000
}
Thank You
It's perfectly possible using a combination of function_score scripting for mimicking sorting, filter aggs for the range gte query and a healthy amount of scripted_metric aggs to limit the summation up to a certain amount.
Let's first set up a mapping and ingest some docs:
PUT summation
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"format": "epoch_second"
}
}
}
}
POST summation/_doc
{
"context": "newest",
"timestamp": 1587049128,
"amount": 20
}
POST summation/_doc
{
"context": "2nd newest",
"timestamp": 1586049128,
"amount": 30
}
POST summation/_doc
{
"context": "3rd newest",
"timestamp": 1585049128,
"amount": 40
}
POST summation/_doc
{
"context": "4th newest",
"timestamp": 1585049128,
"amount": 30
}
Then perform the query:
GET summation/_search
{
"size": 0,
"aggs": {
"filtered_agg": {
"filter": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1585049128
}
}
},
{
"function_score": {
"query": {
"match_all": {}
},
"script_score": {
"script": {
"source": "return (params['now'] - doc['timestamp'].date.toMillis())",
"params": {
"now": 1587049676
}
}
}
}
}
]
}
},
"aggs": {
"limited_sum": {
"scripted_metric": {
"init_script": """
state['my_hash'] = new HashMap();
state['my_hash'].put('sum', 0);
state['my_hash'].put('docs', new ArrayList());
""",
"map_script": """
if (state['my_hash']['sum'] <= 100) {
state['my_hash']['sum'] += doc['amount'].value;
state['my_hash']['docs'].add(doc['context.keyword'].value);
}
""",
"combine_script": "return state['my_hash']",
"reduce_script": "return states[0]"
}
}
}
}
}
}
yielding
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"filtered_agg" : {
"meta" : { },
"doc_count" : 4,
"limited_sum" : {
"value" : {
"docs" : [
"newest",
"2nd newest",
"3rd newest",
"4th newest"
],
"sum" : 120
}
}
}
}
}
I've chosen here to only return the doc.contexts but you can adjust it to retrieve whatever you like -- be it IDs, amounts etc.

ElasticSearch multiple terms aggregation order

I have a document structure which describes a container, some of its fields are:
containerId -> Unique Id,String
containerManufacturer -> String
containerValue -> Double
estContainerWeight ->Double
actualContainerWeight -> Double
I want to run a search aggregation which has two levels of terms aggregations on the two weight fields, but in descending order of the weight fields, like below:
{
"size": 0,
"aggs": {
"by_manufacturer": {
"terms": {
"field": "containerManufacturer",
"size": 10,
"order": {"estContainerWeight": "desc"} //Cannot do this
},
"aggs": {
"by_est_weight": {
"terms": {
"field": "estContainerWeight",
"size": 10,
"order": { "actualContainerWeight": "desc"} //Cannot do this
},
"aggs": {
"by_actual_weight": {
"terms": {
"field": "actualContainerWeight",
"size": 10
},
"aggs" : {
"container_value_sum" : {"sum" : {"field" : "containerValue"}}
}
}
}
}
}
}
}
}
Sample documents:
{"containerId":1,"containerManufacturer":"A","containerValue":12,"estContainerWeight":5.0,"actualContainerWeight":5.1}
{"containerId":2,"containerManufacturer":"A","containerValue":24,"estContainerWeight":5.0,"actualContainerWeight":5.2}
{"containerId":3,"containerManufacturer":"A","containerValue":23,"estContainerWeight":5.0,"actualContainerWeight":5.2}
{"containerId":4,"containerManufacturer":"A","containerValue":32,"estContainerWeight":6.0,"actualContainerWeight":6.2}
{"containerId":5,"containerManufacturer":"A","containerValue":26,"estContainerWeight":6.0,"actualContainerWeight":6.3}
{"containerId":6,"containerManufacturer":"A","containerValue":23,"estContainerWeight":6.0,"actualContainerWeight":6.2}
Expected Output(not complete):
{
"by_manufacturer": {
"buckets": [
{
"key": "A",
"by_est_weight": {
"buckets": [
{
"key" : 5.0,
"by_actual_weight" : {
"buckets" : [
{
"key" : 5.2,
"container_value_sum" : {
"value" : 1234 //Not actual sum
}
},
{
"key" : 5.1,
"container_value_sum" : {
"value" : 1234 //Not actual sum
}
}
]
}
},
{
"key" : 6.0,
"by_actual_weight" : {
"buckets" : [
{
"key" : 6.2,
"container_value_sum" : {
"value" : 1234 //Not actual sum
}
},
{
"key" : 6.3,
"container_value_sum" : {
"value" : 1234 //Not actual sum
}
}
]
}
}
]
}
}
]
}
}
However, I cannot order by the nested aggregations. (Error: Terms buckets can only be sorted on a sub-aggregator path that is built out of zero or more single-bucket aggregations within the path and a final single-bucket or a metrics aggregation...)
For example, for the above sample output, I have no control on the buckets generated if I introduce a size on the terms aggregations (which I will have to do if my data is large),so I would like to only get the top N weights for each terms aggregation.
Is there a way to do this ?
If I understand your problem correctly, you would like to sort the manufacturer terms in decreasing order of the estimated weights of their containers and then each bucket of "estimated weight" in decreasing order of their actual weight.
{
"size": 0,
"aggs": {
"by_manufacturer": {
"terms": {
"field": "containerManufacturer",
"size": 10
},
"by_est_weight": {
"terms": {
"field": "estContainerWeight",
"size": 10,
"order": {
"_term": "desc" <--- change to this
}
},
"by_actual_weight": {
"terms": {
"field": "actualContainerWeight",
"size": 10,
"order" : {"_term" : "desc"} <----- Change to this
},
"aggs": {
"container_value_sum": {
"sum": {
"field": "containerValue"
}
}
}
}
}
}
}
}
}
}

Elasticsearch Histogram of visits

I'm quite new to Elasticsearch and I fail to build a histogram based on ranges of visits. I am not even sure that it's possible to create this kind of chart by using a single query in Elasticsearch, but I'm the feeling that could be possible with pipeline aggregation or may be scripted aggregation.
Here is a test dataset with which I'm working:
PUT /test_histo
{ "settings": { "number_of_shards": 1 }}
PUT /test_histo/_mapping/visit
{
"properties": {
"user": {"type": "string" },
"datevisit": {"type": "date"},
"page": {"type": "string"}
}
}
POST test_histo/visit/_bulk
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Jean","page":"productXX.hmtl","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Robert","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Mary","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Mary","page":"media_center.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"media_center.html","datevisit":"2015-11-26"}
If we consider the ranges [1,2[, [2,3[, [3, inf.[
The expected result should be :
[1,2[ = 2
[2,3[ = 1
[3, inf.[ = 1
All my efforts to find the histogram showing a customer visit frequency remained to date unsuccessful. I would be pleased to have a few tips, tricks or ideas to get a response to my problem.
There are two ways you can do it.
First is doing it in ElasticSearch which will require Scripted Metric Aggregation. You can read more about it here.
Your query would look like this
{
"size": 0,
"aggs": {
"visitors_over_time": {
"date_histogram": {
"field": "datevisit",
"interval": "week"
},
"aggs": {
"no_of_visits": {
"scripted_metric": {
"init_script": "_agg['values'] = new java.util.HashMap();",
"map_script": "if (_agg.values[doc['user'].value]==null) {_agg.values[doc['user'].value]=1} else {_agg.values[doc['user'].value]+=1;}",
"combine_script": "someHashMap = new java.util.HashMap();for(x in _agg.values.keySet()) {value=_agg.values[x];if(value<3){key='[' + value +',' + (value + 1) + '[';}else{key='[' + value +',inf[';}; if(someHashMap[key]==null){someHashMap[key] = 1}else{someHashMap[key] += 1}}; return someHashMap;"
}
}
}
}
}
}
where you can change period of time in date_histogram object in the field interval by values like day, week, month.
Your response would look like this
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"visitors_over_time": {
"buckets": [
{
"key_as_string": "2015-11-23T00:00:00.000Z",
"key": 1448236800000,
"doc_count": 7,
"no_of_visits": {
"value": [
{
"[2,3[": 1,
"[3,inf[": 1,
"[1,2[": 2
}
]
}
}
]
}
}
}
Second method is to the work of scripted_metric in client side. You can use the result of Terms Aggregation. You can read more about it here.
Your query will look like this
GET test_histo/visit/_search
{
"size": 0,
"aggs": {
"visitors_over_time": {
"date_histogram": {
"field": "datevisit",
"interval": "week"
},
"aggs": {
"no_of_visits": {
"terms": {
"field": "user",
"size": 10
}
}
}
}
}
}
and the response will be
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"visitors_over_time": {
"buckets": [
{
"key_as_string": "2015-11-23T00:00:00.000Z",
"key": 1448236800000,
"doc_count": 7,
"no_of_visits": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "john",
"doc_count": 3
},
{
"key": "mary",
"doc_count": 2
},
{
"key": "jean",
"doc_count": 1
},
{
"key": "robert",
"doc_count": 1
}
]
}
}
]
}
}
}
where on the response you can do count for each doc_count for each period.
Have a look at:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html
If you whant to show it in fancy already fixed UI use Kibana.
A query like this:
GET _search
{
"query": {
"match_all": {}
},
{
"aggs" : {
"visits" : {
"date_histogram" : {
"field" : "datevisit",
"interval" : "month"
}
}
}
}
}
Should give you a histogram, I don't have elastic here at the moment so I might have some fat finggered typos.
Then you could ad query terms to only show histogram for specific page our you could have an aouter aggregation bucket wich aggregates / page or user.
Something like this:
GET _search
{
"query": {
"match_all": {}
},
{
{
"aggs" : {
"users" : {
"terms" : {
"field" : "user",
},
"aggs" : {
"visits" : {
"date_histogram" : {
"field" : "datevisit",
"interval" : "month"
}
}
}
}
}
Have a look to this solution:
{
"query": {
"match_all": {}
},
"aggs": {
"periods": {
"filters": {
"filters": {
"1-2": {
"range": {
"datevisit": {
"gte": "2015-11-25",
"lt": "2015-11-26"
}
}
},
"2-3": {
"range": {
"datevisit": {
"gte": "2015-11-26",
"lt": "2015-11-27"
}
}
},
"3-": {
"range": {
"datevisit": {
"gte": "2015-11-27",
}
}
}
}
},
"aggs": {
"users": {
"terms": {"field": "user"}
}
}
}
}
}
Step by step:
Filter aggregation: You can define ranged values for the next aggregation, in this case we define 3 periods based on date range filter
Nested Users aggregation: This aggregation returns as many results as filters you'd defined. So, in this case, you'll get 3 values using range date filtering
You'll get a result like this:
{
...
"aggregations" : {
"periods" : {
"buckets" : {
"1-2" : {
"users" : {
"buckets" : [
{"key" : XXX,"doc_count" : NNN},
{"key" : YYY,"doc_count" : NNN},
]
}
},
"2-3" : {
"users" : {
"buckets" : [
{"key" : XXX1,"doc_count" : NNN1},
{"key" : YYY1,"doc_count" : NNN1},
]
}
},
"3-" : {
"users" : {
"buckets" : [
{"key" : XXX2,"doc_count" : NNN2},
{"key" : YYY2,"doc_count" : NNN2},
]
}
},
}
}
}
}
Try it, and tell if it works

Resources