How to add paging in Elasticsearch's aggregation? - elasticsearch

I have an elasticsearch request as below:
{
"size":0,
"aggs":{
"group_by_state":{
"terms":{
"field":"poi_id"
},
"aggs":{
"sum(price)":{
"sum":{
"field":"price"
}
}
}
}
}
}
I want to add paging in this requst, just like
select poi_id, sum(price) from table group by poi_id limit 0,2
I've searched a lot, and found a link about it:https://github.com/elastic/elasticsearch/issues/4915.
But still I didn't get the implementation method.
Is there any way to implement it by Elasticsearch itself but not my application?

I am working through a solution for paging aggregation results currently. What you want to use is partition. This section in the official docs is very helpful.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions
To adapt your example, the terms setting would be updated as follows.
{
"size":0,
"aggs":{
"group_by_state":{
"terms":{
"field":"poi_id",
"include": {
"partition": 0,
"num_of_partitions": 100
},
"size": 10000
},
"aggs":{
"sum(price)":{
"sum":{
"field":"price"
}
}
}
}
}
}
This will group your results into 100 partitions (num_of_partitions), with a max size of 10k results in each (size), and retrieve the first such partition (partition: 0)
If you have more than 10k unique values for the field you are aggregating on (and want to return all values) you will want to increase the size value or potentially compute size and num_of_partitions dynamically based on the cardinality of your field.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#search-aggregations-metrics-cardinality-aggregation
You might also want to use the show_term_doc_count_error setting to make sure your aggregation is returning accurate counts. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_per_bucket_document_count_error
Hope that's helpful.

Late to the party, but just discovered 'composite' aggregations in v6.3+. These allow:
1. A more 'Sql like' grouping
2. Pagination by use of the 'after_key'.
Saved our day, hope it will help others too.
Example, getting number of hits per hour between 2 dates, grouped on 5 fields:
GET myindex-idx/_search
{
"query": {
"bool": {
"must": [
{"match": {"docType": "myDOcType"}},
{"range": {
"#date": {"gte": "2019-06-19T21:00:00", "lt": "2019-06-19T22:00:00"}
}
}
]
}
},
"size": 0,
"aggs": {
"mybuckets": {
"composite": {
"size": 100,
"sources": [
{"#date": {
"date_histogram": {
"field": "#date",
"interval": "hour",
"format": "date_hour"}
}
},
{"field_1": {"terms": {"field": "field_1"}}},
{"field_2": {"terms": {"field": "field_2"}}},
{"field_3": {"terms": {"field": "field_3"}}},
{"field_4": {"terms": {"field": "field_4"}}},
{"field_5": {"terms": {"field": "field_5"}}}
]
}
}
}
}
Produces:
{
"took": 255,
"timed_out": false,
"_shards": {
"total": 80,
"successful": 80,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 46989,
"max_score": 0,
"hits": []
},
"aggregations": {
"mybuckets": {
"after_key": {
"#date": "2019-06-19T21",
"field_1": 262,
"field_2": 347,
"field_3": 945,
"field_4": 2258,
"field_5": 0
},
"buckets": [
{
"key": {
"#date": "2019-06-19T21",
"field_1": 56,
"field_2": 106,
"field_3": 13224,
"field_4": 46239,
"field_5": 0
},
"doc_count": 3
},
{
"key": {
"#date": "2019-06-19T21",
"field_1": 56,
"field_2": 106,
"field_3": 32338,
"field_4": 76919,
"field_5": 0
},
"doc_count": 2
},
....
Following paging query issued like this, using the 'after_key object in the queries 'after' object:
GET myindex-idx/_search
{
"query": {
"bool": {
"must": [
{"match": {"docType": "myDOcType"}},
{"range": {
"#date": {"gte": "2019-06-19T21:00:00", "lt": "2019-06-19T22:00:00"}
}
}
]
}
},
"size": 0,
"aggs": {
"mybuckets": {
"composite": {
"size": 100,
"sources": [
{"#date": {
"date_histogram": {
"field": "#date",
"interval": "hour",
"format": "date_hour"}
}
},
{"field_1": {"terms": {"field": "field_1"}}},
{"field_2": {"terms": {"field": "field_2"}}},
{"field_3": {"terms": {"field": "field_3"}}},
{"field_4": {"terms": {"field": "field_4"}}},
{"field_5": {"terms": {"field": "field_5"}}}
],
"after": {
"#date": "2019-06-19T21",
"field_1": 262,
"field_2": 347,
"field_3": 945,
"field_4": 2258,
"field_5": 0
}
}
}
}
}
This pages through the results, until the mybuckets returns empty

You can use the parameters from and size in your request. See https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html for more information. Your request would be something like this:
{
"from" : 0,
"size" : 10,
"aggs":{
"group_by_state":{
"terms":{
"field":"poi_id"
},
"aggs":{
"sum(price)":{
"sum":{
"field":"price"
}
}
}
}
}
}

Related

Get buckets containing documents in ElasticSearch

I have a query like that:
https://pastebin.com/9YK6WxEJ
this gives me:
https://pastebin.com/ranpCnzG
Now, the buckets are fine but I want to get the documents' data grouped by bucket name, not just their count in doc_count. Is there any way to do that?
Maybe this works for you?
"aggs": {
"rating_ranges": {
"range": {
"field": "AggregateRating",
"keyed": true,
"ranges": [
{
"key": "bad",
"to": 3
},
{
"key": "average",
"from": 3,
"to": 4
},
{
"key": "good",
"from": 4
}
]
},
"aggs": {
"hits": {
"top_hits": {
"size": 100,
"sort": [
{
"AggregateRating": {
"order": "desc"
}
}
]
}
}
}
}
}

Multiple key aggregation in ElasticSearch

I am new to Elastic Search and was exploring aggregation query. The documents I have are in the format -
{"name":"A",
"class":"10th",
"subjects":{
"S1":92,
"S2":92,
"S3":92,
}
}
We have about 40k such documents in our ES with the Subjects varying from student to student. The query to the system can be to aggregate all subject-wise scores for a given class. We tried to create a bucket aggregation query as explained in this guide here, however, this generates a single bucket per document and in our understanding requires an explicit mention of every subject.
We want to system to generate subject wise aggregate for the data by executing a single aggregation query, the problem I face is that in our data the subjects could vary from student to student and we don't have a global list of subject keys.
We wrote the following script but this only works if we know all possible subjects.
GET student_data_v1_1/_search
{ "query" :
{"match" :
{ "class" : "' + query + '" }},
"aggs" : { "my_buckets" : { "terms" :
{ "field" : "subjects", "size":10000 },
"aggregations": {"the_avg":
{"avg": { "field": "subjects.value" }}} }},
"size" : 0 }'
but this query only works for the document structure, but does not work multiple subjects are defined where we may not know the key-pair -
{"name":"A",
"class":"10th",
"subjects":{
"value":93
}
}
An alternate form the document is present is that the subject is a list of dictionaries -
{"name":"A",
"class":"10th",
"subjects":[
{"S1":92},
{"S2":92},
{"S3":92},
]
}
Having an aggregation query to solve either of the 2 document formats would be helpful.
======EDITS======
After updating the document to hold weights for each subject -
{
class": "10th",
"subject": [
{
"name": "s1",
"marks": 90,
"weight":30
},
{
"name": "s2",
"marks": 80,
"weight":70
}
]}
I have updated the query to be -
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "scores"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs" : { "weighted_grade": { "weighted_avg": { "value": { "field": "subjects.score" }, "weight": { "field": "subjects.weight" } } } }
}
}
}
}
},
"size": 0
}
but it throws the error-
{u'error': {u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'root_cause': [{u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'type': u'unknown_named_object_exception'}],
u'type': u'unknown_named_object_exception'},
u'status': 400}
To achieve the required result I would suggest you to keep your index mapping as follows:
{
"properties": {
"class": {
"type": "keyword"
},
"subject": {
"type": "nested",
"properties": {
"marks": {
"type": "integer"
},
"name": {
"type": "keyword"
}
}
}
}
}
In the mapping above I have created subject as nested type with two properties, name to hold subject name and marks to hold marks in the subject.
Sample doc:
{
"class": "10th",
"subject": [
{
"name": "s1",
"marks": 90
},
{
"name": "s2",
"marks": 80
}
]
}
Now you can use nested aggregation and multilevel aggregation (i.e. aggregation inside aggregation). I used nested aggregation with terms aggregation for subject.name to get bucket containing all the available subjects. Then to get avg for each subject we add a child aggregation of avg to the subjects aggregation as below:
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "subject"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs": {
"avg_score": {
"avg": {
"field": "subject.marks"
}
}
}
}
}
}
},
"size": 0
}
NOTE: I have added "size" : 0 so that elastic doesn't return matching docs in the result. To include or exclude it depends totally on your use case.
Sample result:
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"subjects": {
"doc_count": 6,
"subjects": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "s1",
"doc_count": 3,
"avg_score": {
"value": 80
}
},
{
"key": "s2",
"doc_count": 2,
"avg_score": {
"value": 75
}
},
{
"key": "s3",
"doc_count": 1,
"avg_score": {
"value": 80
}
}
]
}
}
}
}
As you can see the result contains buckets with key as subject name and avg_score.value as the avg of marks.
UPDATE to include weighted_avg:
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "subject"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs": {
"avg_score": {
"avg": {
"field": "subject.marks"
}
},
"weighted_grade": {
"weighted_avg": {
"value": {
"field": "subject.marks"
},
"weight": {
"field": "subject.weight"
}
}
}
}
}
}
}
},
"size": 0
}

How to use multiple Composite Aggregations in ElasticSearch?

I am trying to obtain two composite aggregations in ElasticSearch but the second one is always giving me an empty bucket.
GET /resolutions/_search
{
"query": {
"query_string": {
"query": "*"
}
},
"aggs": {
"total": {
"composite": {
"sources": [
{"doi": {"terms": {"field": "doi"}}},
{"access_method": {"terms": {"field": "access_method"}}}
],
"size": 10000
}
},
"unqiue": {
"composite": {
"sources": [
{"doi": {"terms": {"field": "doi"}}},
{"access_method": {"terms": {"field": "access_method"}}},
{"session": {"terms": {"field": "session"}}}
],
"size": 10000
}
}
},
"size": 0,
"track_total_hits": false
}
In the response, you can see the first aggregation (total) with 1000s of objects in the bucket but the second one aggreagtion (unique) is always empty. I have tried swaping the order of the aggregations and it's always the second one in order that is empty.
[![Reponse with second bucket empty][2]][2]
The index mapping are in: https://github.com/datacite/shiba-inu/blob/2d632d341a22a8dca2afec3b01c3b34030144c9c/templates/aggregating_es.json
Why is it returning an empty bucket?
The "after_key" indicates that there are still results left. The search returned the first page. For further pagination you need to repeat the same request with "after" set to the value from the "after_key". Repeat this with every new after_key until the after_key is missing.
Example from elastic
GET /_search
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"size": 2,
"sources": [
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } },
{ "product": { "terms": { "field": "product", "order": "asc" } } }
],
"after": { "date": 1494288000000, "product": "mad max" }
}
}
}
}

Aggregations and filters in Elastic - find the last hits and filter them afterwards

I'm trying to work with Elastic (5.6) and to find a way to retrieve the top documents per some category.
I have an index with the following kind of documents :
{
"#timestamp": "2018-03-22T00:31:00.004+01:00",
"statusInfo": {
"status": "OFFLINE",
"timestamp": 1521675034892
},
"name": "myServiceName",
"id": "xxxx",
"type": "Http",
"key": "key1",
"httpStatusCode": 200
}
}
What i'm trying to do with these, is retrieve the last document (#timestamp-based) per name (my categories), see if its statusInfo.status is OFFLINE or UP and fetch these results into the hits part of a response so I can put it in a Kibana count dashboard or somewhere else (a REST based tool I do not control and can't modify by myself).
Basically, I want to know how many of my services (name) are OFFLINE (statusInfo.status) in their last update (#timestamp) for monitoring purposes.
I'm stuck at the "Get how many of my services" part.
My query so far:
GET actuator/_search
{
"size": 0,
"aggs": {
"name_agg": {
"terms": {
"field": "name.raw",
"size": 1000
},
"aggs": {
"last_document": {
"top_hits": {
"_source": ["#timestamp", "name", "statusInfo.status"],
"size": 1,
"sort": [
{
"#timestamp": {
"order": "desc"
}
}
]
}
}
}
}
},
"post_filter": {
"bool": {
"must_not": {
"term": {
"statusInfo.status.raw": "UP"
}
}
}
}
}
This provides the following response:
{
"all_the_meta":{...},
"hits": {
"total": 1234,
"max_score": 0,
"hits": []
},
"aggregations": {
"name_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "myCategory1",
"doc_count": 225,
"last_document": {
"hits": {
"total": 225,
"max_score": null,
"hits": [
{
"_index": "myIndex",
"_type": "Http",
"_id": "dummy id",
"_score": null,
"_source": {
"#timestamp": "2018-04-06T00:06:00.005+02:00",
"statusInfo": {
"status": "UP"
},
"name": "myCategory1"
},
"sort": [
1522965960005
]
}
]
}
}
},
{other_buckets...}
]
}
}
}
Removing the size make the result contain ALL of the documents, which is not what I need, I only need each bucket content (every one contains one bucket).
Removing the post filter does not appear to do much.
I think this would be feasible in ORACLE SQL with a PARTITION BY OVER clause, followed by a condition.
Does somebody know how this could be achieved ?
If I understand you correctly, you are looking for the latest doc that have status of OFFLINE in each group (grouped by name)?. In that case you can try the query below and the number of items in the bucket should give you the "how many are down" (for up you would change the term in the filter)
NOTE: this is done in latest version, so it uses keyword field instead of raw
POST /index/_search
{
"size": 0,
"query":{
"bool":{
"filter":{
"term": {"statusInfo.status.keyword": "OFFLINE"}
}
}
},
"aggs":{
"services_agg":{
"terms":{
"field": "name.keyword"
},
"aggs":{
"latest_doc":{
"top_hits": {
"sort": [
{
"#timestamp":{
"order": "desc"
}
}
],
"size": 1,
"_source": ["#timestamp", "name", "statusInfo.status"]
}
}
}
}
}
}

Filtering aggregation issue on string type array values

We are indexing receivers of particular email,receiver may be single or may be multiple.
Below are the properties
FieldName:Subject,Type:String,Analyzer:Keyword
FieldName:Receivers,Type:String,Analyzedr:Keyword
Date to Index
Subject:hello,Receivers:["A#abc.com","B#abc","C#abc.com"]
The problem is while Filter aggregation is applied on terms aggregation. If "A#abc.com","B#abc" is filtered then logically it should only return "A#abc.com","B#abc" in term aggregation but it returns all "A#abc.com","B#abc",C#abc.com.
Below is my query and output.
Input query
{
"size":0,
"aggs":{
"filter":{
"filter":{
"terms":{
"receivers":[
"A#abc.com",
"B#abc"
]
}
},
"aggs":{
"result":{
"terms":{
"field":"receivers"
}
}
}
}
}}
Output
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 26464,
"max_score": 0,
"hits": []
},
"aggregations": {
"filter": {
"doc_count": 1,
"result": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "A#abc.com",
"doc_count": 1
},
{
"key": "B#abc",
"doc_count": 1
},
{
"key": "C#abc.net",
"doc_count": 1
}
]
}
}
}}
We even tried to use include but at some situation we may need to use regular expressions in include it self like below.Where we need "A#abc.com","B#abc" as well filter only ".*abc.com.*" from "A#abc.com","B#abc" only. so the output should be "A#abc.com" but it returns both "A#abc.com","B#abc"
{
"size":0,
"aggs":{
"filter":{
"filter":{
"terms":{
"receiver":[
"A#abc.com",
"B#abc.com"
]
}
},
"aggs":{
"result":{
"terms":{
"field":"receiver",
"include":[ ".*abc.com.*",
"A#abc.com",
"B#abc.com"
]
}
}
}
}
}}
Please suggest how can the above be achieved.
Thanks in advance
Your query should be a bit different: when using a regular expression, this one shouldn't be in an array, but standalone. And the dot (.) should be escaped, as it's a reserved character:
{
"size": 0,
"aggs": {
"filter": {
"filter": {
"terms": {
"receiver": [
"A#abc.com",
"B#abc.com"
]
}
},
"aggs": {
"result": {
"terms": {
"field": "receiver",
"include": ".*abc\\.com.*"
}
}
}
}
}
}

Resources