Elastic Search Aggregation buckets, buckets by number of records - elasticsearch

I am new to Elastic Search and I'm trying to create a request without a lot of success. Here is the use case:
Let's imagine I have 4 documents, which have an amount field:
[
{
"id": 541436748332,
"amount": 5,
"date": "2017-01-01"
},
{
"id": 6348643512,
"amount": 2,
"date": "2017-03-13"
},
{
"id": 343687432,
"amount": 2,
"date": "2017-03-14"
},
{
"id": 6457866181,
"amount": 7,
"date": "2017-05-21"
}
]
And here is the kind of result I'd like to get:
{
"aggregations": {
"my_aggregation": {
"buckets": [
{
"doc_count": 2,
"sum": 7
},
{
"doc_count": 2,
"sum": 9
}
]
}
}
}
As you can see, I want some kind of histogram, but instead of putting a date interval, I'd like to set a "document" interval. So here, that would be 2 documents per bucket, and the sum of the field amount of those two documents.
Does someone knows if that is even possible? That would also imply to sort the records by date for example, to get the wanted results
EDIT: Some more explanations on the use case:
The real use case is a line graph I'd like to print. But I want to make the X axis the number of sales, and in the Y the total amount $$$ of those sales. And I don't want to print thousands of dot on my graph, I want fewer dots, that's why I was hoping to deal with the buckets and the sums...
The example of response I gave is just the first step I want to achieve, the second step would be to add each field the one that is behind it:
{
"aggregations": {
"my_aggregation": {
"buckets": [
{
"doc_count": 2,
"sum": 7
},
{
"doc_count": 2,
"sum": 16
}
]
}
}
}
(7 = 5 + 2); (16 = 7 (from last result) + 2 + 7);

You can use histogram and sum aggregations, like this:
{
"size": 0,
"aggs": {
"prices": {
"histogram": {
"field": "id",
"interval": 2,
"offset": 1
},
"aggs": {
"total_amount": {
"sum": {
"field": "amount"
}
}
}
}
}
}
(offset 1 is required if you want the first bucket to start at 1 instead of at 0.) Then you'll get a response like this:
{
"aggregations": {
"prices": {
"buckets": [
{
"key": 1,
"doc_count": 2,
"total_amount": {
"value": 7
}
},
{
"key": 3,
"doc_count": 2,
"total_amount": {
"value": 9
}
}
]
}
}
}
Sorting is not required, because the default order is the order you want. However, there's also an order parameter in case you want a different ordering of the buckets.

Related

Calculate exact count of distinct values for combination of 2 fields in Elasticsearch

I have around 40 million records in my elasticsearch index. I want to calculate count of distinct values for combination of 2 fields.
Example for given set of documents:
[
{
"JobId" : 2,
"DesigId" : 12
},
{
"JobId" : 2,
"DesigId" : 4
},
{
"JobId" : 3,
"DesigId" : 5
},
{
"JobId" : 2,
"DesigId" : 4
},
{
"JobId" : 3,
"DesigId" : 5
}
]
For above example, I should get the count = 3 as only 3 distinct values exists :
[(2,12),(2,4),(3,5)]
I tried using cardinality aggregation for this but that provides an approximate count. I want to calculate the exact count accurately.
Below is the query which I used using cardinality aggregation:
"aggs": {
"counts": {
"cardinality": {
"script": "doc['JobId'].value + ',' + doc['DesigId'].value",
"precision_threshold": 40000
}
}
}
I also tried using composite aggregation on combination of 2 fields using after key and counting the overall size of buckets but that process is really time taking and my query is getting timed out.
Is there any optimal way to achieve it?
Scripting should be avoided as it affects performance. For your use case, there are 3 ways by which you can achieve your required results :
Using Composite Aggregation (which you have already tried)
Using Multi terms aggregation, but this is not memory efficient solution
Search Query :
{
"size": 0,
"aggs": {
"jobId_and_DesigId": {
"multi_terms": {
"terms": [
{
"field": "JobId"
},
{
"field": "DesigId"
}
]
}
}
}
}
Search Result:
"aggregations": {
"jobId_and_DesigId": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": [
2,
4
],
"key_as_string": "2|4",
"doc_count": 2
},
{
"key": [
3,
5
],
"key_as_string": "3|5",
"doc_count": 2
},
{
"key": [
2,
12
],
"key_as_string": "2|12",
"doc_count": 1
}
]
}
}
The combined field value (i.e., the combination of "JobId" and "DesigId") should be stored at the index time itself as this is the best method. This is possible by using a set processor.
PUT /_ingest/pipeline/concat
{
"processors": [
{
"set": {
"field": "combined_field",
"value": "{{JobId}} {{DesigId}}"
}
}
]
}
Index API
When indexing the documents, you need to add pipeline=concat query param, each time you index the documents. Suppose a index API will look like :
POST _doc/1?pipeline=concat
{
"JobId": 2,
"DesigId": 12
}
Search Query:
{
"size": 0,
"aggs": {
"jobId_and_DesigId": {
"terms": {
"field":"combined_field.keyword"
}
}
}
}
Search Result:
"aggregations": {
"jobId_and_DesigId": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "2 4",
"doc_count": 2
},
{
"key": "3 5",
"doc_count": 2
},
{
"key": "2 12",
"doc_count": 1
}
]
}
}
Cardinality aggregation only gives approximate count. Since there are more than 40K documents using precision threshold will also not work.
You can use scripted metric aggregation. It will give accurate count but will considerably slower than cardinality aggregation.
{
"aggs": {
"Distinct_Count": {
"scripted_metric": {
"init_script": "state.list = []",
"map_script": """
state.list.add(doc['JobId'].value+'-'+doc['DesigId'].value);
""",
"combine_script": "return state.list;",
"reduce_script":"""
Map uniqueValueMap = new HashMap();
int count = 0;
for(shardList in states) {
if(shardList != null) {
for(key in shardList) {
if(!uniqueValueMap.containsKey(key)) {
count +=1;
uniqueValueMap.put(key, key);
}
}
}
}
return count;
"""
}
}
}
}

How Elasticsearch do filter aggregation

I need to do average aggregation, but I want to filter some values. In the below examples, I want filter the length=100, so I want to do average for length (with doc #1 and doc #2) and for width with all documents. So I expect to see length average as 9 and width average as 5. What should I do?
document example:
["id": 1, "length": 10, "width":8]
["id": 2, "length": 8, "width":2]
["id": 3, "length": 100, "width":5]
And In some other case, length may not exist, How about this case?
["id": 1, "length": 10, "width":8]
["id": 2, "length": 8, "width":2]
["id": 3, "width":5]
termAggregation.subAggregation(AggregationBuilders.avg("length").field("length"))
.subAggregation(AggregationBuilders.avg("width").field("width"));
Your aggregation query will look like below for excluding 100 from aggregation. You need to use filter aggregation and inside that avg as sub aggregation.
{
"size": 0,
"aggs": {
"cal": {
"filter": {
"bool": {
"must_not": [
{
"match": {
"length": "100"
}
}
]
}
},
"aggs": {
"avg_length": {
"avg": {
"field": "length"
}
}
}
},
"avg_width":{
"avg": {
"field": "width"
}
}
}
}
Java code
AvgAggregationBuilder widthAgg = new AvgAggregationBuilder("avg_width").field("width");
AvgAggregationBuilder lengthAgg = new AvgAggregationBuilder("avg_length").field("length");
FilterAggregationBuilder filter = new FilterAggregationBuilder("cal",
QueryBuilders.boolQuery().mustNot(QueryBuilders.matchQuery("length", "100")));
filter.subAggregation(lengthAgg);
SearchSourceBuilder ssb = new SearchSourceBuilder();
ssb.aggregation(filter);
ssb.aggregation(widthAgg);
System.out.println(ssb.toString());
Response
"aggregations": {
"avg_width": {
"value": 5
},
"cal": {
"meta": {},
"doc_count": 3,
"avg_length": {
"value": 9
}
}
}

Interval search for messages in Elasticsearch

I need to split the found messages into intervals. Can this be done with Elasticsearch?
For example. There are 10 messages, you need to divide them into 3 intervals. It should look like this...
[0,1,2,3,4,5,6,7,8,9] => {[0,1,2], [3,4,5,6], [7,8,9]}.
I'm only interested in the beginning of the intervals. For example: {[count - 3, min 0], [count - 4, min 3], [count - 3, min - 7]}
Example.
PUT /test_index
{
"mappings": {
"properties": {
"id": {
"type": "long"
}
}
}
}
POST /test_index/_doc/0
{
"id": 0
}
POST /test_index/_doc/1
{
"id": 1
}
POST /test_index/_doc/2
{
"id": 2
}
POST /test_index/_doc/3
{
"id": 3
}
POST /test_index/_doc/4
{
"id": 4
}
POST /test_index/_doc/5
{
"id": 5
}
POST /test_index/_doc/6
{
"id": 6
}
POST /test_index/_doc/7
{
"id": 7
}
POST /test_index/_doc/8
{
"id": 8
}
POST /test_index/_doc/9
{
"id": 9
}
It is necessary to divide the values ​​into 3 intervals with the same number of elements in each interval:
{
...
"aggregations": {
"result": {
"buckets": [
{
"min": 0.0,
"doc_count": 3
},
{
"min": 3.0,
"doc_count": 4
},
{
"min": 7.0,
"doc_count": 3
}
]
}
}
}
There is a similar function: "variable width histogram":
GET /test_index/_search?size=0
{
"aggs": {
"result": {
"variable_width_histogram": {
"field": "id",
"buckets": 3
}
}
},
"query": {
"match_all": {}
}
}
But "variable width histogram" separates documents by id value, not by the number of elements in the bucket
Assuming your mapping is like:
{
"some_numeric_field" : {"type" : "integer"}
}
Then you can build histograms out of it with fixed interval sizes:
POST /my_index/_search?size=0
{
"aggs": {
"some_numeric_field": {
"histogram": {
"field": "some_numeric_field",
"interval": 7
}
}
}
}
Results:
{
...
"aggregations": {
"prices": {
"buckets": [
{
"key": 0.0,
"doc_count": 7
},
{
"key": 7.0,
"doc_count": 7
},
{
"key": 14.0,
"doc_count": 7
}
]
}
}
}
To get the individual values inside each bucket, just add a sub-aggregation, maybe "top_hits" or anything else like a "terms"
aggregation.
Without knowing more about your data, I really cannot help further.

Limit to max records to be searched in Elastic Search Group by query

We have a strange issue where data for one of our customers has a lot of records based on certain field x. When the user triggers a query for the group by for that x field, the Elastic Search cluster is going for a toss and restarting with OOM.
Is there a way to limit max records that elastic search should look for while aggregating the result for a certain field so that cluster can be saved from going OOM ?
PS: The group by can go on multiple fields such as x,y,x, and w, and the user is searching for the last 30-day data only.
Use Sampler Aggregation with terms aggregation if you wish to restrict the number of documents that should be taken into account for an aggregation (let's say terms aggregation) (in this case)
Index Data:
{
"role": "example",
"number": 1
}
{
"role": "example1",
"number": 2
}
{
"role": "example2",
"number": 3
}
Search Query:
{
"size": 0,
"aggs": {
"sample": {
"sampler": {
"shard_size": 2 // Max documents you need to have for the aggregation
},
"aggs": {
"unique_roles": {
"terms": {
"field": "role.keyword"
}
}
}
}
}
}
Search Result:
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"sample": {
"doc_count": 2, // Note this
"unique_roles": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "example",
"doc_count": 1
},
{
"key": "example1",
"doc_count": 1
}
]
}
}
}

ElasticSearch aggregation function

Is that a possible to define an aggregation function in elastic search?
E.g. for data:
author weekday status
me monday ok
me tuesday ok
me moday bad
I want to get an aggregation based on author and weekday, and as a value I want to get concatenation of status field:
agg1 agg2 value
me monday ok,bad
me tuesday ok
I know you can do count, but is that possible to define another function used for aggregation?
EDIT/ANSWER: Looks like there is no multirow aggregation support in ES, thus we had to use subaggregations on last field (see Akshay's example). If you need to have more complex aggregation function, then aggregate by id (note, you won't be able to use _id, so you'll have to duplicate it in other field) - that way you'll be able to do advanced aggregation on individual items in each bucket.
You can get get roughly what you want by using sub aggregations available in 1.0. Assuming the documents are structured as author, weekday and status, you could using the aggregation below:
{
"size": 0,
"aggs": {
"author": {
"terms": {
"field": "author"
},
"aggs": {
"days": {
"terms": {
"field": "weekday"
},
"aggs": {
"status": {
"terms": {
"field": "status"
}
}
}
}
}
}
}
}
Which gives you the following result:
{
...
"aggregations": {
"author": {
"buckets": [
{
"key": "me",
"doc_count": 3,
"days": {
"buckets": [
{
"key": "monday",
"doc_count": 2,
"status": {
"buckets": [
{
"key": "bad",
"doc_count": 1
},
{
"key": "ok",
"doc_count": 1
}
]
}
},
{
"key": "tuesday",
"doc_count": 1,
"status": {
"buckets": [
{
"key": "ok",
"doc_count": 1
}
]
}
}
]
}
}
]
}
}
}

Resources