How Elasticsearch do filter aggregation - elasticsearch

I need to do average aggregation, but I want to filter some values. In the below examples, I want filter the length=100, so I want to do average for length (with doc #1 and doc #2) and for width with all documents. So I expect to see length average as 9 and width average as 5. What should I do?
document example:
["id": 1, "length": 10, "width":8]
["id": 2, "length": 8, "width":2]
["id": 3, "length": 100, "width":5]
And In some other case, length may not exist, How about this case?
["id": 1, "length": 10, "width":8]
["id": 2, "length": 8, "width":2]
["id": 3, "width":5]
termAggregation.subAggregation(AggregationBuilders.avg("length").field("length"))
.subAggregation(AggregationBuilders.avg("width").field("width"));

Your aggregation query will look like below for excluding 100 from aggregation. You need to use filter aggregation and inside that avg as sub aggregation.
{
"size": 0,
"aggs": {
"cal": {
"filter": {
"bool": {
"must_not": [
{
"match": {
"length": "100"
}
}
]
}
},
"aggs": {
"avg_length": {
"avg": {
"field": "length"
}
}
}
},
"avg_width":{
"avg": {
"field": "width"
}
}
}
}
Java code
AvgAggregationBuilder widthAgg = new AvgAggregationBuilder("avg_width").field("width");
AvgAggregationBuilder lengthAgg = new AvgAggregationBuilder("avg_length").field("length");
FilterAggregationBuilder filter = new FilterAggregationBuilder("cal",
QueryBuilders.boolQuery().mustNot(QueryBuilders.matchQuery("length", "100")));
filter.subAggregation(lengthAgg);
SearchSourceBuilder ssb = new SearchSourceBuilder();
ssb.aggregation(filter);
ssb.aggregation(widthAgg);
System.out.println(ssb.toString());
Response
"aggregations": {
"avg_width": {
"value": 5
},
"cal": {
"meta": {},
"doc_count": 3,
"avg_length": {
"value": 9
}
}
}

Related

Interval search for messages in Elasticsearch

I need to split the found messages into intervals. Can this be done with Elasticsearch?
For example. There are 10 messages, you need to divide them into 3 intervals. It should look like this...
[0,1,2,3,4,5,6,7,8,9] => {[0,1,2], [3,4,5,6], [7,8,9]}.
I'm only interested in the beginning of the intervals. For example: {[count - 3, min 0], [count - 4, min 3], [count - 3, min - 7]}
Example.
PUT /test_index
{
"mappings": {
"properties": {
"id": {
"type": "long"
}
}
}
}
POST /test_index/_doc/0
{
"id": 0
}
POST /test_index/_doc/1
{
"id": 1
}
POST /test_index/_doc/2
{
"id": 2
}
POST /test_index/_doc/3
{
"id": 3
}
POST /test_index/_doc/4
{
"id": 4
}
POST /test_index/_doc/5
{
"id": 5
}
POST /test_index/_doc/6
{
"id": 6
}
POST /test_index/_doc/7
{
"id": 7
}
POST /test_index/_doc/8
{
"id": 8
}
POST /test_index/_doc/9
{
"id": 9
}
It is necessary to divide the values ​​into 3 intervals with the same number of elements in each interval:
{
...
"aggregations": {
"result": {
"buckets": [
{
"min": 0.0,
"doc_count": 3
},
{
"min": 3.0,
"doc_count": 4
},
{
"min": 7.0,
"doc_count": 3
}
]
}
}
}
There is a similar function: "variable width histogram":
GET /test_index/_search?size=0
{
"aggs": {
"result": {
"variable_width_histogram": {
"field": "id",
"buckets": 3
}
}
},
"query": {
"match_all": {}
}
}
But "variable width histogram" separates documents by id value, not by the number of elements in the bucket
Assuming your mapping is like:
{
"some_numeric_field" : {"type" : "integer"}
}
Then you can build histograms out of it with fixed interval sizes:
POST /my_index/_search?size=0
{
"aggs": {
"some_numeric_field": {
"histogram": {
"field": "some_numeric_field",
"interval": 7
}
}
}
}
Results:
{
...
"aggregations": {
"prices": {
"buckets": [
{
"key": 0.0,
"doc_count": 7
},
{
"key": 7.0,
"doc_count": 7
},
{
"key": 14.0,
"doc_count": 7
}
]
}
}
}
To get the individual values inside each bucket, just add a sub-aggregation, maybe "top_hits" or anything else like a "terms"
aggregation.
Without knowing more about your data, I really cannot help further.

Nested array of objects aggregation in Elasticsearch

Documents in the Elasticsearch are indexed as such
Document 1
{
"task_completed": 10
"tagged_object": [
{
"category": "cat",
"count": 10
},
{
"category": "cars",
"count": 20
}
]
}
Document 2
{
"task_completed": 50
"tagged_object": [
{
"category": "cars",
"count": 100
},
{
"category": "dog",
"count": 5
}
]
}
As you can see that the value of the category key is dynamic in nature. I want to perform a similar aggregation like in SQL with the group by category and return the sum of the count of each category.
In the above example, the aggregation should return
cat: 10,
cars: 120 and
dog: 5
Wanted to know how to write this aggregation query in Elasticsearch if it is possible. Thanks in advance.
You can achieve your required result, using nested, terms, and sum aggregation.
Adding a working example with index mapping, search query and search result
Index Mapping:
{
"mappings": {
"properties": {
"tagged_object": {
"type": "nested"
}
}
}
}
Search Query:
{
"size": 0,
"aggs": {
"resellers": {
"nested": {
"path": "tagged_object"
},
"aggs": {
"books": {
"terms": {
"field": "tagged_object.category.keyword"
},
"aggs":{
"sum_of_count":{
"sum":{
"field":"tagged_object.count"
}
}
}
}
}
}
}
}
Search Result:
"aggregations": {
"resellers": {
"doc_count": 4,
"books": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cars",
"doc_count": 2,
"sum_of_count": {
"value": 120.0
}
},
{
"key": "cat",
"doc_count": 1,
"sum_of_count": {
"value": 10.0
}
},
{
"key": "dog",
"doc_count": 1,
"sum_of_count": {
"value": 5.0
}
}
]
}
}
}

Stats Aggregation with Min Mode in ElasticSearch

I have the below mapping in ElasticSearch
{
"properties":{
"Costs":{
"type":"nested",
"properties":{
"price":{
"type":"integer"
}
}
}
}
}
So every document has an Array field Costs, which contains many elements and each element has price in it. I want to find the min and max price with the condition being - that from each array the element with the minimum price should be considered. So it is basically min/max among the minimum value of each array.
Lets say I have 2 documents with the Costs field as
Costs: [
{
"price": 100,
},
{
"price": 200,
}
]
and
Costs: [
{
"price": 300,
},
{
"price": 400,
}
]
So I need to find the stats
This is the query I am currently using
{
"costs_stats":{
"nested":{
"path":"Costs"
},
"aggs":{
"price_stats_new":{
"stats":{
"field":"Costs.price"
}
}
}
}
}
And it gives me this:
"min" : 100,
"max" : 400
But I need to find stats after taking minimum elements of each array for consideration.
So this is what i need:
"min" : 100,
"max" : 300
Like we have a "mode" option in sort, is there something similar in stats aggregation also, or any other way of achieving this, maybe using a script or something. Please suggest. I am really stuck here.
Let me know if anything is required
Update 1:
Query for finding min/max among minimums
{
"_source":false,
"timeout":"5s",
"from":0,
"size":0,
"aggs":{
"price_1":{
"terms":{
"field":"id"
},
"aggs":{
"price_2":{
"nested":{
"path":"Costs"
},
"aggs":{
"filtered":{
"aggs":{
"price_3":{
"min":{
"field":"Costs.price"
}
}
},
"filter":{
"bool":{
"filter":{
"range":{
"Costs.price":{
"gte":100
}
}
}
}
}
}
}
}
}
},
"minValue":{
"min_bucket":{
"buckets_path":"price_1>price_2>filtered>price_3"
}
}
}
}
Only few buckets are coming and hence the min/max is coming among those, which is not correct. Is there any size limit.
One way to achieve your use case is to add one more field id, in each document. With the help of id field terms aggregation can be performed, and so buckets will be dynamically built - one per unique value.
Then, we can apply min aggregation, which will return the minimum value among numeric values extracted from the aggregated documents.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"Costs": {
"type": "nested"
}
}
}
}
Index Data:
{
"id":1,
"Costs": [
{
"price": 100
},
{
"price": 200
}
]
}
{
"id":2,
"Costs": [
{
"price": 300
},
{
"price": 400
}
]
}
Search Query:
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id",
"size": 15 <-- note this
},
"aggs": {
"nested_entries": {
"nested": {
"path": "Costs"
},
"aggs": {
"min_position": {
"min": {
"field": "Costs.price"
}
}
}
}
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 100.0
}
}
},
{
"key": 2,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 300.0
}
}
}
]
}
Using stats aggregation also, it can be achieved (if you add one more field id that uniquely identifies your document)
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id",
"size": 15 <-- note this
},
"aggs": {
"costs_stats": {
"nested": {
"path": "Costs"
},
"aggs": {
"price_stats_new": {
"stats": {
"field": "Costs.price"
}
}
}
}
}
}
}
}
Update 1:
To find the maximum value among those minimums (as seen in the above query), you can use max bucket aggregation
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id",
"size": 15 <-- note this
},
"aggs": {
"nested_entries": {
"nested": {
"path": "Costs"
},
"aggs": {
"min_position": {
"min": {
"field": "Costs.price"
}
}
}
}
}
},
"maxValue": {
"max_bucket": {
"buckets_path": "id_terms>nested_entries>min_position"
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 100.0
}
}
},
{
"key": 2,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 300.0
}
}
}
]
},
"maxValue": {
"value": 300.0,
"keys": [
"2"
]
}
}

Elastic Search Aggregation buckets, buckets by number of records

I am new to Elastic Search and I'm trying to create a request without a lot of success. Here is the use case:
Let's imagine I have 4 documents, which have an amount field:
[
{
"id": 541436748332,
"amount": 5,
"date": "2017-01-01"
},
{
"id": 6348643512,
"amount": 2,
"date": "2017-03-13"
},
{
"id": 343687432,
"amount": 2,
"date": "2017-03-14"
},
{
"id": 6457866181,
"amount": 7,
"date": "2017-05-21"
}
]
And here is the kind of result I'd like to get:
{
"aggregations": {
"my_aggregation": {
"buckets": [
{
"doc_count": 2,
"sum": 7
},
{
"doc_count": 2,
"sum": 9
}
]
}
}
}
As you can see, I want some kind of histogram, but instead of putting a date interval, I'd like to set a "document" interval. So here, that would be 2 documents per bucket, and the sum of the field amount of those two documents.
Does someone knows if that is even possible? That would also imply to sort the records by date for example, to get the wanted results
EDIT: Some more explanations on the use case:
The real use case is a line graph I'd like to print. But I want to make the X axis the number of sales, and in the Y the total amount $$$ of those sales. And I don't want to print thousands of dot on my graph, I want fewer dots, that's why I was hoping to deal with the buckets and the sums...
The example of response I gave is just the first step I want to achieve, the second step would be to add each field the one that is behind it:
{
"aggregations": {
"my_aggregation": {
"buckets": [
{
"doc_count": 2,
"sum": 7
},
{
"doc_count": 2,
"sum": 16
}
]
}
}
}
(7 = 5 + 2); (16 = 7 (from last result) + 2 + 7);
You can use histogram and sum aggregations, like this:
{
"size": 0,
"aggs": {
"prices": {
"histogram": {
"field": "id",
"interval": 2,
"offset": 1
},
"aggs": {
"total_amount": {
"sum": {
"field": "amount"
}
}
}
}
}
}
(offset 1 is required if you want the first bucket to start at 1 instead of at 0.) Then you'll get a response like this:
{
"aggregations": {
"prices": {
"buckets": [
{
"key": 1,
"doc_count": 2,
"total_amount": {
"value": 7
}
},
{
"key": 3,
"doc_count": 2,
"total_amount": {
"value": 9
}
}
]
}
}
}
Sorting is not required, because the default order is the order you want. However, there's also an order parameter in case you want a different ordering of the buckets.

Elasticsearch: Using the results of a Metric Aggregation to filter the elements of a bucket and run additional aggregations

Given a dataset like
[{
"type": "A",
"value": 32
}, {
"type": "A",
"value": 34
}, {
"type": "B",
"value": 35
}]
I would like to perform the following aggregation:
Firstly, I would like to group by "type" in buckets using the terms
aggregation.
After that, I would like to calculate some metrics of the field "value" using the extended_stats.
Knowing the std_deviation_bounds (upper and lower) I would like to
calculate the average value of the elements of the bucket excluding
those outside the range [std_deviation_bounds.lower,
std_deviation_bounds.upper]
First and second point of my list are trivial. I would like to know if the third point, using information of a sibling metric aggregation result to filter out elements of the bucket and recalculate an average is possible. And, if it is, I would like to have a hint of the aggregation structure I would need to use.
The version of the Elasticsearch instance is 5.0.0
Well, OP here.
I still don't know if ElasticSearch allows to formulate an aggregation as I described in the original question.
What I did to solve this problem is taking a different approach. I will post it here just in case it is helpful to anyone else.
so,
POST hostname:9200/index/type/_search
with
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"group": {
"terms": {
"field": "type"
},
"aggs": {
"histogramAgg": {
"histogram": {
"field": "value",
"interval": 10,
"offset": 0,
"order": {
"_key": "asc"
},
"keyed": true,
"min_doc_count": 0
},
"aggs": {
"statsAgg": {
"stats": {
"field": "value"
}
}
}
},
"extStatsAgg": {
"extended_stats": {
"field": "value",
"sigma": 2
}
}
}
}
}
}
will generate a result like this
{
"took": 100,
"timed_out": false,
"_shards": {
"total": 10,
"successful": 10,
"failed": 0
},
"hits": {
"total": 100000,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"group": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "A",
"doc_count": 10000,
"histogramAgg": {
"buckets": {
"0.0": {
"key": 0.0,
"doc_count": 1234,
"statsAgg": {
"count": 1234,
"min": 0.0,
"max": 9.0,
"avg": 0.004974220783280196,
"sum": 7559.0
}
},
"10.0": {
"key": 10.0,
"doc_count": 4567,
"statsAgg": {
"count": 4567,
"min": 10.0,
"max": 19.0,
"avg": 15.544345993923,
"sum": 331846.0
}
},
[...]
}
},
"extStatsAgg": {
"count": 10000,
"min": 0.0,
"max": 104.0,
"avg": 16.855123857,
"sum": 399079395E10,
"sum_of_squares": 3.734838645273888E15,
"variance": 1.2690056384124432E9,
"std_deviation": 35.10540102369,
"std_deviation_bounds": {
"upper": 87.06592590438,
"lower": -54.35567819038
}
}
},
[...]
]
}
}
}
If you pay attention to the results of the group aggregation for type:"A" you will notice we have now the average and the count of every sub-group of the histogram.
You will have noticed too the results of the extStatsAgg aggregation (sibling of the histogram aggregation) shows the std_deviation_bounds for every bucket group (for type:"A", type:"B",...)
As you may have noticed, this doesn't give the solution I was looking for.
I needed to do a few calculations on my code. Example in pseudoCode
for bucket in buckets_groupAggregation
Long totalCount = 0
Double accumWeightedAverage = 0.0
ExtendedStats extendedStats = bucket.extendedStatsAggregation
Double upperLimit = extendedStats.std_deviation_bounds.upper
Double lowerLimit = extendedStats.std_deviation_bounds.lower
Histogram histogram = bucket.histogramAggregation
for group in histogram
Stats stats = group.statsAggregation
if group.key > lowerLimit & group.key < upperLimit
totalCount += group.count
accumWeightedAverage += group.count * stats.average
Double average = accumWeightedAverage / totalCount
Notes:
The size of the histogram interval will determine the "accuracy" of the final average. Finer interval will get more accurate results while increasing the aggregation time.
I hope it helps someone else

Resources