How can I know if two different aggregations aggregated the same docs? - elasticsearch

Suppose I have two aggs:
GET .../_search
{
"size": 0,
"aggs": {
"foo": {
"terms": {
"field": "foo"
}
},
"bar": {
"terms": {
"field": "bar"
}
}
}
}
Which returns the following:
...
"aggregations": {
"foo": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Africa",
"doc_count": 23
}
]
},
"bar": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Oil",
"doc_count": 23
}
]
}
}
My question is, how can I know if both "foo" and "bar" aggs are aggregating the same 23 docs?
I tried adding a sub agg to both "foo" and "bar" aggs to sum an arbitrary numeric field, but that's not remotely foolproof.

You can add a subaggregation which aggregates the identity field of the documents, you can do this with terms or either composite aggregation. When using terms you need to provide a size. See this example:
GET .../_search
{
"size": 0,
"aggs": {
"foo": {
"terms": {
"field": "foo"
},
"aggs" : {
"terms" : {
"field" : your_id_here
}
}
},
"bar": {
"terms": {
"field": "bar"
},
"aggs" : {
"terms" : {
"field" : your_id_here
}
}
}
}
}
You will need to compare the nested aggregations then.
Another approach would be to just filter out the desired documents using the search query.

Related

Filtering aggregation results

This question is a subquestion of this question. Posting as a separate question for attention.
Sample Docs:
{
"id":1,
"product":"p1",
"cat_ids":[1,2,3]
}
{
"id":2,
"product":"p2",
"cat_ids":[3,4,5]
}
{
"id":3,
"product":"p3",
"cat_ids":[4,5,6]
}
Ask: To get products belonging to a particular category. e.g cat_id = 3
Query:
GET product/_search
{
"size": 0,
"aggs": {
"cats": {
"terms": {
"field": "cats",
"size": 10
},"aggs": {
"products": {
"terms": {
"field": "name.keyword",
"size": 10
}
}
}
}
}
}
Question:
How to filter the aggregated result for cat_id = 3 here. I tried bucket_selector as well but it is not working.
Note: Due to multi-value of cat_ids filtering and then aggregation isn't working
You can filter values, on the basis of which buckets will be created.
It is possible to filter the values for which buckets will be created.
This can be done using the include and exclude parameters which are
based on regular expression strings or arrays of exact values.
Additionally, include clauses can filter using partition expressions.
Adding a working example with index data, search query, and search result
Index Data:
{
"id":1,
"product":"p1",
"cat_ids":[1,2,3]
}
{
"id":2,
"product":"p2",
"cat_ids":[3,4,5]
}
{
"id":3,
"product":"p3",
"cat_ids":[4,5,6]
}
Search Query:
{
"size": 0,
"aggs": {
"cats": {
"terms": {
"field": "cat_ids",
"include": [ <-- note this
3
]
},
"aggs": {
"products": {
"terms": {
"field": "product.keyword",
"size": 10
}
}
}
}
}
}
Search Result:
"aggregations": {
"cats": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 2,
"products": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "p1",
"doc_count": 1
},
{
"key": "p2",
"doc_count": 1
}
]
}
}
]
}

Return just buckets size of aggregation query - Elasticsearch

I'm using an aggregation query on elasticsearch 2.1, here is my query:
"aggs": {
"atendimentos": {
"terms": {
"field": "_parent",
"size" : 0
}
}
}
The return is like that:
"aggregations": {
"atendimentos": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1a92d5c0-d542-4f69-aeb0-42a467f6a703",
"doc_count": 12
},
{
"key": "4e30bf6d-730d-4217-a6ef-e7b2450a012f",
"doc_count": 12
}.......
It return 40000 buckets, so i have a lot of buckets in this aggregation, i just want return the buckets size, but i want something like that:
buckets_size: 40000
Guys, how return just the buckets size?
Well, thank you all.
try this query:
POST index/_search
{
"size": 0,
"aggs": {
"atendimentos": {
"terms": {
"field": "_parent"
}
},
"count":{
"cardinality": {
"field": "_parent"
}
}
}
}
It may return something like that:
"aggregations": {
"aads": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "aa",
"doc_count": 1
},
{
"key": "bb",
"doc_count": 1
}
]
},
"count": {
"value": 2
}
}
EDIT: More info here - https://www.elastic.co/guide/en/elasticsearch/reference/2.1/search-aggregations-metrics-cardinality-aggregation.html
{
"aggs" : {
"type_count" : {
"cardinality" : {
"field" : "type"
}
}
}
}
Read more about Cardinality Aggregation

Elasticsearch: Can I return only the cardinality of a buckets agg, without returning all the buckets?

Take the following query and result,
POST index/_search
{
"size": 0,
"aggs": {
"perDeviceAggregation": {
"terms": {
"field": "deviceID"
},
"aggs": {
"score_avg": {
"avg": {
"field": "device_score"
}
}
}
},
"count":{
"cardinality": {
"field": "deviceID"
}
}
}
}
result:
"aggregations": {
"aads": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "aa",
"doc_count": 3,
"score_avg": {
"value": 3.8
}
},
{
"key": "bb",
"doc_count": 1,
"score_avg": {
"value": 3.8
}
}
]
},
"count": {
"value": 2
}
}
That's great. But in my situation, I don't really care about information about each bucket. I only want to know the # of buckets. Something like the following:
"aggregations": {
"aads": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"bucket_count": 2
}
}
Is this possible in Elasticsearch?
Edit:
You might wonder why I calculate an average (which limits using terms instead of cardinality) if I don't care about what's in buckets. I do use the average to do a range aggregation. My actual problem is like folowing: The above question was simplified.
POST index/_search
{
"size": 0,
"aggs" : {
"mos_over_time" : {
"range" : {
"field" : "device_score",
"ranges" : [
{ "from" : 0.0, "to" : 2.6 },
{ "from" : 2.6, "to" : 4.0 },
{ "from" : 4.0 }
]
},
"aggs": {
"perDeviceAggregation": {
"terms": {
"field": "deviceID"
},
"aggs": {
"score_avg": {
"avg": {
"field": "device_score"
}
}
}
},
"count":{
"cardinality": {
"field": "deviceID"
}
}
}
}
}
}

Elasticserach filter on aggregated results (SQL HAVING)

I have an ES query that aggregates data from a monitoring tool.
Currently, I've found the number of documents in each relevant group (by "externalId").
Now, I wish to filter the results by the number of records in each group.
(Similar to "HAVING" clause in SQL, doc_count > 0)
For instance, to find the "externalId" that stored more then one time.
This is my ES query:
{
"query":
{
"match" :
{
"method" : "METHOD_NAME"
}
},
"size":0,
"aggs":
{
"group_by_external_id":
{
"terms":
{
"field": "externalId"
}
}
}
}
The results looks like this:
"aggregations": {
"group_by_external_id": {
"doc_count_error_upper_bound": 5,
"sum_other_doc_count": 53056,
"buckets": [
{
"key": "6088417651626873",
"doc_count": 1
},
{
"key": "6088417688232882",
"doc_count": 1
}
Terms aggregations have a min_doc_count option you can use. For example,
"aggs":
{
"group_by_external_id":
{
"terms":
{
"field": "externalId",
"min_doc_count": 2
}
}
}

ElasticSearch: min_doc_count on lower/lowest level nested aggregation

I have this query with some nested aggregations
{
"aggs": {
"by_date": {
"date_histogram": {
"field": "timestamp",
"interval": "day"
},
"aggs": {
"new_users": {
"filter": {
"query": {
"match": {
"action": "USER_ADD"
}
}
},
"aggs": {
"unique_users": {
"cardinality": {
"field": "user"
}
}
}
}
}
}
},
"size": 0
}
It yields results that look like this
"aggregations": {
"by_date": {
"buckets": [
{
"key_as_string": "1970-01-07T00:00:00.000Z",
"key": 518400000,
"doc_count": 210,
"new_users": {
"doc_count": 0,
"unique_users": {
"value": 0
}
}
},
{
"key_as_string": "1970-01-09T00:00:00.000Z",
"key": 691200000,
"doc_count": 6,
"new_users": {
"doc_count": 0,
"unique_users": {
"value": 0
}
}
},
......
What I want to happen is apply min_doc_count on the most nested sub-aggregation such that I don't see zero values for "unique_users" (in this case) returned.
The issue is that min_doc_count can't be applied to my query other than the date_histogram at the top level.
Does the ES query language support something like this? Any know workarounds?
Thanks,
George
As per ElasticSearch Documentation min_doc_count can used with any aggregation including histogram
for example
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tag"
}
}
}
}
the above query is not date_histogram still you can apply the min_doc_count
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tag",
"min_doc_count" : 1
}
}
}
}
only thing is min_doc_count can be applied to any aggregation

Resources