ElasticSearch: min_doc_count on lower/lowest level nested aggregation - elasticsearch

I have this query with some nested aggregations
{
"aggs": {
"by_date": {
"date_histogram": {
"field": "timestamp",
"interval": "day"
},
"aggs": {
"new_users": {
"filter": {
"query": {
"match": {
"action": "USER_ADD"
}
}
},
"aggs": {
"unique_users": {
"cardinality": {
"field": "user"
}
}
}
}
}
}
},
"size": 0
}
It yields results that look like this
"aggregations": {
"by_date": {
"buckets": [
{
"key_as_string": "1970-01-07T00:00:00.000Z",
"key": 518400000,
"doc_count": 210,
"new_users": {
"doc_count": 0,
"unique_users": {
"value": 0
}
}
},
{
"key_as_string": "1970-01-09T00:00:00.000Z",
"key": 691200000,
"doc_count": 6,
"new_users": {
"doc_count": 0,
"unique_users": {
"value": 0
}
}
},
......
What I want to happen is apply min_doc_count on the most nested sub-aggregation such that I don't see zero values for "unique_users" (in this case) returned.
The issue is that min_doc_count can't be applied to my query other than the date_histogram at the top level.
Does the ES query language support something like this? Any know workarounds?
Thanks,
George

As per ElasticSearch Documentation min_doc_count can used with any aggregation including histogram
for example
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tag"
}
}
}
}
the above query is not date_histogram still you can apply the min_doc_count
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tag",
"min_doc_count" : 1
}
}
}
}
only thing is min_doc_count can be applied to any aggregation

Related

How can I know if two different aggregations aggregated the same docs?

Suppose I have two aggs:
GET .../_search
{
"size": 0,
"aggs": {
"foo": {
"terms": {
"field": "foo"
}
},
"bar": {
"terms": {
"field": "bar"
}
}
}
}
Which returns the following:
...
"aggregations": {
"foo": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Africa",
"doc_count": 23
}
]
},
"bar": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Oil",
"doc_count": 23
}
]
}
}
My question is, how can I know if both "foo" and "bar" aggs are aggregating the same 23 docs?
I tried adding a sub agg to both "foo" and "bar" aggs to sum an arbitrary numeric field, but that's not remotely foolproof.
You can add a subaggregation which aggregates the identity field of the documents, you can do this with terms or either composite aggregation. When using terms you need to provide a size. See this example:
GET .../_search
{
"size": 0,
"aggs": {
"foo": {
"terms": {
"field": "foo"
},
"aggs" : {
"terms" : {
"field" : your_id_here
}
}
},
"bar": {
"terms": {
"field": "bar"
},
"aggs" : {
"terms" : {
"field" : your_id_here
}
}
}
}
}
You will need to compare the nested aggregations then.
Another approach would be to just filter out the desired documents using the search query.

Elasticsearch: Can I return only the cardinality of a buckets agg, without returning all the buckets?

Take the following query and result,
POST index/_search
{
"size": 0,
"aggs": {
"perDeviceAggregation": {
"terms": {
"field": "deviceID"
},
"aggs": {
"score_avg": {
"avg": {
"field": "device_score"
}
}
}
},
"count":{
"cardinality": {
"field": "deviceID"
}
}
}
}
result:
"aggregations": {
"aads": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "aa",
"doc_count": 3,
"score_avg": {
"value": 3.8
}
},
{
"key": "bb",
"doc_count": 1,
"score_avg": {
"value": 3.8
}
}
]
},
"count": {
"value": 2
}
}
That's great. But in my situation, I don't really care about information about each bucket. I only want to know the # of buckets. Something like the following:
"aggregations": {
"aads": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"bucket_count": 2
}
}
Is this possible in Elasticsearch?
Edit:
You might wonder why I calculate an average (which limits using terms instead of cardinality) if I don't care about what's in buckets. I do use the average to do a range aggregation. My actual problem is like folowing: The above question was simplified.
POST index/_search
{
"size": 0,
"aggs" : {
"mos_over_time" : {
"range" : {
"field" : "device_score",
"ranges" : [
{ "from" : 0.0, "to" : 2.6 },
{ "from" : 2.6, "to" : 4.0 },
{ "from" : 4.0 }
]
},
"aggs": {
"perDeviceAggregation": {
"terms": {
"field": "deviceID"
},
"aggs": {
"score_avg": {
"avg": {
"field": "device_score"
}
}
}
},
"count":{
"cardinality": {
"field": "deviceID"
}
}
}
}
}
}

Elasticsearch query array field

I used elastic search in my project. I stored some values to ES. I want to query the array field from elastic search. I have to get how many time the array of value came. For example, You could see the below code, In that, image and price are coming two times.
{
"missing_fields_arr": ["images", "price"]
},
{
"missing_fields_arr": ["price"]
},
{
"missing_fields_arr": ["images"]
},
{
"missing_fields_arr": ["images", "price"]
}
and I expected output should be
"aggregations": {
"missing_fields": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "images, price",
"doc_count": 2
},
{
"key": "price",
"doc_count": 1
},
{
"key": "images",
"doc_count": 1
}
]
}
}
My code is here,
{
"query":{
"bool":{
"must":[
{
"range": {
"#timestamp":{
"gte": "2017-07-20T00:00:00.000Z",
"lte": "2017-07-28T23:59:59.999Z"
}
}
},
{
"term": {
"tracker_name": true
}
}
]
}
},
"from": 0,
"size": 0,
"aggregations" : {
"missing_fields": {"terms": {"field": "missing_fields_arr.raw", "size": 0} }
}
}
You need to use the count api it's much more efficient than the search:
of course combined with a little bit of regex
ex :
curl -XGET 'localhost:9200/product/item/_count?pretty' -H 'Content-Type:application/json' -d'\
{ "query" : { "term" : { "image|price" } } } '
GET /product/item/_count
{
"query" : {
"term" : { "image|price"}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-count.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-valuecount-aggregation.html

Elasticsearch nested cardinality aggregation

I have a mapping with nested schema, i am tring to do aggregation on nested field and order by docid count.
select name, count(distinct docid) as uniqueid from table
group by name
order by uniqueid desc
Above is what i am trying to do.
{
"size": 0,
"aggs": {
"samples": {
"nested": {
"path": "sample"
},
"aggs": {
"sample": {
"terms": {
"field": "sample.name",
"order": {
"DocCounts": "desc"
}
},
"aggs": {
"DocCounts": {
"cardinality": {
"field": "docid"
}
}
}
}
}
}
}
}
But in the result i am not getting the expected output
result:
"buckets": [
{
"key": "xxxxx",
"doc_count": 173256,
"DocCounts": {
"value": 0
}
},
{
"key": "yyyyy",
"doc_count": 63,
"DocCounts": {
"value": 0
}
}
]
i am getting the DocCounts = 0. This is not expected. What went wrong in my query.
I think your last nested aggregation is too much. Try to get rid of it:
{
"size": 0,
"aggs": {
"samples": {
"nested": {
"path": "sample"
},
"aggs": {
"sample": {
"terms": {
"field": "sample.name",
"order": {
"DocCounts": "desc"
}
},
"DocCounts": {
"cardinality": {
"field": "docid"
}
}
}
}
}
}
}
In general when doing some aggregation on nested type by value from upper scope, we observed that we need to put/copy the value from upper scope on nested type when storing document.
Then in your case aggregation would look like:
"aggs": {
"DocCounts": {
"cardinality": {
"field": "sample.docid"
}
}
}
It works in such case at least on version 1.7 of Elasticsearch.
You can use reverse nested aggregation on top of Cardinality aggregation on DocCounts. This is because when nested aggregation is applied, the query runs against the nested document. So to access any field of parent document inside nested doc, reverse nested aggregation can be used. Check ES Reference for more info on this.
Your cardinality query will look like:
"aggs": {
"internal_DocCounts": {
"reverse_nested": { },
"DocCounts": {
"cardinality": {
"field": "docid"
}
}
}
}
The response will look like:
"buckets": [
{
"key": "xxxxx",
"doc_count": 173256,
"internal_DocCounts": {
"doc_count": 173256,
"DocCounts": {
"value": <some_value>
}
}
},
{
"key": "yyyyy",
"doc_count": 63,
"internal_DocCounts": {
"doc_count": 63,
"DocCounts": {
"value": <some_value>
}
}
},
.....
Check this similar thread

ElasticSearch Filtering aggregations from array field

I am trying to do an aggregation on values in an array and also filter the buckets that are returned by a prefix. Not sure if this is possible or I am misusing the filter bucket.
3 documents:
{ "colors":["red","black","blue"] }
{ "colors":["red","black"] }
{ "colors":["red"] }
The goal is to get a count of documents that have a color starting with the letter B:
{
"size":0,
"aggs" : {
"colors" : {
"filter" : { "prefix" : { "colors" : "b" } },
"aggs" : {
"top-colors" : { "terms" : { "field":"colors" } }
}
}
}
}
The results that come back include Red unfortunately. Obviously because the documents with red still match by filter because they also have blue and/or black.
"aggregations": {
"colors": {
"doc_count": 2,
"top-colors": {
"buckets": [
{
"key": "black",
"doc_count": 2
},
{
"key": "red",
"doc_count": 2
},
{
"key": "blue",
"doc_count": 1
}
]
}
}
}
Is there a way to filter just the bucket results?
Try this, it will filter the values the buckets themselves are created for:
{
"size": 0,
"aggs": {
"colors": {
"filter": {
"prefix": {
"colors": "b"
}
},
"aggs": {
"top-colors": {
"terms": {
"field": "colors",
"include": {
"pattern": "b.*"
}
}
}
}
}
}
}

Resources