How can I filter doc_count value which is a result of a nested aggregation - elasticsearch

How can I filter the doc_count value which is a result of a nested aggregation?
Here is my query:
"aggs": {
"CDIDs": {
"terms": {
"field": "CDID.keyword",
"size": 1000
},
"aggs": {
"my_filter": {
"filter": {
"range": {
"transactionDate": {
"gte": "now-1M/M"
}
}
}
},
"in_active": {
"bucket_selector": {
"buckets_path": {
"doc_count": "_count"
},
"script": "params.doc_count > 4"
}
}
}
}
}
The result of the query looks like:
{
"aggregations" : {
"CDIDs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 2386,
"buckets" : [
{
"key" : "1234567",
"doc_count" : 5,
"my_filter" : {
"doc_count" : 4
}
},
{
"key" : "12345",
"doc_count" : 5,
"my_filter" : {
"doc_count" : 5
}
}
]
}
}
}
I'm trying to filter the second doc_count value here. Let's say I wanna have docs that are > 4 so the result should be having only one aggregation result in a bucket with doc_count = 5. Can anyone help how can I do this filter? Please let me know if any additional information is required.

Take a close look at the bucket_selector aggregation. You simply need to specify the aggregation name in buckets_path section i.e. "doc_count":"my_filter>_count"
Pipeline aggregation (buckets_path) has its own syntax where > acts as a separator. Refer to this LINK for more information on this.
Aggregation Query
POST <your_index_name>/_search
{
"size":0,
"aggs":{
"CDIDs":{
"terms":{
"field":"CDID.keyword",
"size":1000
},
"aggs":{
"my_filter":{
"filter":{
"range":{
"transactionDate":{
"gte":"now-1M/M"
}
}
}
},
"in_active":{
"bucket_selector":{
"buckets_path":{
"doc_count":"my_filter>_count"
},
"script":"params.doc_count > 4"
}
}
}
}
}
}
Hope it helps!

Related

Finding intersection of two buckets using Elastic

I have data structured as the following in an elastic index:
[ { customer_id: 1, date_of_purchase: 01-01-2022 },
{ customer_id: 2, date_of_purchase: 01-02-2022 },
{ customer_id: 1, date_of_purchase: 01-02-2022 },
....
]
I want to find the numbers of users who have bought something in both September and October, but having issues figuring out how to make a query for this. Any suggestions would rock, thanks!
I have used following aggregations
1. Terms aggregation
2. Bucket selector
3. Date Range
In query I have filtered all documents which either have purchase date in Jan or in Feb. This reduces number of documents for aggregation to work on. In aggregation I have done a group by(terms aggregation) on customer_id and then further grouped documents based on date ranges(1 bucket for each month). Then I have eliminated months(using bucket selector) which have zero documents i.e. with no purchase date in that month and further eliminated customers which have 1 or zero buckets
Query
{
"query": {
"bool": {
"should": [
{
"range": {
"date_of_purchase": {
"gte": "2022-01-01",
"lte": "2022-01-31"
}
}
},
{
"range": {
"date_of_purchase": {
"gte": "2022-02-01",
"lte": "2022-02-28"
}
}
}
]
}
},
"aggs": {
"cutomers": {
"terms": {
"field": "customer_id",
"size": 10
},
"aggs": {
"range": {
"date_range": {
"field": "date_of_purchase",
"ranges": [
{
"to": "2022-01-31",
"from": "2022-01-01"
},
{
"to": "2022-02-28",
"from": "2022-02-01"
}
]
},
"aggs": {
"filter_months": {
"bucket_selector": {
"buckets_path": {
"doc_count":"_count"
},
"script": "params.doc_count>=1"
}
}
}
},
"bucket_count":{
"bucket_selector": {
"buckets_path": {
"bucket_count":"range._bucket_count"
},
"script": "params.bucket_count>1"
}
}
}
}
}
}
Results
"aggregations" : {
"cutomers" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1,
"doc_count" : 2,
"range" : {
"buckets" : [
{
"key" : "2022-01-01T00:00:00.000Z-2022-01-31T00:00:00.000Z",
"from" : 1.6409952E12,
"from_as_string" : "2022-01-01T00:00:00.000Z",
"to" : 1.6435872E12,
"to_as_string" : "2022-01-31T00:00:00.000Z",
"doc_count" : 1
},
{
"key" : "2022-02-01T00:00:00.000Z-2022-02-28T00:00:00.000Z",
"from" : 1.6436736E12,
"from_as_string" : "2022-02-01T00:00:00.000Z",
"to" : 1.6460064E12,
"to_as_string" : "2022-02-28T00:00:00.000Z",
"doc_count" : 1
}
]
}
}
]
}
}

How to filter by sub-aggregated results in Elasticsearch

I've got the following elastic search query in order to get the number of product sales per hour grouped by product id and hour of sale.
POST /my_sales/_search?size=0
{
"aggs": {
"sales_per_hour": {
"date_histogram": {
"field": "event_time",
"fixed_interval": "1h",
"format": "yyyy-MM-dd:HH:mm"
},
"aggs": {
"sales_per_hour_per_product": {
"terms": {
"field": "name.keyword"
}
}
}
}
}
}
One example of data :
{
"#timestamp" : "2020-10-29T18:09:56.921Z",
"name" : "my-beautifull_product",
"event_time" : "2020-10-17T08:01:33.397Z"
}
This query returns several buckets (one per hour and per product) but i would like to only retrieve those who have a doc_count higher than 10 for example, is it possible ?
For those results i would like to know the id of the product and the event_time bucket.
Thanks for your help.
Perhaps using the Bucket Selector feature will help on filtering out the results.
Try out this below search query:
{
"aggs": {
"sales_per_hour": {
"date_histogram": {
"field": "event_time",
"fixed_interval": "1h",
"format": "yyyy-MM-dd:HH:mm"
},
"aggs": {
"sales_per_hour_per_product": {
"terms": {
"field": "name.keyword"
},
"aggs": {
"the_filter": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count > 10"
}
}
}
}
}
}
}
}
It will filter out all the documents, whose count is greater than 10 based on "params.the_doc_count > 10"
Thank you for your help this is not far from what i would like but not exactly ; with the bucket selector i have something like this :
"aggregations" : {
"sales_per_hour" : {
"buckets" : [
{
"key_as_string" : "2020-08-31:23:00",
"key" : 1598914800000,
"doc_count" : 16,
"sales_per_hour_per_product" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "my_product_1",
"doc_count" : 2
},
{
"key" : "my_product_2",
"doc_count" : 2
},
{
"key" : "myproduct_3",
"doc_count" : 12
}
]
}
}
]
}
And sometimes none of the buckets are greater than 10, is it possible to have the same thing but with the filter on _count applied to the second level aggregation (sales_per_hour_per_product) and not on the first level (sales_per_hour) ?

aggregating properties in elastic search

I have an indexed entry that has optional properties. So, for example, I have entries like this
{
"id":1
"field1":"XYZ"
},
{
"id":2
"field2":"XYZ"
},
{
"id":3
"field1":"XYZ"
}
I would like to make an aggregation that will tell me how many entries I have with field1 and field2 populated.
The expected result should be:
{
"field1":2
"field2":1
}
Is this even possible with elasticsaerch?
Yes, you can do it like this:
POST myindex/_search
{
"size": 0,
"aggs": {
"field_exists": {
"filters": {
"filters": {
"field1": {
"exists": {
"field": "field1"
}
},
"field2": {
"exists": {
"field": "field2"
}
}
}
}
}
}
}
You'll get an answer like this one:
"aggregations" : {
"field_exists" : {
"buckets" : {
"field1" : {
"doc_count" : 2
},
"field2" : {
"doc_count" : 1
}
}
}
}

Elastic GeoHash Query - Aggregation Filter

I am trying to query an elastic index where the result of the query is a list of the geohashes with only one matching document.
I can get a simple list of all geo hashes and their document counts using the following:
{
"size" : 0,
"aggregations" : {
"boundingbox" : {
"filter" : {
"geo_bounding_box" : {
"location" : {
"top_left" : "34.5, -118.9",
"bottom_right" : "33.3, -116."
}
}
},
"aggregations":{
"grid" : {
"geohash_grid" : {
"field": "location",
"precision": 4
}
}
}
}
}
}
However I can't work out the correct syntax to filter the query, the closest I can get are below:
This fails with 503 org.elasticsearch.search.aggregations.bucket.filter.InternalFilter cannot be cast to org.elasticsearch.search.aggregations.InternalMultiBucketAggregation
"aggregations":{
"grid" : {
"geohash_grid" : {
"field": "location",
"precision": 4
}
},
"grid_bucket_filter" : {
"bucket_selector" : {
"buckets_path" :{
"docCount" : "grid" //Also tried `"docCount" : "doc_count"`
},
"script" : "params.docCount == 1"
}
}
}
This fails with 400 No aggregation found for path [doc_count]
"aggregations":{
"grid" : {
"geohash_grid" : {
"field": "location",
"precision": 4
}
},
"grid_bucket_filter" : {
"bucket_selector" : {
"buckets_path" :{
"docCount" : "doc_count"
},
"script" : "params.docCount > 1"
}
}
}
How can I filter based on the doc_count in a geohash grid?
You need to do it like this, i.e. the bucket selector pipeline shall be specified as a sub-aggregation of the geohash_grid one. Plus you need to use _count instead of doc_count(see here):
{
"aggregations": {
"grid": {
"geohash_grid": {
"field": "location",
"precision": 4
},
"aggs": {
"grid_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"docCount": "_count"
},
"script": "params.docCount > 1"
}
}
}
}
}
}

Post filter on subaggregation in elasticsearch

I am trying to run a post filter on the aggregated data, but it is not working as i expected. Can someone review my query and suggest if i am doing anything wrong here.
"query" : {
"bool" : {
"must" : {
"range" : {
"versionDate" : {
"from" : null,
"to" : "2016-04-22T23:13:50.000Z",
"include_lower" : false,
"include_upper" : true
}
}
}
}
},
"aggregations" : {
"associations" : {
"terms" : {
"field" : "association.id",
"size" : 0,
"order" : {
"_term" : "asc"
}
},
"aggregations" : {
"top" : {
"top_hits" : {
"from" : 0,
"size" : 1,
"_source" : {
"includes" : [ ],
"excludes" : [ ]
},
"sort" : [ {
"versionDate" : {
"order" : "desc"
}
} ]
}
},
"disabledDate" : {
"filter" : {
"missing" : {
"field" : "disabledDate"
}
}
}
}
}
}
}
STEPS in the query:
Filter by indexDate less than or equal to a given date.
Aggregate based on formId. Forming buckets per formId.
Sort in descending order and return top hit result per bucket.
Run a subaggregation filter after the sort subaggregation and remove all the documents from buckets where disabled date is not null.(Which is not working)
The whole purpose of post_filter is to run after aggregations have been computed. As such, post_filter has no effect whatsoever on aggregation results.
What you can do in your case is to apply a top-level filter aggregation so that documents with no disabledDate are not taken into account in aggregations, i.e. consider only documents with disabledDate.
{
"query": {
"bool": {
"must": {
"range": {
"versionDate": {
"from": null,
"to": "2016-04-22T23:13:50.000Z",
"include_lower": true,
"include_upper": true
}
}
}
}
},
"aggregations": {
"with_disabled": {
"filter": {
"exists": {
"field": "disabledDate"
}
},
"aggs": {
"form.id": {
"terms": {
"field": "form.id",
"size": 0
},
"aggregations": {
"top": {
"top_hits": {
"size": 1,
"_source": {
"includes": [],
"excludes": []
},
"sort": [
{
"versionDate": {
"order": "desc"
}
}
]
}
}
}
}
}
}
}
}

Resources