Filter Elasticsearch Aggregation by Bucket Key Value - elasticsearch

I have an Elasticsearch index of documents in which there is a field that contains a list of URLs. Aggregating on this field gives me the count of unique URLs, as expected.
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
}
}
}
}
I then want to filter out the buckets whose keys do not contain a certain string. I've tried doing so with the Bucket Selector Aggregation.
This attempt:
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
}
},
"links_key_filter": {
"bucket_selector": {
"buckets_path": {
"key": "links"
},
"script": "!key.contains('foo')"
}
}
}
}
Fails with:
Invalid pipeline aggregation named [links_key_filter] of type
[bucket_selector]. Only sibling pipeline aggregations are allowed at
the top level
Putting the bucket selector inside the links aggregation, like so:
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
},
"bucket_selector": {
"buckets_path": {
"key": "links"
},
"script": "!key.contains('foo')"
}
}
}
}
fails with:
Found two aggregation type definitions in [links]: [terms] and [bucket_selector]
I'm going to keep tinkering but am a bit stuck at the moment :(

You won't be able to use the bucket_selector because its bucket_path
must reference either a number value or a single value numeric metric aggregation [source]
and what a terms aggregation produces is denoted as StringTerms — and that simply won't work, regardless of whether you force a placeholder multibucket aggregation or not.
Having said that, each terms aggregation supports the exclude filter.
Assuming that your links are arrays of keywords:
POST models/_doc/1
{
"links": [
"google.com",
"wikipedia.org"
]
}
POST models/_doc/2
{
"links": [
"reddit.com",
"google.com"
]
}
and you'd like to group everything except reddit, you can use the following regex:
POST models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"exclude": ".*reddit.*", <--
"size": 10
}
}
}
}
BTW, There are some non-trivial implications arising from the usage of such regexes, esp. when you imagine a case-sensitive scenario in which you'd need a query-time-generated regex — as discussed in How to correctly query inside of terms aggregate values in elasticsearch, using include and regex?

GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
}
},
"bucket_selector": {
"buckets_path": {
"key": "links"
},
"script": "!key.contains('foo')"
}
}
}
Your selector should come a level up, it should be directly in the aggs and parallel to your selector group.
I am not sure about the key filtering

You can use "_key" to get keys:
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
},
"bucket_selector": {
"buckets_path": {
"key": "_key"
},
"script": "!params.key.contains('foo')"
}
}
}
}

Related

How to define percentage of result items with specific field in Elasticsearch query?

I have a search query that returns all items matching users that have type manager or lead.
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [
{
"terms": {
"type": ["manager", "lead"]
}
}
]
}
}
}
Is there a way to define what percentage of the results should be of type "manager"?
In other words, I want the results to have 80% of users with type manager and 20% with type lead.
I want to make a suggestion to use bucket_path aggregation. As I know this aggregation needs to be run in sub-aggs of a histogram aggregation. As you have such field in your mapping so I think this query should work for you:
{
"size": 0,
"aggs": {
"NAME": {
"date_histogram": {
"field": "my_datetime",
"interval": "month"
},
"aggs": {
"role_type": {
"terms": {
"field": "type",
"size": 10
},
"aggs": {
"count": {
"value_count": {
"field": "_id"
}
}
}
},
"role_1_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_1 / (params.role_1+params.role_2)*100"
}
},
"role_2_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_2 / (params.role_1+params.role_2)*100"
}
}
}
}
}
}
Please let me know if it didn't work well for you.

Filter out terms aggregation buckets in elasticsearch after applying aggregation

Below is snapshot of the dataset:
recordNo employeeId employeeStatus employeeAddr
1 employeeA Permanent
2 employeeA ABC
3 employeeB Contract
4 employeeB CDE
I want to get the list of employees along with employeeStatus and employeeAddr.
So I am using terms aggregation on employeeId and then using sub-aggregations of employeeStatus and employeeAddr to get these details.
Below query returns the results correctly.
{
"aggregations": {
"Employee": {
"terms": {
"field": "employeeID"
},
"aggregations": {
"employeeStatus": {
"terms": {"field": "employeeStatus"}
},
"employeeAddr": {
"terms": {"field": "employeeAddr"}
}
}
}
}
}
Now I want only the employees which are in Permanent status. So I am applying filter aggregation.
{
"aggregations": {
"filter_Employee_employeeID": {
"filter": {
"bool": {
"must": [
{
"match": {
"employeeStatus": {"query": "Permanent"}
}
}
]
}
},
"aggregations": {
"Employee": {
"terms": {
"field": "employeeID"
},
"aggregations": {
"employeeStatus": {
"terms": {"field": "employeeStatus"}
},
"employeeAddr": {
"terms": {"field": "employeeAddr"}
}
}
}
}
}
}
}
Now the problem is that the employeeAddr aggregation returns no buckets for employeeA because record 2 gets filtered out before the aggregation is done.
Assuming that I cannot modify the data set and I want to achieve the result with a single elastic query, how can I do it?
I checked the Bucket Selector pipeline aggregation but it only works for metric aggregations.
Is there a way to filter out term buckets after the aggregation is applied?
If I understood correctly you want to preserve the aggregations even if you use some kind of filter. To achieve that, try using the post_filter clause.
You can check the docs here
The clause is applied "outside" the aggregation. Using your example, it should look like this:
{
"aggregations": {
"filter_Employee_employeeID": {
"aggregations": {
"Employee": {
"terms": {
"field": "employeeID"
},
"aggregations": {
"employeeStatus": {
"terms": {
"field": "employeeStatus"
}
},
"employeeAddr": {
"terms": {
"field": "employeeAddr"
}
}
}
}
}
}
},
"post_filter": {
"bool": {
"must": [
{
"match": {
"employeeStatus": {
"query": "Permanent"
}
}
}
]
}
}
}
I tested a combination of the include field for the terms aggregation, plus using a bucket_selector with document count would give you the desired result.
Filtering term values is here.
Bucket selector using document count is here
the subtlety here is that, yes you need numeric values, but you can also reference meta/custom fields that elasticsearch has
{
"aggregations": {
"Employee": {
"terms": {
"field": "employeeId.keyword"
},
"aggregations": {
"employeeStatus": {
"terms": {"field": "employeeStatus", "include": "Permanent"}
},
"employeeAddr": {
"terms": {"field": "employeeAddr"}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "employeeStatus._bucket_count"
},
"script": {
"source": "params.count != 0"
}
}
}
}
}
}
}
I tested this on 7.10 and it worked, returning only employeeA, with the address included.

How to convert ElasticSearch query to ES7

We are having a tremendous amount of trouble converting an old ElasticSearch query to a newer version of ElasticSearch. The original query for ES 1.8 is:
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*",
"default_operator": "AND"
}
},
"filter": {
"and": [
{
"terms": {
"organization_id": [
"fred"
]
}
}
]
}
}
},
"size": 50,
"sort": {
"updated": "desc"
},
"aggs": {
"status": {
"terms": {
"size": 0,
"field": "status"
}
},
"tags": {
"terms": {
"size": 0,
"field": "tags"
}
}
}
}
and we are trying to convert it to ES version 7. Does anyone know how to do that?
The Elasicsearch docs for Filtered query in 6.8 (the latest version of the docs I can find that has the page) state that you should move the query and filter to the must and filter parameters in the bool query.
Also, the terms aggregation no longer support setting size to 0 to get Integer.MAX_VALUE. If you really want all the terms, you need to set it to the max value (2147483647) explicitly. However, the documentation for Size recommends using the Composite aggregation instead and paginate.
Below is the closest query I could make to the original that will work with Elasticsearch 7.
{
"query": {
"bool": {
"must": {
"query_string": {
"query": "*",
"default_operator": "AND"
}
},
"filter": {
"terms": {
"organization_id": [
"fred"
]
}
}
}
},
"size": 50,
"sort": {
"updated": "desc"
},
"aggs": {
"status": {
"terms": {
"size": 2147483647,
"field": "status"
}
},
"tags": {
"terms": {
"size": 2147483647,
"field": "tags"
}
}
}
}

How to mention from and size for the first level of elastic search aggregation in nested aggregation?

I have written a query to get the buckets based on id and then sort it. This works fine. But how to make it return buckets from position 100 till 200 for aggregation_by_id rule?
{
"query": {
"match_all": {}
},
"size": 0,
"aggregations": {
"aggregation_by_id": {
"terms": {
"field": "id.keyword"
"size" : 200
},
"aggs": {
"sort_timestamp": {
"top_hits": {
"sort": [{
"timestamp": {
"order": "desc",
"unmapped_type": "long"
}
}],
"size": 1
}
}
}
}
}
}

bucket script not working - elasticsearch 2.4.2

I have tried to subtract the aggregations
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"total_query_id": {
"sum": {
"field": "query_id"
}
},
"total_num_results": {
"sum": {
"field": "num_results"
}
},
"minus_value": {
"bucket_script": {
"buckets_path": {
"qid": "total_query_id",
"nrs": "total_num_results"
},
"script": "qid - nrs"
}
}
}
}
it throws the below error
"reason": "Invalid pipeline aggregation named [minus_value] of type [bucket_script]. Only sibling pipeline aggregations are allowed at the top level"
I have moved to back and forth minus_value node to aggs node but it does not solve my problem.
can anyone help me on this?
The idea is that pipeline aggregations must work on a parent bucket aggregation.
It is not the case in your example, so you must have one parent aggregation. Since you have a match_all query, you could try using a global bucket aggregation and then embed your 3 aggregations inside it, like this:
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"all": {
"global": {},
"aggs": {
"total_query_id": {
"sum": {
"field": "query_id"
}
},
"total_num_results": {
"sum": {
"field": "num_results"
}
},
"minus_value": {
"bucket_script": {
"buckets_path": {
"qid": "total_query_id",
"nrs": "total_num_results"
},
"script": "qid - nrs"
}
}
}
}
}
}

Resources