ElasticSearch - 2nd Level nested aggregation incorrect doc_count - elasticsearch

ElasticSearch 7.10.1 nested aggregations.
Can anyone point me to why the doc_count on my 2nd nested aggregation is not correct?
The count on the first aggregation is accurate but the 2nd isnt (both are keyword fields).
{
"size": 0,
"_source": false,
"query": {
"match_all": {}
},
"aggs": {
"products": {
"nested": {
"path": "productsImpacted"
},
"aggs": {
"field1": {
"terms": {
"field": "productsImpacted.product.keyword",
"size": 1000
},
"aggs": {
"resellers": {
"nested": {
"path": "requestType"
},
"aggs": {
"field2": {
"terms": {
"field": "requestType.type.keyword",
"size": 1000
}
}
}
}
}
}
}
}
}
}
Thanks,

ES’agg is inaccurate.
you cna use size and shard_size to improve accuracy means a decline in performance,You can refer to the official documents:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-shard-size

Related

How to define percentage of result items with specific field in Elasticsearch query?

I have a search query that returns all items matching users that have type manager or lead.
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [
{
"terms": {
"type": ["manager", "lead"]
}
}
]
}
}
}
Is there a way to define what percentage of the results should be of type "manager"?
In other words, I want the results to have 80% of users with type manager and 20% with type lead.
I want to make a suggestion to use bucket_path aggregation. As I know this aggregation needs to be run in sub-aggs of a histogram aggregation. As you have such field in your mapping so I think this query should work for you:
{
"size": 0,
"aggs": {
"NAME": {
"date_histogram": {
"field": "my_datetime",
"interval": "month"
},
"aggs": {
"role_type": {
"terms": {
"field": "type",
"size": 10
},
"aggs": {
"count": {
"value_count": {
"field": "_id"
}
}
}
},
"role_1_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_1 / (params.role_1+params.role_2)*100"
}
},
"role_2_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_2 / (params.role_1+params.role_2)*100"
}
}
}
}
}
}
Please let me know if it didn't work well for you.

How to convert ElasticSearch query to ES7

We are having a tremendous amount of trouble converting an old ElasticSearch query to a newer version of ElasticSearch. The original query for ES 1.8 is:
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*",
"default_operator": "AND"
}
},
"filter": {
"and": [
{
"terms": {
"organization_id": [
"fred"
]
}
}
]
}
}
},
"size": 50,
"sort": {
"updated": "desc"
},
"aggs": {
"status": {
"terms": {
"size": 0,
"field": "status"
}
},
"tags": {
"terms": {
"size": 0,
"field": "tags"
}
}
}
}
and we are trying to convert it to ES version 7. Does anyone know how to do that?
The Elasicsearch docs for Filtered query in 6.8 (the latest version of the docs I can find that has the page) state that you should move the query and filter to the must and filter parameters in the bool query.
Also, the terms aggregation no longer support setting size to 0 to get Integer.MAX_VALUE. If you really want all the terms, you need to set it to the max value (2147483647) explicitly. However, the documentation for Size recommends using the Composite aggregation instead and paginate.
Below is the closest query I could make to the original that will work with Elasticsearch 7.
{
"query": {
"bool": {
"must": {
"query_string": {
"query": "*",
"default_operator": "AND"
}
},
"filter": {
"terms": {
"organization_id": [
"fred"
]
}
}
}
},
"size": 50,
"sort": {
"updated": "desc"
},
"aggs": {
"status": {
"terms": {
"size": 2147483647,
"field": "status"
}
},
"tags": {
"terms": {
"size": 2147483647,
"field": "tags"
}
}
}
}

Filter Elasticsearch Aggregation by Bucket Key Value

I have an Elasticsearch index of documents in which there is a field that contains a list of URLs. Aggregating on this field gives me the count of unique URLs, as expected.
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
}
}
}
}
I then want to filter out the buckets whose keys do not contain a certain string. I've tried doing so with the Bucket Selector Aggregation.
This attempt:
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
}
},
"links_key_filter": {
"bucket_selector": {
"buckets_path": {
"key": "links"
},
"script": "!key.contains('foo')"
}
}
}
}
Fails with:
Invalid pipeline aggregation named [links_key_filter] of type
[bucket_selector]. Only sibling pipeline aggregations are allowed at
the top level
Putting the bucket selector inside the links aggregation, like so:
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
},
"bucket_selector": {
"buckets_path": {
"key": "links"
},
"script": "!key.contains('foo')"
}
}
}
}
fails with:
Found two aggregation type definitions in [links]: [terms] and [bucket_selector]
I'm going to keep tinkering but am a bit stuck at the moment :(
You won't be able to use the bucket_selector because its bucket_path
must reference either a number value or a single value numeric metric aggregation [source]
and what a terms aggregation produces is denoted as StringTerms — and that simply won't work, regardless of whether you force a placeholder multibucket aggregation or not.
Having said that, each terms aggregation supports the exclude filter.
Assuming that your links are arrays of keywords:
POST models/_doc/1
{
"links": [
"google.com",
"wikipedia.org"
]
}
POST models/_doc/2
{
"links": [
"reddit.com",
"google.com"
]
}
and you'd like to group everything except reddit, you can use the following regex:
POST models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"exclude": ".*reddit.*", <--
"size": 10
}
}
}
}
BTW, There are some non-trivial implications arising from the usage of such regexes, esp. when you imagine a case-sensitive scenario in which you'd need a query-time-generated regex — as discussed in How to correctly query inside of terms aggregate values in elasticsearch, using include and regex?
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
}
},
"bucket_selector": {
"buckets_path": {
"key": "links"
},
"script": "!key.contains('foo')"
}
}
}
Your selector should come a level up, it should be directly in the aggs and parallel to your selector group.
I am not sure about the key filtering
You can use "_key" to get keys:
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
},
"bucket_selector": {
"buckets_path": {
"key": "_key"
},
"script": "!params.key.contains('foo')"
}
}
}
}

Compute the "fill rate" of a field in Elasticsearch

I would like to compute the ratio of fields that have a value in my index.
I managed to count how many documents miss the field:
GET profiles/_search
{
"aggs": {
"profiles_wo_country": {
"missing": {
"field": "country"
}
}
},
"size": 0
}
I also managed to count how many documents have the filed:
GET profiles/_search
{
"query": {
"filtered": {
"query": {"match_all": {}},
"filter": {
"exists": {
"field": "country"
}
}
}
},
"size": 0
}
Naturally I can also get the total number of documents in the index. How can I compute the ratio?
An easy way to get the numbers you need out of a query is using the following query
POST profiles/_search?filter_path=hits.total,aggregations.existing.doc_count
{
"size": 0,
"aggs": {
"existing": {
"filter": {
"exists": {
"field": "tag"
}
}
}
}
}
You'll get an response like this one:
{
"hits": {
"total": 37258601
},
"aggregations": {
"existing": {
"doc_count": 9287160
}
}
}
And then in your client code, you can simply do
fill_rate = (aggregations.existing.doc_count / hits.total) * 100
And you're good to go.

Filter/Query support in Elasticsearch Top hits Aggregation

Elasticsearch documentation states that The top_hits aggregation returns regular search hits, because of this many per hit features can be supported Crucially, the list includes Named filters and queries
But trying to add any filter or query throws SearchParseException: Unknown key for a START_OBJECT
Use case: I have items which have list of nested comments
items{id} -> comments {date, rating}
I want to get top rated comment for each item in the last week.
{
"query": {
"match_all": {}
},
"aggs": {
"items": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"comment": {
"nested": {
"path": "comments"
},
"aggs": {
"top_comment": {
"top_hits": {
"size": 1,
//need filter here to select only comments of last week
"sort": {
"comments.rating": {
"order": "desc"
}
}
}
}
}
}
}
}
}
}
So is the documentation wrong, or is there any way to add a filter?
https://www.elastic.co/guide/en/elasticsearch/reference/2.1/search-aggregations-metrics-top-hits-aggregation.html
Are you sure you have mapped them as Nested? I've just tried to execute such query on my data and it did work fine.
If so, you could simply add a filter aggregation, right after nested aggregation (hopefully I haven't messed up curly brackets):
POST data/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"nested": {
"path": "comments",
"query": {
"range": {
"comments.date": {
"gte": "now-1w",
"lte": "now"
}
}
}
}
}
}
},
"aggs": {
"items": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"nested": {
"nested": {
"path": "comments"
},
"aggs": {
"filterComments": {
"filter": {
"range": {
"comments.date": {
"gte": "now-1w",
"lte": "now"
}
}
},
"aggs": {
"topComments": {
"top_hits": {
"size": 1,
"sort": {
"comments.rating": "desc"
}
}
}
}
}
}
}
}
}
}
}
P.S. Always include FULL path for nested objects.
So this query will:
Filter documents that have comments younger than one week to narrow down documents for aggregation and to find those, who actually have such comments (filtered query)
Do terms aggregation based on id field
Open nested sub documents (comments)
Filter them by date
Return the most badass one (most rated)

Resources