Retrieve document frequency for terms in query result with aggregations - elasticsearch

For some of my queries to ElasticSearch I want three pieces of information back:
Which terms T occurred in the result document set?
How often does each element of T occur in the result document set?
How often does each element of T occur in the entire index (--> document frequency)?
The first points are easily determined using the default term facet or, nowadays, by the term aggregation method.
So my question is really about the third point.
Before ElasticSearch 1.x, i.e. before the switch to the 'aggregation' paradigm, I could use a term facet with the 'global' option set to true and a QueryFilter to get the document frequency ('global counts') of the exact terms occurring in the document set specified by the QueryFilter.
At first I thought I could do the same thing using a global aggregation, but it seems I can't. The reason is - if I understand correctly - that the original facet mechanism were centered around terms whereas the aggregation buckets are defined by the the set of documents belonging to each bucket.
I.e. specifying the global option of a term facet with a QueryFilter first determined the terms hit by the filter and then computed facet values. Since the facet was global I would receive the document counts.
With aggregations, it's different. The global aggregation can only be used as a top aggregation, causing the aggregation to ignore the current query results and compute the aggregation - e.g. a terms aggregation - on all documents in the index. So for me, that's too much, since I WANT to restrict the returned terms ('buckets') to the terms in the document result set. But if I use a filter-sub-aggregation with a terms-sub-aggregation, I would restrict the term-buckets to the filter again, thus not retrieving the document frequencies but normal facet counts. The reason is that the buckets are determined after the filter so they are "too small". But I don't want restrict bucket size, I want to restrict the buckets to the terms in the query result set.
How can I get the document frequency of those terms in a query result set using aggregations (since facets are deprecated and will be removed)?
Thanks for your time!
EDIT: Here comes an example of how I tried to achieve the desired behaviour.
I will define two aggregations:
global_agg_with_filter_and_terms
global_agg_with_terms_and_filter
Both have a global aggregation at their tops because its the only valid position for it. Then, in the first aggregation, I first filter the results to the original query and then apply a term-sub-aggregation.
In the second aggregation, I do mostly the same, only that here the filter aggregation is a sub-aggregation of the terms aggregation. Hence the similar names, only the order of aggregation differs.
{
"query": {
"query_string": {
"query": "text: my query string"
}
},
"aggs": {
"global_agg_with_filter_and_terms": {
"global": {},
"aggs": {
"filter_agg": {
"filter": {
"query": {
"query_string": {
"query": "text: my query string"
}
}
},
"aggs": {
"terms_agg": {
"terms": {
"field": "facets"
}
}
}
}
}
},
"global_agg_with_terms_and_filter": {
"global": {},
"aggs": {
"document_frequency": {
"terms": {
"field": "facets"
},
"aggs": {
"term_count": {
"filter": {
"query": {
"query_string": {
"query": "text: my query string"
}
}
}
}
}
}
}
}
}
}
Response:
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 221,
"max_score": 0.9839197,
"hits": <omitted>
},
"aggregations": {
"global_agg_with_filter_and_terms": {
"doc_count": 1978,
"filter_agg": {
"doc_count": 221,
"terms_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "fid8",
"doc_count": 155
},
{
"key": "fid6",
"doc_count": 40
},
{
"key": "fid9",
"doc_count": 10
},
{
"key": "fid5",
"doc_count": 9
},
{
"key": "fid13",
"doc_count": 5
},
{
"key": "fid7",
"doc_count": 2
}
]
}
}
},
"global_agg_with_terms_and_filter": {
"doc_count": 1978,
"document_frequency": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "fid8",
"doc_count": 1050,
"term_count": {
"doc_count": 155
}
},
{
"key": "fid6",
"doc_count": 668,
"term_count": {
"doc_count": 40
}
},
{
"key": "fid9",
"doc_count": 67,
"term_count": {
"doc_count": 10
}
},
{
"key": "fid5",
"doc_count": 65,
"term_count": {
"doc_count": 9
}
},
{
"key": "fid7",
"doc_count": 63,
"term_count": {
"doc_count": 2
}
},
{
"key": "fid13",
"doc_count": 55,
"term_count": {
"doc_count": 5
}
},
{
"key": "fid10",
"doc_count": 11,
"term_count": {
"doc_count": 0
}
},
{
"key": "fid11",
"doc_count": 9,
"term_count": {
"doc_count": 0
}
},
{
"key": "fid12",
"doc_count": 5,
"term_count": {
"doc_count": 0
}
}
]
}
}
}
}
At first, please have a look at the first two returned term-buckets of both aggregations, with keys fid8 and fid6. We can easily see that those terms have been appearing in the result set 155 and 40 times, respectively. Now please look at the second aggregation, global_agg_with_terms_and_filter. The terms-aggregation is within the scope of the global aggregation, so here we can actually see the document frequencies, 1050 and 668, respectively. So this part looks good. The issue arises when you scan the list of term buckets further down, to the buckets with the keys fid10 to fid12. While we receive their document frequency, we can also see that their term_count is 0. This is due to the fact that those terms did not occur in our query, that we also used for the filter-sub-aggregation. So the problem is that for ALL terms (global scope!) their document frequency and their facet count with regards to the actual query result is returned. But I need this to be made exactly for the terms that occurred in the query result, i.e. for those exact terms returned by the first aggregation global_agg_with_filter_and_terms.
Perhaps there is a possibity to define some kind of filter that removes all buckets where their sub-filter-aggregation term_count has a zero doc_count?

Hello and sorry if the answer is late.
You should have a look at the Significant Terms aggregation as, like the terms aggregation, it returns one bucket for each term occuring in the results set with the number of occurences available through doc_count, but you also get the number of occurrences in a background set through bg_count. This means it only creates buckets for terms appearing in documents of your query results set.
The default background set comprises all documents in the query scope, but can be filtered down to any subset you want using background_filter.
You can use a scripted bucket scoring function to rank the buckets the way you want by combining several metrics:
_subset_freq: number of documents the term appears in the results set,
_superset_freq: number of documents the term appears in the background set,
_subset_size: number of documents in the results set,
_superset_size: number of documents in the background set.
Request:
{
"query": {
"query_string": {
"query": "text: my query string"
}
},
"aggs": {
"terms": {
"significant_terms": {
"script": "_subset_freq",
"size": 100
}
}
}
}

Related

Elasticsearch Terms Aggregation on array -- filter to buckets that match your query?

I'm using a elasticsearch terms aggregation to bucket based on an array property on each document. I'm running into an issue where I get back buckets that are not in my query, and I'd like to those filter out.
Let's say each document is a Post, and has an array property media which specifies which social media website the post is on (and may be empty):
{
id: 1
media: ["facebook", "twitter", "instagram"]
}
{
id: 2
media: ["twitter", "instagram", "tiktok"]
}
{
id: 3
media: ["instagram"]
}
{
id: 4
media: []
}
And, let's say there's another index of Users, which stores a favorite_media property of the same type.
{
id: 42
favorite_media: ["twitter", "instagram"]
}
I have a query uses a terms lookup to filter, then does a terms aggregation.
{
"query": {
"filter": {
"terms": {
"index": "user_index",
"id": 42,
"path": "favorite_media"
}
}
},
"aggs": {
"Posts_by_media": {
"terms": {
"field": "media",
"size": 1000
}
}
}
}
This will result in:
{
...
"aggregations": {
"Posts_by_media": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "instagram",
"doc_count": 3
},
{
"key": "twitter",
"doc_count": 2
},
{
"key": "facebook",
"doc_count": 1
},
{
"key": "tiktok",
"doc_count": 1
}
]
}
}
}
Because media is an array property, any document that matches the filter will be used to create buckets, and I'll have buckets that don't match my filter. Here I want to only get back buckets facebook and instagram, since those are the two that I'm filtering to (via the terms-lookup).
I know terms aggregations offer a includes ability, but that doesn't work for me here since I'm using a terms-lookup, and don't know the data in favorite_media at query time.
How can I limit my buckets to be only those that match the filters in my query?
Thank you for your help!

Issue with date Histrogram aggrgation with filter

I am trying to get date histrogram for a timestamp field for a specific period. I am using the following query,
{
"aggs" : {
"dataRange" : {
"filter": {"range" : { "#timestamp" :{ "gte":"2020-02-28T17:20:10Z","lte":"2020-03-01T18:00:00Z" } } },
"aggs" : {
"severity_over_time" :{
"date_histogram" : { "field" : "#timestamp", "interval" : "28m" }
}}}
},"size" :0
}
The following result I got,
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 32,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"dataRange": {
"doc_count": 20,
"severity_over_time": {
"buckets": [
{
"key_as_string": "2020-02-28T17:04:00.000Z",
"key": 1582909440000,
"doc_count": 20
}
]
}
}
}
}
The the start of the histogram range ("key_as_string" ) goes outside of my filter criteria! My input filter is from "2020-02-28T17:20:10Z" but the key_as_string in the result is "2020-02-28T17:04:00.000Z" which is outside the range filter!
I tried looking at the docs but no avail. Am I missing something here?
I guess that has to do with the way a Range or a bucket is calculated. My understanding is that 28m of range would have to be maintained throughout i.e. the bucket size must be consistent.
Notice that 28m of range difference is maintained perfectly and in a way first and the last bucket seem to be stretched just to accommodate this 28m range.
Notice that logically, your result documents are all in the right buckets and that documents which are outside the filter range would not be in the aggregation query irrespective of the key_as_string appears within their limits.
Basically ES doesn't guarantee that the range values i.e. key_as_string or start and end values of buckets created may fall accurately within the scope of the filter you've provided but it does guarantee that only the documents filtered as per that range filtered query would be considered for evaluation.
You can say that bucket values are nearest possible values or approximations.
If you want to be sure of the filtered documents, just remove the filter from aggregation and use that in the query as below and remove size: 0
Notice I've made use of offset which would change the start value of the specified bucket. Perhaps that is something you are looking for.
Also one more thing, I've made use of min_doc_count just so you can filter out empty buckets.
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2020-02-28T17:20:10Z",
"lte": "2020-03-01T18:00:01Z"
}
}
}
]
}
},
"aggs": {
"severity_over_time": {
"date_histogram": {
"field": "#timestamp",
"interval": "28m",
"offset": "+11h",
"min_doc_count": 1
}
}
}
}

Subaggregation leads to missing data

Question in short: When executing a query with a subaggregation, why does the inner aggregation miss data in some cases?
Question in detail: I have a search query with a subaggregation (buckets in buckets) as follows:
{
"size": 0,
"aggs": {
"outer_docs": {
"terms": {"size": 20, "field": "field_1_to_aggregate_on"},
"aggs": {
"inner_docs": {
"terms": {"size": 10000, "field": "field_2_to_aggregate_on"},
"aggs": "things to display here"
}
}
}
}
}
If I execute this query, for some outer_docs, I receive not all inner_docs that are associated with it. In the output below, there are three inner docs for outer doc key_1.
{
"hits": {
"total": 9853,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"outer_docs": {
"doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
"buckets": [
{
"key": "key_1", "doc_count": 3,
"inner_docs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{"key": "1", "doc_count": 1, "some": "data here"},
...
{"key": "3", "doc_count": 1, "some": "data here"},
]
}
},
...
]
}
}
}
Now, I add a query to singly select one outer_doc that would have been in the first 20 anyway.
"query": {"bool": {"must": [{'term': {'field_1_to_aggregate_on': 'key_1'}}]}}
In this case, I do get all inner_docs, which are in the output below seven inner docs for outer doc key_1.
{
"hits": {
"total": 8,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"outer_docs": {
"doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
"buckets": [
{
"key": "key_1", "doc_count": 8,
"inner_docs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{"key": "1", "doc_count": 1, "some": "data here"},
...
{"key": "7", "doc_count": 2, "some": "data here"},
]
}
},
...
]
}
}
}
I have specified explicitly that I want 10,000 inner_docs per outer_doc. What is preventing me from getting all data?
This is my version information:
{
'build_date': '2018-09-26T13:34:09.098244Z',
'build_flavor': 'default',
'build_hash': '04711c2',
'build_snapshot': False,
'build_type': 'deb',
'lucene_version': '7.4.0',
'minimum_index_compatibility_version': '5.0.0',
'minimum_wire_compatibility_version': '5.6.0',
'number': '6.4.2'
}
EDIT: After digging a bit more, I found out that the issue was unrelated to subaggregation, but to aggregation itself and the usage of shards. I have opened this bug report for Elastic about it:
https://discuss.elastic.co/t/bug-in-aggregation-result-when-using-shards/164161
https://github.com/elastic/elasticsearch/issues/37425
Check your elastic deprecation logfile. You will probably have some warnings like this:
This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.
search.max_buckets is a dynamic cluster setting that defaults to 10.000 buckets in 7.0.
Now, this is not documented anywhere, but in my experience: Allocating over 10.000 buckets result in the termination of your query, but you will get back the results that have been achieved until that moment. This explains missing data in your result
Using the composite Aggregation will help, your other option is to increase the max_buckets. Be careful with that, you can crash your entire cluster that way, because there is a cost for every bucket (RAM). It does not matter if you actually use all the allocated buckets, you can crash with empty buckets only.
See:
https://www.elastic.co/guide/en/elasticsearch/reference/master/breaking-changes-7.0.html#_literal_search_max_buckets_literal_in_the_cluster_setting
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket.html
https://github.com/elastic/elasticsearch/issues/35896
How about using the composite aggregation for this? Pretty sure that solves your problem.
GET /_search
{
"aggs" : {
"all_docs": {
"composite" : {
"size": 1000,
"sources" : [
{ "outer_docs": { "terms": { "field": "field_1_to_aggregate_on" } } },
{ "inner_docs": { "terms": { "field": "field_2_to_aggregate_on" } } }
]
}
}
}
}
If you have many buckets, the composite aggregation will help you scroll through each of them using size/after.
It turned out that the problem was not due to subaggregation, and that it is an actual feature of ElasticSearch. We are using 5 shards, and when using shards, aggregations only return approximate results.
We have made this problem reproducible, and posted it in the Elastic discuss forum. There, we learned that aggregations do not always return all data, with a link to the documentation where this is explained in more detail.
We also learned that using only 1 shard solves the issue, and when that is not possible, the parameter shard_size can alleviate the problem.

Elasticsearch filter based on field similarity

For reference, I'm using Elasticsearch 6.4.0
I have a Elasticsearch query that returns a certain number of hits, and I'm trying to remove hits with text field values that are too similar. My query is:
{
"size": 10,
"collapse": {
"field": "author_id"
},
"query": {
"function_score": {
"boost_mode": "replace",
"score_mode": "avg",
"functions": [
{
//my custom query function
}
],
"query": {
"bool": {
"must_not": [
{
"term": {
"author_id": MY_ID
}
}
]
}
}
}
},
"aggs": {
"book_name_sample": {
"sampler": {
"shard_size": 10
},
"aggs": {
"frequent_words": {
"significant_text": {
"field": "book_name",
"filter_duplicate_text": true
}
}
}
}
}
}
This query uses a custom function score combined with a filter to return books a person might like (that they haven't authored). Thing is, for some people, it returns books with names that are very similar (i.e. The Life of George Washington, Good Times with George Washington, Who was George Washington), and I'd like the hits to have a more diverse set of names.
I'm using a bucket_selector to aggregate the hits based on text similarity, and the query gives me something like:
...,
"aggregations": {
"book_name_sample": {
"doc_count": 10,
"frequent_words": {
"doc_count": 10,
"bg_count": 482626,
"buckets": [
{
"key": "George",
"doc_count": 3,
"score": 17.278715785140975,
"bg_count": 9718
},
{
"key": "Washington",
"doc_count": 3,
"score": 15.312204414323656,
"bg_count": 10919
}
]
}
}
}
Is it possible to filter the returned documents based on this aggregation result within Elasticsearch? IE remove hits with book_name_sample doc_count less than X? I know I can do this in PHP or whatever language uses the hits, but I'd like to keep it within ES. I've tried using a bucket_selector aggregator like so:
"book_name_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"freqWords": "frequent_words"
},
"script": "params.freqWords < 3"
}
}
But then I get an error: org.elasticsearch.search.aggregations.bucket.sampler.InternalSampler cannot be cast to org.elasticsearch.search.aggregations.InternalMultiBucketAggregation
Also, if that filter removes enough documents so that the hit count is less than the requested size, is it possible to tell ES to go fetch the next top scoring hits so that hits count is filled out?
Why not use top hits inside the aggregation to get relevant document that match the bucket? You can specify how many relevant top hits you want inside the top hits aggregation. So basically this will give you a certain number of documents for each bucket.

Elastic Search Limiting the records that are aggregated

I am running an elastic search query with aggregation, which I intend to limit to say 100 records. The problem is that even when I apply the "size" filter, there is no effect on the aggregation.
GET /index_name/index_type/_search
{
"size":0,
"query":{
"match_all": {}
},
"aggregations":{
"courier_code" : {
"terms" : {
"field" : "city"
}
}
}}
The result set is
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 10,
"successful": 10,
"failed": 0
},
"hits": {
"total": 10867,
"max_score": 0,
"hits": []
},
"aggregations": {
"city": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Mumbai",
"doc_count": 2706
},
{
"key": "London",
"doc_count": 2700
},
{
"key": "Patna",
"doc_count": 1800
},
{
"key": "New York",
"doc_count": 1800
},
{
"key": "Melbourne",
"doc_count": 900
}
]
}
}
}
As you can see there is no effect on limiting the records on which the aggregation is to be performed. Is there a filter for, say top 100 records in Elastic Search.
Search operations in elasticsearch are performed in two phases query and fetch. During the first phase elasticsearch obtains results from all shards sorts them and determines which records should be returned. These records are then retrieved during the second phase. The size parameter controls the number of records that are returned to you in the response. Aggregations are executed during the first phase before elasticsearch actually knows which records will need to be retrieved and they are always executed on all records in the search. So, it's not possible to limit it by the total number of results. If you want to limit the scope of aggregation execution you need to limit the search query instead changing retrieval parameter. For example, if you add a filter to your search query that will only include records from the last year, aggregations will be executed on this filter.
It's also possible to limit the number of records that are analyzed on each shard using terminate_after parameter, however you will have no control on which records will be included and which records wouldn't be included into results, so this option is most likely not what you want.

Resources