Subaggregation leads to missing data - elasticsearch

Question in short: When executing a query with a subaggregation, why does the inner aggregation miss data in some cases?
Question in detail: I have a search query with a subaggregation (buckets in buckets) as follows:
{
"size": 0,
"aggs": {
"outer_docs": {
"terms": {"size": 20, "field": "field_1_to_aggregate_on"},
"aggs": {
"inner_docs": {
"terms": {"size": 10000, "field": "field_2_to_aggregate_on"},
"aggs": "things to display here"
}
}
}
}
}
If I execute this query, for some outer_docs, I receive not all inner_docs that are associated with it. In the output below, there are three inner docs for outer doc key_1.
{
"hits": {
"total": 9853,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"outer_docs": {
"doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
"buckets": [
{
"key": "key_1", "doc_count": 3,
"inner_docs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{"key": "1", "doc_count": 1, "some": "data here"},
...
{"key": "3", "doc_count": 1, "some": "data here"},
]
}
},
...
]
}
}
}
Now, I add a query to singly select one outer_doc that would have been in the first 20 anyway.
"query": {"bool": {"must": [{'term': {'field_1_to_aggregate_on': 'key_1'}}]}}
In this case, I do get all inner_docs, which are in the output below seven inner docs for outer doc key_1.
{
"hits": {
"total": 8,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"outer_docs": {
"doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
"buckets": [
{
"key": "key_1", "doc_count": 8,
"inner_docs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{"key": "1", "doc_count": 1, "some": "data here"},
...
{"key": "7", "doc_count": 2, "some": "data here"},
]
}
},
...
]
}
}
}
I have specified explicitly that I want 10,000 inner_docs per outer_doc. What is preventing me from getting all data?
This is my version information:
{
'build_date': '2018-09-26T13:34:09.098244Z',
'build_flavor': 'default',
'build_hash': '04711c2',
'build_snapshot': False,
'build_type': 'deb',
'lucene_version': '7.4.0',
'minimum_index_compatibility_version': '5.0.0',
'minimum_wire_compatibility_version': '5.6.0',
'number': '6.4.2'
}
EDIT: After digging a bit more, I found out that the issue was unrelated to subaggregation, but to aggregation itself and the usage of shards. I have opened this bug report for Elastic about it:
https://discuss.elastic.co/t/bug-in-aggregation-result-when-using-shards/164161
https://github.com/elastic/elasticsearch/issues/37425

Check your elastic deprecation logfile. You will probably have some warnings like this:
This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.
search.max_buckets is a dynamic cluster setting that defaults to 10.000 buckets in 7.0.
Now, this is not documented anywhere, but in my experience: Allocating over 10.000 buckets result in the termination of your query, but you will get back the results that have been achieved until that moment. This explains missing data in your result
Using the composite Aggregation will help, your other option is to increase the max_buckets. Be careful with that, you can crash your entire cluster that way, because there is a cost for every bucket (RAM). It does not matter if you actually use all the allocated buckets, you can crash with empty buckets only.
See:
https://www.elastic.co/guide/en/elasticsearch/reference/master/breaking-changes-7.0.html#_literal_search_max_buckets_literal_in_the_cluster_setting
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket.html
https://github.com/elastic/elasticsearch/issues/35896

How about using the composite aggregation for this? Pretty sure that solves your problem.
GET /_search
{
"aggs" : {
"all_docs": {
"composite" : {
"size": 1000,
"sources" : [
{ "outer_docs": { "terms": { "field": "field_1_to_aggregate_on" } } },
{ "inner_docs": { "terms": { "field": "field_2_to_aggregate_on" } } }
]
}
}
}
}
If you have many buckets, the composite aggregation will help you scroll through each of them using size/after.

It turned out that the problem was not due to subaggregation, and that it is an actual feature of ElasticSearch. We are using 5 shards, and when using shards, aggregations only return approximate results.
We have made this problem reproducible, and posted it in the Elastic discuss forum. There, we learned that aggregations do not always return all data, with a link to the documentation where this is explained in more detail.
We also learned that using only 1 shard solves the issue, and when that is not possible, the parameter shard_size can alleviate the problem.

Related

Elasticsearch aggregation on values in nested list (array)

I have stored some values in Elasticsearch nested data type (an array) but without using key/value pair. An example record would be:
{
"categories": [
"Category1",
"Category2"
],
"product_name": "productx"
}
Now I want to run aggregation query to find out unique list of categories available. But all the examples I've seen pointed to mapping that has key/value. Is there any way I can use above schema as is or do I need to change my schema to something like this to run aggregation query
{
"categories": [
{"name": "Category1"},
{"name": "Category2"}
],
"product_name": "productx"
}
Well regarding JSON structure, you need to take a step back and figure out if you'd want list or key-value pairs.
Looking at your example, I don't think you need key-value pairs but again its something you may want to clarify by understanding your domain if there'd be some more properties for categories.
Regarding aggregation, as far as I know, aggregations would work on any valid JSON structure.
For the data you've mentioned, you can make use of the below aggregation query. Also I'm assuming the fields are of type keyword.
Aggregation Query
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"myaggs": {
"terms": {
"size": 100,
"script": {
"inline": """
def myString = "";
def list = new ArrayList();
for(int i=0; i<doc['categories'].length; i++){
myString = doc['categories'][i] + ", " + doc['product'].value;
list.add(myString);
}
return list;
"""
}
}
}
}
}
Aggregation Response
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"myaggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "category1, productx",
"doc_count": 1
},
{
"key": "category2, productx",
"doc_count": 1
}
]
}
}
}
Hope it helps!

How to apply exact match on single field and distinct on multiple fields together in ElasticSearch?

I recently started working on ElasticSearch, and I am trying search for following criteria
I want to apply exact match on ENAME & distinct on both EID & ENAME on above data.
Let say for matching, I have string ABC.
So result should be like as below
[
{"EID" :111, "ENAME" : "ABC"},
{"EID" : 444, "ENAME" : "ABC"}
]
You can achieve this via a combination of term query and terms aggregation.
Assuming that you have the following mapping:
PUT my_index
{
"mappings": {
"doc": {
"properties": {
"EID": {
"type": "keyword"
},
"ENAME": {
"type": "keyword"
}
}
}
}
}
And inserted the documents like this:
POST my_index/doc/3
{
"EID": "111",
"ENAME": "ABC"
}
POST my_index/doc/4
{
"EID": "222",
"ENAME": "XYZ"
}
POST my_index/doc/12
{
"EID": "444",
"ENAME": "ABC"
}
The query that will do the job might look like this:
POST my_index/doc/_search
{
"query": {
"term": { 1️⃣
"ENAME": "ABC"
}
},
"size": 0, 3️⃣
"aggregations": {
"by EID": {
"terms": { 2️⃣
"field": "EID"
}
}
}
}
Let me explain how it works:
1️⃣ - term query asks Elasticsearch to filter on exact value of a keyword field "ENAME";
2️⃣ - terms aggregation collects the list of all possible values of another keyword field "EID" and gives back the first N most frequent ones;
3️⃣ - "size": 0 tells Elasticsearch not to return any search hits (we are only interested in the aggregations).
The output of the query will look like this:
{
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"by EID": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "111", <== Here is the first "distinct" value that we wanted
"doc_count": 3
},
{
"key": "444", <== Here is another "distinct" value
"doc_count": 2
}
]
}
}
}
The output does not look exactly like what you posted in the question, but I believe it is the closest what you can achieve with Elasticsearch.
However, this output is equivalent:
"ENAME" is implicitly present (since its value was used for filtering)
"EID" is present under the "buckets" of the aggregations section.
Note that under "doc_count" you will find the number of documents having such "EID".
What if I want to do a DISTINCT on several fields?
For a more complex scenario (e.g. when you need to do a distinct on many fields) see this answer.
More information about aggregations is available here.
Hope that helps!

Very slow elasticsearch term aggregation. How to improve?

We have ~20M (hotel offers) documents stored in elastic(1.6.2) and the point is to group documents by multiple fields (duration, start_date, adults, kids) and select one cheapest offer out of each group. We have to sort those results by cost field.
To avoid sub-aggregations we have united target fields values into one called default_group_field by joining them with dot(.).
Mapping for the field looks like this:
"default_group_field": {
"index": "not_analyzed",
"fielddata": {
"loading": "eager_global_ordinals"
},
"type": "string"
}
Query we perform looks like this:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field",
"size": 5,
"order": {
"min_sort_value": "asc"
}
},
"aggs": {
"min_sort_value": {
"min": {
"field": "cost"
}
},
"cheapest": {
"top_hits": {
"_source": {}
},
"sort": {
"cost": "asc"
},
"size": 1
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"and": [
...
]
}
}
}
}
The problem is that such query takes seconds (2-5sec) to load.
However once we perform query without aggregations we get a moderate amount of results (say "total": 490) in under 100ms.
{
"took": 53,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 490,
"max_score": 1,
"hits": [...
But with aggregation it take 2sec :
{
"took": 2158,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 490,
"max_score": 0,
"hits": [
]
},...
It seems like it should not take so long to process that moderate amount filtered documents and select the cheapest one out of every group. It could be done inside application, which seems an ugly hack for me.
The log is full of lines stating:
[DEBUG][index.fielddata.plain ] [Karen Page] [offers] Global-ordinals[default_group_field][2564761] took 2453 ms
That is why we updated our mapping to perform eager global_ordinals rebuild on index update, however this did not make notable impact on query timings.
Is there any way to speedup such aggregation, or maybe a way to tell elastic to do aggregation on filtered documents only.
Or maybe there is another source of such a long query execution? Any ideas highly appreciated!
thanks again for the effort.
Finally we have solved the main problem and our performance is back to normal.
To be short we have done the following:
- updated the mapping for the default_group_field to be of type Long
- compressed the default_group_field values so that it would match type Long
Some explanations:
Aggregations on string fields require some work work be done on them. As we see from logs building Global Ordinals for that field that has very wide variance was very expensive. In fact we do only aggregations on the field mentioned. With that said it is not very efficient to use String type.
So we have changed the mapping to:
default_group_field: {
type: 'long',
index: 'not_analyzed'
}
This way we do not touch those expensive operations.
After this and the same query timing reduced to ~100ms. It also dropped down CPU usage.
PS 1
I`ve got a lot of info from docs on global ordinals
PS 2
Still I have no idea on how to bypass this issue with the field of type String. Please comment if you have some ideas.
This is likely due to the the default behaviour of terms aggregations, which requires global ordinals to be built. This computation can be expensive for high-cardinality fields.
The following blog addresses the likely cause of this poor performance and several approaches to resolve it.
https://www.elastic.co/blog/improving-the-performance-of-high-cardinality-terms-aggregations-in-elasticsearch
Ok. I will try to answer this,
There are few parts in the question which I was not able to understand like -
To avoid sub-aggregations we have united target fields values into one called default_group_field by joining them with dot(.)
I am not sure what you really mean by this because you said that,
You added this field to avoid aggregation(But how? and also how are you avoiding the aggregation if you are joining them with dot(.)?)
Ok. Even I am also new to elastic search. So If there is anything I missed, you can comment on this answer. Thanks,
I will continue to answer this question.
But before that I am assuming that you have
that(default_group_field) field to differentiate between records
duration, start_date, adults, kids.
I will try to provide one example below after my solution.
My solution:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field"
},
"aggs": {
"sort_cost_asc": {
"top_hits": {
"sort": [
{
"cost": {
"order": "asc"
}
}
],
"_source": {
"include": [ ... fields you want from the document ... ]
},
"size": 1
}
}
}
}
},
"query": {
"... your query part ..."
}
}
I will try to explain what I am trying to do here:
I am assuming that your document looks like this (may be there is some nesting also, But for example I am trying to keep the document as simple as I can):
document1:
{
"default_group_field": "kids",
"cost": 100,
"documentId":1
}
document2:
{
"default_group_field": "kids",
"cost": 120,
"documentId":2
}
document3:
{
"default_group_field": "adults",
"cost": 50,
"documentId":3
}
document4:
{
"default_group_field": "adults",
"cost": 150,
"documentId":4
}
So now you have this documents and you want to get the min. cost document for both adults and kids:
so your query should look like this:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field"
},
"aggs": {
"sort_cost_asc": {
"top_hits": {
"sort": [
{
"cost": {
"order": "asc"
}
}
],
"_source": {
"include": ["documentId", "cost", "default_group_field"]
},
"size": 1
}
}
}
}
},
"query": {
"filtered":{ "query": { "match_all": {} } }
}
}
To explain the above query, what I am doing is grouping the document by "default_group_field" and then I am sorting each group by cost and size:1 helps me to get the just one document.
Therefore the result for this query will be min. cost document in each category (adults and kids)
Usually when I try to write the query for elastic search or db. I try to minimize the number of document or rows.
I assume that I am right in understanding your question.
If I am wrong in understanding your question or I did some mistake, Please reply and let me know where I went wrong.
Thanks,

Elasticsearch counts of multiple indices

I am creating a report to compare actual count from database with indexed records.
I have three indices index1, index2 and index3
To get the count for a single index i am using the following URL
http://localhost:9200/index1/_count?q=_type:invoice
=> {"count":50,"_shards":{"total":5,"successful":5,"failed":0}}
For multiple indices:
http://localhost:9200/index1,index2/_count?q=_type:invoice
=> {"count":80,"_shards":{"total":5,"successful":5,"failed":0}}
Now the count is added up, i want it to grouped by index also how can i pass filters group by a specific field
to get the output like this:
{"index1_count":50,"index2_count":50,"approved":10,"rejected":40 ,"_shards":{"total":5,"successful":5,"failed":0}}
You can use _search?search_type=count and do an aggregation based on _index field to make the distinction between the indices:
GET /index1,index2/_search?search_type=count
{
"aggs": {
"by_index": {
"terms": {
"field": "_index"
}
}
}
}
and the result would be something like this:
"aggregations": {
"by_index": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "index1",
"doc_count": 50
},
{
"key": "index2",
"doc_count": 80
}
]
}
}

Retrieve document frequency for terms in query result with aggregations

For some of my queries to ElasticSearch I want three pieces of information back:
Which terms T occurred in the result document set?
How often does each element of T occur in the result document set?
How often does each element of T occur in the entire index (--> document frequency)?
The first points are easily determined using the default term facet or, nowadays, by the term aggregation method.
So my question is really about the third point.
Before ElasticSearch 1.x, i.e. before the switch to the 'aggregation' paradigm, I could use a term facet with the 'global' option set to true and a QueryFilter to get the document frequency ('global counts') of the exact terms occurring in the document set specified by the QueryFilter.
At first I thought I could do the same thing using a global aggregation, but it seems I can't. The reason is - if I understand correctly - that the original facet mechanism were centered around terms whereas the aggregation buckets are defined by the the set of documents belonging to each bucket.
I.e. specifying the global option of a term facet with a QueryFilter first determined the terms hit by the filter and then computed facet values. Since the facet was global I would receive the document counts.
With aggregations, it's different. The global aggregation can only be used as a top aggregation, causing the aggregation to ignore the current query results and compute the aggregation - e.g. a terms aggregation - on all documents in the index. So for me, that's too much, since I WANT to restrict the returned terms ('buckets') to the terms in the document result set. But if I use a filter-sub-aggregation with a terms-sub-aggregation, I would restrict the term-buckets to the filter again, thus not retrieving the document frequencies but normal facet counts. The reason is that the buckets are determined after the filter so they are "too small". But I don't want restrict bucket size, I want to restrict the buckets to the terms in the query result set.
How can I get the document frequency of those terms in a query result set using aggregations (since facets are deprecated and will be removed)?
Thanks for your time!
EDIT: Here comes an example of how I tried to achieve the desired behaviour.
I will define two aggregations:
global_agg_with_filter_and_terms
global_agg_with_terms_and_filter
Both have a global aggregation at their tops because its the only valid position for it. Then, in the first aggregation, I first filter the results to the original query and then apply a term-sub-aggregation.
In the second aggregation, I do mostly the same, only that here the filter aggregation is a sub-aggregation of the terms aggregation. Hence the similar names, only the order of aggregation differs.
{
"query": {
"query_string": {
"query": "text: my query string"
}
},
"aggs": {
"global_agg_with_filter_and_terms": {
"global": {},
"aggs": {
"filter_agg": {
"filter": {
"query": {
"query_string": {
"query": "text: my query string"
}
}
},
"aggs": {
"terms_agg": {
"terms": {
"field": "facets"
}
}
}
}
}
},
"global_agg_with_terms_and_filter": {
"global": {},
"aggs": {
"document_frequency": {
"terms": {
"field": "facets"
},
"aggs": {
"term_count": {
"filter": {
"query": {
"query_string": {
"query": "text: my query string"
}
}
}
}
}
}
}
}
}
}
Response:
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 221,
"max_score": 0.9839197,
"hits": <omitted>
},
"aggregations": {
"global_agg_with_filter_and_terms": {
"doc_count": 1978,
"filter_agg": {
"doc_count": 221,
"terms_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "fid8",
"doc_count": 155
},
{
"key": "fid6",
"doc_count": 40
},
{
"key": "fid9",
"doc_count": 10
},
{
"key": "fid5",
"doc_count": 9
},
{
"key": "fid13",
"doc_count": 5
},
{
"key": "fid7",
"doc_count": 2
}
]
}
}
},
"global_agg_with_terms_and_filter": {
"doc_count": 1978,
"document_frequency": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "fid8",
"doc_count": 1050,
"term_count": {
"doc_count": 155
}
},
{
"key": "fid6",
"doc_count": 668,
"term_count": {
"doc_count": 40
}
},
{
"key": "fid9",
"doc_count": 67,
"term_count": {
"doc_count": 10
}
},
{
"key": "fid5",
"doc_count": 65,
"term_count": {
"doc_count": 9
}
},
{
"key": "fid7",
"doc_count": 63,
"term_count": {
"doc_count": 2
}
},
{
"key": "fid13",
"doc_count": 55,
"term_count": {
"doc_count": 5
}
},
{
"key": "fid10",
"doc_count": 11,
"term_count": {
"doc_count": 0
}
},
{
"key": "fid11",
"doc_count": 9,
"term_count": {
"doc_count": 0
}
},
{
"key": "fid12",
"doc_count": 5,
"term_count": {
"doc_count": 0
}
}
]
}
}
}
}
At first, please have a look at the first two returned term-buckets of both aggregations, with keys fid8 and fid6. We can easily see that those terms have been appearing in the result set 155 and 40 times, respectively. Now please look at the second aggregation, global_agg_with_terms_and_filter. The terms-aggregation is within the scope of the global aggregation, so here we can actually see the document frequencies, 1050 and 668, respectively. So this part looks good. The issue arises when you scan the list of term buckets further down, to the buckets with the keys fid10 to fid12. While we receive their document frequency, we can also see that their term_count is 0. This is due to the fact that those terms did not occur in our query, that we also used for the filter-sub-aggregation. So the problem is that for ALL terms (global scope!) their document frequency and their facet count with regards to the actual query result is returned. But I need this to be made exactly for the terms that occurred in the query result, i.e. for those exact terms returned by the first aggregation global_agg_with_filter_and_terms.
Perhaps there is a possibity to define some kind of filter that removes all buckets where their sub-filter-aggregation term_count has a zero doc_count?
Hello and sorry if the answer is late.
You should have a look at the Significant Terms aggregation as, like the terms aggregation, it returns one bucket for each term occuring in the results set with the number of occurences available through doc_count, but you also get the number of occurrences in a background set through bg_count. This means it only creates buckets for terms appearing in documents of your query results set.
The default background set comprises all documents in the query scope, but can be filtered down to any subset you want using background_filter.
You can use a scripted bucket scoring function to rank the buckets the way you want by combining several metrics:
_subset_freq: number of documents the term appears in the results set,
_superset_freq: number of documents the term appears in the background set,
_subset_size: number of documents in the results set,
_superset_size: number of documents in the background set.
Request:
{
"query": {
"query_string": {
"query": "text: my query string"
}
},
"aggs": {
"terms": {
"significant_terms": {
"script": "_subset_freq",
"size": 100
}
}
}
}

Resources