Elasticsearch aggregation on values in nested list (array) - elasticsearch

I have stored some values in Elasticsearch nested data type (an array) but without using key/value pair. An example record would be:
{
"categories": [
"Category1",
"Category2"
],
"product_name": "productx"
}
Now I want to run aggregation query to find out unique list of categories available. But all the examples I've seen pointed to mapping that has key/value. Is there any way I can use above schema as is or do I need to change my schema to something like this to run aggregation query
{
"categories": [
{"name": "Category1"},
{"name": "Category2"}
],
"product_name": "productx"
}

Well regarding JSON structure, you need to take a step back and figure out if you'd want list or key-value pairs.
Looking at your example, I don't think you need key-value pairs but again its something you may want to clarify by understanding your domain if there'd be some more properties for categories.
Regarding aggregation, as far as I know, aggregations would work on any valid JSON structure.
For the data you've mentioned, you can make use of the below aggregation query. Also I'm assuming the fields are of type keyword.
Aggregation Query
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"myaggs": {
"terms": {
"size": 100,
"script": {
"inline": """
def myString = "";
def list = new ArrayList();
for(int i=0; i<doc['categories'].length; i++){
myString = doc['categories'][i] + ", " + doc['product'].value;
list.add(myString);
}
return list;
"""
}
}
}
}
}
Aggregation Response
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"myaggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "category1, productx",
"doc_count": 1
},
{
"key": "category2, productx",
"doc_count": 1
}
]
}
}
}
Hope it helps!

Related

Issue with date Histrogram aggrgation with filter

I am trying to get date histrogram for a timestamp field for a specific period. I am using the following query,
{
"aggs" : {
"dataRange" : {
"filter": {"range" : { "#timestamp" :{ "gte":"2020-02-28T17:20:10Z","lte":"2020-03-01T18:00:00Z" } } },
"aggs" : {
"severity_over_time" :{
"date_histogram" : { "field" : "#timestamp", "interval" : "28m" }
}}}
},"size" :0
}
The following result I got,
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 32,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"dataRange": {
"doc_count": 20,
"severity_over_time": {
"buckets": [
{
"key_as_string": "2020-02-28T17:04:00.000Z",
"key": 1582909440000,
"doc_count": 20
}
]
}
}
}
}
The the start of the histogram range ("key_as_string" ) goes outside of my filter criteria! My input filter is from "2020-02-28T17:20:10Z" but the key_as_string in the result is "2020-02-28T17:04:00.000Z" which is outside the range filter!
I tried looking at the docs but no avail. Am I missing something here?
I guess that has to do with the way a Range or a bucket is calculated. My understanding is that 28m of range would have to be maintained throughout i.e. the bucket size must be consistent.
Notice that 28m of range difference is maintained perfectly and in a way first and the last bucket seem to be stretched just to accommodate this 28m range.
Notice that logically, your result documents are all in the right buckets and that documents which are outside the filter range would not be in the aggregation query irrespective of the key_as_string appears within their limits.
Basically ES doesn't guarantee that the range values i.e. key_as_string or start and end values of buckets created may fall accurately within the scope of the filter you've provided but it does guarantee that only the documents filtered as per that range filtered query would be considered for evaluation.
You can say that bucket values are nearest possible values or approximations.
If you want to be sure of the filtered documents, just remove the filter from aggregation and use that in the query as below and remove size: 0
Notice I've made use of offset which would change the start value of the specified bucket. Perhaps that is something you are looking for.
Also one more thing, I've made use of min_doc_count just so you can filter out empty buckets.
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2020-02-28T17:20:10Z",
"lte": "2020-03-01T18:00:01Z"
}
}
}
]
}
},
"aggs": {
"severity_over_time": {
"date_histogram": {
"field": "#timestamp",
"interval": "28m",
"offset": "+11h",
"min_doc_count": 1
}
}
}
}

Subaggregation leads to missing data

Question in short: When executing a query with a subaggregation, why does the inner aggregation miss data in some cases?
Question in detail: I have a search query with a subaggregation (buckets in buckets) as follows:
{
"size": 0,
"aggs": {
"outer_docs": {
"terms": {"size": 20, "field": "field_1_to_aggregate_on"},
"aggs": {
"inner_docs": {
"terms": {"size": 10000, "field": "field_2_to_aggregate_on"},
"aggs": "things to display here"
}
}
}
}
}
If I execute this query, for some outer_docs, I receive not all inner_docs that are associated with it. In the output below, there are three inner docs for outer doc key_1.
{
"hits": {
"total": 9853,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"outer_docs": {
"doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
"buckets": [
{
"key": "key_1", "doc_count": 3,
"inner_docs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{"key": "1", "doc_count": 1, "some": "data here"},
...
{"key": "3", "doc_count": 1, "some": "data here"},
]
}
},
...
]
}
}
}
Now, I add a query to singly select one outer_doc that would have been in the first 20 anyway.
"query": {"bool": {"must": [{'term': {'field_1_to_aggregate_on': 'key_1'}}]}}
In this case, I do get all inner_docs, which are in the output below seven inner docs for outer doc key_1.
{
"hits": {
"total": 8,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"outer_docs": {
"doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
"buckets": [
{
"key": "key_1", "doc_count": 8,
"inner_docs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{"key": "1", "doc_count": 1, "some": "data here"},
...
{"key": "7", "doc_count": 2, "some": "data here"},
]
}
},
...
]
}
}
}
I have specified explicitly that I want 10,000 inner_docs per outer_doc. What is preventing me from getting all data?
This is my version information:
{
'build_date': '2018-09-26T13:34:09.098244Z',
'build_flavor': 'default',
'build_hash': '04711c2',
'build_snapshot': False,
'build_type': 'deb',
'lucene_version': '7.4.0',
'minimum_index_compatibility_version': '5.0.0',
'minimum_wire_compatibility_version': '5.6.0',
'number': '6.4.2'
}
EDIT: After digging a bit more, I found out that the issue was unrelated to subaggregation, but to aggregation itself and the usage of shards. I have opened this bug report for Elastic about it:
https://discuss.elastic.co/t/bug-in-aggregation-result-when-using-shards/164161
https://github.com/elastic/elasticsearch/issues/37425
Check your elastic deprecation logfile. You will probably have some warnings like this:
This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.
search.max_buckets is a dynamic cluster setting that defaults to 10.000 buckets in 7.0.
Now, this is not documented anywhere, but in my experience: Allocating over 10.000 buckets result in the termination of your query, but you will get back the results that have been achieved until that moment. This explains missing data in your result
Using the composite Aggregation will help, your other option is to increase the max_buckets. Be careful with that, you can crash your entire cluster that way, because there is a cost for every bucket (RAM). It does not matter if you actually use all the allocated buckets, you can crash with empty buckets only.
See:
https://www.elastic.co/guide/en/elasticsearch/reference/master/breaking-changes-7.0.html#_literal_search_max_buckets_literal_in_the_cluster_setting
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket.html
https://github.com/elastic/elasticsearch/issues/35896
How about using the composite aggregation for this? Pretty sure that solves your problem.
GET /_search
{
"aggs" : {
"all_docs": {
"composite" : {
"size": 1000,
"sources" : [
{ "outer_docs": { "terms": { "field": "field_1_to_aggregate_on" } } },
{ "inner_docs": { "terms": { "field": "field_2_to_aggregate_on" } } }
]
}
}
}
}
If you have many buckets, the composite aggregation will help you scroll through each of them using size/after.
It turned out that the problem was not due to subaggregation, and that it is an actual feature of ElasticSearch. We are using 5 shards, and when using shards, aggregations only return approximate results.
We have made this problem reproducible, and posted it in the Elastic discuss forum. There, we learned that aggregations do not always return all data, with a link to the documentation where this is explained in more detail.
We also learned that using only 1 shard solves the issue, and when that is not possible, the parameter shard_size can alleviate the problem.

Range aggregation on multiple field

I have two fields in the search document such as salary_from and salary_to and want the aggregation of salary ranges such as 0 - 10 , 10 - 20, etc.
Is there any ways to set multiple fields to the Elastic Range Aggregation. (I can set one field by using setField function)
I just need to get the aggregated count of salary ranges or slabs by considering the two fields salary_from and salary_to.
Please help me.
If I understand your question correctly, below is what you need.
{
"size": 0,
"aggs": {
"salary_ranges": {
"terms": {
"script": "doc['salary_from'].value + ' to ' + doc['salary_to'].value",
"size": 0
}
}
}
}
It basically uses a script for Terms Aggregation. Read more about it here.
If say, you have 3 documents with salary_from set to 3 and salary_to set to 5 and then you have 2 documents with salary_from set to 10 and salary_to set to 25, the result of the query above will look something like below:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0,
"hits": []
},
"aggregations": {
"salary_ranges": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "3 to 5",
"doc_count": 3
},
{
"key": "10 to 25",
"doc_count": 2
}
]
}
}
}

elasticsearch how to find number of occurrences

I wonder if it's possible to convert this sql query into ES query?
select top 10 app, cat, count(*) from err group by app, cat
Or in English it would be answering: "Show top app, cat and their counts", so this will be grouping by multiple fields and returning name and count.
For aggregating on a combination of multiple fields, you have to use scripting in Terms Aggregation like below:
POST <index name>/<type name>/_search?search_type=count
{
"aggs": {
"app_cat": {
"terms": {
"script" : "doc['app'].value + '#' + doc['cat'].value",
"size": 10
}
}
}
}
I am using # as a delimiter assuming that it is not present in any value of app and/or cat fields. You can use any other delimiter of your choice. You'll get a response something like below:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0,
"hits": []
},
"aggregations": {
"app_cat": {
"buckets": [
{
"key": "app2#cat2",
"doc_count": 4
},
{
"key": "app1#cat1",
"doc_count": 3
},
{
"key": "app2#cat1",
"doc_count": 2
},
{
"key": "app1#cat2",
"doc_count": 1
}
]
}
}
}
On the client side, you can get the individual values of app and cat fields from the aggregation response by string manipulations.
In newer versions of Elasticsearch, scripting is disabled by default due to security reasons. If you want to enable scripting, read this.
Terms aggregation is what you are looking for.

Elasticsearch highlighting - not working

I have the following problem:
I'm doing some tests with facetings
My script is as follows:
https://gist.github.com/nayelisantacruz/6610862
the result I get is as follows:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": []
},
"facets": {
"title": {
"_type": "terms",
"missing": 0,
"total": 2,
"other": 0,
"terms": [
{
"term": "JavaScript",
"count": 1
},
{
"term": "Java Platform, Standard Edition",
"count": 1
}
]
}
}
}
which is fine, but the problem is that I can not display the "highlighting"
I was expecting a result like the following:
..........
..........
..........
"facets": {
"title": {
"_type": "terms",
"missing": 0,
"total": 2,
"other": 0,
"terms": [
{
"term": "<b>Java</b>Script",
"count": 1
},
{
"term": "<b>Java</b> Platform, Standard Edition",
"count": 1
}
]
}
}
..........
..........
..........
Anyone can help me and tell me what I'm doing wrong or what I'm missing, please
Thank you very much for your attention
Faceting and highlighting are two completely different things. Highlighting works together with search, in order to return highlighted snippets for each of the search results.
Faceting is a completely different story, as a facet effectively looks at all the terms that have been indexed for a specific field, throughout all the documents that match the main query. In that respect, the query only controls the documents that are going to be taken into account to perform faceting. Only the top terms (by default with higher count) are going to be returned. Those terms are not only related to the search results (by default 10) but to all the documents that match the query.
That said, the terms returned with the facets are never highlighted.
If you use highlighting you should see in your response, as mentioned in the reference, a new section that contains the highlighted snippets for each of your search results. The reason why you don't see it is that you are querying the title.autocomplete field, but you make highlighting on the title field with require_field_match enabled. You either have to set require_field_match to true or highlight the same field that you are querying on. But again this is not related to faceting whatsoever.
Note the use of * instead of _all. This works like a charm at all level of nesting:
POST 123821/Encounters/_search
{
"query": {
"query_string": {
"query": "Aller*"
}
},
"highlight": {
"fields": {
"*": {}
}
}
}

Resources