We have ~20M (hotel offers) documents stored in elastic(1.6.2) and the point is to group documents by multiple fields (duration, start_date, adults, kids) and select one cheapest offer out of each group. We have to sort those results by cost field.
To avoid sub-aggregations we have united target fields values into one called default_group_field by joining them with dot(.).
Mapping for the field looks like this:
"default_group_field": {
"index": "not_analyzed",
"fielddata": {
"loading": "eager_global_ordinals"
},
"type": "string"
}
Query we perform looks like this:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field",
"size": 5,
"order": {
"min_sort_value": "asc"
}
},
"aggs": {
"min_sort_value": {
"min": {
"field": "cost"
}
},
"cheapest": {
"top_hits": {
"_source": {}
},
"sort": {
"cost": "asc"
},
"size": 1
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"and": [
...
]
}
}
}
}
The problem is that such query takes seconds (2-5sec) to load.
However once we perform query without aggregations we get a moderate amount of results (say "total": 490) in under 100ms.
{
"took": 53,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 490,
"max_score": 1,
"hits": [...
But with aggregation it take 2sec :
{
"took": 2158,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 490,
"max_score": 0,
"hits": [
]
},...
It seems like it should not take so long to process that moderate amount filtered documents and select the cheapest one out of every group. It could be done inside application, which seems an ugly hack for me.
The log is full of lines stating:
[DEBUG][index.fielddata.plain ] [Karen Page] [offers] Global-ordinals[default_group_field][2564761] took 2453 ms
That is why we updated our mapping to perform eager global_ordinals rebuild on index update, however this did not make notable impact on query timings.
Is there any way to speedup such aggregation, or maybe a way to tell elastic to do aggregation on filtered documents only.
Or maybe there is another source of such a long query execution? Any ideas highly appreciated!
thanks again for the effort.
Finally we have solved the main problem and our performance is back to normal.
To be short we have done the following:
- updated the mapping for the default_group_field to be of type Long
- compressed the default_group_field values so that it would match type Long
Some explanations:
Aggregations on string fields require some work work be done on them. As we see from logs building Global Ordinals for that field that has very wide variance was very expensive. In fact we do only aggregations on the field mentioned. With that said it is not very efficient to use String type.
So we have changed the mapping to:
default_group_field: {
type: 'long',
index: 'not_analyzed'
}
This way we do not touch those expensive operations.
After this and the same query timing reduced to ~100ms. It also dropped down CPU usage.
PS 1
I`ve got a lot of info from docs on global ordinals
PS 2
Still I have no idea on how to bypass this issue with the field of type String. Please comment if you have some ideas.
This is likely due to the the default behaviour of terms aggregations, which requires global ordinals to be built. This computation can be expensive for high-cardinality fields.
The following blog addresses the likely cause of this poor performance and several approaches to resolve it.
https://www.elastic.co/blog/improving-the-performance-of-high-cardinality-terms-aggregations-in-elasticsearch
Ok. I will try to answer this,
There are few parts in the question which I was not able to understand like -
To avoid sub-aggregations we have united target fields values into one called default_group_field by joining them with dot(.)
I am not sure what you really mean by this because you said that,
You added this field to avoid aggregation(But how? and also how are you avoiding the aggregation if you are joining them with dot(.)?)
Ok. Even I am also new to elastic search. So If there is anything I missed, you can comment on this answer. Thanks,
I will continue to answer this question.
But before that I am assuming that you have
that(default_group_field) field to differentiate between records
duration, start_date, adults, kids.
I will try to provide one example below after my solution.
My solution:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field"
},
"aggs": {
"sort_cost_asc": {
"top_hits": {
"sort": [
{
"cost": {
"order": "asc"
}
}
],
"_source": {
"include": [ ... fields you want from the document ... ]
},
"size": 1
}
}
}
}
},
"query": {
"... your query part ..."
}
}
I will try to explain what I am trying to do here:
I am assuming that your document looks like this (may be there is some nesting also, But for example I am trying to keep the document as simple as I can):
document1:
{
"default_group_field": "kids",
"cost": 100,
"documentId":1
}
document2:
{
"default_group_field": "kids",
"cost": 120,
"documentId":2
}
document3:
{
"default_group_field": "adults",
"cost": 50,
"documentId":3
}
document4:
{
"default_group_field": "adults",
"cost": 150,
"documentId":4
}
So now you have this documents and you want to get the min. cost document for both adults and kids:
so your query should look like this:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field"
},
"aggs": {
"sort_cost_asc": {
"top_hits": {
"sort": [
{
"cost": {
"order": "asc"
}
}
],
"_source": {
"include": ["documentId", "cost", "default_group_field"]
},
"size": 1
}
}
}
}
},
"query": {
"filtered":{ "query": { "match_all": {} } }
}
}
To explain the above query, what I am doing is grouping the document by "default_group_field" and then I am sorting each group by cost and size:1 helps me to get the just one document.
Therefore the result for this query will be min. cost document in each category (adults and kids)
Usually when I try to write the query for elastic search or db. I try to minimize the number of document or rows.
I assume that I am right in understanding your question.
If I am wrong in understanding your question or I did some mistake, Please reply and let me know where I went wrong.
Thanks,
Related
I run aggregation that on 2 indices: idx-2020-07-21, idx-2020-07-22
The target:
Get all documents,
but in the case of duplicate id (50% are), get the one from the latest index using the index name.
This is the query I'm running
{
"size": 0,
"aggregations": {
"latest_item": {
"composite": {
"size": 1000,
"sources": [
{
"product": {
"terms": {
"field": "_id",
"missing_bucket": false,
"order": "asc"
}
}
}
]
},
"aggregations": {
"max_date": {
"top_hits": {
"from": 0,
"size": 1,
"version": false,
"explain": false,
"sort": [
{
"_index": {
"order": "desc"
}
}
]
}
}
}
}
}
}
Each index size is 8G with ~1M docs. ES version 7.5
and it takes around 8Min to aggregate, most of the times I get
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [32933676058/30.6gb], which is larger than the limit of [32641751449/30.3gb].
Is there a better way to write this query?
How do I deal with this exception?
I run a java job that query ES every 10 min, I noticed it happened a lot in the second time,
do I need to release any resources or something? I use restHighLevelClient.searchAsync() with a listener that call again with the next key until I get null.
The cluster has 3 nodes, 32G each.
I tries to play with the bucket size it didn't help a lot.
Thanks!
I am trying to get date histrogram for a timestamp field for a specific period. I am using the following query,
{
"aggs" : {
"dataRange" : {
"filter": {"range" : { "#timestamp" :{ "gte":"2020-02-28T17:20:10Z","lte":"2020-03-01T18:00:00Z" } } },
"aggs" : {
"severity_over_time" :{
"date_histogram" : { "field" : "#timestamp", "interval" : "28m" }
}}}
},"size" :0
}
The following result I got,
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 32,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"dataRange": {
"doc_count": 20,
"severity_over_time": {
"buckets": [
{
"key_as_string": "2020-02-28T17:04:00.000Z",
"key": 1582909440000,
"doc_count": 20
}
]
}
}
}
}
The the start of the histogram range ("key_as_string" ) goes outside of my filter criteria! My input filter is from "2020-02-28T17:20:10Z" but the key_as_string in the result is "2020-02-28T17:04:00.000Z" which is outside the range filter!
I tried looking at the docs but no avail. Am I missing something here?
I guess that has to do with the way a Range or a bucket is calculated. My understanding is that 28m of range would have to be maintained throughout i.e. the bucket size must be consistent.
Notice that 28m of range difference is maintained perfectly and in a way first and the last bucket seem to be stretched just to accommodate this 28m range.
Notice that logically, your result documents are all in the right buckets and that documents which are outside the filter range would not be in the aggregation query irrespective of the key_as_string appears within their limits.
Basically ES doesn't guarantee that the range values i.e. key_as_string or start and end values of buckets created may fall accurately within the scope of the filter you've provided but it does guarantee that only the documents filtered as per that range filtered query would be considered for evaluation.
You can say that bucket values are nearest possible values or approximations.
If you want to be sure of the filtered documents, just remove the filter from aggregation and use that in the query as below and remove size: 0
Notice I've made use of offset which would change the start value of the specified bucket. Perhaps that is something you are looking for.
Also one more thing, I've made use of min_doc_count just so you can filter out empty buckets.
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2020-02-28T17:20:10Z",
"lte": "2020-03-01T18:00:01Z"
}
}
}
]
}
},
"aggs": {
"severity_over_time": {
"date_histogram": {
"field": "#timestamp",
"interval": "28m",
"offset": "+11h",
"min_doc_count": 1
}
}
}
}
I was working with products data, here: link
The search query that sort by keyword field tags using max mode is as follows.
GET product/_doc/_search
{
"size":100,"from":20,"_source":["tags", "name"],
"query": {
"match_all": {}
},
"sort": [
{"tags":{
"order":"desc",
"mode":"max"
}}
]
}
Some documents have same sort value. I had read somewhere that if the sort value is same, it arranges by internal doc id (_id). However, the case does not seem so. See screenshot below:
First _id: 961 followed by _id:972 (fine). However, then came _id: 114. I am not understanding how it got random.
Help will be appreciated.
As you have already seen, its random. To overcome this you can add another field to be used to sort when the sorting value for first field is same. As you want to use _id the query will be then as follows:
{
"size": 100,
"from": 20,
"_source": [
"tags",
"name"
],
"query": {
"match_all": {}
},
"sort": [
{
"tags": {
"order": "desc",
"mode": "max"
}
},
{
"_id": "asc"
}
]
}
I have the following query:
GET my-index-*/my-type/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"script" : "code"
},
"aggs": {
"dates": {
"date_range": {
"field": "created_time",
"ranges": [
{
"from": "2017-12-09T00:00:00.000",
"to": "2017-12-09T16:00:00.000"
},
{
"from": "2017-12-10T00:00:00.000",
"to": "2017-12-10T16:00:00.000"
}
]
}
},
"total_count": {
"sum_bucket": {
"buckets_path": "dates._count"
}
},
"bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalCount": "total_count"
},
"script": "params.totalCount == 0"
}
}
}
}
}
}
The result of this query is a bunch of buckets. What I need is the list of keys of my buckets. The problem is the aggregation result size is 10 by default, after getting those 10, my bucket_filter filters them by total count, and I get only some of those 10. I need to have all the results, which means I need to specify "size" = n, where n is the distinct count of code values, so that I don't lose any data. I have billions of documents, so in my case n is about 30.000. When I tried executing the query, "Out of memory" occurred on cluster, so I guess it's not the best idea. Is there a good way to get all the results for my query?
Unfortunately this is not recommended for high carnality fields with 30K unique values. The reason is because of memory cost and the large amount of data it needs to collect from the shards as you've discovered. It might work, but then you need more memory...
A more efficient solution is to use the Scroll API and specify in fields in your search request the values you want to retrieve from a field, and then store these values either in your client in-memory or stream it.
Update: since ES 6.5 this has been possible with Composite aggregations, see https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
I have the following problem:
I'm doing some tests with facetings
My script is as follows:
https://gist.github.com/nayelisantacruz/6610862
the result I get is as follows:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": []
},
"facets": {
"title": {
"_type": "terms",
"missing": 0,
"total": 2,
"other": 0,
"terms": [
{
"term": "JavaScript",
"count": 1
},
{
"term": "Java Platform, Standard Edition",
"count": 1
}
]
}
}
}
which is fine, but the problem is that I can not display the "highlighting"
I was expecting a result like the following:
..........
..........
..........
"facets": {
"title": {
"_type": "terms",
"missing": 0,
"total": 2,
"other": 0,
"terms": [
{
"term": "<b>Java</b>Script",
"count": 1
},
{
"term": "<b>Java</b> Platform, Standard Edition",
"count": 1
}
]
}
}
..........
..........
..........
Anyone can help me and tell me what I'm doing wrong or what I'm missing, please
Thank you very much for your attention
Faceting and highlighting are two completely different things. Highlighting works together with search, in order to return highlighted snippets for each of the search results.
Faceting is a completely different story, as a facet effectively looks at all the terms that have been indexed for a specific field, throughout all the documents that match the main query. In that respect, the query only controls the documents that are going to be taken into account to perform faceting. Only the top terms (by default with higher count) are going to be returned. Those terms are not only related to the search results (by default 10) but to all the documents that match the query.
That said, the terms returned with the facets are never highlighted.
If you use highlighting you should see in your response, as mentioned in the reference, a new section that contains the highlighted snippets for each of your search results. The reason why you don't see it is that you are querying the title.autocomplete field, but you make highlighting on the title field with require_field_match enabled. You either have to set require_field_match to true or highlight the same field that you are querying on. But again this is not related to faceting whatsoever.
Note the use of * instead of _all. This works like a charm at all level of nesting:
POST 123821/Encounters/_search
{
"query": {
"query_string": {
"query": "Aller*"
}
},
"highlight": {
"fields": {
"*": {}
}
}
}