I have the following problem:
I'm doing some tests with facetings
My script is as follows:
https://gist.github.com/nayelisantacruz/6610862
the result I get is as follows:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": []
},
"facets": {
"title": {
"_type": "terms",
"missing": 0,
"total": 2,
"other": 0,
"terms": [
{
"term": "JavaScript",
"count": 1
},
{
"term": "Java Platform, Standard Edition",
"count": 1
}
]
}
}
}
which is fine, but the problem is that I can not display the "highlighting"
I was expecting a result like the following:
..........
..........
..........
"facets": {
"title": {
"_type": "terms",
"missing": 0,
"total": 2,
"other": 0,
"terms": [
{
"term": "<b>Java</b>Script",
"count": 1
},
{
"term": "<b>Java</b> Platform, Standard Edition",
"count": 1
}
]
}
}
..........
..........
..........
Anyone can help me and tell me what I'm doing wrong or what I'm missing, please
Thank you very much for your attention
Faceting and highlighting are two completely different things. Highlighting works together with search, in order to return highlighted snippets for each of the search results.
Faceting is a completely different story, as a facet effectively looks at all the terms that have been indexed for a specific field, throughout all the documents that match the main query. In that respect, the query only controls the documents that are going to be taken into account to perform faceting. Only the top terms (by default with higher count) are going to be returned. Those terms are not only related to the search results (by default 10) but to all the documents that match the query.
That said, the terms returned with the facets are never highlighted.
If you use highlighting you should see in your response, as mentioned in the reference, a new section that contains the highlighted snippets for each of your search results. The reason why you don't see it is that you are querying the title.autocomplete field, but you make highlighting on the title field with require_field_match enabled. You either have to set require_field_match to true or highlight the same field that you are querying on. But again this is not related to faceting whatsoever.
Note the use of * instead of _all. This works like a charm at all level of nesting:
POST 123821/Encounters/_search
{
"query": {
"query_string": {
"query": "Aller*"
}
},
"highlight": {
"fields": {
"*": {}
}
}
}
Related
I am trying to get date histrogram for a timestamp field for a specific period. I am using the following query,
{
"aggs" : {
"dataRange" : {
"filter": {"range" : { "#timestamp" :{ "gte":"2020-02-28T17:20:10Z","lte":"2020-03-01T18:00:00Z" } } },
"aggs" : {
"severity_over_time" :{
"date_histogram" : { "field" : "#timestamp", "interval" : "28m" }
}}}
},"size" :0
}
The following result I got,
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 32,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"dataRange": {
"doc_count": 20,
"severity_over_time": {
"buckets": [
{
"key_as_string": "2020-02-28T17:04:00.000Z",
"key": 1582909440000,
"doc_count": 20
}
]
}
}
}
}
The the start of the histogram range ("key_as_string" ) goes outside of my filter criteria! My input filter is from "2020-02-28T17:20:10Z" but the key_as_string in the result is "2020-02-28T17:04:00.000Z" which is outside the range filter!
I tried looking at the docs but no avail. Am I missing something here?
I guess that has to do with the way a Range or a bucket is calculated. My understanding is that 28m of range would have to be maintained throughout i.e. the bucket size must be consistent.
Notice that 28m of range difference is maintained perfectly and in a way first and the last bucket seem to be stretched just to accommodate this 28m range.
Notice that logically, your result documents are all in the right buckets and that documents which are outside the filter range would not be in the aggregation query irrespective of the key_as_string appears within their limits.
Basically ES doesn't guarantee that the range values i.e. key_as_string or start and end values of buckets created may fall accurately within the scope of the filter you've provided but it does guarantee that only the documents filtered as per that range filtered query would be considered for evaluation.
You can say that bucket values are nearest possible values or approximations.
If you want to be sure of the filtered documents, just remove the filter from aggregation and use that in the query as below and remove size: 0
Notice I've made use of offset which would change the start value of the specified bucket. Perhaps that is something you are looking for.
Also one more thing, I've made use of min_doc_count just so you can filter out empty buckets.
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2020-02-28T17:20:10Z",
"lte": "2020-03-01T18:00:01Z"
}
}
}
]
}
},
"aggs": {
"severity_over_time": {
"date_histogram": {
"field": "#timestamp",
"interval": "28m",
"offset": "+11h",
"min_doc_count": 1
}
}
}
}
I have stored some values in Elasticsearch nested data type (an array) but without using key/value pair. An example record would be:
{
"categories": [
"Category1",
"Category2"
],
"product_name": "productx"
}
Now I want to run aggregation query to find out unique list of categories available. But all the examples I've seen pointed to mapping that has key/value. Is there any way I can use above schema as is or do I need to change my schema to something like this to run aggregation query
{
"categories": [
{"name": "Category1"},
{"name": "Category2"}
],
"product_name": "productx"
}
Well regarding JSON structure, you need to take a step back and figure out if you'd want list or key-value pairs.
Looking at your example, I don't think you need key-value pairs but again its something you may want to clarify by understanding your domain if there'd be some more properties for categories.
Regarding aggregation, as far as I know, aggregations would work on any valid JSON structure.
For the data you've mentioned, you can make use of the below aggregation query. Also I'm assuming the fields are of type keyword.
Aggregation Query
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"myaggs": {
"terms": {
"size": 100,
"script": {
"inline": """
def myString = "";
def list = new ArrayList();
for(int i=0; i<doc['categories'].length; i++){
myString = doc['categories'][i] + ", " + doc['product'].value;
list.add(myString);
}
return list;
"""
}
}
}
}
}
Aggregation Response
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"myaggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "category1, productx",
"doc_count": 1
},
{
"key": "category2, productx",
"doc_count": 1
}
]
}
}
}
Hope it helps!
It seems I followed every similar answer I found, but I just cant figure out what is wrong...
This is a "match all" query:
{
"query": {
"match_all": {}
}
}
..and the results:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "unittest_index_unittestdocument",
"_type": "unittestdocument",
"_id": "a.b",
"_score": 1,
"_source": {
"id": "a.b",
"docdate": "2018-01-24T09:45:44.4168345+02:00",
"primarykeywords": [
"keyword"
],
"primarytitles": [
"the title of a document"
]
}
}
]
}
}
but when I try to filter that with a date like this:
{
"query":{
"bool":{
"must":{
"multi_match":{
"type":"most_fields",
"query":"document",
"fields":[ "primarytitles","primarykeywords" ]
}
},
"filter": [
{"range":{ "docdate": { "gte":"1900-01-23T15:17:12.7313261+02:00" } } }
]
}
}
}
I have zero hits...
I tried to follow this https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html and this filtering by date in elasticsearch with no success at all..
Is there any difference that I cannot see????
Please note that when I remove the date filter and I add a term filter on "primarykeywords" i get the results i want. The only problem is the range filter
Apparently there was no error with my query, the problem was that the docdate field wasn't index... :/
Don't know why I initially skipped indexing that field (my mistake), but I do believe elastic should warn me that I am trying to query something that has "index: false"
This thing that elastic just doesn't return results without informing what is going on is, in my opinion, a major issue. I lost one day reading everything I could find on the web, just because I didn't had a proper feedback from the engine.
Fail safe died for this reason...
We have ~20M (hotel offers) documents stored in elastic(1.6.2) and the point is to group documents by multiple fields (duration, start_date, adults, kids) and select one cheapest offer out of each group. We have to sort those results by cost field.
To avoid sub-aggregations we have united target fields values into one called default_group_field by joining them with dot(.).
Mapping for the field looks like this:
"default_group_field": {
"index": "not_analyzed",
"fielddata": {
"loading": "eager_global_ordinals"
},
"type": "string"
}
Query we perform looks like this:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field",
"size": 5,
"order": {
"min_sort_value": "asc"
}
},
"aggs": {
"min_sort_value": {
"min": {
"field": "cost"
}
},
"cheapest": {
"top_hits": {
"_source": {}
},
"sort": {
"cost": "asc"
},
"size": 1
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"and": [
...
]
}
}
}
}
The problem is that such query takes seconds (2-5sec) to load.
However once we perform query without aggregations we get a moderate amount of results (say "total": 490) in under 100ms.
{
"took": 53,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 490,
"max_score": 1,
"hits": [...
But with aggregation it take 2sec :
{
"took": 2158,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 490,
"max_score": 0,
"hits": [
]
},...
It seems like it should not take so long to process that moderate amount filtered documents and select the cheapest one out of every group. It could be done inside application, which seems an ugly hack for me.
The log is full of lines stating:
[DEBUG][index.fielddata.plain ] [Karen Page] [offers] Global-ordinals[default_group_field][2564761] took 2453 ms
That is why we updated our mapping to perform eager global_ordinals rebuild on index update, however this did not make notable impact on query timings.
Is there any way to speedup such aggregation, or maybe a way to tell elastic to do aggregation on filtered documents only.
Or maybe there is another source of such a long query execution? Any ideas highly appreciated!
thanks again for the effort.
Finally we have solved the main problem and our performance is back to normal.
To be short we have done the following:
- updated the mapping for the default_group_field to be of type Long
- compressed the default_group_field values so that it would match type Long
Some explanations:
Aggregations on string fields require some work work be done on them. As we see from logs building Global Ordinals for that field that has very wide variance was very expensive. In fact we do only aggregations on the field mentioned. With that said it is not very efficient to use String type.
So we have changed the mapping to:
default_group_field: {
type: 'long',
index: 'not_analyzed'
}
This way we do not touch those expensive operations.
After this and the same query timing reduced to ~100ms. It also dropped down CPU usage.
PS 1
I`ve got a lot of info from docs on global ordinals
PS 2
Still I have no idea on how to bypass this issue with the field of type String. Please comment if you have some ideas.
This is likely due to the the default behaviour of terms aggregations, which requires global ordinals to be built. This computation can be expensive for high-cardinality fields.
The following blog addresses the likely cause of this poor performance and several approaches to resolve it.
https://www.elastic.co/blog/improving-the-performance-of-high-cardinality-terms-aggregations-in-elasticsearch
Ok. I will try to answer this,
There are few parts in the question which I was not able to understand like -
To avoid sub-aggregations we have united target fields values into one called default_group_field by joining them with dot(.)
I am not sure what you really mean by this because you said that,
You added this field to avoid aggregation(But how? and also how are you avoiding the aggregation if you are joining them with dot(.)?)
Ok. Even I am also new to elastic search. So If there is anything I missed, you can comment on this answer. Thanks,
I will continue to answer this question.
But before that I am assuming that you have
that(default_group_field) field to differentiate between records
duration, start_date, adults, kids.
I will try to provide one example below after my solution.
My solution:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field"
},
"aggs": {
"sort_cost_asc": {
"top_hits": {
"sort": [
{
"cost": {
"order": "asc"
}
}
],
"_source": {
"include": [ ... fields you want from the document ... ]
},
"size": 1
}
}
}
}
},
"query": {
"... your query part ..."
}
}
I will try to explain what I am trying to do here:
I am assuming that your document looks like this (may be there is some nesting also, But for example I am trying to keep the document as simple as I can):
document1:
{
"default_group_field": "kids",
"cost": 100,
"documentId":1
}
document2:
{
"default_group_field": "kids",
"cost": 120,
"documentId":2
}
document3:
{
"default_group_field": "adults",
"cost": 50,
"documentId":3
}
document4:
{
"default_group_field": "adults",
"cost": 150,
"documentId":4
}
So now you have this documents and you want to get the min. cost document for both adults and kids:
so your query should look like this:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field"
},
"aggs": {
"sort_cost_asc": {
"top_hits": {
"sort": [
{
"cost": {
"order": "asc"
}
}
],
"_source": {
"include": ["documentId", "cost", "default_group_field"]
},
"size": 1
}
}
}
}
},
"query": {
"filtered":{ "query": { "match_all": {} } }
}
}
To explain the above query, what I am doing is grouping the document by "default_group_field" and then I am sorting each group by cost and size:1 helps me to get the just one document.
Therefore the result for this query will be min. cost document in each category (adults and kids)
Usually when I try to write the query for elastic search or db. I try to minimize the number of document or rows.
I assume that I am right in understanding your question.
If I am wrong in understanding your question or I did some mistake, Please reply and let me know where I went wrong.
Thanks,
I wonder if it's possible to convert this sql query into ES query?
select top 10 app, cat, count(*) from err group by app, cat
Or in English it would be answering: "Show top app, cat and their counts", so this will be grouping by multiple fields and returning name and count.
For aggregating on a combination of multiple fields, you have to use scripting in Terms Aggregation like below:
POST <index name>/<type name>/_search?search_type=count
{
"aggs": {
"app_cat": {
"terms": {
"script" : "doc['app'].value + '#' + doc['cat'].value",
"size": 10
}
}
}
}
I am using # as a delimiter assuming that it is not present in any value of app and/or cat fields. You can use any other delimiter of your choice. You'll get a response something like below:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0,
"hits": []
},
"aggregations": {
"app_cat": {
"buckets": [
{
"key": "app2#cat2",
"doc_count": 4
},
{
"key": "app1#cat1",
"doc_count": 3
},
{
"key": "app2#cat1",
"doc_count": 2
},
{
"key": "app1#cat2",
"doc_count": 1
}
]
}
}
}
On the client side, you can get the individual values of app and cat fields from the aggregation response by string manipulations.
In newer versions of Elasticsearch, scripting is disabled by default due to security reasons. If you want to enable scripting, read this.
Terms aggregation is what you are looking for.