How are the documents ordered in Elasticsearch if the sort value for two documents is same? - sorting

I was working with products data, here: link
The search query that sort by keyword field tags using max mode is as follows.
GET product/_doc/_search
{
"size":100,"from":20,"_source":["tags", "name"],
"query": {
"match_all": {}
},
"sort": [
{"tags":{
"order":"desc",
"mode":"max"
}}
]
}
Some documents have same sort value. I had read somewhere that if the sort value is same, it arranges by internal doc id (_id). However, the case does not seem so. See screenshot below:
First _id: 961 followed by _id:972 (fine). However, then came _id: 114. I am not understanding how it got random.
Help will be appreciated.

As you have already seen, its random. To overcome this you can add another field to be used to sort when the sorting value for first field is same. As you want to use _id the query will be then as follows:
{
"size": 100,
"from": 20,
"_source": [
"tags",
"name"
],
"query": {
"match_all": {}
},
"sort": [
{
"tags": {
"order": "desc",
"mode": "max"
}
},
{
"_id": "asc"
}
]
}

Related

Elasticsearch ordering by field value which is not in the filter

can somebody help me please to make a query which will order result items according some field value if this field is not part of query in request. I have a query:
{
"_source": [
"ico",
"name",
"city",
"status"
],
"sort": {
"_score": "desc",
"status": "asc"
},
"size": 20,
"query": {
"bool": {
"should": [
{
"match": {
"normalized": {
"query": "idona",
"analyzer": "standard",
"boost": 3
}
}
},
{
"term": {
"normalized2": {
"value": "idona",
"boost": 2
}
}
},
{
"match": {
"normalized": "idona"
}
}
]
}
}
}
The result is sorted according field status alphabetically ascending. Status contains few values like [active, canceled, old....] and I need something like boosting for every possible values in query. E.g. active boost 5, canceled boost 4, old boost 3 ........... Is it possible to do it? Thanks.
You would need a custom sort using script to achieve what you want.
I've just made use of generic match_all query for my query, you can probably go ahead and add your query logic there, but the solution that you are looking for is in the sort section of the below query.
Make sure that status is a keyword type
Custom Sorting Based on Values
POST <your_index_name>/_search
{
"query":{
"match_all":{
}
},
"sort":[
{ "_score": "desc" },
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"if(params.scores.containsKey(doc['status'].value)) { return params.scores[doc['status'].value];} return 100000;",
"params":{
"scores":{
"active":5,
"old":4,
"cancelled":3
}
}
},
"order":"desc"
}
}
]
}
In the above query, go ahead and add the values in the scores section of the query. For e.g. if your value is new and you want it to be at say value 2, then your scores would be in the below:
{
"scores":{
"active":5,
"old":4,
"cancelled":3,
"new":6
}
}
So basically the documents would first get sorted by _score and then on that sorted documents, the script sort would be executed.
Note that the script sort is desc by nature as I understand that you would want to show active documents at the top, followed by other values. Feel free to play around with it.
Hope this helps!

Very slow elasticsearch term aggregation. How to improve?

We have ~20M (hotel offers) documents stored in elastic(1.6.2) and the point is to group documents by multiple fields (duration, start_date, adults, kids) and select one cheapest offer out of each group. We have to sort those results by cost field.
To avoid sub-aggregations we have united target fields values into one called default_group_field by joining them with dot(.).
Mapping for the field looks like this:
"default_group_field": {
"index": "not_analyzed",
"fielddata": {
"loading": "eager_global_ordinals"
},
"type": "string"
}
Query we perform looks like this:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field",
"size": 5,
"order": {
"min_sort_value": "asc"
}
},
"aggs": {
"min_sort_value": {
"min": {
"field": "cost"
}
},
"cheapest": {
"top_hits": {
"_source": {}
},
"sort": {
"cost": "asc"
},
"size": 1
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"and": [
...
]
}
}
}
}
The problem is that such query takes seconds (2-5sec) to load.
However once we perform query without aggregations we get a moderate amount of results (say "total": 490) in under 100ms.
{
"took": 53,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 490,
"max_score": 1,
"hits": [...
But with aggregation it take 2sec :
{
"took": 2158,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 490,
"max_score": 0,
"hits": [
]
},...
It seems like it should not take so long to process that moderate amount filtered documents and select the cheapest one out of every group. It could be done inside application, which seems an ugly hack for me.
The log is full of lines stating:
[DEBUG][index.fielddata.plain ] [Karen Page] [offers] Global-ordinals[default_group_field][2564761] took 2453 ms
That is why we updated our mapping to perform eager global_ordinals rebuild on index update, however this did not make notable impact on query timings.
Is there any way to speedup such aggregation, or maybe a way to tell elastic to do aggregation on filtered documents only.
Or maybe there is another source of such a long query execution? Any ideas highly appreciated!
thanks again for the effort.
Finally we have solved the main problem and our performance is back to normal.
To be short we have done the following:
- updated the mapping for the default_group_field to be of type Long
- compressed the default_group_field values so that it would match type Long
Some explanations:
Aggregations on string fields require some work work be done on them. As we see from logs building Global Ordinals for that field that has very wide variance was very expensive. In fact we do only aggregations on the field mentioned. With that said it is not very efficient to use String type.
So we have changed the mapping to:
default_group_field: {
type: 'long',
index: 'not_analyzed'
}
This way we do not touch those expensive operations.
After this and the same query timing reduced to ~100ms. It also dropped down CPU usage.
PS 1
I`ve got a lot of info from docs on global ordinals
PS 2
Still I have no idea on how to bypass this issue with the field of type String. Please comment if you have some ideas.
This is likely due to the the default behaviour of terms aggregations, which requires global ordinals to be built. This computation can be expensive for high-cardinality fields.
The following blog addresses the likely cause of this poor performance and several approaches to resolve it.
https://www.elastic.co/blog/improving-the-performance-of-high-cardinality-terms-aggregations-in-elasticsearch
Ok. I will try to answer this,
There are few parts in the question which I was not able to understand like -
To avoid sub-aggregations we have united target fields values into one called default_group_field by joining them with dot(.)
I am not sure what you really mean by this because you said that,
You added this field to avoid aggregation(But how? and also how are you avoiding the aggregation if you are joining them with dot(.)?)
Ok. Even I am also new to elastic search. So If there is anything I missed, you can comment on this answer. Thanks,
I will continue to answer this question.
But before that I am assuming that you have
that(default_group_field) field to differentiate between records
duration, start_date, adults, kids.
I will try to provide one example below after my solution.
My solution:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field"
},
"aggs": {
"sort_cost_asc": {
"top_hits": {
"sort": [
{
"cost": {
"order": "asc"
}
}
],
"_source": {
"include": [ ... fields you want from the document ... ]
},
"size": 1
}
}
}
}
},
"query": {
"... your query part ..."
}
}
I will try to explain what I am trying to do here:
I am assuming that your document looks like this (may be there is some nesting also, But for example I am trying to keep the document as simple as I can):
document1:
{
"default_group_field": "kids",
"cost": 100,
"documentId":1
}
document2:
{
"default_group_field": "kids",
"cost": 120,
"documentId":2
}
document3:
{
"default_group_field": "adults",
"cost": 50,
"documentId":3
}
document4:
{
"default_group_field": "adults",
"cost": 150,
"documentId":4
}
So now you have this documents and you want to get the min. cost document for both adults and kids:
so your query should look like this:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field"
},
"aggs": {
"sort_cost_asc": {
"top_hits": {
"sort": [
{
"cost": {
"order": "asc"
}
}
],
"_source": {
"include": ["documentId", "cost", "default_group_field"]
},
"size": 1
}
}
}
}
},
"query": {
"filtered":{ "query": { "match_all": {} } }
}
}
To explain the above query, what I am doing is grouping the document by "default_group_field" and then I am sorting each group by cost and size:1 helps me to get the just one document.
Therefore the result for this query will be min. cost document in each category (adults and kids)
Usually when I try to write the query for elastic search or db. I try to minimize the number of document or rows.
I assume that I am right in understanding your question.
If I am wrong in understanding your question or I did some mistake, Please reply and let me know where I went wrong.
Thanks,

How to sort items by array size in ElasticSearch?

I have 3 millions items with this structure:
{
"id": "some_id",
"title": "some_title",
"photos": [
{...},
{...},
...
]
}
Some items may have empty photos field:
{
"id": "some_id",
"title": "some_title",
"photos": []
}
I want to sort by the number of photos to result in elements without photos were at the end of the list.
I have the one working solution but it's very slow on 3 million items:
GET myitems/_search
{
"filter": {
...some filters...
},
"sort": [
{
"_script": {
"script": "_source.photos.size()",
"type": "number",
"order": "desc"
}
}
]
}
This query executes 55 seconds. How to optimize this query?
As suggested in the comments, adding a new field with the number of photos would be the way to go. There's a way to achieve this without reindexing all your data by using the update by query plugin.
Basically, after installing the plugin, you can run the following query and all your documents will get that new field. However, make sure that your indexing process also populates that new field in the new documents:
curl -XPOST 'localhost:9200/myitems/_update_by_query' -d '{
"query" : {
"match_all" : {}
},
"script" : "ctx._source.nb_photos = ctx._source.photos.size();"
}'
After this has run, you'll be able to sort your results simply with:
"sort": {"nb_photos": "desc"}
Note: for this plugin to work, one needs to have scripting enabled, it is already the case for you since you were able to use a sort script, but I'm just mentioning this for completeness' sake.
Problem was solved with Transform directive. Now I have a mapping:
PUT /myitems/_mapping/lol
{
"lol" : {
"transform": {
"lang": "groovy",
"script": "ctx._source['has_photos'] = ctx._source['photos'].size() > 0"
},
"properties" : {
... fields ...
"photos" : {"type": "object"},
"has_photos": {"type": "boolean"}
... fields ...
}
}
}
Now I can sort items by photos existence:
GET /test/_search
{
"sort": [
{
"has_photos": {
"order": "desc"
}
}
]
}
Unfortunately, this will cause full reindexation.

elasticsearch sort by document id

I have a simple index in elasticsearch and all my ids are manually added, i.e. I do not add documents with automatic string ids.
Now the requirement is to get list of all documents page by page and sorted by the document id (i.e. _id)
When I tried this with _id, it did not work. Then I looked for it on forums and found out this much that I have to use _uid for that. This actually works, although I have no clue how. But another problem is that the sorting is done as if the the _id is string. And it actually is a string. But I want the results as if the _id was a number.
So there are two issues here:
Why sorting does not work with _id and it does work with _uid
Is there a way to get document ids sorted as numbers and not integers
For e.g. if my doc ids are 1, 2, 3, ..... , 55
I am getting results in this order:
1, 10, 11, 12, ... , 19, 2, 20, ... so on
While I would like to get the results in this order:
1, 2, 3, ... so on
Any help is highly appreciated!
Have the _id indexed:
{
"mappings": {
"some_type": {
"_id": {
"index": "not_analyzed"
}
}
}
}
And use a script:
{
"sort": {
"_script": {
"type": "number",
"script": "doc['_id'].value?.isInteger()?doc['_id'].value.toFloat():null",
"order": "asc"
}
}
}
Even though I strongly recommend, if possible, changing the id to integer rather having it as string and contain numbers, instead.
And I kind of doubt that it worked with _uid because _uid is a combination between type and id.
For some reasons the code above didn't work for me. ("dynamic method [java.lang.String, isInteger/0] not found")
However the script below works (only if your _id can be converted into integers)
GET ENDPOINT/INDEX/_search
{
"sort": {
"_script": {
"type": "number",
"script": "return Integer.parseInt(doc['_id'].value)",
"order": "desc" // I personally needed descending
}
}
}
Instead of id, I used id.keyword and it worked.. sample code below:
GET index_name/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"id.keyword": {
"order": "desc"
}
}
]
}

ElasticSearch - sort search results by relevance and custom field (Date)

For example, I have entities with two fields - Text and Date. I want search by entities with results sorted by Date. But if I do it simply, then the result is unexpected.
For search query "Iphone 6" there are the newest texts only with "6" in top of еру results, not with "iphone 6". Without sorting the results seem nice, but not ordered by Date as I want.
How write custom sort function which will consider both relevance and Date? Or may be exist way to give weight to field Date which will be consider in scoring?
In addition, may be I shall want to suppress search results only with "6". How to customize search to find results only by bigrams for example?
Did you tried with bool query like this
{
"query": {
"bool": {
"must": {
"match": {
"field": "iphone 6"
}
}
}
},
"sort": {
"date": {
"order": "desc"
}
}
}
or with your query you can also do this with is more appropriate way of doing i guess ..
just add this as sort
"sort": [
{ "date": { "order": "desc" }},
{ "_score": { "order": "desc" }}
]
all matching results sorted first by date, then by relevance.
The solution is to use _score and the date field both in sort. _score as the first sort order and date field as secondary sort order.
You can use simple match query to perform relevance match.
Try it out.
Data setup:
POST ecom/prod
{
"name":"iphone 6",
"date":"2019-02-10"
}
POST ecom/prod
{
"name":"iphone 5",
"date":"2019-01-10"
}
POST ecom/prod
{
"name":"iphone 6",
"date":"2019-02-28"
}
POST ecom/prod
{
"name":"6",
"date":"2019-03-01"
}
Query for relevance and date based sorting:
POST ecommerce/prododuct/_search
{
"query": {
"match": {
"name": "iphone 6"
}
},
"sort": [
{
"_score": {
"order": "desc"
}
},
{
"date": {
"order": "desc"
}
}
]
}
You could definitely use a phrase matching query for this.
It does position-aware matching so the documents will be considered a match for your query only if both "iphone" and "6" occur in the searched fields AND that their occurrences respects this order, "iphone" shows up before "6".
looks like you want to sort first by relevance and then by date. this query will do it.
{ "query" : {
"match" : {
"my_field" : "my query"
}
},
"sort": {
"pubDate": {
"order": "desc",
"mode": "min"
}
}
}
When sorting on fields with more than one value, remember that the
values do not have any intrinsic order; a multivalue field is just a
bag of values. Which one do you choose to sort on? For numbers and
dates, you can reduce a multivalue field to a single value by using
the min, max, avg, or sum sort modes. For instance, you could sort on
the earliest date in each dates field by using the above query.
elasticsearch guide sorting
I think your relevance is broken. You should use two different analyzers, 1 for setting up your index and another for searching. like this:
PUT /my_index/my_type/_mapping
{
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
also you can read more about this here: https://www.elastic.co/guide/en/elasticsearch/guide/master/_index_time_search_as_you_type.html
Once you fix the relevance then sorting should work correctly.

Resources