Search score of identical documents changes when nested integer attribute is modified - elasticsearch

We stumbled upon this issue today and cannot really understand what is happening.
Suppose we have a really simple index with just two documents inside that have the same contents.
// document 1
{
"question": "text of the question",
// nested part
"answers": [
{
"text": "text of first answer",
"clickscore": 0,
},
]
//
}
// document 2
{
"question": "text of the question",
// nested part
"answers": [
{
"text": "text of first answer",
"clickscore": 0,
},
]
//
}
question and answers.text are Text fields with the same analyzer defined on them. answers is a list with either 1 or many answers inside. clickscore is an Integer field that we will use in the future to boost the relevance of some documents. When we do a search we always look for matches in question and answers.text.
Now the weird part.
document 1 and document 2 have EXACTLY the same content, thus a search on the cluster with text contained in both question and answers.text (for example "text") returns hits with exactly the same score: makes sense.
However, if we update the clickscore of one of the two documents by setting e.g. the document 2 clickscore == 1 and we repeat EXACTLY the same search then the score of the documents are NOT the same.
How is this possible? clickscore is just an integer attribute and it should not affect the score of the search, especially since we're only looking for matches in the Text fields...

Apparently the problem is related to the fact that the shard statistics are not updated on time, and this causes the discrepancy.
If anyone arrives here on this question the only way to fix this is to manually perform a flush, so Index('...').flush() and the scores then are the same again.

Related

How to take (length of the aliases field) out of score calculation

Suppose we have a documents of people with their name and array of aliases like this:
{
name: "Christian",
aliases: ["נוצרי", "کریستیان" ]
}
Suppose I have a document with 10 aliases and another one with 2 aliases
but both of them contains alias with value کریستیان.
The length of field (dl) for the first document is bigger than the second document
so the term frequency (tf) of the first document gets lower than the second one. eventually the score of the document with less aliases is bigger than another.
Sometimes I want to add more aliases for person in different languages and different forms because he/she is more famous but it causes to get lower score in results. I want to somehow take length of the aliases field out of my query's calculation.
Norms
store the relative length of the field.
How long is the field? The shorter the field, the higher the weight.
If a term appears in a short field, such as a title field, it is more
likely that the content of that field is about the term than if the
same term appears in a much bigger body field.
Norms can be disabled using PUT mapping api
PUT my_index/_mapping
{
"properties": {
"title": {
"type": "text",
"norms": false
}
}
}
Links for further study
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html#field-norm

Search After (pagination) in Elasticsearch when sorting by score

Search after in elasticsearch must match its sorting parameters in count and order. So I was wondering how to get the score from previous result (example page 1) to use it as a search after for next page.
I faced an issue when using the score of the last document in previous search. The score was 1.0, and since all documents has 1.0 score, the result for next page turned out to be null (empty).
That's actually make sense, since I am asking elasticsearch for results that has lower rank (score) than 1.0 which are zero, so which score do I use to get the next page.
Note:
I am sorting by score then by TieBreakerID, so one possible solution is using high value (say 1000) for score.
What you're doing sounds like it should work, as explained by an Elastic team member. It works for me (in ES 7.7) even with tied scores when using the document ID (copied into another indexed field) as a tiebreaker. It's true that indexing additional documents while paginating will make your scores slightly unstable, but not likely enough to cause a significant problem for an end user. If you need it to be reliable for a batch job, the Scroll API is the better choice.
{
"query": {
...
},
"search_after": [
12.276552,
14173
],
"sort": [
{ "_score": "desc" },
{ "id": "asc" }
]
}

Elasticsearch: auto increment integer field across two index

I need a auto increment integer field across two index.
Can Elasticsearch do it automatically like MySQL "auto increment" field in a table?
Eg. when puts some documents in two different index:
POST /my_index_1/blogpost/
{
"title": "Foo Bar"
}
POST /my_index_2/blogpost/
{
"title": "Baz quux"
}
On retrieve it, i want:
GET /my_index_*/blogpost/
{
"uid" : 1,
"title": "Foo Bar"
},
{
"uid" : 2,
"title": "Baz quux"
}
No, ES does not have any auto increment feature since it is a distributed system, figuring out the correct value for the counter is non trivial. Especially since (bulk) indexing tends to be heavily concurrent. You can typically max out CPUs on all nodes if you throw enough documents at it.
So, your best option is to do this outside of ES before you send the documents to ES. Or even better, don't do this. If you need some kind of order of insertion, a better option is to simply use a timestamp. They are actually stored as a number internally. You still might get duplicates of course if two documents get indexed the same millisecond. A trick we've used to work around that is to offset documents indexed at the same time by 1 ms. to ensure we keep the insertion order.

How do I make a field have varying type in Elastic Search

I need a field, here score, to be a number, and other times a string. Like:
{
"name": "Joe"
"score": 32.5
}
{
"name": "Sue"
"score": "NOT_AVAILABLE"
}
How can I express this in this in the index settings in Elastic Search?
I basically want "dynamic typing" on the field. The code may not make sense to you (like: why not split it into 2 different fields), but it's necessary to be this way on my end (for consistency reasons).
I don't need/want the property to be indexed/"searchable" btw. I just need the data to be in the json response. I need something like "any object will fit here".
Finally figured it out. All I had to do was to set enabled to false, and elastic search will not attempt to do anything with the data - but it's still present in the json response.
Like so:
"score": {
"enabled": false
}
Just define "score" field to be of type "string" in your mapping. That's it. Make sure you do define the mapping before indexing any document though. Otherwise if you let the mapping be created dynamically and the type of value of "score" field is anything but string in the first document you index, you won't be able to index any document next in which "score" holds a string.

Possible to have a document always return above certain position

I've got a bunch of documents from a query which are sorted by a modified date. However I'd like certain documents (identified by a field value) to always return in the top ten results regardless of whether there are ten or more documents with a more recent modified date.
From what I've read about the various ways of sorting in Elasticsearch (score, boost, scripts) I don't think I have any way of determining the actual position of a document in the search results, let alone some way of manipulating the score to push a document into the top ten.
Assuming that you have a field called "important_field" which contains value 1, for documents you in top and say 0 for all other documents, you can use multi field sorting as below
{
"sort": [
{ "important_field": { "order": "desc" }},
{ "modified_date": { "order": "desc" }}
]
}
This way of sorting means it will sort by important_field value and if they are same then will be sorted by modified_date. So all documents with important_field value 1 will come on top and rest will still be sorted by modified_date.

Resources