Elasticsearch date based function scoring boosting the wrong way - elasticsearch

I would like to boost scores of documents based on how "recent" a document is. I am trying to do this using a function_score. Here is an example of me doing this on a field called updated_at:
{
"function_score": {
"boost_mode": "sum",
"functions": [
{
"exp": {
"updated_at": {
"origin": "now",
"scale": "1h",
"decay": 0.01,
},
},
"weight": 1,
}
],
"query": query
},
}
I would expect documents close to the datetime now will have a score closer to 1, and documents closer to scale will have a score closer to decay (as described in the docs). Therefore, I'm using the boost_mode sum, to keep the original document scores, and increase depending on how close to now the updated_at value is. (Also, the query score is useful so I would rather add than multiply, which is the default).
To test this scenario, I create a document (A) that returns a query score of about 2. I then duplicate it (B) and modify the new document's updated_at timestamp to be an hour in the past.
In this scenario, I would expect (A) to have a higher score and (B) to have a lower score. However, when I run this scenario, I get the exact opposite. (B) ends up with a score of 3 and (A) ends up with a score of 2.
What am I misunderstanding here to cause this to happen? And how would I modify my function score to do what I would like?

This turned out to be a a timezone issue.
I ended up using the explain API to look at what was contributing to the score. When doing that, I noticed that the origin set to now was actually in a different timezone to the one I was setting in the documents.
I fixed this by manually providing a UTC timestamp in the elasticsearch query rather than using now as the value.
(If there is a better way to do this, please let me know)

Related

Search After (pagination) in Elasticsearch when sorting by score

Search after in elasticsearch must match its sorting parameters in count and order. So I was wondering how to get the score from previous result (example page 1) to use it as a search after for next page.
I faced an issue when using the score of the last document in previous search. The score was 1.0, and since all documents has 1.0 score, the result for next page turned out to be null (empty).
That's actually make sense, since I am asking elasticsearch for results that has lower rank (score) than 1.0 which are zero, so which score do I use to get the next page.
Note:
I am sorting by score then by TieBreakerID, so one possible solution is using high value (say 1000) for score.
What you're doing sounds like it should work, as explained by an Elastic team member. It works for me (in ES 7.7) even with tied scores when using the document ID (copied into another indexed field) as a tiebreaker. It's true that indexing additional documents while paginating will make your scores slightly unstable, but not likely enough to cause a significant problem for an end user. If you need it to be reliable for a batch job, the Scroll API is the better choice.
{
"query": {
...
},
"search_after": [
12.276552,
14173
],
"sort": [
{ "_score": "desc" },
{ "id": "asc" }
]
}

Scoring documents by both textual match and distance to a point

I have an ElasticSearch index with a list of "shops".
I'd like to allow customers to search these shops by both geo_distance (so, search for a point and get shops near that location), and textual match, like matches on shop name / address.
I'd like to get results that match either of these two criteria, and I'd like the order of these results to be a combination of both. The stronger the textual match, and the closer to the point searched, the higher the result. (Obviously, there's going to be a formula to combine these two, that'll need tweaking, not too worried about that part yet).
My issue / what I've tried:
geo_distance is a filter, not a query, so I can't combine both on the query part of the request.
I can use a bool => should filter (rather than query) that matches on either name or location. This gives me the results I want, but not in order.
I can also have _geo_distance as part of a sort clause so that documents closer to the point rank higher.
What I haven't figured out is how I would take the "regular" _score that ElasticSearch gives to documents when doing textual matches, and combine that with the geo_distance score.
By having the textual match in the filter, it doesn't seem to affect the score of documents (which makes sense). And I don't see how I could combine the textual match in the query part and a geo_distance filter so it's an OR rather than an AND.
I guess my best bet would be the equivalent of this:
{
function_score: {
query: { ... },
functions: [
{ geo_distance function },
{ multi_match_result score },
],
score_mode: 'multiply'
}
}
but I'm not sure you can do geo_distance as a score function, and I don't know how to have multi_match_result score as a score function, or if it's even possible.
Any pointers will be greatly appreciated.
I'm working with ElasticSearch v1.4, but I can upgrade if necessary.
but I'm not sure you can do geo_distance as a score function, and I don't know how to have multi_match_result score as a score function, or if it's even possible.
You can't really do it in the way that you're asking, but you can do what you want just as easily. For the simpler case, you get scoring just by using a normal query.
The problem with filters is that they're yes/no questions, so if you use them in a function_score, then it either boosts the score or it doesn't. What you probably want is degradation of the score as the distance from the origin grows. It's the yes/no nature that stops them from impacting the score at all. There's no improvement to relevancy implied by matching a filter -- it just means that it's part of the answer, but it doesn't make sense to say that it should be closer to the top/bottom as a result.
This is where the Decay function score helps. It works with numbers, dates, and -- most helpfully here -- geo_points. In addition to the types of data it accepts, it can decay using either gaussian, exponential, or linear decay functions. The one that you want to choose is honestly arbitrary and you should give the one that chooses the best "experience". I would suggest to start with gauss.
"function_score": {
"functions": [
"gauss": {
"my_geo_point_field": {
"origin": "0, 1",
"scale": "5km",
"offset": "500m",
"decay": 0.5
}
}
]
}
Note that origin is in x, y format (due to standard GeoJSON), which is longitude, latitude.
Each one of the values impacts how the score decays based on the graph (taken wholesale from the documentation). If you would use an offset of 0, then the score begins to drop once it's not exactly at the origin. With the offset, it allows it some buffer to be considered just as good.
The scale is directly associated with the decay in that the score will be chopped down by the decay value once it is scale-distance away from the origin (+/- the offset). In my above example, anything 5km from the origin would get half of the score as anything at the origin.
Again, just note that the different types of decay functions change the shape of scoring.
I'd like the order of these results to be a combination of both.
This is the purpose of the bool / should compound query. You get OR behavior with score improvement based on each match. Combining this with the above, you'd want something like:
{
"query": {
"bool": {
"should": [
{
"multi_match": { ... }
},
{
"function_score": {
"functions": [
"gauss": {
"my_geo_point_field": {
"origin": "0, 1",
"scale": "5km",
"offset": "500m",
"decay": 0.5
}
}
]
}
}
]
}
}
}
NOTE: If you add a must, then the should behavior changes from literal OR-like behavior (at least 1 must match) to completely optional behavior (none must match).
I'm working with ElasticSearch v1.4, but I can upgrade if necessary.
Starting with Elasticsearch 2.0, every filter is a query and every query is also a filter. The only difference is the context that it's used in. This doesn't change my answer here, but it's something that may help you in the future in addition to what I say next.
Geo-related performance increased dramatically in ES 2.2+. You should upgrade (and recreate your geo-related indices) to take advantage of those changes. ES 5.0 will have similar benefits!

elasticsearch more_like_this query is taking long time to run

I have the below more_like_this query to elasticsearch.
I run this in a loop for 15 times with different art_title and art_tags each time. For some articles the time it takes is very less but for some articles in the loop it takes too long to execute. Is there anything which I can do to optimize this query. Any help is appreciated.
bodyquery={
"query":
{"bool":
{"should":
[
{"more_like_this":
{
"like_text": art_title,
"fields": ["title"],
"max_query_terms": 30,
"boost": 5,
"min_term_freq": 1
}
},
{"more_like_this":
{
"like_text": art_tags,
"fields": ["tags"],
"max_query_terms": 30,
"boost": 5,
"min_term_freq": 1
}
}
]
}
}
}
I believe you might have solved this already by now but depending on the content of your indexed docs and the analyzers applied to the fields you are looking at, this can take a wide range of time to complete. Think how similarity works and how it will be calculated for your documents and you probably will find the answer. Also, you can use the explain param to get a Lucene detailed step-by-step response to the question
, but just in case I want to add: it is virtually impossible to determine anything without more details:
What your mappings look like
How are those fields analyzed
What version of ES are you using
Your ES setup
Also, describe in english what are you trying to retrieve: "I want documents in the catalog index that have a title similar to art_title and/or a tag similar to art_tag".
There is reference to the syntax in HERE if you are using the latest version of ES
Cheers

elasticsearch: boost query based on values of a variable

I understand how to boost query in elasticsearch depending on absolute value of a variable. For example
{
"query": {
"bool": [
{ "match": {"field1": {"query": 10, "boost": 2}} }
]
}
}
What I need to do is to make sure the field1 influences the score but I dont know any absolute value. For example, document will field1 = 20 will get higher score as compared to document with field1 = 10. However, this is different from sort. Because sorting is absolute. I just want this variable to contribute to the overall score but this is not the only field controlling the overall score.
The best solution here would be function_score query
It can be seen as the swiss army knife for customizing scores.
You can use field_value_factor function in it to achieve what you are looking for.

change _score in elasticsearch to make equal to doc's score field

I have score (integer) field in data, I'm getting data from api, and posting it directly to localhost:9200//listings/
And I want the item _score to be equal to score field in data.
For now a solution is to add ?sort=score:desc to url
One solution is to use a function_score query, where you replace the default _score using a field_value_factor score function. It goes like this:
curl -XPOST localhost:9200/listings/_search -d '{
"query": {
"function_score": {
"functions": [
{
"field_value_factor": {
"field": "score", <---- we use the score field instead
"factor": 1, <---- take the exact same score
"missing": 1 <---- use 1 as score if the score field is missing
}
}
],
"query": {
"match_all": {}
},
"boost_mode": "replace" <---- we're replacing the default _score
}
}
}'
So we're basically computing the score using the score field multiplied by 1 and if any document doesn't have the score field we just assume the score to be 1 (you can change that to whatever makes more sense in your case).
UPDATE
According to your comment, you need the _score to be multiplied by the document's score field. You can achieve it simply by removing the boost_mode parameter, the default boost_mode is to multiply the _score with whatever value comes out of the field_value_factor function.
If you need to completely replace the default scoring mechanism to be based on your score field instead, there's a more complex way using the similarity module, where you can define another similarity algorithm solely for your score field. There is a great blog post explaining the nitty gritty details of the similarity module.

Resources