Solr document Scoring/Boosting not working as expected - solrnet

We have integrated solr search with .net project, but we are facing some issues related to document boosting or scoring feature of solr.
Problem: Solr is not returning score as per term frequency in document.
Eg:- We have created four documents whose Title contain term "Link" and solr has returned score as below:
1)Link ==> 6.037953
2)Link Link Link Link Link ==> 5.9249415
3)Link Link ==> 5.374235
4)Link Link Link ==> 5.2746024
Can anyone please help me on solr scoring or boosting issue.

Scoring calculation for Solr is something really complex. Here, you have to begin with the primal equation:
score(q,d) = coord(q,d) · queryNorm(q) · ∑ ( tf(t in d) ·
idf(t)2 · t.getBoost() · norm(t,d) )
You have tf parameter which represents term frequency and its value is the squareroot of the frequency of the term.
You also have norm (aka fieldNorm) which is used in fieldWeight calculation. Let's take your example:
Link Link Link Link Link
Your score will be calculate like (you can see this by adding debugQuery parameter):
5.9249415 = fieldWeight, product of:
2.236068 = tf(freq=5.0), with freq of:
5.0 = termFreq=5.0
idf (wich will be the same for all your scores)
0.4375 = fieldNorm(doc=177)
link
6.037953= fieldWeight, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
idf (wich will be the same for all your scores)
1.0 = fieldNorm
Here, link has a better score than the other because fieldWeight is the product of tf, idf and fieldNorm. This last one is higher for link document because he only contains one term.
As above documentation said:
lengthNorm - computed when the document is added to the index in
accordance with the number of tokens of this field in the document, so
that shorter fields contribute more to the score.
The more terms you have in a field, lower fieldNorm will be.
Be careful with the value of this field.
So, to conclude, here you have a perfect mix to understand that the score is not calculated only with the frequency but also with the number of term that you have in your field.

Related

Elastic search minimize the boost factor as time pass

I have elastic search document that looks like this:
...
{
title : "post 1",
total_likes : 100,
total_comments : 129,
updated_at : "2020-10-19"
},
...
And i use a query that boost the likes and comments with respect to the post creation date
so it look like this:
total_likes^6,
total_comments^4,
updated_at
now the issue with this approach, that if some post had a huge number of likes it will stuck on top of the results forever no matter when it is created.
How i can minimize the boost as the time pass, for example a very fresh post will have the full boost factor (6,4) however, a post that has been created 1 year ago will have the factors (2,1) ?
So I think what you are look for is the function score in coordination with the decay factor [doc]
Or if your logic is more complex, you could write it in painless in the function field value factor [doc]

Unexpected Solr scores for documents boosted by the same boost values

I have 2 documents:
{
title: "Popular",
registrations_count: 700,
is_featured: false
}
and
{
title: "Unpopular",
registrations_count: 100,
is_featured: true
}
I'm running this Solr query (via the Ruby Sunspot gem):
fq: ["type:Event"],
sort: "score desc",
q: "*:*",
defType: "edismax",
fl: "* score",
bq: ["registrations_count_i:[700 TO *]^10", "is_featured_bs:true^10"],
start: 0, rows: 30
or, for those who are more used to ruby:
Challenge.search do
boost(10) do
with(:registrations_count).greater_than_or_equal_to(700)
end
boost(10) do
with(:is_featured, true)
end
order_by :score, :desc
end
One document matches the first boost query, and the other matches the other boost query. They have the same boost value.
What I would expect is that both documents get the same score. But they don't, they get something like that
1.2011336 # score for 'unpopular' (featured)
0.6366436 # score for 'popular' (not featured)
I also checked that if i boost an attribute that they both have in common, they get the exact same score, and they do. I also tried to change the 700 value, to something like 7000, but it makes no difference (which makes total sense).
Can anyone explain why they get such a different score, while they both match one of the boost queries?
I'm guessing the confusion stems from "the queries being boosted by the same value" - that's not true - the boost is the score of the query itself, which is then amplified 10x by your ^10.
The bq is additive - the score from the query is added to the score of the document (while boost is multiplicative, the score is multiplied by the boost query).
If you instead want to add the same score value to the original query based on either one matching, you can use ^=10 which makes the query constant scoring (the score will be 10 for that term, regardless of the regular score of the document).
Also, if you want to apply these factors independent of each other (instead of as a single, merged score with contributions from both factors), use multiple bq entries instead.

Get final score by sum of multiple fields boost

I want to build a search that prioritizes the amount of field matches instead of one field over another. All the fields would have the same boost value and the final score should be calculated by sum matched fields boost. If the full text matches two fields and each field have boost 1, the final score would be 1 + 1 = 2.
Let's use an example:
class Event < ApplicationRecord
searchable do
text :title
text :category
text :artist_name
end
end
Suppose I have two events:
Event 1: Name: "Christmas festival" Artist name: "AC/DC"
Event 2: Name: "New year festival" Artist name: "Queen"
So, if the user searches just "festival", both events are returned with the same score because it matches both event's name.
But, if the user searches "festival AC/DC", I want to return Event 1 in the first place or just Event 1 because it matches the event name (festival) and the artist name (AC/DC). While Event 2 just matches the event name (festival). Event 1 score should be 2 while Event 2 score should be 1.
Any suggestion about How can I do that? Is this even possible?
It seems you are mixing up scoring and boosting, I think your question should be titled Compute total score by summing each field score (regardless of the boosts).
Field scores are computed based on field matches, and they can be applied arbitrary set of additive or multiplicative boosts (functions and/or matching subqueries). But in the end what you want is to compute the global score by summing each field score, not the boosts themselves.
DisMax query parser for example precisely allows you to control how the final score is computed using the tie (Tie Breaker) parameter :
The tie parameter specifies a float value (which should be something
much less than 1) to use as tiebreaker in DisMax queries.
When a term from the user’s input is tested against multiple fields,
more than one field may match. If so, each field will generate a
different score based on how common that word is in that field (for
each document relative to all other documents). The tie parameter lets
you control how much the final score of the query will be influenced
by the scores of the lower scoring fields compared to the highest
scoring field.
A value of "0.0" - the default - makes the query a pure "disjunction
max query": that is, only the maximum scoring subquery contributes to
the final score. A value of "1.0" makes the query a pure "disjunction
sum query" where it doesn’t matter what the maximum scoring sub query
is, because the final score will be the sum of the subquery scores.
Typically a low value, such as 0.1, is useful.
In your situation you need a disjunction sum query so you might want to set the tie to 1.0.

How can I multiply the score of two queries together in Elasticsearch?

In Solr I can use the query function query to return a numerical score for a query and I can user that in the context of a bf parameter something like bf=product(query('cat'),query('dog')) to multiply two relevance scores together.
Elasticsearch has search API that is generally more flexible to work with, but I can't figure out how I would accomplish the same feat. I can use _score in a script_function of a function_query but I can only user the _score of the main query. How can I incorporate the score of another query? How can I multiply the scores together?
You could script a TF*IDF scoring function using a function_score query. Something like this (ignoring Lucene's query and length normalization):
"script": "tf = _index[field][term].tf(); idf = (1 + log ( _index.numDocs() / (_index[field][term].df() + 1))); return sqrt(tf) * pow(idf,2)"
You'd take the product of those function results for 'cat' and 'dog' and add them to your original query score.
Here's the full query gist.
Alternately, if you've got something in that bf that's heavyweight enough you'd rather not run it across the entire set of matches, you could use rescore requests to modify the score of the top N ranked ORIGINAL QUERY results using subsequent scoring passes with your (cat, dog, etc...) scoring-queries.

Why does ElasticSearch give a lower score to a term when it's with more terms?

I have an index (on a local, testing, cluster), with some fields using the simple analizer.
When I search for a term, the results where the term is in a field with more terms, get a lower score - why is that? I couldn't find any reference.
For example, 'koala' in a boolean search returns:
(title 'a koala'): score 0.04500804
(title 'how the Koala 1234'): score 0.02250402
In the query explanation, the fieldNorm is 1.0 in the first case, and 0.5 in the second.
Is it possible to return a score indipendent from the number of terms in the field?
To return a bool must term query of koala with all documents scoring equal on "koala". You could use the constant score query to basically remove the score from your query.
Here is a runnable example
http://sense.qbox.io/gist/21ae7b7e743dc30d66309f2a6b93043ded4ee401
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-constant-score-query.html

Resources