Unexpected Solr scores for documents boosted by the same boost values - sorting

I have 2 documents:
{
title: "Popular",
registrations_count: 700,
is_featured: false
}
and
{
title: "Unpopular",
registrations_count: 100,
is_featured: true
}
I'm running this Solr query (via the Ruby Sunspot gem):
fq: ["type:Event"],
sort: "score desc",
q: "*:*",
defType: "edismax",
fl: "* score",
bq: ["registrations_count_i:[700 TO *]^10", "is_featured_bs:true^10"],
start: 0, rows: 30
or, for those who are more used to ruby:
Challenge.search do
boost(10) do
with(:registrations_count).greater_than_or_equal_to(700)
end
boost(10) do
with(:is_featured, true)
end
order_by :score, :desc
end
One document matches the first boost query, and the other matches the other boost query. They have the same boost value.
What I would expect is that both documents get the same score. But they don't, they get something like that
1.2011336 # score for 'unpopular' (featured)
0.6366436 # score for 'popular' (not featured)
I also checked that if i boost an attribute that they both have in common, they get the exact same score, and they do. I also tried to change the 700 value, to something like 7000, but it makes no difference (which makes total sense).
Can anyone explain why they get such a different score, while they both match one of the boost queries?

I'm guessing the confusion stems from "the queries being boosted by the same value" - that's not true - the boost is the score of the query itself, which is then amplified 10x by your ^10.
The bq is additive - the score from the query is added to the score of the document (while boost is multiplicative, the score is multiplied by the boost query).
If you instead want to add the same score value to the original query based on either one matching, you can use ^=10 which makes the query constant scoring (the score will be 10 for that term, regardless of the regular score of the document).
Also, if you want to apply these factors independent of each other (instead of as a single, merged score with contributions from both factors), use multiple bq entries instead.

Related

Get final score by sum of multiple fields boost

I want to build a search that prioritizes the amount of field matches instead of one field over another. All the fields would have the same boost value and the final score should be calculated by sum matched fields boost. If the full text matches two fields and each field have boost 1, the final score would be 1 + 1 = 2.
Let's use an example:
class Event < ApplicationRecord
searchable do
text :title
text :category
text :artist_name
end
end
Suppose I have two events:
Event 1: Name: "Christmas festival" Artist name: "AC/DC"
Event 2: Name: "New year festival" Artist name: "Queen"
So, if the user searches just "festival", both events are returned with the same score because it matches both event's name.
But, if the user searches "festival AC/DC", I want to return Event 1 in the first place or just Event 1 because it matches the event name (festival) and the artist name (AC/DC). While Event 2 just matches the event name (festival). Event 1 score should be 2 while Event 2 score should be 1.
Any suggestion about How can I do that? Is this even possible?
It seems you are mixing up scoring and boosting, I think your question should be titled Compute total score by summing each field score (regardless of the boosts).
Field scores are computed based on field matches, and they can be applied arbitrary set of additive or multiplicative boosts (functions and/or matching subqueries). But in the end what you want is to compute the global score by summing each field score, not the boosts themselves.
DisMax query parser for example precisely allows you to control how the final score is computed using the tie (Tie Breaker) parameter :
The tie parameter specifies a float value (which should be something
much less than 1) to use as tiebreaker in DisMax queries.
When a term from the user’s input is tested against multiple fields,
more than one field may match. If so, each field will generate a
different score based on how common that word is in that field (for
each document relative to all other documents). The tie parameter lets
you control how much the final score of the query will be influenced
by the scores of the lower scoring fields compared to the highest
scoring field.
A value of "0.0" - the default - makes the query a pure "disjunction
max query": that is, only the maximum scoring subquery contributes to
the final score. A value of "1.0" makes the query a pure "disjunction
sum query" where it doesn’t matter what the maximum scoring sub query
is, because the final score will be the sum of the subquery scores.
Typically a low value, such as 0.1, is useful.
In your situation you need a disjunction sum query so you might want to set the tie to 1.0.

Elasticsearch similarity match score for set of terms

Is there a way to query for similarity (match score) for set of terms in elasticsearch?
Simple example:
Data:
doc1:{
"tags":["tag1", "tag2", "tag3", "tag4"]
}
doc2:{
"tags":["tag1", "tag2", "tag4"]
}
Query:
criteria:{
"tags":["tag1","tag2","tag3"]
}
Result
Result:{
doc1 - match 100%
doc2 - match 66.6%
}
Explanation:
doc1 has all tags that are present in search
doc2 has 2 of 3 tags that are present in search
So basically query that will return list of documents ordered by match, where match = how similar are tags in document compared to tags in query. No fuzziness needed. Return in % is just an example, return in points or some other unit is fine. Number of tags can be different.
I am designing system so can store data in any format, whatever works for ElasticSearch. I looked at their docs, but probably missed this type of search.
Many thanks for help.
Improvements
Is it possible to specify custom weight of match for each tag?
I.e. tag1 - 100points (or 20%), tag2 - 200 points (or 40%).
Yes, you need the similarity module
Not sure about weighted match, maybe the boost attribute?

tf/idf boosting within field

My use case is like this:
for a query iphone charger, I am getting higher relevance for results, having name, iphone charger coupons than with name iphone charger, possibly because of better match in description and other fields. Boosting name field isn't helping much unless I skew the importance drastically. what I really need is tf/idf boost within name field
to quote elasticsearch blog:
the frequency of a term in a field is offset by the length of the field. However, the practical scoring function treats all fields in the same way. It will treat all title fields (because they are short) as more important than all body fields (because they are long).
I need to boost this more important value for a particular field. Can we do this with function score or any other way?
A one term difference in length is not much of a difference to the scoring algorithm (and, in fact, can vanish entirely due to imprecision on the length norm). If there are hits on other fields, you have a lot of scoring elements to fight against.
A dis_max would probably be a reasonable approach to this. Instead of all the additive scores and coords and such you are trying to overcome, it will simply select the score of the best matching subquery. If you boost the query against title, you can ensure matches there are strongly preferred.
You can then assign a "tie_breaker", so that the score against the description subquery is factored in only when "title" scores are tied.
{
"dis_max" : {
"tie_breaker" : 0.2,
"queries" : [
{
"terms" : {
"age" : ["iphone", "charger"],
"boost" : 10
}
},
{
"terms" : {
"description" : ["iphone", "charger"]
}
}
]
}
}
Another approach to this sort of thing, if you absolutely know when you have an exact match against the entire field, is to separately index an untokenized version of that field, and query that field as well. Any match against the untokenized version of the field will be an exact match again the entire field contents. This would prevent you needing to relying on the length norm to make that determination.

Why not use min_score with Elasticsearch?

New to Elasticsearch. I am interested in only returning the most relevant docs and came across min_score. They say "Note, most times, this does not make much sense" but doesn't provide a reason. So, why does it not make sense to use min_score?
EDIT: What I really want to do is only return documents that have a higher than x "score". I have this:
data = {
'min_score': 0.9,
'query': {
'match': {'field': 'michael brown'},
}
}
Is there a better alternative to the above so that it only returns the most relevant docs?
thx!
EDIT #2:
I'm using minimum_should_match and it returns a 400 error:
"error": "SearchPhaseExecutionException[Failed to execute phase [query], all shards failed;"
data = {
'query': {
'match': {'keywords': 'michael brown'},
'minimum_should_match': '90%',
}
}
I've used min_score quite a lot for trying to find documents that are a definitive match to a given set of input data - which is used to generate the query.
The score you get for a document depends on the query, of course. So I'd say try your query in many permutations (different keywords, for example) and decide which document is the first you would rather it didn't return for each, and and make a note of each of their scores. If the scores are similar, this would give you a good guess at the value to use for your min score.
However, you need to bear in mind that score isn't just dependant on the query and the returned document, it considers all the other documents that have data for the fields you are querying. This means that if you test your min_score value with an index of 20 documents, this score will probably change greatly when you try it on a production index with, for example, a few thousands of documents or more. This change could go either way, and is not easily predictable.
I've found for my matching uses of min_score, you need to create quite a complicated query, and set of analysers to tune the scores for various components of your query. But what is and isn't included is vital to my application, so you may well be happy with what it gives you when keeping things simple.
I don't know if it's the best solution, but it works for me (java):
// "tiny" search to discover maxScore
// it is fast, because it returns only 1 item
SearchResponse response = client.prepareSearch(INDEX_NAME)
.setTypes(TYPE_NAME)
.setQuery(queryBuilder)
.setSize(1)
.execute()
.actionGet();
// get the maxScore and
// and set minScore = 70%
float maxScore = response.getHits().maxScore();
float minScore = maxScore * 0.7;
// second round with minimum score
SearchResponse response = client.prepareSearch(INDEX_NAME)
.setTypes(TYPE_NAME)
.setQuery(queryBuilder)
.setMinScore(minScore)
.execute()
.actionGet();
I search twice, but the first time it's fast because it returns only 1 item, then we can get the max_score
NOTE: minimum_should_match work different. If you have 4 queries, and you say minimum_should_match = 70%, it doesn't mean that item.score should be > 70%. It means that the item should match 70% of the queries, that is minimum 3/4 queries

Why does ElasticSearch give a lower score to a term when it's with more terms?

I have an index (on a local, testing, cluster), with some fields using the simple analizer.
When I search for a term, the results where the term is in a field with more terms, get a lower score - why is that? I couldn't find any reference.
For example, 'koala' in a boolean search returns:
(title 'a koala'): score 0.04500804
(title 'how the Koala 1234'): score 0.02250402
In the query explanation, the fieldNorm is 1.0 in the first case, and 0.5 in the second.
Is it possible to return a score indipendent from the number of terms in the field?
To return a bool must term query of koala with all documents scoring equal on "koala". You could use the constant score query to basically remove the score from your query.
Here is a runnable example
http://sense.qbox.io/gist/21ae7b7e743dc30d66309f2a6b93043ded4ee401
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-constant-score-query.html

Resources