Get final score by sum of multiple fields boost - ruby

I want to build a search that prioritizes the amount of field matches instead of one field over another. All the fields would have the same boost value and the final score should be calculated by sum matched fields boost. If the full text matches two fields and each field have boost 1, the final score would be 1 + 1 = 2.
Let's use an example:
class Event < ApplicationRecord
searchable do
text :title
text :category
text :artist_name
end
end
Suppose I have two events:
Event 1: Name: "Christmas festival" Artist name: "AC/DC"
Event 2: Name: "New year festival" Artist name: "Queen"
So, if the user searches just "festival", both events are returned with the same score because it matches both event's name.
But, if the user searches "festival AC/DC", I want to return Event 1 in the first place or just Event 1 because it matches the event name (festival) and the artist name (AC/DC). While Event 2 just matches the event name (festival). Event 1 score should be 2 while Event 2 score should be 1.
Any suggestion about How can I do that? Is this even possible?

It seems you are mixing up scoring and boosting, I think your question should be titled Compute total score by summing each field score (regardless of the boosts).
Field scores are computed based on field matches, and they can be applied arbitrary set of additive or multiplicative boosts (functions and/or matching subqueries). But in the end what you want is to compute the global score by summing each field score, not the boosts themselves.
DisMax query parser for example precisely allows you to control how the final score is computed using the tie (Tie Breaker) parameter :
The tie parameter specifies a float value (which should be something
much less than 1) to use as tiebreaker in DisMax queries.
When a term from the user’s input is tested against multiple fields,
more than one field may match. If so, each field will generate a
different score based on how common that word is in that field (for
each document relative to all other documents). The tie parameter lets
you control how much the final score of the query will be influenced
by the scores of the lower scoring fields compared to the highest
scoring field.
A value of "0.0" - the default - makes the query a pure "disjunction
max query": that is, only the maximum scoring subquery contributes to
the final score. A value of "1.0" makes the query a pure "disjunction
sum query" where it doesn’t matter what the maximum scoring sub query
is, because the final score will be the sum of the subquery scores.
Typically a low value, such as 0.1, is useful.
In your situation you need a disjunction sum query so you might want to set the tie to 1.0.

Related

Unexpected Solr scores for documents boosted by the same boost values

I have 2 documents:
{
title: "Popular",
registrations_count: 700,
is_featured: false
}
and
{
title: "Unpopular",
registrations_count: 100,
is_featured: true
}
I'm running this Solr query (via the Ruby Sunspot gem):
fq: ["type:Event"],
sort: "score desc",
q: "*:*",
defType: "edismax",
fl: "* score",
bq: ["registrations_count_i:[700 TO *]^10", "is_featured_bs:true^10"],
start: 0, rows: 30
or, for those who are more used to ruby:
Challenge.search do
boost(10) do
with(:registrations_count).greater_than_or_equal_to(700)
end
boost(10) do
with(:is_featured, true)
end
order_by :score, :desc
end
One document matches the first boost query, and the other matches the other boost query. They have the same boost value.
What I would expect is that both documents get the same score. But they don't, they get something like that
1.2011336 # score for 'unpopular' (featured)
0.6366436 # score for 'popular' (not featured)
I also checked that if i boost an attribute that they both have in common, they get the exact same score, and they do. I also tried to change the 700 value, to something like 7000, but it makes no difference (which makes total sense).
Can anyone explain why they get such a different score, while they both match one of the boost queries?
I'm guessing the confusion stems from "the queries being boosted by the same value" - that's not true - the boost is the score of the query itself, which is then amplified 10x by your ^10.
The bq is additive - the score from the query is added to the score of the document (while boost is multiplicative, the score is multiplied by the boost query).
If you instead want to add the same score value to the original query based on either one matching, you can use ^=10 which makes the query constant scoring (the score will be 10 for that term, regardless of the regular score of the document).
Also, if you want to apply these factors independent of each other (instead of as a single, merged score with contributions from both factors), use multiple bq entries instead.

Kibana. Data tables. Exclude terms depending on the length

I'm storing sentences in Elasticsearch.
Example:
this is a sentence
this is a second sentence
And I want to show a data table with the most used terms in my Kibana 4.3.1, selecting:
Metric = count
Split rows
Aggregation = terms
Field = input
Order by = metric count
Order descending. Size 5
This is what I'm getting in the table:
this 2
is 2
a 2
sentence 2
second 1
And I want to remove the short words, with less than 3 chars. In this example, "is" and "a".
How can achieve this?
Thanks!
It works adding this Exclude Pattern:
[a-zA-Z0-9]{0,3}

Elasticsearch: How to search, sort, limit the results then sort again?

This isn't about multi-level sorting.
I need my results first selected by distance, limited to 50, then those 50 sorted by price.
select *
from
(
select top 50 * from mytable order by distance asc)
)
order by price asc
Essentially, the second sort throws away the ordering of the inner sort - but the inner sort is used to hone in on the top 50 results.
The other answers I've seen for this sort of question looks at second-level sorting, which is not what I'm after.
BTW: I've looked at aggregations - Top N results, but I'm not sure I can apply a sort on the aggregation result sort. Also looked at rescore, but I don't know where to put my 'sorts'
A top hits aggregation will allow you to sort on a separate field, in your case price from the main query sort (on distance). See the documentation here for how to specify sorting in the top hits agg.
It'll look a little like this (which assumes distance is a double type; if it's a geo-location type, use the documentation provided by Volodymyr Bilyachat.)
{
"sort":[
{
"distance":"asc"
}
],
"query":{
"match_all":{}
},
"size":50,
"aggs":{
"top_price_hits":{
"top_hits":{
"sort":[
{
"price":{
"order":"asc"
}
}
],
"size":50
}
}
}
}
However, if there are only 50 results that you want from your primary query, why don't you just sort in the application client side? This would be a better approach as using a top hits aggregation for a secondary sort is a slight abuse of its purpose.
The in-application approach would be more robust.
+1'ed the accepted answer, but I wanted to make sure you were aware of how search scoring, can often deliver a better user experience than traditional sorting.
Based on your current strategy, one could say:
Distance is important, relatively speaking (e.g. top 50 closest) but not in absolute terms (e.g. must be within 50mi).
You only want to show 50 results.
You want those results to be sorted by price (or perhaps alphabetically).
However, if you find yourself trying to generalize about which result a searcher is most likely to choose, you may discover a function of price and distance (or other features) which better models the real-world likelihood of a searcher choosing a particular result.
E.g. Say you discover that
Users will pay more for the convenience of a nearby result
Users will travel greater distances for greater discounts
Then you could model a sample scoring function that generates a result ordering based on this relationship.
E.g. 1/price + 1/distance ... which would generate a higher score as either price or distance decreased.
Which could be generalized to P * 1/price + 1/distance where P represented a tuning coefficient expressing the relative importance of price vs distance.
Armed with this model, you could then write a function score query which would output ordered results with the optimal combinations of price and distance for your users.
As i see it would be better to do select top 50 using size: 50 property in query, and ordering by distance, then sort result in your application by price.

How can I multiply the score of two queries together in Elasticsearch?

In Solr I can use the query function query to return a numerical score for a query and I can user that in the context of a bf parameter something like bf=product(query('cat'),query('dog')) to multiply two relevance scores together.
Elasticsearch has search API that is generally more flexible to work with, but I can't figure out how I would accomplish the same feat. I can use _score in a script_function of a function_query but I can only user the _score of the main query. How can I incorporate the score of another query? How can I multiply the scores together?
You could script a TF*IDF scoring function using a function_score query. Something like this (ignoring Lucene's query and length normalization):
"script": "tf = _index[field][term].tf(); idf = (1 + log ( _index.numDocs() / (_index[field][term].df() + 1))); return sqrt(tf) * pow(idf,2)"
You'd take the product of those function results for 'cat' and 'dog' and add them to your original query score.
Here's the full query gist.
Alternately, if you've got something in that bf that's heavyweight enough you'd rather not run it across the entire set of matches, you could use rescore requests to modify the score of the top N ranked ORIGINAL QUERY results using subsequent scoring passes with your (cat, dog, etc...) scoring-queries.

Why does ElasticSearch give a lower score to a term when it's with more terms?

I have an index (on a local, testing, cluster), with some fields using the simple analizer.
When I search for a term, the results where the term is in a field with more terms, get a lower score - why is that? I couldn't find any reference.
For example, 'koala' in a boolean search returns:
(title 'a koala'): score 0.04500804
(title 'how the Koala 1234'): score 0.02250402
In the query explanation, the fieldNorm is 1.0 in the first case, and 0.5 in the second.
Is it possible to return a score indipendent from the number of terms in the field?
To return a bool must term query of koala with all documents scoring equal on "koala". You could use the constant score query to basically remove the score from your query.
Here is a runnable example
http://sense.qbox.io/gist/21ae7b7e743dc30d66309f2a6b93043ded4ee401
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-constant-score-query.html

Resources