unable to understand sorting data on query solr - sorting

I am trying to understand a solr sort clause I found in Legacy code -
q=*:*
sort=product(if(salesAmount,salesAmount,0.05), query($sortbq)) desc,
sortbq=*:*^10.000 brand:"nike"^1.600
fl=salesAmount,queryVal:query($sortbq)
Sample document in result looks like -
<!-- FOR brand=nike -->
<doc>
<double name="salesAmount">91743.75</double>
<str name="brand">Nike</str>
<float name="queryVal">2.3159266</float>
</doc>
<!-- FOR brand!=nike -->
<doc>
<str name="prdId">1070694</str>
<double name="sls_amt">92660.75</double>
<str name="brand">Lee</str>
<float name="queryVal">0.19959758</float>
</doc>
Can anybody please explain how this query($sortbq)) calculates a single value on which sorting is done? I tried the solr query with debug=true and getting the below values in debug section -
<str name="1139424">
1.0 = *:*, product of: 1.0 = boost 1.0 = queryNorm
</str>
<str name="1011619">
1.0 = *:*, product of: 1.0 = boost 1.0 = queryNorm
</str>
PS : If any one chooses to down-vote this question, please do mention reason in comments.

Please try by putting your sort clause in bq of solr query and put "debug.explain.structured=true"
You will find how the sort score is being calculated.

Related

Unexpected Solr scores for documents boosted by the same boost values

I have 2 documents:
{
title: "Popular",
registrations_count: 700,
is_featured: false
}
and
{
title: "Unpopular",
registrations_count: 100,
is_featured: true
}
I'm running this Solr query (via the Ruby Sunspot gem):
fq: ["type:Event"],
sort: "score desc",
q: "*:*",
defType: "edismax",
fl: "* score",
bq: ["registrations_count_i:[700 TO *]^10", "is_featured_bs:true^10"],
start: 0, rows: 30
or, for those who are more used to ruby:
Challenge.search do
boost(10) do
with(:registrations_count).greater_than_or_equal_to(700)
end
boost(10) do
with(:is_featured, true)
end
order_by :score, :desc
end
One document matches the first boost query, and the other matches the other boost query. They have the same boost value.
What I would expect is that both documents get the same score. But they don't, they get something like that
1.2011336 # score for 'unpopular' (featured)
0.6366436 # score for 'popular' (not featured)
I also checked that if i boost an attribute that they both have in common, they get the exact same score, and they do. I also tried to change the 700 value, to something like 7000, but it makes no difference (which makes total sense).
Can anyone explain why they get such a different score, while they both match one of the boost queries?
I'm guessing the confusion stems from "the queries being boosted by the same value" - that's not true - the boost is the score of the query itself, which is then amplified 10x by your ^10.
The bq is additive - the score from the query is added to the score of the document (while boost is multiplicative, the score is multiplied by the boost query).
If you instead want to add the same score value to the original query based on either one matching, you can use ^=10 which makes the query constant scoring (the score will be 10 for that term, regardless of the regular score of the document).
Also, if you want to apply these factors independent of each other (instead of as a single, merged score with contributions from both factors), use multiple bq entries instead.

solr boost query with separate sort

I want to demote all documents that have inv=0(possible values from 0 to 1000) to the end of the result set. i have got other sorting options like name desc also as part of the query.
For example below are my solr documents
Doc1 : name=apple , Inv=2
Doc2 : name=ball , Inv=1
Doc3 : name=cat , Inv=0
Doc4 : name=dog , Inv=0
Doc5 : name=fish , Inv=4
Doc6 : name=Goat , Inv=5
I want achieve below sorting ...here, i want to push all documents with inv=0 down to bottom and then apply "name asc" sorting.
Doc1
Doc2
Doc5
Doc6
Doc3
Doc4
my solr request is like
bq: "(: AND -inv:"0")^999.0" & defType: "edismax"
here 999 is the rank that i gave to demote results.
this boosting query works fine. it moves all documents with inv=0 down to the bottom.
But when i add &sort=name asc to the solr query, it prioritizes "sort" over bq..i am seeing below results with "name asc".
Doc1 : name=apple , Inv=2
Doc2 : name=ball , Inv=1
Doc3 : name=cat , Inv=0
Doc4 : name=dog , Inv=0
Doc5 : name=fish , Inv=4
Doc6 : name=Goat , Inv=5
can anyone please help me out. ?
Sort will override the boost.
So, you either move your sort into boost by making that condition map into boost values.
Or you move your boost condition into sort, using query() syntax. This was one of the gems from the Lucene/Solr Revolution 2016 presentation by hoss (click start presentation):
qq = Harry
q = +{!edismax v=$qq}
qf = title actor writer director keywords
sort = query($title_sort,0) desc, title asc
title_sort = {!field f=title v=$qq}
Default sorting criteria in Solr is score desc, where score is a virtual field and it actually represents the document's score.
Once one is passing &sort=name asc it will override default sorting.
Possible solution here might be something like this: &sort=score desc, name asc. Which literally means: please sort by score first and for documents with equal score please make a tie-break by name ascending.
It should work as long as you will have equal scores for doc1, doc2, doc5, doc6.
If it is not the case - then check out this Solr Wiki link for more details how to penalize docs with inv:0.

Solr - unexpected sorting order and sortMissingLast

I've got text type defined as below:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
...
And a couple of fields using given type. One of these fields is a title field, which is always defined and not missing, nor empty for any of the documents. When sorting by this field, either asc or desc Solr would however not return documents in the given order, but, seemingly random. Only after adding sortMissingLast="true" to type declaration sorting was in proper order.
Can anybody explain to me why is it so? In my understanding, sortMissingLast shouldn't be in effect when using sort, as a) it's connected with insertion of documents b) all documents in my collection have this field defined.
Reading further:
If sortMissingLast="true", then a sort on this field will cause documents without the field to come after documents with the field, regardless of the requested sort order (asc or desc).
I do indeed have other fields that use the same text type, however all of them are present. They might be empty, but they're present within the document.
I tested a sample index with your fieldType, that is text. when i tried with title asc my result was below response
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"sort":"title asc",
"indent":"true",
"q":"*:*",
"wt":"json"}},
"response":{"numFound":10,"start":0,"docs":[
{
"id":["123"],
"title":"awesome designs_takeaway",
"lastmodified":"f75a2e26-cb41-4028-abb2-bcd7f61e4f9e",
"_version_":1538551521252212736},
{
"id":["124"],
"title":"breathtaking_designs takeaway",
"lastmodified":"170b3857-d906-44df-950c-547c25b4e594",
"_version_":1538551543494606848},
{
"id":["125"],
"title":"curtain raiser",
"lastmodified":"ea7149d5-449f-4d69-919b-617b90420381",
"_version_":1538573292313509888},
{
"id":["126"],
"title":"defying gravity_008",
"lastmodified":"82844b75-24ba-4b2f-be20-9bb3fe83e6b1",
"_version_":1538551590630195200},
{
"id":["127"],
"title":"emancipation_of the poor",
"lastmodified":"d19482a5-1666-4d4e-a40e-eb93c00eca7e",
"_version_":1538551627310432256},
{
"id":["128"],
"title":"functioning of the-metadata",
"lastmodified":"7b07f281-1268-48cc-aee6-7a6636702ba5",
"_version_":1538551653171462144},
{
"id":["130"],
"title":"graphics enhancer 101",
"lastmodified":"67fd79d6-2ae5-4597-b2e1-128bfd815b67",
"_version_":1538551680471138304},
{
"id":["131"],
"title":"half-hearted attempt",
"lastmodified":"abb4707c-8392-4595-aaeb-fbf6d4f098b1",
"_version_":1538551699761790976},
{
"id":["132"],
"title":"INK jet corporation",
"lastmodified":"b29ba3af-f3da-49d1-bd45-f7d277c53cff",
"_version_":1538551727666495488},
{
"id":["136"],
"title":"xamarin",
"filecontent":"bolshevik",
"lastmodified":"af8a1445-e693-4bac-9ac8-84fa2c9b838d",
"_version_":1538571040880328704}]
}}`
When i tried with title desc.
My response
`{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"sort":"title desc",
"indent":"true",
"q":"*:*",
"wt":"json"}},
"response":{"numFound":10,"start":0,"docs":[
{
"id":["136"],
"title":"xamarin",
"filecontent":"bolshevik",
"lastmodified":"af8a1445-e693-4bac-9ac8-84fa2c9b838d",
"_version_":1538571040880328704},
{
"id":["132"],
"title":"INK jet corporation",
"lastmodified":"b29ba3af-f3da-49d1-bd45-f7d277c53cff",
"_version_":1538551727666495488},
{
"id":["131"],
"title":"half-hearted attempt",
"lastmodified":"abb4707c-8392-4595-aaeb-fbf6d4f098b1",
"_version_":1538551699761790976},
{
"id":["130"],
"title":"graphics enhancer 101",
"lastmodified":"67fd79d6-2ae5-4597-b2e1-128bfd815b67",
"_version_":1538551680471138304},
{
"id":["128"],
"title":"functioning of the-metadata",
"lastmodified":"7b07f281-1268-48cc-aee6-7a6636702ba5",
"_version_":1538551653171462144},
{
"id":["127"],
"title":"emancipation_of the poor",
"lastmodified":"d19482a5-1666-4d4e-a40e-eb93c00eca7e",
"_version_":1538551627310432256},
{
"id":["126"],
"title":"defying gravity_008",
"lastmodified":"82844b75-24ba-4b2f-be20-9bb3fe83e6b1",
"_version_":1538551590630195200},
{
"id":["125"],
"title":"curtain raiser",
"lastmodified":"ea7149d5-449f-4d69-919b-617b90420381",
"_version_":1538573292313509888},
{
"id":["124"],
"title":"breathtaking_designs takeaway",
"lastmodified":"170b3857-d906-44df-950c-547c25b4e594",
"_version_":1538551543494606848},
{
"id":["123"],
"title":"awesome designs_takeaway",
"lastmodified":"f75a2e26-cb41-4028-abb2-bcd7f61e4f9e",
"_version_":1538551521252212736}]
}}`
As you can see, i am getting expected results . I used Solr v 5.3.2 . Your type text also does not tokenize text into parts and therefore is a good candidate for sorting . So no use of thinking in that way to solve the problem .The sortMissingLast and sortMissingFirst parameters totally serve different purpose , though i did used them to replicate your observations, i saw only expected results . And as you say that your every document has a title field,so i also kept title field in all my documents, therefore there was of no use of sortMissingLast and sortMissingFirst parameters, as they will affect the document set having document having no title field in them, your results should not deviate from what i got . This only trickles down to the inference, that may be your solr has a bug . If you are not using the same version as mine , try your documents once on the version 5.3.2 or some version different from yours. Or can you provide a subset of titles from your side that are getting sorted as wrong just as to see how they are getting analyzed , if you don't suspect Solr is having a bug. Let me know if that helps :) .

Elasticsearch similarity match score for set of terms

Is there a way to query for similarity (match score) for set of terms in elasticsearch?
Simple example:
Data:
doc1:{
"tags":["tag1", "tag2", "tag3", "tag4"]
}
doc2:{
"tags":["tag1", "tag2", "tag4"]
}
Query:
criteria:{
"tags":["tag1","tag2","tag3"]
}
Result
Result:{
doc1 - match 100%
doc2 - match 66.6%
}
Explanation:
doc1 has all tags that are present in search
doc2 has 2 of 3 tags that are present in search
So basically query that will return list of documents ordered by match, where match = how similar are tags in document compared to tags in query. No fuzziness needed. Return in % is just an example, return in points or some other unit is fine. Number of tags can be different.
I am designing system so can store data in any format, whatever works for ElasticSearch. I looked at their docs, but probably missed this type of search.
Many thanks for help.
Improvements
Is it possible to specify custom weight of match for each tag?
I.e. tag1 - 100points (or 20%), tag2 - 200 points (or 40%).
Yes, you need the similarity module
Not sure about weighted match, maybe the boost attribute?

Solr document Scoring/Boosting not working as expected

We have integrated solr search with .net project, but we are facing some issues related to document boosting or scoring feature of solr.
Problem: Solr is not returning score as per term frequency in document.
Eg:- We have created four documents whose Title contain term "Link" and solr has returned score as below:
1)Link ==> 6.037953
2)Link Link Link Link Link ==> 5.9249415
3)Link Link ==> 5.374235
4)Link Link Link ==> 5.2746024
Can anyone please help me on solr scoring or boosting issue.
Scoring calculation for Solr is something really complex. Here, you have to begin with the primal equation:
score(q,d) = coord(q,d) · queryNorm(q) · ∑ ( tf(t in d) ·
idf(t)2 · t.getBoost() · norm(t,d) )
You have tf parameter which represents term frequency and its value is the squareroot of the frequency of the term.
You also have norm (aka fieldNorm) which is used in fieldWeight calculation. Let's take your example:
Link Link Link Link Link
Your score will be calculate like (you can see this by adding debugQuery parameter):
5.9249415 = fieldWeight, product of:
2.236068 = tf(freq=5.0), with freq of:
5.0 = termFreq=5.0
idf (wich will be the same for all your scores)
0.4375 = fieldNorm(doc=177)
link
6.037953= fieldWeight, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
idf (wich will be the same for all your scores)
1.0 = fieldNorm
Here, link has a better score than the other because fieldWeight is the product of tf, idf and fieldNorm. This last one is higher for link document because he only contains one term.
As above documentation said:
lengthNorm - computed when the document is added to the index in
accordance with the number of tokens of this field in the document, so
that shorter fields contribute more to the score.
The more terms you have in a field, lower fieldNorm will be.
Be careful with the value of this field.
So, to conclude, here you have a perfect mix to understand that the score is not calculated only with the frequency but also with the number of term that you have in your field.

Resources