Explain scoring in lucene while sorting involved - sorting

I am getting null value in topDocs.scoreDocs for some documents while searching in lucene index.
Please explain me about value in [ ] in topDocs.scoreDocs
SortField sortFieldObj = new SortField(sortField, SortField.STRING, sortOrder);
Sort sort = new Sort(sortFieldObj);
TopDocs topDocs = searcher.search(query, null, sizeNeeded, sort);
Document docNew = searcher.doc(topDocs.scoreDocs[i].doc);
System.out.println(topDocs.scoreDocs[i]);
output:
doc=2 score=NaN[null]
doc=44 score=NaN[testString]

Well, the reson is indirectly you told Lucene to ignore its document scores and use your own sort order. Scoring is used to bring topdocs, but you chose to bring docs in the sort order you specified, hence NAN.
If you want to force Lucene to give you scores when you specified your own sort order use another overloaded method for search :
search(Query query, Filter filter, int n,
Sort sort, boolean doDocScores, boolean doMaxScore)
If doDocScores is true then the score of each hit will becomputed and returned.
If doMaxScore true then the maximum score over all collected hits will be computed.
So you would do something like : searcher.search(query, null, sizeNeeded, sort,true,true);

Related

Represent enum in Elastic Search for sorting

I have a use case to represent an enum for difficulty level (EASY, MEDIUM, DIFFICULT) in elastic search with support of sorting on this field. If this field is indexed as string the sorting will not work as expected.
One way to support this is to index integer values for each enumeration in ES and map it to string values when sorted results are returned by ES.
Are there other alternatives such that ES itself takes care of sorting in the enumeration order while this field is indexed as string? Can I specify custom sort function for a field? function_score is an option, but given that I have to sort based on enum ordering is there better way than defining custom function_score?
In my use case there are multiple such enumeration defining scale across dimensions like difficulty, height (low, medium, high), grades (good, average, poor), etc. Both the above solution requires custom work as a new dimension is introduced. Can either of the above approach be generalzied?
You can check the answer to the same question here. You will need to use script_score like below:
GET /my-index-2/_search
{
"query": {
"script_score": {
"query": {
"match_all":{}
},
"script": {
"source": "if (doc['field name'].value == 'EASY'){2} else if(doc['field name'].value == 'MEDIUM') {1} else if(doc['field name'].value == 'DIFFICULT') {0}"
}
}
}
}

Unexpected Solr scores for documents boosted by the same boost values

I have 2 documents:
{
title: "Popular",
registrations_count: 700,
is_featured: false
}
and
{
title: "Unpopular",
registrations_count: 100,
is_featured: true
}
I'm running this Solr query (via the Ruby Sunspot gem):
fq: ["type:Event"],
sort: "score desc",
q: "*:*",
defType: "edismax",
fl: "* score",
bq: ["registrations_count_i:[700 TO *]^10", "is_featured_bs:true^10"],
start: 0, rows: 30
or, for those who are more used to ruby:
Challenge.search do
boost(10) do
with(:registrations_count).greater_than_or_equal_to(700)
end
boost(10) do
with(:is_featured, true)
end
order_by :score, :desc
end
One document matches the first boost query, and the other matches the other boost query. They have the same boost value.
What I would expect is that both documents get the same score. But they don't, they get something like that
1.2011336 # score for 'unpopular' (featured)
0.6366436 # score for 'popular' (not featured)
I also checked that if i boost an attribute that they both have in common, they get the exact same score, and they do. I also tried to change the 700 value, to something like 7000, but it makes no difference (which makes total sense).
Can anyone explain why they get such a different score, while they both match one of the boost queries?
I'm guessing the confusion stems from "the queries being boosted by the same value" - that's not true - the boost is the score of the query itself, which is then amplified 10x by your ^10.
The bq is additive - the score from the query is added to the score of the document (while boost is multiplicative, the score is multiplied by the boost query).
If you instead want to add the same score value to the original query based on either one matching, you can use ^=10 which makes the query constant scoring (the score will be 10 for that term, regardless of the regular score of the document).
Also, if you want to apply these factors independent of each other (instead of as a single, merged score with contributions from both factors), use multiple bq entries instead.

Make a count for distinct values in Elastic search

I have duplicate ids in my db and wish to get the count for distinct values only similar to SELECT COUNT(DISTINCT column) FROM table in SQL.
public SearchSourceBuilder createQueryForCount(QueryBuilder queryBuilder, int start, boolean fetchSource, String field){
logger.info("Creating aggregation count ");
QueryBuilder finalQuery = QueryBuilders.boolQuery().must(queryBuilder);
AggregationBuilder aggregationCount = AggregationBuilders.terms("agg").field(USER_ID)
.subAggregation(AggregationBuilders.topHits("top").explain(false).from(start))
.subAggregation(AggregationBuilders.count("count").field(field));
return new SearchSourceBuilder()
.query(finalQuery)
.fetchSource(fetchSource)
.from(start)
.aggregation(aggregationCount);
}
Is there a way to do a distinct count in Elastic search?
You should look for Cardinality Aggregation. Javadocs are available here. Also, remember elasticsearch use approximation to tradeoff for performance. You can control this using precision_threshold up to some extent. Good explanation available here.
To get the distinct values count in elastic search use cardinality aggregation.

How to get the total documents count, containing a specific field, using aggregations?

I am moving from ElasticSearch 1.7 to 2.0. Previously while calculating Term Facets I got the Total Count as well. This will tell in how many documents that field exists. This is how I was doing previously.
TermsFacet termsFacet = (TermsFacet) facet;
termsFacet.getTotalCount();
It worked with Multivalue field as well.
Now in current version for Term Aggregation we don't have anything as Total Count. I am getting DocCount inside Aggregation bucket. But that will not work for muti-valued fields.
Terms termsAggr = (Terms) aggr;
for (Terms.Bucket bucket : termsAggr.getBuckets()) {
String bucketKey = bucket.getKey();
totalCount += bucket.getDocCount();
}
Is there any way I can get Total count of the field from term aggregation.
I don't want to fire exists Filter query. I want result in single query.
I would use the exists query:
https://www.elastic.co/guide/en/elasticsearch/reference/2.x/query-dsl-exists-query.html
For instance to find the documents that contain the field user you can use:
{
"exists" : { "field" : "user" }
}
There is of course also a java API:
https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/java-term-level-queries.html#java-query-dsl-exists-query
QueryBuilder qb = existsQuery("name");

Why not use min_score with Elasticsearch?

New to Elasticsearch. I am interested in only returning the most relevant docs and came across min_score. They say "Note, most times, this does not make much sense" but doesn't provide a reason. So, why does it not make sense to use min_score?
EDIT: What I really want to do is only return documents that have a higher than x "score". I have this:
data = {
'min_score': 0.9,
'query': {
'match': {'field': 'michael brown'},
}
}
Is there a better alternative to the above so that it only returns the most relevant docs?
thx!
EDIT #2:
I'm using minimum_should_match and it returns a 400 error:
"error": "SearchPhaseExecutionException[Failed to execute phase [query], all shards failed;"
data = {
'query': {
'match': {'keywords': 'michael brown'},
'minimum_should_match': '90%',
}
}
I've used min_score quite a lot for trying to find documents that are a definitive match to a given set of input data - which is used to generate the query.
The score you get for a document depends on the query, of course. So I'd say try your query in many permutations (different keywords, for example) and decide which document is the first you would rather it didn't return for each, and and make a note of each of their scores. If the scores are similar, this would give you a good guess at the value to use for your min score.
However, you need to bear in mind that score isn't just dependant on the query and the returned document, it considers all the other documents that have data for the fields you are querying. This means that if you test your min_score value with an index of 20 documents, this score will probably change greatly when you try it on a production index with, for example, a few thousands of documents or more. This change could go either way, and is not easily predictable.
I've found for my matching uses of min_score, you need to create quite a complicated query, and set of analysers to tune the scores for various components of your query. But what is and isn't included is vital to my application, so you may well be happy with what it gives you when keeping things simple.
I don't know if it's the best solution, but it works for me (java):
// "tiny" search to discover maxScore
// it is fast, because it returns only 1 item
SearchResponse response = client.prepareSearch(INDEX_NAME)
.setTypes(TYPE_NAME)
.setQuery(queryBuilder)
.setSize(1)
.execute()
.actionGet();
// get the maxScore and
// and set minScore = 70%
float maxScore = response.getHits().maxScore();
float minScore = maxScore * 0.7;
// second round with minimum score
SearchResponse response = client.prepareSearch(INDEX_NAME)
.setTypes(TYPE_NAME)
.setQuery(queryBuilder)
.setMinScore(minScore)
.execute()
.actionGet();
I search twice, but the first time it's fast because it returns only 1 item, then we can get the max_score
NOTE: minimum_should_match work different. If you have 4 queries, and you say minimum_should_match = 70%, it doesn't mean that item.score should be > 70%. It means that the item should match 70% of the queries, that is minimum 3/4 queries

Resources