How can I multiply the score of two queries together in Elasticsearch? - elasticsearch

In Solr I can use the query function query to return a numerical score for a query and I can user that in the context of a bf parameter something like bf=product(query('cat'),query('dog')) to multiply two relevance scores together.
Elasticsearch has search API that is generally more flexible to work with, but I can't figure out how I would accomplish the same feat. I can use _score in a script_function of a function_query but I can only user the _score of the main query. How can I incorporate the score of another query? How can I multiply the scores together?

You could script a TF*IDF scoring function using a function_score query. Something like this (ignoring Lucene's query and length normalization):
"script": "tf = _index[field][term].tf(); idf = (1 + log ( _index.numDocs() / (_index[field][term].df() + 1))); return sqrt(tf) * pow(idf,2)"
You'd take the product of those function results for 'cat' and 'dog' and add them to your original query score.
Here's the full query gist.

Alternately, if you've got something in that bf that's heavyweight enough you'd rather not run it across the entire set of matches, you could use rescore requests to modify the score of the top N ranked ORIGINAL QUERY results using subsequent scoring passes with your (cat, dog, etc...) scoring-queries.

Related

Elasticsearch: How to search, sort, limit the results then sort again?

This isn't about multi-level sorting.
I need my results first selected by distance, limited to 50, then those 50 sorted by price.
select *
from
(
select top 50 * from mytable order by distance asc)
)
order by price asc
Essentially, the second sort throws away the ordering of the inner sort - but the inner sort is used to hone in on the top 50 results.
The other answers I've seen for this sort of question looks at second-level sorting, which is not what I'm after.
BTW: I've looked at aggregations - Top N results, but I'm not sure I can apply a sort on the aggregation result sort. Also looked at rescore, but I don't know where to put my 'sorts'
A top hits aggregation will allow you to sort on a separate field, in your case price from the main query sort (on distance). See the documentation here for how to specify sorting in the top hits agg.
It'll look a little like this (which assumes distance is a double type; if it's a geo-location type, use the documentation provided by Volodymyr Bilyachat.)
{
"sort":[
{
"distance":"asc"
}
],
"query":{
"match_all":{}
},
"size":50,
"aggs":{
"top_price_hits":{
"top_hits":{
"sort":[
{
"price":{
"order":"asc"
}
}
],
"size":50
}
}
}
}
However, if there are only 50 results that you want from your primary query, why don't you just sort in the application client side? This would be a better approach as using a top hits aggregation for a secondary sort is a slight abuse of its purpose.
The in-application approach would be more robust.
+1'ed the accepted answer, but I wanted to make sure you were aware of how search scoring, can often deliver a better user experience than traditional sorting.
Based on your current strategy, one could say:
Distance is important, relatively speaking (e.g. top 50 closest) but not in absolute terms (e.g. must be within 50mi).
You only want to show 50 results.
You want those results to be sorted by price (or perhaps alphabetically).
However, if you find yourself trying to generalize about which result a searcher is most likely to choose, you may discover a function of price and distance (or other features) which better models the real-world likelihood of a searcher choosing a particular result.
E.g. Say you discover that
Users will pay more for the convenience of a nearby result
Users will travel greater distances for greater discounts
Then you could model a sample scoring function that generates a result ordering based on this relationship.
E.g. 1/price + 1/distance ... which would generate a higher score as either price or distance decreased.
Which could be generalized to P * 1/price + 1/distance where P represented a tuning coefficient expressing the relative importance of price vs distance.
Armed with this model, you could then write a function score query which would output ordered results with the optimal combinations of price and distance for your users.
As i see it would be better to do select top 50 using size: 50 property in query, and ordering by distance, then sort result in your application by price.

How can I find the true score from Elasticsearch query string with a wildcard?

My ElasticSearch 2.x NEST query string search contains a wildcard:
Using NEST in C#:
var results = _client.Search<IEntity>(s => s
.Index(Indices.AllIndices)
.AllTypes()
.Query(qs => qs
.QueryString(qsq => qsq.Query("Micro*")))
.From(pageNumber)
.Size(pageSize));
Comes up with something like this:
$ curl -XGET 'http://localhost:9200/_all/_search?q=Micro*'
This code was derived from the ElasticSearch page on using Co-variants. The results are co-variant; they are of mixed type coming from multiple indices. The problem I am having is that all of the hits come back with a score of 1.
This is regardless of type or boosting. Can I boost by type or, alternatively, is there a way to reveal or "explain" the search result so I can order by score?
Multi term queries like wildcard query are given a constant score equal to the boosting by default. You can change this behaviour using .Rewrite().
var results = client.Search<IEntity>(s => s
.Index(Indices.AllIndices)
.AllTypes()
.Query(qs => qs
.QueryString(qsq => qsq
.Query("Micro*")
.Rewrite(RewriteMultiTerm.ScoringBoolean)
)
)
.From(pageNumber)
.Size(pageSize)
);
With RewriteMultiTerm.ScoringBoolean, the rewrite method first translates each term into a should clause in a bool query and keeps the scores as computed by the query.
Note that this can be CPU intensive and there is a default limit of 1024 bool query clauses that can be easily hit for a large document corpus; running your query on the complete StackOverflow data set (questions, answers and users) for example, hits the clause limit for questions. You may want to analyze some text with an analyzer that uses an edgengram token filter.
Wildcard searches will always return a score of 1.
You can boost by a particular type. See this:
How to boost index type in elasticsearch?

How to use the elasticseach java api for dynamic searches?

So I'm trying to use elasticsearch for dynamic query building. Imagine that I can have a query like:
a = "something" AND b >= "other something" AND (c LIKE "stuff" OR c LIKE "stuff2" OR d BETWEEN "x" AND "y");
or like this:
(c>= 23 OR d<=43) AND (a LIKE "text" OR a LIKE "text2") AND f="text"
Should I use the QueryBuilder or the FilterBuilder, how do you match both? The official documentation says that for exact values we should use the filter approach? I assume I should use filters for equal comparisons? what about dates and numbers? Should I use the Filter or Query?
For the Like/Equals for the number/number problem I tried this:
#Field(type = String, index = FieldIndex.analyzed, pattern = "(\\d+\\/\\d+)|(\\d+\\/)|(\\d+)|(\\/\\d+)")
public String processNumber;
The pattern would deal with the structure number + slash + number, but also number and number + slash.
But when using either the term filter or the match_query I can't get only hits with the exact structure like 20/2014, if I type 20 I would still get hits on the term filter.
Query is the main component when you search for something, it takes into consideration ranking and other features such as stemming, synonyms and other things. Filter, on the other hand, just filters the result set you get from your query.
I suggest that if you don't care about the ranking use filters because they are faster. Otherwise, use query.

Why not use min_score with Elasticsearch?

New to Elasticsearch. I am interested in only returning the most relevant docs and came across min_score. They say "Note, most times, this does not make much sense" but doesn't provide a reason. So, why does it not make sense to use min_score?
EDIT: What I really want to do is only return documents that have a higher than x "score". I have this:
data = {
'min_score': 0.9,
'query': {
'match': {'field': 'michael brown'},
}
}
Is there a better alternative to the above so that it only returns the most relevant docs?
thx!
EDIT #2:
I'm using minimum_should_match and it returns a 400 error:
"error": "SearchPhaseExecutionException[Failed to execute phase [query], all shards failed;"
data = {
'query': {
'match': {'keywords': 'michael brown'},
'minimum_should_match': '90%',
}
}
I've used min_score quite a lot for trying to find documents that are a definitive match to a given set of input data - which is used to generate the query.
The score you get for a document depends on the query, of course. So I'd say try your query in many permutations (different keywords, for example) and decide which document is the first you would rather it didn't return for each, and and make a note of each of their scores. If the scores are similar, this would give you a good guess at the value to use for your min score.
However, you need to bear in mind that score isn't just dependant on the query and the returned document, it considers all the other documents that have data for the fields you are querying. This means that if you test your min_score value with an index of 20 documents, this score will probably change greatly when you try it on a production index with, for example, a few thousands of documents or more. This change could go either way, and is not easily predictable.
I've found for my matching uses of min_score, you need to create quite a complicated query, and set of analysers to tune the scores for various components of your query. But what is and isn't included is vital to my application, so you may well be happy with what it gives you when keeping things simple.
I don't know if it's the best solution, but it works for me (java):
// "tiny" search to discover maxScore
// it is fast, because it returns only 1 item
SearchResponse response = client.prepareSearch(INDEX_NAME)
.setTypes(TYPE_NAME)
.setQuery(queryBuilder)
.setSize(1)
.execute()
.actionGet();
// get the maxScore and
// and set minScore = 70%
float maxScore = response.getHits().maxScore();
float minScore = maxScore * 0.7;
// second round with minimum score
SearchResponse response = client.prepareSearch(INDEX_NAME)
.setTypes(TYPE_NAME)
.setQuery(queryBuilder)
.setMinScore(minScore)
.execute()
.actionGet();
I search twice, but the first time it's fast because it returns only 1 item, then we can get the max_score
NOTE: minimum_should_match work different. If you have 4 queries, and you say minimum_should_match = 70%, it doesn't mean that item.score should be > 70%. It means that the item should match 70% of the queries, that is minimum 3/4 queries

Why does ElasticSearch give a lower score to a term when it's with more terms?

I have an index (on a local, testing, cluster), with some fields using the simple analizer.
When I search for a term, the results where the term is in a field with more terms, get a lower score - why is that? I couldn't find any reference.
For example, 'koala' in a boolean search returns:
(title 'a koala'): score 0.04500804
(title 'how the Koala 1234'): score 0.02250402
In the query explanation, the fieldNorm is 1.0 in the first case, and 0.5 in the second.
Is it possible to return a score indipendent from the number of terms in the field?
To return a bool must term query of koala with all documents scoring equal on "koala". You could use the constant score query to basically remove the score from your query.
Here is a runnable example
http://sense.qbox.io/gist/21ae7b7e743dc30d66309f2a6b93043ded4ee401
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-constant-score-query.html

Resources