Top 10% of results with sort - elasticsearch

I'm looking for a setup that actually returns the top 10% of results of a certain query. After the result we also want to sort the subset.
Is there an easy way to do this?
Can anyone provide a simple example for this.
I was thinking scaling the results scores between 0 and 1.0 and basically sepcifiying min_score to 0.9.
I was trying to create function_score queries but those seem a bit complex for a simple requirement such as this one, plus I was not sure how sorting would effect the results, since I want the sort functions work always on the 10% most relevant articles of course.
Thanks,
Peter

As you want to slice response in % of overall docs count, you need to know that anyway. And using from / size params will cut off the required amount at query time.
Assuming this, seems that easiest way to achieve your goal is to make 2 queries:
Filtered query with all filters, no queries and search_type=count to get overall document count.
Perform your regular matching query, applying {"from": 0, "size": count/10} with count got from 1st response.
Talking about tweaking the scoring. For me, it seems as bad idea, as getting multiple documents with the same score is pretty generic situation. So, cutting dataset by min_score will probably result in skewed data.

Related

Suggestion for limiting fuzzy search suggestion results

I've implemented a fuzzy search algorithm based on a N closest neighbors query for given search terms. Each query returns a pre-set number of raw results, in my case a max. of 200 hits / query, sorted descending by score, highest score first.
The raw search already produces good results, but in some rather rare cases not good enough so I've added another post-processing layer or better said another metric to the raw search results based on Levenshtein-Damerau algorithm that measures the word / phrase distance between query term(s) and raw results. The lower the resulting score the better, 0.0 would be an exact match.
Using the Levenshtein-Damerau post-processing algorithm I sort the results ascending, from the lowest to the highest.
The quality of matches is amazingly good and all relevant hits are ranked to the top. Still I have the bulk of 200 hits from the core search and I am looking for a smart way to limit the final result set down to a maximum of 10-20 hits. I could just add a static limit as it is basically done. But I wonder if there is a better way to do this based on the individual metrics I get with each search result set.
I have the following result metrics:
The result score of the fuzzy core search search, a value of type float/double. The higher the better
The Levenshtein-Damerau post processing weight, another value of type float/double. The lower the better
And finally each result set knows its minimum and maximum score limits. Using the Levenshtein-Damerau post processing algorithm on the raw results I take the min/max values from there.
The only ideas I have is to take a sub-range out of the result set, something like the top 20% results which is simple to achieve. More interesting would be to analyse the top result scores/metrics and find some indication where it gets too fuzzy. I could use the metrics I gather inside my Levenshtein-Damerau algorithm layer, respectively the word- and phrase-distance parameters - these values along with 2 other parameters make up the final distance score. For example if the word- and/or phrase distance exceed a certain threshold, then skip the result. This way is a bit more complicated but possible.
Well, I wonder if there are more opportunities I could use and just not obviously see. Once again, I would like to omit a static limit and make it more flexible on each individual result set.
Any hints or further ideas are greatly appreciated.

Difference between Elasticsearch Range Query and Range Filter

I want to query elasticsearch documents within a date range. I have two options now, both work fine for me. Have tested both of them.
1. Range Query
2. Range Filter
Since I have a small data set for now, I am unable to test the performance for both of them. What is the difference between these two? and which one would result in faster retrieval of documents and faster response?
The main difference between queries and filters has to do with scoring. Queries return documents with a relative ranked score for each document. Filters do not. This difference allows a filter to be faster for two reasons. First, it does not incur the cost of calculating the score for each document. Second, it can cache the results as it does not have to deal with possible changes in the score from moment to moment - it's just a boolean really, does the document match or not?
From the documentation:
Filters are usually faster than queries because:
they don’t have to calculate the relevance _score for each document — 
the answer is just a boolean “Yes, the document matches the filter” or
“No, the document does not match the filter”. the results from most
filters can be cached in memory, making subsequent executions faster.
As a practical matter, the question is do you use the relevance score in any way? If not, filters are the way to go. If you do, filters still may be of use but should be used where they make sense. For instance, if you had a language field (let's say language: "EN" as an example) in your documents and wanted to query by language along with a relevance score, you would combine a query for the text search along with a filter for language. The filter would cache the document ids for all documents in english and then the query could be applied to that subset.
I'm over simplifying a bit, but that's the basics. Good places to read up on this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-filtered-query.html
http://exploringelasticsearch.com/searching_data.html
http://elasticsearch-users.115913.n3.nabble.com/Filters-vs-Queries-td3219558.html
Filters are cached so they are faster!
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html

is there a way to find out the max theoritical score from an elasticsearch query?

I have a search that's purely based on attributes rather than any text searching. I'd like to know if there's a way to interpret the scores returned from elasticsearch in such a way as to determine if a match is good or not (or how good it is on a scale of 0-100)..
The scores obviously change based on the query - if I ask for things that have 5 attributes using an OR search - those that have all 5 get a highscore, whilst those with 1 get a lower score (which is fine..) - I'd like to know if there's an easy way to ask ES: given this query, what's the max score anything could give me?
I could do things like say that this result is a 90% match to your query, this one is a 50% match. Rather than this one scored 1.746373..
I'd rather not be double checking each result against the search to work this out..

Setting priority in lucene.net results

I am using a Lucene.Net query like this
(PropertyID:1 OR PropertyID:25 OR PropertyID:5 OR PropertyID:10 OR PropertyID:15)
I want result from Lucene.Net in order of PropertyId. I passed for example first record should be for PropertyId 1 second for 25 and third for 5. But currently Lucene.Net arranging result set in different way.
The order of fields in the query has no effect on sorting.
There are 2 ways to achieve the sorting you're looking for:
Use boosts in your query. You can boost PropertyID:1 higher than the rest so that these matches are scored higher and thus appear first in the results, then score PropertyID:2 second highest, etc. For example:
(PropertyID:1^5 OR PropertyID:25^4 OR PropertyID:5^3 OR PropertyID:10^2 OR PropertyID:15) This is simple to implement but may not work right if you're including other criteria in your query because that other criteria will affecting the scoring.
Implement custom sorting via your own Comparator class. This may take quite a bit of work especially given the lack of resources on the web for doing this, however it will give you the greatest control over your sorting. Here is an example of a custom Comparator used to sort by a string value alphabetically that may be a good place for you to start.

Random noise in Solr score

I am looking for a way of introducing random noise into my scoring function, and I'm at a loss on how to best proceed.
Some background:
We use Solr for a web application that manages large-ish sets of photos for agencies.
One customer has an interesting requirement for scoring:
'quality' field, maintained by editors, from 1 (highest) to 3 (lowest);
'date' field, boosting more recent photos; I would probably use a logarithmic function;
However, due to how the stock photo market works, this will likely result in many similar photos appearing together.
Their request is to give 'quality' a large boost, but introduce some randomness so that photos will not appear in a strict date order.
Any idea?
EDITED: a key requirement is to have "stable" query results: if I search twice for "tropical island" I can get a slightly different result set, but if I ask for the first page, then the second, then the first, I'd better get the same results :)
You could do this with FunctionQueries. For each photo add a field with a random number close to 1 (e.g. 0.99, 1.02) and use it in a product function query to alter the "natural" score.
Turns out my first approach to solving the problem was the correct one, and I had a trivial implementation bug. In case it helps others:
RandomSortField does have the characteristics I need (that is, returning repeatable results for the same query).
Leaving aside the FunctionQuery for a moment, even something trivial like:
sort=quality_i asc, date_d desc, random_12345 desc
will approximate my requirements.
However, when using the Sunspot ruby gem, there's no way of passing the seed, and that's what was tricking me earlier: I ended up using a different seed each time, thus getting "true" random results.

Resources