Some of my search results returns a total of over 10k documents, varying from a high score (in my most recent search, ~75) to a very low score (less than 5). Other queries return a high score of ~20 and a low score of ~1.
Does anyone have a good solution for trimming off the less relevant documents? A java or query implementation would work. I've thought about using min_score, but i'm wary of that since it has to be a constant number, and some of the scores of my responses are a lot closer than the above. I suppose I could come up with some formula based off of the returned scores to create a cutoff for every response, but I was curious if anyone has come up with a solution to a similar use case?
Related
Based on this article - link there are some serious performance implications with having track_total_hits property set to true.
We currently use it to get the number of documents matching after users search. Then user can use pagination to scroll through the results. The number of documents for such a search usually ranges from 10k - 5M.
Example of a user work flow:
User performs a search which matches 150.000 documents
We show him the first 200 results which he can scroll through but we also show him the total number of documents found in the search.
Since we always show the number of document searches and often those numbers can be quite high we need some kind of a way to get that count. I'm not sure but if we almost always perform paginated searches I would assume a lot of the things would be in memory ? Maybe then this actually effects us less then how it's shown in the provided article?
Some kind of an approximation and not an exact count would be ok for us if it would improve performance.
Is there such an option in Elasticsearch where we can get approximated count on search requests ?
There is no option to get an approximate count, but you may want to consider assigning track_total_hits a lower bound instead of true , which is a good compromise from a performance standpoint ( https://www.elastic.co/guide/en/elasticsearch/reference/master/search-your-data.html#track-total-hits)
That way, you can show users that there are at least k results - but there could be more.
Also, try using search_after (if you are not using it already) for pagination.
I've implemented a fuzzy search algorithm based on a N closest neighbors query for given search terms. Each query returns a pre-set number of raw results, in my case a max. of 200 hits / query, sorted descending by score, highest score first.
The raw search already produces good results, but in some rather rare cases not good enough so I've added another post-processing layer or better said another metric to the raw search results based on Levenshtein-Damerau algorithm that measures the word / phrase distance between query term(s) and raw results. The lower the resulting score the better, 0.0 would be an exact match.
Using the Levenshtein-Damerau post-processing algorithm I sort the results ascending, from the lowest to the highest.
The quality of matches is amazingly good and all relevant hits are ranked to the top. Still I have the bulk of 200 hits from the core search and I am looking for a smart way to limit the final result set down to a maximum of 10-20 hits. I could just add a static limit as it is basically done. But I wonder if there is a better way to do this based on the individual metrics I get with each search result set.
I have the following result metrics:
The result score of the fuzzy core search search, a value of type float/double. The higher the better
The Levenshtein-Damerau post processing weight, another value of type float/double. The lower the better
And finally each result set knows its minimum and maximum score limits. Using the Levenshtein-Damerau post processing algorithm on the raw results I take the min/max values from there.
The only ideas I have is to take a sub-range out of the result set, something like the top 20% results which is simple to achieve. More interesting would be to analyse the top result scores/metrics and find some indication where it gets too fuzzy. I could use the metrics I gather inside my Levenshtein-Damerau algorithm layer, respectively the word- and phrase-distance parameters - these values along with 2 other parameters make up the final distance score. For example if the word- and/or phrase distance exceed a certain threshold, then skip the result. This way is a bit more complicated but possible.
Well, I wonder if there are more opportunities I could use and just not obviously see. Once again, I would like to omit a static limit and make it more flexible on each individual result set.
Any hints or further ideas are greatly appreciated.
I'm using elasticsearch 2.4 and would like to get distinct counts for various entities in my data. I've played around with lot queries which include two ways of calculating distinct counts. One is through a cardinality aggregation and other is doing a terms aggregation can then getting distinct counts by calculating bucket size. By the former approach I've seen the counts being erroneous and inaccurate, but faster and relatively simple. My data is huge and will increase with time, so I do not know how cardinality aggregation will perform, whether it will become more accurate or less accurate.Wanted to take some advice from people who have had this question before and which approach they chose.
cardinality aggregation takes an additional parameter for precision_threshold
The precision_threshold options allows to trade memory for accuracy,
and defines a unique count below which counts are expected to be close
to accurate. Above this value, counts might become a bit more fuzzy.
The maximum supported value is 40000, thresholds above this number
will have the same effect as a threshold of 40000. The default values
is 3000.
configurable precision, which decides on how to trade memory for accuracy,
excellent accuracy on low-cardinality sets,
fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.
In short, cardinality can give you exact counts upto a maximum of 40000 cardinality after which it gives an approximate count. Higher the precision_threshold, higher the memory cost and higher the accuracy. For very high values, it can only give you an approximate count.
To add to what Rahul said in the below answer. Cardinality will
give you an approximate count yes, but if you set the precision
threshold to its maximum value which is 40000 it will give you
accurate results till 40000. Above which the error rate increases but
more importantly it never goes above
1%, even upto 10 million documents. See screen-shot below Source: https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-metrics-cardinality-aggregation.html
Also if we look at it from the user's perspective. If the user gets the
count of 10 million documnets or for a matter of fact even a million
documets off by 1% it will not make much of a difference and will go
unnoticed. And when the user wants to look at the actual data he will do a
search anyways which will return accurate results.
I'm looking for a setup that actually returns the top 10% of results of a certain query. After the result we also want to sort the subset.
Is there an easy way to do this?
Can anyone provide a simple example for this.
I was thinking scaling the results scores between 0 and 1.0 and basically sepcifiying min_score to 0.9.
I was trying to create function_score queries but those seem a bit complex for a simple requirement such as this one, plus I was not sure how sorting would effect the results, since I want the sort functions work always on the 10% most relevant articles of course.
Thanks,
Peter
As you want to slice response in % of overall docs count, you need to know that anyway. And using from / size params will cut off the required amount at query time.
Assuming this, seems that easiest way to achieve your goal is to make 2 queries:
Filtered query with all filters, no queries and search_type=count to get overall document count.
Perform your regular matching query, applying {"from": 0, "size": count/10} with count got from 1st response.
Talking about tweaking the scoring. For me, it seems as bad idea, as getting multiple documents with the same score is pretty generic situation. So, cutting dataset by min_score will probably result in skewed data.
I'm trying to use elastic search to do a fuzzy query for strings. According to this link (https://www.elastic.co/guide/en/elasticsearch/reference/1.6/common-options.html#fuzziness), the maximum fuzziness allowed is 2, so the query will only return results that are 2 edits away using the Levenshtein Edit Distance. The site says that Fuzzy Like This Query supports fuzzy searches with a fuzziness greater than 2, but so far using the Fuzzy Like This Query has only allowed me to search for results within two edits of the search. Is there any workaround for this constraint?
It looks like this was a bug which was fixed quite a while back. Which Elasticsearch version are you using?
For context, the reason why Edit Distance is now limited to [0,1,2] for most Fuzzy operations has to do with a massive performance improvement of fuzzy/wildcard/regexp matching in Lucene 4 using Finite State Transducers.
Executing a fuzzy query via an FST requires knowing the desired edit-distance at the time the transducer is constructed (at index-time). This was likely capped at an edit-distance of 2 to keep the FST size requirements manageable. But also possibly, because for many applications, an edit-distance of greater than 2 introduces a whole lot of noise.
The previous fuzzy query implementation required visiting each document to calculate edit distance at query-time and was impractical for large collections.
It sounds like Elasticsearch (1.x) is still using the original (non-performant) implementation for the FuzzyLikeThisQuery, which is why the edit-distance can increase beyond 2. However, FuzzyLikeThis has been deprecated as of 1.6 and won't be supported in 2.0.