The texts I query for (and the queries itself) have on average 11 words (up to about 25). I want my query to return matches only if at least half of the words in query are matched in text.
For example, this is how my initial Lucene query looks like (for simplicity it has only 4 words):
jakarta~ apache~ lucene~ stackoverflow~
It will return a match if at least one of the words is fuzzy matched but I want it to return a match only if at least any two (half of 4) of the words are fuzzy matched.
Is it possible in Lucene?
I could split my query like this (OR is default operator in Lucene):
(jakarta~ apache~) AND (lucene~ stackoverflow~)
But that wouldn’t return a match is both jakarta and apache are matched but none of lucene and stackoverflow is matched.
I could change my query to:
(jakarta~ AND apache~) (jakarta~ AND lucene~) (jakarta~ AND stackoverflow~)
(apache~ AND lucene~) (apache~ and stackoverflow~) (lucene~ AND stackoverflow~)
Would that be efficient? On average my expression would consist of 462 AND clauses (binomial coefficient of 11 and 6), in the worst case of 5200300 AND clauses (binomial coefficient of 25 and 13).
If it is not possible (or doesn’t make sense performance wise) to do in Lucene, is it possible in Elasticsearch or Solr?
It should work fast (<= 0.5 sec/search) for at least 10 000 texts in database.
It would be even better if I could easily later change the minimum matches percentage (e.g. 40% instead of 50%) but I may not need this.
All three options support a minimum should match functionality among optional query clauses.
Lucene: Set in BooleanQueries via the BooleanQuery.Builder.setMinimumShouldMatch method.
Solr: The DisMax mm parameter.
Elasticsearch: The minimum_should_match parameter, in Bool queries, Multi Match queries, etc.
In Solr, you can use minimum match (mm) parameter with DisMax and eDisMax and you can specify the percentage of the match expected.
Related
I have gone through [Theory behind relevance scoring][1] and have got two related questions
Q1 :- As IDF formula is idf(t) = 1 + log ( numDocs / (docFreq + 1)) where numDocs is total number of documents in index. Does it mean each time new document is added in index, we need to re-calculate the IDF for each word for all existing documents in index ?
Q2 :- Link mentioned below statement. My question is there any reason why TF/IDF score is calculated against each field instead of complete document ?
When we refer to documents in the preceding formulae, we are actually
talking about a field within a document. Each field has its own
inverted index and thus, for TF/IDF purposes, the value of the field
is the value of the document.
You only calculate the score at query time and not at insert time. Lucene has the right statistics to make this a fast calculation and the values are always fresh.
The frequency only really makes sense against a single field since you are interested in the values for that specific field. Assume we have multiple fields and we search a single one, then we're only interested in the frequency of that one. Searching multiple ones you still want control over the individual fields (such as boosting "title" over "body") or want to define how to combine them. If you have a use-case where this doesn't make sense (not sure I have a good example right now — it's IMO far less common) then you could combine multiple fields into one with copy_to and search on that.
I have an index with 500 million documents. Each document is essentially a "keyword" / string of letters and digits (no spaces or punctuations). The strings are on average 10 letters and between 3 and 40 characters long.
I want to be able to swiftly find documents where the keyword field contains a certain substring.
I read that "wildcard" search (*abc*) is slow and not scaleable (prefixed wildcard).
I have now focused on n-grams. Ideally I figure that I should set "min" and "max" to 3 and 40. But if I set both to 3 and a minimum_should_match to 100% on the query, I can get a good result (without adding the tons of extra storage for ngram sized 4 - 40). The drawback seems to be that I get some unwanted results, such as searching for "dabc" would also match "abcd".
My question is, how to solve my goal in the best possible way (performance and storage).
Am I trying to reinvent the wheel? Should I just go with ngram min: 3 and max: 40?
You can try indexing the string with several different analysis strategies and then use ngrams to filter out documents that definitely are not part of what you are looking for and then use wildcards for the remaining ones. Your ngram filter will return some false positives but that is OK because your wildcard filter will fix that. You are trading off space vs. performance here. Smaller ngrams means more false positives (but less space used) and more work for your wildcard filter.
I'd suggest experimenting with a few approaches here before drawing any conclusions on performance and size here.
Instead of a wildcard you could also try a regexp query. This might be a bit cheaper to run than wildcard queries and you can combine it with the ngrams filter approach.
I'm using Elasticsearch 5.3.1 and I'm evaluating BM25 and Classic TF/IDF.
I came across the discount_overlaps property which is optional.
Determines whether overlap tokens (Tokens with 0 position increment)
are ignored when computing norm. By default this is true, meaning
overlap tokens do not count when computing norms.
Can someone explain what the above means with an example if possible.
First off, the norm is calculated as boost / √length, and this value is stored at index time. This causes matches on shorter fields to get a higher score (because 1 in 10 is generally a better match than 1 in 1000).
For an example, let's say we have a synonym filter on our analyzer, that is going index a bunch of synonyms in the indexed form of our field. Then we index this text:
The man threw a frisbee
Once the analyzer adds all the synonyms to the field, it looks like this:
Now when we search for "The dude pitched a disc", we'll get a match.
The question is, for the purposes the norm calculation above, what is the length?
if discount_overlaps = false, then length = 12
if discount_overlaps = true, then length = 5
I'm trying to use elastic search to do a fuzzy query for strings. According to this link (https://www.elastic.co/guide/en/elasticsearch/reference/1.6/common-options.html#fuzziness), the maximum fuzziness allowed is 2, so the query will only return results that are 2 edits away using the Levenshtein Edit Distance. The site says that Fuzzy Like This Query supports fuzzy searches with a fuzziness greater than 2, but so far using the Fuzzy Like This Query has only allowed me to search for results within two edits of the search. Is there any workaround for this constraint?
It looks like this was a bug which was fixed quite a while back. Which Elasticsearch version are you using?
For context, the reason why Edit Distance is now limited to [0,1,2] for most Fuzzy operations has to do with a massive performance improvement of fuzzy/wildcard/regexp matching in Lucene 4 using Finite State Transducers.
Executing a fuzzy query via an FST requires knowing the desired edit-distance at the time the transducer is constructed (at index-time). This was likely capped at an edit-distance of 2 to keep the FST size requirements manageable. But also possibly, because for many applications, an edit-distance of greater than 2 introduces a whole lot of noise.
The previous fuzzy query implementation required visiting each document to calculate edit distance at query-time and was impractical for large collections.
It sounds like Elasticsearch (1.x) is still using the original (non-performant) implementation for the FuzzyLikeThisQuery, which is why the edit-distance can increase beyond 2. However, FuzzyLikeThis has been deprecated as of 1.6 and won't be supported in 2.0.
I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title.
As I understand the min_similarity is a percent by which the queried string matches the string in the database. I couldn't find an exact description of how this value is calculated.
The max_expansions as I understand is the Levenshtein distance by which a search should be executed. If this actually was Levenshtein distance it would have been the ideal solution for me. Anyway, it's not working
for example i have the word "Samvel"
queryStr max_expansions matches?
samvel 0 Should not be 0. error (but levenshtein distance can be 0!)
samvel 1 Yes
samvvel 1 Yes
samvvell 1 Yes (but it shouldn't have)
samvelll 1 Yes (but it shouldn't have)
saamvelll 1 No (but for some weird reason it matches with Samvelian)
saamvelll anything bigger than 1 No
The documentation says something I actually do not understand:
Add max_expansions to the fuzzy query allowing to control the maximum number
of terms to match. Default to unbounded (or bounded by the max clause count in
boolean query).
So can please anyone explain to me how exactly these parameters affect the search results.
The min_similarity is a value between zero and one. From the Lucene docs:
For example, for a minimumSimilarity of 0.5 a term of the same length
as the query term is considered similar to the query term if the edit
distance between both terms is less than length(term)*0.5
The 'edit distance' that is referred to is the Levenshtein distance.
The way this query works internally is:
it finds all terms that exist in the index that could match the search term, when taking the min_similarity into account
then it searches for all of those terms.
You can imagine how heavy this query could be!
To combat this, you can set the max_expansions parameter to specify the maximum number of matching terms that should be considered.