I have an index with 500 million documents. Each document is essentially a "keyword" / string of letters and digits (no spaces or punctuations). The strings are on average 10 letters and between 3 and 40 characters long.
I want to be able to swiftly find documents where the keyword field contains a certain substring.
I read that "wildcard" search (*abc*) is slow and not scaleable (prefixed wildcard).
I have now focused on n-grams. Ideally I figure that I should set "min" and "max" to 3 and 40. But if I set both to 3 and a minimum_should_match to 100% on the query, I can get a good result (without adding the tons of extra storage for ngram sized 4 - 40). The drawback seems to be that I get some unwanted results, such as searching for "dabc" would also match "abcd".
My question is, how to solve my goal in the best possible way (performance and storage).
Am I trying to reinvent the wheel? Should I just go with ngram min: 3 and max: 40?
You can try indexing the string with several different analysis strategies and then use ngrams to filter out documents that definitely are not part of what you are looking for and then use wildcards for the remaining ones. Your ngram filter will return some false positives but that is OK because your wildcard filter will fix that. You are trading off space vs. performance here. Smaller ngrams means more false positives (but less space used) and more work for your wildcard filter.
I'd suggest experimenting with a few approaches here before drawing any conclusions on performance and size here.
Instead of a wildcard you could also try a regexp query. This might be a bit cheaper to run than wildcard queries and you can combine it with the ngrams filter approach.
Related
I need to implement search by substring. It is supposed to work the same like “CTRL + F” that highlight a word if its substring matches it.
The search is going to be performed by two fields only:
Name - no more than 255 chars
Id - no more than 200 chars
However, number of records going to be pretty large about a million.
So far I’m using querystring search by keywords wrapped with wildcards but it will definitely lead to performance problems later on once number of records will start growing.
Do you have any suggestions how would I do more performance wise solution?
Searching with leading wildcards is going to be extremely slow on a large index
Avoid beginning patterns with * or ?. This can increase the iterations
needed to find matching terms and slow search performance.
As written in documentation wildcards queries are very slow.
Better to use ngram strategy if you want it to be fast at query time. If you want to search by partial match, word prefix, or any substring match it is better to use n-gram tokenizer, which will improve the full-text search.
The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.
Please go through this SO answer, that includes a working example for a partial match using ngrams
In Elasticsearch 5.6.5 I'm searching against a field with the following filter applied:
"filter_shingle":{
"max_shingle_size":"4",
"min_shingle_size":"2",
"output_unigrams":"true",
"type":"shingle"
}
When I perform a search for depreciation tax against a document with that exact text, I see the following explanation of the score:
weight(Synonym(content:depreciation content:depreciation tax)) .... [7.65]
weight(content:tax) ... [6.02]
If I change the search to depreciation taffy against the exact same document with depreciation tax in the content I get this explanation:
weight(Synonym(content:depreciation content:depreciation taffy)) .... [7.64]
This is not what I expected. I thought a match on the bigram token for depreciation tax would get a much higher score than a match on the unigram depreciation. However this scoring seems to reflect a simple unigram match. There is an extremely small difference and digging further this is because the termFreq=28 under the depreciation taffy match, and termFreq=29 under the depreciation tax match. I'm also not sure how this relates as I imagine across the shard holding this document there are very different counts for depreciation, depreciation tax and depreciation tafffy
Is this expected behavior? Is ES treating all the different sized shingles, including unigrams, with the same IDF value? Do I need to split out each shingle size into different sub fields with different analyzers to get the behavior I expect?
TL;DR
Shingles and Synonyms are broken in Elastic/Lucene and a lot of hacks need to be applied until a fix is released (accurate as of ES 6).
Put unigrams, bigrams and so on in individual subfields and search them separately, combining the scores for an overall match. Don't use a single shingle filter on a field that does multiple n-gram configurations
Don't combine a synonym and shingle filter on the same field.
In my case I do a must match with synonyms on a unigram field, then a series of should matches to boost the score on shingles of each size, without synonyms
Details
I got an answer on the elastic support forums:
https://discuss.elastic.co/t/does-elasticsearch-score-different-length-shingles-with-the-same-idf/126653/2
Yep, this is mostly expected.
It's not really the shingles causing the scoring oddness, but the fact
that SynonymQueries do the frequency blending behavior that you're
seeing. They use frequency of the original token for all the
subsequent 'synonym' tokens, as a way to help prevent skewing the
score results. Synonyms are often relatively rare, and would
drastically affect the scoring if they each used their individual
df's.
From the Lucene docs:
For scoring purposes, this query tries to score the terms as if you
had indexed them as one term: it will match any of the terms but only
invoke the similarity a single time, scoring the sum of all term
frequencies for the document.
The SynonymQuery also sets the docFrequency to the maximum
docFrequency of the terms in the document. So for example, if:
"deprecation"df == 5 "deprecation tax"df == 2, "deprecation taffy"df
== 1, it will use 5 as the docFrequency for scoring purposes.
The bigger issue is that Lucene doesn't have a way to differentiate
shingles from synonyms... they both use tokens that overlap the
position of other tokens in the token stream. So if unigrams are mixed
with bi-(or larger)-grams, Lucene is tricked into thinking it's
actually a synonym situation.
The fix is to keep your unigrams and bi-plus-grams in different
fields. That way Lucene won't attempt to use SynonymQueries in these
situations, because the positions won't be overlapping anymore.
Here's another related question that I asked which relates to how actual synonyms also get mangled when combined with shingles. https://discuss.elastic.co/t/es-5-4-synonyms-and-shingles-dont-seem-to-work-together/127552
Elastic/Lucene expands the synonym set, injects them into the token stream, then creates shingles. E.g. Query: econ supply and demand => econ, economics, supply, demand. Document: `... econ foo ... => econ, foo '. Now we get the shingle from the query "econ economics" and somehow this matches the document. No idea why since I only applied synonyms to the query, not the document, so I don't see the match. Also, the way the shingles are created from the query is wrong too.
This is a known problem, and it is still not fully resolved. A number
of Lucene filters can't consume graphs as their inputs.
There is currently active work being done on developing a fixed
shingle filter, and also an idea to have a sub-field for indexing
shingles.
The texts I query for (and the queries itself) have on average 11 words (up to about 25). I want my query to return matches only if at least half of the words in query are matched in text.
For example, this is how my initial Lucene query looks like (for simplicity it has only 4 words):
jakarta~ apache~ lucene~ stackoverflow~
It will return a match if at least one of the words is fuzzy matched but I want it to return a match only if at least any two (half of 4) of the words are fuzzy matched.
Is it possible in Lucene?
I could split my query like this (OR is default operator in Lucene):
(jakarta~ apache~) AND (lucene~ stackoverflow~)
But that wouldn’t return a match is both jakarta and apache are matched but none of lucene and stackoverflow is matched.
I could change my query to:
(jakarta~ AND apache~) (jakarta~ AND lucene~) (jakarta~ AND stackoverflow~)
(apache~ AND lucene~) (apache~ and stackoverflow~) (lucene~ AND stackoverflow~)
Would that be efficient? On average my expression would consist of 462 AND clauses (binomial coefficient of 11 and 6), in the worst case of 5200300 AND clauses (binomial coefficient of 25 and 13).
If it is not possible (or doesn’t make sense performance wise) to do in Lucene, is it possible in Elasticsearch or Solr?
It should work fast (<= 0.5 sec/search) for at least 10 000 texts in database.
It would be even better if I could easily later change the minimum matches percentage (e.g. 40% instead of 50%) but I may not need this.
All three options support a minimum should match functionality among optional query clauses.
Lucene: Set in BooleanQueries via the BooleanQuery.Builder.setMinimumShouldMatch method.
Solr: The DisMax mm parameter.
Elasticsearch: The minimum_should_match parameter, in Bool queries, Multi Match queries, etc.
In Solr, you can use minimum match (mm) parameter with DisMax and eDisMax and you can specify the percentage of the match expected.
I am currently using Lucene to search a large amount of documents.
Most commonly it is being searched on the name of the object in the document.
I am using the standardAnalyser with a null list of stop words. This means words like 'and' will be searchable.
The search term looks like this (+keys:bunker +keys:s*)(keys:0x000bunkers*)
the 0x000 is a prefix to make sure that it comes higher up the list of results.
the 'keys' field also contains other information like postcode.
So must match at least one of those.
Now with the background done on with the main problem.
For some reason when I search a term with a single character. Whether it is just 's' or bunker 's' it takes around 1.7 seconds compared to say 'bunk' which will take less than 0.5 seconds.
I have sorting, I have tried it with and without that no difference. I have tried it with and without the prefix.
Just wondering if anyone else has come across anything like this, or will have any inkling of why it would do this.
Thank you.
The most commonly used terms in your index will be the slowest terms to search on.
You're using StandardAnalyzer which does not remove any stop words. Further, it splits words on punctuation, so John's is indexed as two terms John and s. These splits are likely creating a lot of occurrences of s in your index.
The more occurrences of a term in your index, the more work Lucene has to do at search-time. A term like bunk likely occurs much less in your index by orders of magnitude, thus it requires a lot less work to process at search-time.
I want to understand the implications of using a large setting for max_gram when using the nGram tokenizer. I know it will explode the size of the index, but then what? Will it make searches slower? will it cause things to error out? etc
It'll make searches slower for sure, because lots of tokens will be generated for comparison.
In general, you should analyze your business and find out what size of ngram is suitable for your field.
Ex: for a product ID, you can support search ngram for max 20 chars (max_gram=20), because usually people only remember 5 or 6 chars of a product ID, 20 is good enough.