I need to implement search by substring. It is supposed to work the same like “CTRL + F” that highlight a word if its substring matches it.
The search is going to be performed by two fields only:
Name - no more than 255 chars
Id - no more than 200 chars
However, number of records going to be pretty large about a million.
So far I’m using querystring search by keywords wrapped with wildcards but it will definitely lead to performance problems later on once number of records will start growing.
Do you have any suggestions how would I do more performance wise solution?
Searching with leading wildcards is going to be extremely slow on a large index
Avoid beginning patterns with * or ?. This can increase the iterations
needed to find matching terms and slow search performance.
As written in documentation wildcards queries are very slow.
Better to use ngram strategy if you want it to be fast at query time. If you want to search by partial match, word prefix, or any substring match it is better to use n-gram tokenizer, which will improve the full-text search.
The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.
Please go through this SO answer, that includes a working example for a partial match using ngrams
Related
I have an index with 500 million documents. Each document is essentially a "keyword" / string of letters and digits (no spaces or punctuations). The strings are on average 10 letters and between 3 and 40 characters long.
I want to be able to swiftly find documents where the keyword field contains a certain substring.
I read that "wildcard" search (*abc*) is slow and not scaleable (prefixed wildcard).
I have now focused on n-grams. Ideally I figure that I should set "min" and "max" to 3 and 40. But if I set both to 3 and a minimum_should_match to 100% on the query, I can get a good result (without adding the tons of extra storage for ngram sized 4 - 40). The drawback seems to be that I get some unwanted results, such as searching for "dabc" would also match "abcd".
My question is, how to solve my goal in the best possible way (performance and storage).
Am I trying to reinvent the wheel? Should I just go with ngram min: 3 and max: 40?
You can try indexing the string with several different analysis strategies and then use ngrams to filter out documents that definitely are not part of what you are looking for and then use wildcards for the remaining ones. Your ngram filter will return some false positives but that is OK because your wildcard filter will fix that. You are trading off space vs. performance here. Smaller ngrams means more false positives (but less space used) and more work for your wildcard filter.
I'd suggest experimenting with a few approaches here before drawing any conclusions on performance and size here.
Instead of a wildcard you could also try a regexp query. This might be a bit cheaper to run than wildcard queries and you can combine it with the ngrams filter approach.
How to include fuzziness in phrase matching ? In the elasticsearch documentation it is mentioned that fuzziness is not supported with phrase matching.
I have documents containing phrases now i have a text body , now i want to find out the common phrases of text and phrases within the documents , but need to search phrases that might be spelled wrong .
There are some ways to do this:
Remove Whitespace and index the hole Phrase as one Token (I think there is a Filter for that in Elastic). In your query you would have to do the same.
There is a Tokenizer where I forgot the name (Maybe someone can help out here?) which lets you index more than one words together. If your phrases have a common maximum length like 5 words or so this could do the trick.
Beware fuzzi only works for a maximum distance of 2 so if you have a very long sentence 2 might not be enough and you have to split it.
I've been working with ElasticSearch within an existing code base for a few days, so I expect that the answer is easy once I know what I'm doing. I want to extend a search to yield the same results when I search with a compound word, like "eyewitness", or its component words separated by a whitespace, like "eye witness".
For example, I have a catalog of toy cars that includes both "firetruck" toys and "fire truck" toys. I would like to ensure that if someone searched on either of these terms, the results would include both the "firetruck" and the "fire truck" entries.
I attempted to do this at first with the "fuzziness" of a match, hoping that "fire truck" would be considered one transform away from "firetruck", but that does not work: ES fuzziness is per-word and will not add or remove whitespace characters as a valid transformation.
I know that I could do some brute-forcing before generating the query by trying to come up with additional search terms by breaking big words into smaller words and also joining smaller words into bigger words and checking all of them against a dictionary, but that falls apart pretty quickly when "fuzziness" and proper names are part of the task.
It seems like this is exactly the kind of thing that ES should do well, and that I simply don't have the right vocabulary yet for searching for the solution.
Thanks, everyone.
there are two things you could could do:
you could split words into their compounds, i.e. firetruck would be split into two tokens fire and truck, see here
you could use n-grams, i.e. for 4 grams the original firetruck get split into the tokens fire, iret, retr, etru, truc, ruck. In queries, the scoring function helps you ending up with pretty decent results. Check out this.
Always remember to do the same tokenization on both the analysis and the query side.
I would start with the ngrams and if that is not good enough you should go with the compounds and split them yourself - but that's a lot of work depending on the vocabulary you have under consideration.
hope the concepts and the links help, fricke
I'm looking for a filter in elasticsearch that will let me break english compound words into their constituent parts, so for example for a term like eyewitness, eye witness and eyewitness as queries would both match eyewitness. I noticed the compound word filter, but this requires explicity defining a word list, which I couldn't possibly come up with on my own.
First, you need to ask yourself if you really need to break the compound words. Consider a simpler approach like using "edge n-grams" to hit in the leading or trailing edges. It would have the side effect of loosely hitting on fragments like "ey", but maybe that would be acceptable for your situation.
If you do need to break the compounds, and want to explicitly index the word fragments, the you'll need to get a word list. You can download a list English words, one example is here. The dictionary word list is used to know which fragments of the compound words are actually words themselves. This will add overhead to your indexing, so be sure to test it. An example showing the usage is here.
If your text is German, consider https://github.com/jprante/elasticsearch-analysis-decompound
I am currently using Lucene to search a large amount of documents.
Most commonly it is being searched on the name of the object in the document.
I am using the standardAnalyser with a null list of stop words. This means words like 'and' will be searchable.
The search term looks like this (+keys:bunker +keys:s*)(keys:0x000bunkers*)
the 0x000 is a prefix to make sure that it comes higher up the list of results.
the 'keys' field also contains other information like postcode.
So must match at least one of those.
Now with the background done on with the main problem.
For some reason when I search a term with a single character. Whether it is just 's' or bunker 's' it takes around 1.7 seconds compared to say 'bunk' which will take less than 0.5 seconds.
I have sorting, I have tried it with and without that no difference. I have tried it with and without the prefix.
Just wondering if anyone else has come across anything like this, or will have any inkling of why it would do this.
Thank you.
The most commonly used terms in your index will be the slowest terms to search on.
You're using StandardAnalyzer which does not remove any stop words. Further, it splits words on punctuation, so John's is indexed as two terms John and s. These splits are likely creating a lot of occurrences of s in your index.
The more occurrences of a term in your index, the more work Lucene has to do at search-time. A term like bunk likely occurs much less in your index by orders of magnitude, thus it requires a lot less work to process at search-time.