Elasticsearch - parts of string with correct word arrangement - elasticsearch

I'm wondering how to properly query this scenario:
Field values:
20182199
20182188
20182177
Query-strings (that should match all three):
2018 -> hit
0182 -> fail
821 -> fail
The other requirement is, that if greater than 1 word is present in the query string, both (the whole query string) must match, not every word of the string seperately.
Thats why I choosed a match phrase prefix query. (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html). It only doesn't cover hits on inner elements of a word. That's what I am now looking for :-)
I'd appreciate any help. Thank you!

I believe Elasticsearch docs specifically cover your use case and you are looking to match what Elasticsearch refers to as ngrams.
Partial Matching - a quick introduction
Ngrams for Partial Matching - it's worth noting that Elasticsearch search calls sequence of characters ngrams, and sequence of tokens shingles (a slight difference from what you may be used to)
Wildcard and Regexp Queries - the same section on partial matching has notes on these queries that might suffice for you and may not require you to reindex/change analysis

Related

How do I escape the word "And" in Elasticsearch if I want to search by the literal "And"?

I'm trying to search over an index that includes constellation code names, and the code name for the Andromeda constellation is And.
Unfortunately, if I search using And, all results are returned. This is the only one that doesn't work, across dozens of constellation code names, and I assume it's because it's interpreted as the logical operator AND.
(constellation:(And)) returns my entire result set, regardless of the value of constellation.
Is there a way to fix this without doing tricks like indexing with an underscore in front?
Thanks!
I went for a bit of a hack, indexing the constellation as __Foo__ and then changing my search query accordingly by adding the __ prefix and suffix to the selected constellation.

Maching two words as a single word

Consider that I have a document which has a field with the following content: 5W30 QUARTZ INEO MC 3 5L
A user wants to be able to search for MC3 (no space) and get the document; however, search for MC 3 (with spaces) should also work. Moreover, there can be documents that have the content without spaces and that should be found when querying with a space.
I tried indexing without spaces (e.g. 5W30QUARTZINEOMC35L), but that does not really work as using a wildcard search I would match too much, e.g. MC35 would also match, and I only want to match two exact words concatenated together (as well as exact single word).
So far I'm thinking of additionally indexing all combinations of two words, e.g. 5W30QUARTZ, QUARTZINEO, INEOMC, MC3, 35L. However, does Elasticsearch have a native solution for this?
I'm pretty sure what you want can be done with the shingle token filter. Depending on your mapping, I would imagine you'd need to add a filter looking something like this to your content field to get your tokens indexed in pairs:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":2,
"min_shingle_size":2,
"output_unigrams":"true"
}
Note that this is also already the default configuration, I just added it for clarity.

How do I analyze text that doesn't have a separator (eg a domain name)?

I have a bunch of domain names without the tld I'd like to search but they don't always have a natural break in between words (like a "-"). For instance:
techtarget
americanexpress
theamericanexpress // a non-existent site
thefacebook
What is the best analyzer to use? e.g. if a user types in "american ex" I'd like to prioritize "americanexpress" over "theamericanexpress". A simple prefix query would work in this particular case but a user then types in "facebook" but that doesn't return anything. ;(
In most of the case including yours, Standard Analyzer is sufficient. Also, it is default analyzer in ElasticSearch and it provides grammar based tokenization. For example:
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." will be tokenized into [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ].
In your case, domain names are tokenized into list of terms as [techtarget, americanexpress, theamericanexpress, thefacebook].
Why query search for facebook doesnot return anything?
Because, there is no facebook term stored in the dictionary and hence search result return no data. Whats going on is that ES try to find search term facebook in the dictionary but the dictionary only contain thefacebook and hence search return no result.
Solution:
In order to match search term facebook with thefacebook, you need to wrap wildcards around your search term i.e. .*facebook will match thefacebook. However, you should know that using regex will have a performance overheads.
Other workaround is that you can use synonyms. What synonyms does is that you can specify synonyms (list of alternative search terms) for your search terms. e.g. "facebook, thefacebook, facebooksocial, fb, fbook", with these synonyms, you can provide any of search term from these synonyms, the it will match with any of these synonyms. i.e. If your search term is facebook and your domain is stored as thefacebook then the search will be matched.
Also, for prioritization you need to first understand how scoring work in ES and then you can use Boosting.

elasticsearch - fulltext search for words with special/reserved characters

I am indexing documents that may contain any special/reserved characters in their fulltext body. For example
"PDF/A is an ISO-standardized version of the Portable Document Format..."
I would like to be able to search for pdf/a without having to escape the forward slash.
How should i analyze my query-string and what type of query should i use?
The default standard analyzer will tokenize a string like that so that "PDF" and "A" are separate tokens. The "A" token might get cut out by the stop token filter (See Standard Analyzer). So without any custom analyzers, you will typically get any documents with just "PDF".
You can try creating your own analyzer modeled off the standard analyzer that includes a Mapping Char Filter. The idea would that "PDF/A" might get transformed into something like "pdf_a" at index and query time. A simple match query will work just fine. But this is a very simplistic approach and you might want to consider how '/' characters are used in your content and use slightly more complex regex filters which are also not perfect solutions.
Sorry, I completely missed your point about having to escape the character. Can you elaborate on your use case if this turns out to not be helpful at all?
To support queries containing reserved characters i now use the Simple Query String Query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html)
As of not using a query parser it is a bit limited (e.g. no field-queries like id:5), but it solves the purpose.

Highlighting a query word in a document

I have a document and a query term. I want to
find the query term in the document.
Pad each occurrence of the query term with a certain text marker.
For example
Text: I solemnly swear that I am upto no good.
Query: swear
Output: I solemnly MATCHSTART swear MATCHEND that I am upto no good.
Assuming that I have multiple query words and a large document, now can I do this efficiently.
I did go over various links on the internet but couldn't find anything very conclusive or definite. Moreover, this is just a programming question and has nothing to do with search engine development or information retrieval.
Any help would be appreciated. Thanks.
If each your query is word (some substring, does not contains SP/TAB/NL, etc), and allowed with very low probability false positive (when you mark some word, omitted in the query set) - you can use Bloom filter: http://en.wikipedia.org/wiki/Bloom_filter
First, load your query words into Bloom filter, and thereafter scan document, and match each word in the filter. If search result is positive - mark this word.
You can use my implementation of bloom filter: http://olegh.cc.st/src/bloom.c.txt
In Python:
text = "I solemnly swear I am up to no good" #read in however you like.
query = input("Query: ")
text.replace(" "+query" "," MATCHSTART "+query+" MATCHEND ")
OUTPUT:
'I solemnly MATCHSTART swear MATCHEND that I am up to no good.'
You could also use regex, but that's slower, so I just used string concat to add whitespace to the beginning and end of the word (so as not to match "swears" or "swearing" or "sportswear". This is easily translatable to whatever language you prefer.

Resources