Implement Fuzzy Prefix Full-text matching query in Elasticsearch - elasticsearch

I understand that Elasticsearch tries to avoid using Fuzzy search with prefix matching, and this is also why it doesn't natively support such feature due to its complexity. However, we have a directory search system that solely relies on elasticsearch as a blackbox search engine, and we need the following logic:
E.g. Say the terms are "Michael Pierce Chem". We want to support full text search on the first two terms (with match query) and we also want to do fuzzy on the last term first and then do a prefix match, as if "Chem" matches "chemistry", "chen", and even "YouTube Chen" due to full-text support.
Please give me some advice on the implementation and design suggestions. Current stack is a NodeJS web app with Elasticsearch.

Related

Searching for a term as both a single string and multi worded string

I'm setting up my elastic instance in a schema-less manner (no up front mappings) and the application requires users be able to search against a field that contains a word that may or may not be tokenized into multiple strings. For example, the field may contain the word "ONETWO". The spec requires that a user should be able to search "ONETWO", "ONE", and "TWO" and retrieve that same document. There doesn't seem any easy way to accomplish this even with a custom tokenizer (and I don't think there SHOULD be an easy way to do this -- or any way at all). Just want to confirm my thoughts.
Its very easy to cater your requirement using the custom analyzer which uses the n-gram tokenizer, You can even pass it to a lowercase token filter, so that in your case even your text was ONETWO but if user searches for one, One, ONE he should get a result. Although for this you need to apply a different analyzer search time read more about it https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html.
Refer https://devticks.com/how-to-improve-your-full-text-search-in-elasticsearch-with-ngram-tokenizer-e346f29f8ddb for more information and let me know if you need any information.

Elasticsearch: Return same search results regardless of diacritics/accents

I've got a word in the text (e.g. nagymező) and I want to be able to type in the search query nagymező or nagymezo and it should show this text which contains that word in the search results.
How can it be accomplished?
You want to use a Unicode folding strategy, probably the asciifolding filter. I'm not sure which version of Elasticsearch you're on, so here are a couple of documentation links:
asciifolding for ES 2.x (older version, but much more detailed guide)
asciifolding for ES 6.3
The trick is to remove the diacritics when you index them so they don't bother you anymore.
Have a look at ignore accents in elastic search with haystack
and also at https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-analyzers.html (look for 'diacritic' on the page).
Then, just because it will probably be useful to someone one day or the other, know that the regular expression \p{L} will match any Unicode letter :D
Hope this helps,

ES Analyzer which tokens the numbers, digits as well

I am using Elasticsearch in-built Simple analyzer https://www.elastic.co/guide/en/elasticsearch/reference/1.7/analysis-simple-analyzer.html, which uses Lower Case Tokenizer. and text apple 8 IS Awesome is tokenized in the below format.
"apple",
"is",
"awesome"
You can clearly see, that it misses tokenizing the number 8, hence now if I just search with 8, my message will not appear in search.
I went through all the available analyzer available with ES but couldn't find any suitable analyzer which matches my requirement.
How can I tokenize all the words with a number using a custom or in-built analyzer of ES ?
Your question is about the simple analyzer, but you mention a very old link to documentation. Try
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html
Like Val told you, you probably looking for the standard analyser.
If you want to see the difference try the analysis api:
http://localhost:9200/_analyze?analyzer=simple&text=apple%208%20IS%20Awesome
http://localhost:9200/_analyze?analyzer=standard&text=apple%208%20IS%20Awesome

Search algorithm options for ontology querying?

I have developed a tool that enables searching of an ontology I authored. It submits the searches as SPARQL queries.
I have received some feedback that my search implementation is all-or-none, or "binary". In other words, if a user's input doesn't exactly match a term in the ontology, they won't get any hit at all.
I have been asked to add some more flexible, or "advanced" search algorithms. Indexing and bag-of-words searching were suggested.
Can anyone give some examples of implementing search methods on an ontology that don't require a literal match?
FIrst of all, what kind of entities are you trying to match (literals, or string casts of URIs?), and what kind of SPARQL queries are you running now? Something like this?
?term ?predicate "user input" .
If you are searching across literals, you can make the search more flexible right off the bat by using case-insensitive regular expression filtering, although this will probably make your searches slower, and it won't catch cases where some of the word tokens are present but in a different order. In the following example, your should probably constrain the types of ?term and ?predicate first, or even filter on a string datatype on ?userInput
?term ?predicate ?someLiteral .
FILTER(regex(?someLiteral), "user input", "i"))
Several triplestores offer support for full-text searching and result scoring. These are often extensions to the SPARQL language.
For example, Virtuoso and some others offer a bif:contains predicate. Virtuoso also offers the faceted search web interface (plus a service, I think.) I have been pleased with the web-based full text search in Blazegraph and Stardog, but I can't say anything at this point about using them with a SPARQL query to get a score on a search pattern. Some (GraphDB) even support explicit integration with Lucene or Solr*, so you may be able to take advantage of their search languages.
Finally... are you using a library like the OWL API or RDF4J to access your ontology? If so, you could certainly save the relationships between your terms and any literals in a Java native data structure, and then directly use a fuzzy search component like Lucene to index each literal as a "document" and then search the user input across the index.
Why don't you post your ontology and give an example of a search you would like to peform in a non-binary way. I (or someone else) can try to show you a minimal implementation.
*Solr integration only appears to be offered in the commercially-licensed version of GraphDB

Can Elasticsearch understand nouns, verbs etc?

Hi I was wondering whether there is any analyzer in Elasticsearch to identify the grammar of the text (noun, verbs etc..)
For example when the user searches for "fast smartphone", the Elasticsearch should be able to put more emphasis on the "smartphone" rather than the "fast"So I would like Elasticsearch to return results in the following order:
1) docs where both words match "fast smartphone"
2) docs where smartphone matches
3) docs where "fast" matches. Or maybe docs with only fast should never come out since the user mainly looks for smartphones

Resources