ES Analyzer which tokens the numbers, digits as well - elasticsearch

I am using Elasticsearch in-built Simple analyzer https://www.elastic.co/guide/en/elasticsearch/reference/1.7/analysis-simple-analyzer.html, which uses Lower Case Tokenizer. and text apple 8 IS Awesome is tokenized in the below format.
"apple",
"is",
"awesome"
You can clearly see, that it misses tokenizing the number 8, hence now if I just search with 8, my message will not appear in search.
I went through all the available analyzer available with ES but couldn't find any suitable analyzer which matches my requirement.
How can I tokenize all the words with a number using a custom or in-built analyzer of ES ?

Your question is about the simple analyzer, but you mention a very old link to documentation. Try
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html
Like Val told you, you probably looking for the standard analyser.
If you want to see the difference try the analysis api:
http://localhost:9200/_analyze?analyzer=simple&text=apple%208%20IS%20Awesome
http://localhost:9200/_analyze?analyzer=standard&text=apple%208%20IS%20Awesome

Related

Which Azure Cognitive Search analyzer is equivalent to ES keyword analyzer

The post title is pretty much what I want to ask:
Which Azure Cognitive Search analyzer is equivalent to ES keyword analyzer
I want to store composite values separated by space that should be always searched together, like:
Cap TC
Sis B
Act A
Act B
Act C
Using Java API I could find the same keyword analyzer
LexicalAnalyzerName.KEYWORD
Then I also found the documentation: https://learn.microsoft.com/en-us/azure/search/index-add-custom-analyzers#built-in-analyzers

Is there a list of punctuations removed in standard analyzer in Elasticsearch?

In the official documentation of the standard analyzer in Elasticsearch, it is mentioned that "It removes most punctuation".
I need the list of the punctuations that this standard analyzer removes. Can someone point me to any reference or section of the Elasticsearch source code that might be useful in this regard?
Elastic search standard Analyzer based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex here. There are couple of rules and regex that you can find in the documentation in detail.

Implement Fuzzy Prefix Full-text matching query in Elasticsearch

I understand that Elasticsearch tries to avoid using Fuzzy search with prefix matching, and this is also why it doesn't natively support such feature due to its complexity. However, we have a directory search system that solely relies on elasticsearch as a blackbox search engine, and we need the following logic:
E.g. Say the terms are "Michael Pierce Chem". We want to support full text search on the first two terms (with match query) and we also want to do fuzzy on the last term first and then do a prefix match, as if "Chem" matches "chemistry", "chen", and even "YouTube Chen" due to full-text support.
Please give me some advice on the implementation and design suggestions. Current stack is a NodeJS web app with Elasticsearch.

Elasticsearch: Return same search results regardless of diacritics/accents

I've got a word in the text (e.g. nagymező) and I want to be able to type in the search query nagymező or nagymezo and it should show this text which contains that word in the search results.
How can it be accomplished?
You want to use a Unicode folding strategy, probably the asciifolding filter. I'm not sure which version of Elasticsearch you're on, so here are a couple of documentation links:
asciifolding for ES 2.x (older version, but much more detailed guide)
asciifolding for ES 6.3
The trick is to remove the diacritics when you index them so they don't bother you anymore.
Have a look at ignore accents in elastic search with haystack
and also at https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-analyzers.html (look for 'diacritic' on the page).
Then, just because it will probably be useful to someone one day or the other, know that the regular expression \p{L} will match any Unicode letter :D
Hope this helps,

When are Stemmers used in ElasticSearch?

I am confused about when stemmers are used in ElasticSearch.
In the Dealing with Human Language/Reducing Words to Their Root Form section I see that stemmers are used to strip words into their root forms. This lead me to believe that Stemmers were used as a token filter on an analyzer.
But a token filter only filters the token, does not actually reduce words to their root forms.
So, where are stemmers used?
In fact, you can do stemming with a token filter in an analyzer. That is exactly how stemming works in ES. Have a look at the documentation for Stemmer Token Filter.
ES also provides the Snowball Analyzer, which is a convenient analyzer to use for stemming.
Otherwise, if there is a different type of stemming you would like to use, you can always build your own Custom Analyzer. This gives you complete control over the stemming solution that works best for you, as discussed here in the guide.
Hope this helps!

Resources