When are Stemmers used in ElasticSearch? - elasticsearch

I am confused about when stemmers are used in ElasticSearch.
In the Dealing with Human Language/Reducing Words to Their Root Form section I see that stemmers are used to strip words into their root forms. This lead me to believe that Stemmers were used as a token filter on an analyzer.
But a token filter only filters the token, does not actually reduce words to their root forms.
So, where are stemmers used?

In fact, you can do stemming with a token filter in an analyzer. That is exactly how stemming works in ES. Have a look at the documentation for Stemmer Token Filter.
ES also provides the Snowball Analyzer, which is a convenient analyzer to use for stemming.
Otherwise, if there is a different type of stemming you would like to use, you can always build your own Custom Analyzer. This gives you complete control over the stemming solution that works best for you, as discussed here in the guide.
Hope this helps!

Related

Partial word tokenizers vs Word oriented tokenizers Elasticsearch

reading the link below I am looking for some use case/example in which will be better using Ngram-tokenizing or standard tokenizer doing some comperison.
I hope elastic documentation will include more examples and comparisons in future.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
Can someone help me?
Thank you.
The Elastic documentation does include more examples. You can find them in the dedicated page of each tokenizer (here is the standard, here is the ngram).
In general, you might want to use an ngram tokenizer to implement a search-as-you-type functionality, such as the auto-suggest in a search input.

Implement Fuzzy Prefix Full-text matching query in Elasticsearch

I understand that Elasticsearch tries to avoid using Fuzzy search with prefix matching, and this is also why it doesn't natively support such feature due to its complexity. However, we have a directory search system that solely relies on elasticsearch as a blackbox search engine, and we need the following logic:
E.g. Say the terms are "Michael Pierce Chem". We want to support full text search on the first two terms (with match query) and we also want to do fuzzy on the last term first and then do a prefix match, as if "Chem" matches "chemistry", "chen", and even "YouTube Chen" due to full-text support.
Please give me some advice on the implementation and design suggestions. Current stack is a NodeJS web app with Elasticsearch.

Elasticsearch: Return same search results regardless of diacritics/accents

I've got a word in the text (e.g. nagymező) and I want to be able to type in the search query nagymező or nagymezo and it should show this text which contains that word in the search results.
How can it be accomplished?
You want to use a Unicode folding strategy, probably the asciifolding filter. I'm not sure which version of Elasticsearch you're on, so here are a couple of documentation links:
asciifolding for ES 2.x (older version, but much more detailed guide)
asciifolding for ES 6.3
The trick is to remove the diacritics when you index them so they don't bother you anymore.
Have a look at ignore accents in elastic search with haystack
and also at https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-analyzers.html (look for 'diacritic' on the page).
Then, just because it will probably be useful to someone one day or the other, know that the regular expression \p{L} will match any Unicode letter :D
Hope this helps,

ES Analyzer which tokens the numbers, digits as well

I am using Elasticsearch in-built Simple analyzer https://www.elastic.co/guide/en/elasticsearch/reference/1.7/analysis-simple-analyzer.html, which uses Lower Case Tokenizer. and text apple 8 IS Awesome is tokenized in the below format.
"apple",
"is",
"awesome"
You can clearly see, that it misses tokenizing the number 8, hence now if I just search with 8, my message will not appear in search.
I went through all the available analyzer available with ES but couldn't find any suitable analyzer which matches my requirement.
How can I tokenize all the words with a number using a custom or in-built analyzer of ES ?
Your question is about the simple analyzer, but you mention a very old link to documentation. Try
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html
Like Val told you, you probably looking for the standard analyser.
If you want to see the difference try the analysis api:
http://localhost:9200/_analyze?analyzer=simple&text=apple%208%20IS%20Awesome
http://localhost:9200/_analyze?analyzer=standard&text=apple%208%20IS%20Awesome

ElasticSearch search for partial alphanumeric values

I have a string field with values like PA2456U or PA23U-RB and I would like to do a partial match, so that I can search for PA24 and I would get the first result, or search PA23U-RB and find the second result (so that would be a full match.
I tried using ngram, but it ignores the numeric values, so, if I enter pa111 it returns anything that starts with pa
See this gist for an example.
This may be a separate question, or related, but searching for 12345001 should also match 12345-001
Thanks
Update
The final analyzer I used is here: https://gist.github.com/3803180
Making ngrams looks like a good choice based on your requirements, but I think edge_ngrams should be enough. This way your index would grow a little bit slower since you'd be indexing less terms. Anyway the problem is that you don't need to apply the same analyzer to the query too, otherwise querying for pa111 would mean querying for all the ngrams that you can make out of it, which would lead you to a lot more matches that you'd expect.
You just need to change your search_analyzer to an analyzer which doesn't make ngrams. You can use the same you already have and remove the ngram token filter (only for the search_analyzer, the index_analyzer is fine).
Regarding the dash question, have a look at the Word delimiter token filter. You need to configure it to make it work as you expect. I guess the generate_number_parts=false, generate_word_parts=false and split_on_numerics=false options should make it work as you want. That way the dash won't be indexed. You need to apply the token filter at both index time and query time.

Resources