Elasticsearch: Return same search results regardless of diacritics/accents - elasticsearch

I've got a word in the text (e.g. nagymező) and I want to be able to type in the search query nagymező or nagymezo and it should show this text which contains that word in the search results.
How can it be accomplished?

You want to use a Unicode folding strategy, probably the asciifolding filter. I'm not sure which version of Elasticsearch you're on, so here are a couple of documentation links:
asciifolding for ES 2.x (older version, but much more detailed guide)
asciifolding for ES 6.3

The trick is to remove the diacritics when you index them so they don't bother you anymore.
Have a look at ignore accents in elastic search with haystack
and also at https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-analyzers.html (look for 'diacritic' on the page).
Then, just because it will probably be useful to someone one day or the other, know that the regular expression \p{L} will match any Unicode letter :D
Hope this helps,

Related

Can ElasticSearch return relevant passages and not the entire document

I'm looking for a search engine that is able to return just relevant passages as a result and not the entire documents. Is ElasticSearch able to do this?
If you are looking to extract part of a long document, look at Highlighting. Specifically parameters like fragment_size (100 characters by default) or boundary_chars will help you to build that functionality.

Implement Fuzzy Prefix Full-text matching query in Elasticsearch

I understand that Elasticsearch tries to avoid using Fuzzy search with prefix matching, and this is also why it doesn't natively support such feature due to its complexity. However, we have a directory search system that solely relies on elasticsearch as a blackbox search engine, and we need the following logic:
E.g. Say the terms are "Michael Pierce Chem". We want to support full text search on the first two terms (with match query) and we also want to do fuzzy on the last term first and then do a prefix match, as if "Chem" matches "chemistry", "chen", and even "YouTube Chen" due to full-text support.
Please give me some advice on the implementation and design suggestions. Current stack is a NodeJS web app with Elasticsearch.

autocomplete and search in Elasticsearch

Is there any possibility to make a search on two non-complete words in the same field using Elasticsearch in Rails? I mean the situation when I could successfully search for example "victorian buildings" phrase by inserting into search input for example "vict bui" phrase (only beginnings of words, also with fuzziness).
Partial match (word_start, text_start etc. available in Searchkick) doesn't work in this project. I've also tried using wildcard queries, but it also failed. Maybe writing some custom mappings/settings would be a good idea?
Can I ask you for any suggestions on what to search/read to do this task?
Try this example
"%#{params[:place]}%"
Since % is a wildcard, doing a like on '%%' matches everything,
and you get all the records in the result.

Elastic Greeklish to Greek conversion

I am new to a elastic and I am trying to find a way to convert greeklish character to greek when the search executes.
e.g word "papoutsia" to be searched as "παπουτσια" (shoes)
Due to my search I found the following plugins:
elasticsearch-analysis-greeklish
elasticsearch-skroutz-greekstemmer
Applied the filters to my index as the example but my queries still hit nothing.
Do I have to apply the filter some way in every query or do a special one?
Sorry I this question has a very large/broad answer to be given.
I trying to figure how the whole filtering thing works for a couple of days to understand if I am even in the correct direction or have to find an other way for this solution.
Unfortunately, the intention of the greeklish plugin / char filter is the inverse of what you want to achieve:
Using this filter, you can retrieve greek text from a document, using a query that is written in latin characters ("greeklish").
So, for your example, you can add a document with the text παπούτσια and retrieve it using the terms papoutsia, papoutsi, etc.
We have prepared a detailed text pipeline example in the repo's wiki for future reference.

ElasticSearch search for partial alphanumeric values

I have a string field with values like PA2456U or PA23U-RB and I would like to do a partial match, so that I can search for PA24 and I would get the first result, or search PA23U-RB and find the second result (so that would be a full match.
I tried using ngram, but it ignores the numeric values, so, if I enter pa111 it returns anything that starts with pa
See this gist for an example.
This may be a separate question, or related, but searching for 12345001 should also match 12345-001
Thanks
Update
The final analyzer I used is here: https://gist.github.com/3803180
Making ngrams looks like a good choice based on your requirements, but I think edge_ngrams should be enough. This way your index would grow a little bit slower since you'd be indexing less terms. Anyway the problem is that you don't need to apply the same analyzer to the query too, otherwise querying for pa111 would mean querying for all the ngrams that you can make out of it, which would lead you to a lot more matches that you'd expect.
You just need to change your search_analyzer to an analyzer which doesn't make ngrams. You can use the same you already have and remove the ngram token filter (only for the search_analyzer, the index_analyzer is fine).
Regarding the dash question, have a look at the Word delimiter token filter. You need to configure it to make it work as you expect. I guess the generate_number_parts=false, generate_word_parts=false and split_on_numerics=false options should make it work as you want. That way the dash won't be indexed. You need to apply the token filter at both index time and query time.

Resources