How can I remove one delimiter from elasticsearch tokenizer? - elasticsearch

I am using elasticsearch 6.8 for text searching. And I realised that elasticsearch tokenizer breaks text into words by using delimiters listed here: http://unicode.org/reports/tr29/#Default_Word_Boundaries. I am using match_phase to search one of the fields in my document and I'd like to remove one delimiter used by tokenizer.
I did some search and found some solutions like, using keyword rather than text. This solution will have a big impact on my search function because it doesn't support partial query.
Another solution is to use keyword query but use wildcard to support partial query. But this may impact performance on the query. And also, I still like using tokenizer for other delimiters.
A third options is to use tokenize_on_chars to define all characters used to tokenize text. But this requires me to list all other delimiters. So I am looking for something like tokenize_except_chars.
So is there a easy way for me to take one character out from delimiters tokenizer is using in elasticsearch6.8?

I found elasticsearch supports protected_words which can do the job. More info can be found in https://www.elastic.co/guide/en/elasticsearch/reference/6.8/analysis-word-delimiter-tokenfilter.html

Related

Performance of match vs term query in elasticsearch?

I've been using a lot of match queries in my project. Now, I have just faced with term query in Elasticsearch. It seems the term query is more faster in case that keyword of your query is specified.
Now I have a question there..
Should I refactor my codes (it's a lot) and use term instead of match?
How much is the performance of using term better than match?
using term in my query:
main_query["query"]["bool"]["must"].append({"term":{object[..]:object[...]}})
using match query in my query:
main_query["query"]["bool"]["must"].append({"match":{object[..]:object[...]}})
Elastic discourages to use term queries for text fields for obvious reasons (analysis!!), but if you know you need to query a keyword field (not analyzed!!), definitely go for term/terms queries instead of match, because the match query does a lot more things aside from analyzing the input and will eventually end up executing a term query anyway because it notices that the queried field is a keyword field.
As far as I know when you use the match query it means your field is mapped as "text" and you use an analyzer. With that, your indexed word will generate tokens and when you run the query you go through an analyzer and the correspondence will be made for each of them.
Term will do the exact match, that is, it does not go through any analyzer, it will look for the exact term in the inverted index.
Because of this I believe that by not going through analyzers, Term is faster.
I use Term match to search for keywords like categories, tag, things that don't make sense use an analyzer.

Find documents whose analyzed field contains slash and underscore

My documents have an analyzed field url with content looking like this
http://sub.example.com/data/11/222/333/filename.txt
I would like to find all documents whose filename starts with an underscore. I've tried multiple approaches (wildcard, pattern, query_string, span queries) but I never got the right result. I expect this is because the the underscore is a term separator. How can I write such a query? Is it possible at all without changing the field to not analyzed (which I cannot do at the moment)?
It's ElasticSearch 1.5, but we'll be migrating to at least 2.4 in foreseeable future.
You might be able to write a script that would do that, but it would be amazingly slow.
You best bet (even though you say you can't right now) is changing the field from analyzed to a multi-field. This way you could have both analyzed and not-analyzed versions to work work.
You could use the Reindex API to migrate all the data from the old version to the new version (assuming you're using ES 2.3 or greater).

Elasticsearch - Autocomplete return word/term/token suggestions instead of whole documents

I am trying to implement a simple auto completion for query terms.
There are many different approaches but most of them do return documents instead of terms
- or the authors simply stopped explaining from that point and i am not able to adapt.
A user is typing in a query - e.g. phil
What i want is to provide a list of term completion suggestions like philipp, philius, philadelphia, ...
I am able to get document matches via (edge)ngrams, phrase_prefix and so on but i am am stuck at retrieving matching terms (completion suggestions).
Can someone give me a hint?
I have documents like this {"title":"...", "description":"...", "content":"..."}
All fields have larger string values but especially the field content contains fulltext content.
I do not want to suggest the whole title of a document containing e.g. Philadelphia. Just the word "Philadelphia".
Looking for something like that, myself.
In SOLR it was relatively simple to configure (although a pain to build and keep up-to-date) using solr.SpellCheckComponent. Somehow the same underlying Lucene functionality is used differently between SOLR and ElasticSearch, and in ElasticSearch it is geared towards finding whole documents (or whole field values, if you will) or so it seems...
Despite the profusion of "elasticsearch autocomplete" articles, none appears to deal with this particular issue. Like it doesn't exist. Maybe their use case is different and ElasticSearch works for them just fine, who knows?
At this point I think that preparing the exact field values to use with ElasticSearch autocomplete (yes, that's the input field values, not analyzer tokens) maybe the only way to solve the problem. Which is terrible, because the performance is going to be very low.
Try term suggester:
The term suggester suggests terms based on edit distance. The provided
suggest text is analyzed before terms are suggested. The suggested
terms are provided per analyzed suggest text token. The term suggester
doesn’t take the query into account that is part of request.

Full-text search against string (databaseless)

Is there a way to perform a search in a document which I don't want to be stored anywhere? I've got some experience with Sphinx search and ElasticSearch and it seems they both operate on a database of some kind. I want to search a word in a single piece of text, in a string variable.
I ended up using nltk and pymorphy just tokenizing my text and comparing stems/normalized morphological forms from pymorphy with search items. No need for any heavy full-text search weaponry.

retaining case in elasticsearch faceted search

Is there a way to do faceted searches using the elasticsearch Search API maintaining case (as opposed to having the results be converted to lowercase).
Thanks in advance, Chuck
Assuming you are using the "terms" facet, the facet entries are exactly the terms in the index. Briefly, analysis is the process of converting a field value into a sequence of terms, and lowercasing is a step in the default analyzer; that's why you're seeing lowercased terms. So you will want to change your analysis configuration (and perhaps introduce a multi_field if you want to run several different analyzers.)
There's a great explanation in Lucene in Action (2nd Ed.); it's applicable to ElasticSearch, too.

Resources