elasticsearch tokenizer to generate tokens for special characters - elasticsearch

I have a use case where special characters also should be searchable. I have tried some tokenizers like char_group, standard, n-gram.
If I use an n-gram tokenizer I am able to make special characters searchable(since it generates a token for each character).
But n-gram generates too many tokens, so I am not interested in using an n-gram tokenizer.
For example, if the text is hey john.s #100 is a test name, then the tokenizer should create tokens for [hey, john, s, #, 100, is, a, test, name]
Please refer to this question for a detailed explanation.
Thank you.....

Based on your use-case the best option would be to use a Whitespace tokenizer with a combination of Word Delimiter Graph filter.
For more information check the official documentation of Elasticsearch about whitespace tokenizer and word delimiter graph filter here:
https://www.elastic.co/guide/en/elasticsearch/reference/8.4/analysis-whitespace-tokenizer.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html

Related

Elasticsearch - tokenize terms by capitalized character, for example "TheStarTech" => [The, Star, Tech]

Does Elasticsearch support tokenizer to tokenize terms by capitalized character, for example: Tokenize TheStarTech to terms [The, Star, Tech]. Pattern tokenizer seems helpful, any suggestions?
See this: World Delimited Token Filter
It does what you want and more. You can pass in the parameters as may fit your need. Check split_on_case_change parameter which is true by default.

How to search for # in Azure Search

Hi I have a string field which has an nGram analyzer.
And our query goes like this.
$count=true&queryType=full&searchFields=name&searchMode=any&$skip=0&$top=50&search=/(.*)Site#12(.*)/
The test we are searching for has Site#123
The above query will work with all other alpha numeric charecters except #. Any idea how could I make this work.
If you are using the standard tokenizer, the ‘#’ character was removed from indexed documents as it’s considered a separator. For indexing, you can either use a different tokenizer, such as the whitespace tokenizer, or replace the ‘#’ character with another character such as ‘_’ with the mapping character filter (underscore ‘_’ is not considered a separator). You can test the analyzer behavior using the Analyze API : https://learn.microsoft.com/rest/api/searchservice/test-analyzer.
It’s important to know that the query terms of regex queries are not analyzed. This means that the ‘#’ character won’t be removed by the analyzer from the regex expression. You can learn more about query processing in Azure Search here: How full text search works in Azure Search
Your string is being tokenized by spaces and punctuation like #. If you want to search for # and other punctuation characters, you could consider tokenzing only by whitespace. Or perhaps do not apply any tokenization at all and treat a whole string as a single token.

elasticsearch - fulltext search for words with special/reserved characters

I am indexing documents that may contain any special/reserved characters in their fulltext body. For example
"PDF/A is an ISO-standardized version of the Portable Document Format..."
I would like to be able to search for pdf/a without having to escape the forward slash.
How should i analyze my query-string and what type of query should i use?
The default standard analyzer will tokenize a string like that so that "PDF" and "A" are separate tokens. The "A" token might get cut out by the stop token filter (See Standard Analyzer). So without any custom analyzers, you will typically get any documents with just "PDF".
You can try creating your own analyzer modeled off the standard analyzer that includes a Mapping Char Filter. The idea would that "PDF/A" might get transformed into something like "pdf_a" at index and query time. A simple match query will work just fine. But this is a very simplistic approach and you might want to consider how '/' characters are used in your content and use slightly more complex regex filters which are also not perfect solutions.
Sorry, I completely missed your point about having to escape the character. Can you elaborate on your use case if this turns out to not be helpful at all?
To support queries containing reserved characters i now use the Simple Query String Query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html)
As of not using a query parser it is a bit limited (e.g. no field-queries like id:5), but it solves the purpose.

What characters does the standard tokenizer delimit on?

I was wondering which characters are used to delimit a string for elastic search's standard tokenizer?
As per the documentation I believe this is the list of symbols/characters used for defining tokens: http://unicode.org/reports/tr29/#Default_Word_Boundaries

Tokenizer that splits up words with underscores in them but also retains a full version

I'm implementing an Elasticsearch search for content that has filesnames in it, such as
"golf_master_2009.xls". I'd like a tokenizer that splits this up into at least the following tokens: "golf", "master", "golf_master_2009.xml". Now I have to use wildcards (for example "master") if I want to search for it without specifying the full filename.
You can apply differents analyzers using a multifield field.
See http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html
HTH
You can use your own analyzer with keyword tokenizer and word delimiter token filter (with options generate_word_parts and preserve_original set to true)

Resources