How to search for # in Azure Search - elasticsearch

Hi I have a string field which has an nGram analyzer.
And our query goes like this.
$count=true&queryType=full&searchFields=name&searchMode=any&$skip=0&$top=50&search=/(.*)Site#12(.*)/
The test we are searching for has Site#123
The above query will work with all other alpha numeric charecters except #. Any idea how could I make this work.

If you are using the standard tokenizer, the ‘#’ character was removed from indexed documents as it’s considered a separator. For indexing, you can either use a different tokenizer, such as the whitespace tokenizer, or replace the ‘#’ character with another character such as ‘_’ with the mapping character filter (underscore ‘_’ is not considered a separator). You can test the analyzer behavior using the Analyze API : https://learn.microsoft.com/rest/api/searchservice/test-analyzer.
It’s important to know that the query terms of regex queries are not analyzed. This means that the ‘#’ character won’t be removed by the analyzer from the regex expression. You can learn more about query processing in Azure Search here: How full text search works in Azure Search

Your string is being tokenized by spaces and punctuation like #. If you want to search for # and other punctuation characters, you could consider tokenzing only by whitespace. Or perhaps do not apply any tokenization at all and treat a whole string as a single token.

Related

elasticsearch tokenizer to generate tokens for special characters

I have a use case where special characters also should be searchable. I have tried some tokenizers like char_group, standard, n-gram.
If I use an n-gram tokenizer I am able to make special characters searchable(since it generates a token for each character).
But n-gram generates too many tokens, so I am not interested in using an n-gram tokenizer.
For example, if the text is hey john.s #100 is a test name, then the tokenizer should create tokens for [hey, john, s, #, 100, is, a, test, name]
Please refer to this question for a detailed explanation.
Thank you.....
Based on your use-case the best option would be to use a Whitespace tokenizer with a combination of Word Delimiter Graph filter.
For more information check the official documentation of Elasticsearch about whitespace tokenizer and word delimiter graph filter here:
https://www.elastic.co/guide/en/elasticsearch/reference/8.4/analysis-whitespace-tokenizer.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html

Elasticsearch: highlighting with ngram_tokenizer and pattern_replace character filter

I have a custom analyzer that uses the "ngram" tokenizer and a pattern_replace character filter that removes all non-alphanumeric characters from the input.
so for example "+34 123 456 789"
turns into "34123456789"
and is then split up into 3grams.
The goal is to allow searching of phone numbers and other identifiers ignoring whitespace and special characters and allowing to search substrings as well.
Now, this works fine for searching, but highlighting is broken. As it says in the documentation of the pattern_replace character filter:
Using a replacement string that changes the length of the original text will work for search purposes, but will result in incorrect highlighting
So, I tried using the pattern_replace token filter instead, but that filter is applied to the 3gram tokens, which makes them unusable for searching.
Using the ngram filter instead of the ngram tokenizer doesn't help either, because the filter does not set the token positions correctly, resulting in the entire string being highlighted.
Is there a way to have the correct substring highlighted in this case?

Elasticsearch ingore special characters unless quoted

I am making a search tool to query against text fields.
If the user searches for:
6ES7-820
Than I would like to return all documents containing:
6ES7820
6-ES7820
6ES7-820
...
In other words I would like to ignore special characters. I could achieve this by removing the special characters in my search analyzer and my indexing analyzer.
But when the user would search for the same term using quotation marks (or something else):
"6ES7-820"
I want to only return the documents containing
6ES7-820
So then special characters should not be ignored. This means I can not remove these characters while indexing.
How could this search method be implemented in Elasticsearch, which analysers should I use?

What characters does the standard tokenizer delimit on?

I was wondering which characters are used to delimit a string for elastic search's standard tokenizer?
As per the documentation I believe this is the list of symbols/characters used for defining tokens: http://unicode.org/reports/tr29/#Default_Word_Boundaries

What does Elasticsearch's auto_generate_phrase_queries do?

In the docs for query string query, auto_generate_phrase_queries is listed as a parameter but the only description is "defaults to false." So what does this parameter do exactly?
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
This will directly match to the lucene's org.apache.lucene.queryparser.classic.QueryParserSettings#autoGeneratePhraseQueries. When the analyzer applied on the query string, this setting allows lucene to generate quoted phrases no keywords.
Quoting:
SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField.
autoGeneratePhraseQueries="true" (the default) causes the query parser to
generate phrase queries if multiple tokens are generated from a single
non-quoted analysis string. For example WordDelimiterFilter splitting text:pdp-11
will cause the parser to generate text:"pdp 11" rather than (text:PDP OR text:11).
Note that autoGeneratePhraseQueries="true" tends to not work well for non whitespace
delimited languages.
where word delimiter works as WordDelimiterFilter.html
Important thing to note is single non-quoted analysis string, i.e. if your query string is non-quoted. If you are already searching for a quoted phrase then it won't make any sense.

Resources