What does Elasticsearch's auto_generate_phrase_queries do? - elasticsearch

In the docs for query string query, auto_generate_phrase_queries is listed as a parameter but the only description is "defaults to false." So what does this parameter do exactly?
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

This will directly match to the lucene's org.apache.lucene.queryparser.classic.QueryParserSettings#autoGeneratePhraseQueries. When the analyzer applied on the query string, this setting allows lucene to generate quoted phrases no keywords.
Quoting:
SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField.
autoGeneratePhraseQueries="true" (the default) causes the query parser to
generate phrase queries if multiple tokens are generated from a single
non-quoted analysis string. For example WordDelimiterFilter splitting text:pdp-11
will cause the parser to generate text:"pdp 11" rather than (text:PDP OR text:11).
Note that autoGeneratePhraseQueries="true" tends to not work well for non whitespace
delimited languages.
where word delimiter works as WordDelimiterFilter.html
Important thing to note is single non-quoted analysis string, i.e. if your query string is non-quoted. If you are already searching for a quoted phrase then it won't make any sense.

Related

Elasticsearch ingore special characters unless quoted

I am making a search tool to query against text fields.
If the user searches for:
6ES7-820
Than I would like to return all documents containing:
6ES7820
6-ES7820
6ES7-820
...
In other words I would like to ignore special characters. I could achieve this by removing the special characters in my search analyzer and my indexing analyzer.
But when the user would search for the same term using quotation marks (or something else):
"6ES7-820"
I want to only return the documents containing
6ES7-820
So then special characters should not be ignored. This means I can not remove these characters while indexing.
How could this search method be implemented in Elasticsearch, which analysers should I use?

Getting an exact match to the string `#deprecated` in Kibana/ELK

I'm using Kibana to find all logs containing an exact match of the string #deprecated.
For a reason I don't understand, it matches string with the word "deprecated" without the # sign.
I tried to use escaping for # according to the Lucene Documentation. i.e. message:"\\#deprecated" - without change in results.
How can I query to exact match the #deprecated text exact match only
Why is this happening?
You problem isn't an issue with query syntax, which is what escaping is for, it's with analysis. You analyzer removes punctuation, because it's parsing it as full text. It removes #, in much the same way that it will remove periods and commas.
So, after analysis (assuming standard analysis) of something like: "Class is #deprecated" the token stream generated will have the following tokens: "class", "deprecated" ("is" is a stop word). The indexed form of "#deprecated" and "deprecated" are identical, so it is impossible to have a query that can differentiate between them as it is currently indexed.
To fix this you would have to change your analyzer. WhitespaceAnalyzer may be a good choice, and should fix this issue. However, be careful you aren't doing more harm than good. If you use WhitespaceAnalyzer, you are going to have to contend with other punctuation as well, and a search for "sentence"
would not find "match at the end of this sentence.", because of the period. So, if you are searching full text, this will certainly cause far more problems than it solves.
If you want to know the full rules of standard analysis, by the way, it's an implementation of UAX #29 word boundaries

How to search for # in Azure Search

Hi I have a string field which has an nGram analyzer.
And our query goes like this.
$count=true&queryType=full&searchFields=name&searchMode=any&$skip=0&$top=50&search=/(.*)Site#12(.*)/
The test we are searching for has Site#123
The above query will work with all other alpha numeric charecters except #. Any idea how could I make this work.
If you are using the standard tokenizer, the ‘#’ character was removed from indexed documents as it’s considered a separator. For indexing, you can either use a different tokenizer, such as the whitespace tokenizer, or replace the ‘#’ character with another character such as ‘_’ with the mapping character filter (underscore ‘_’ is not considered a separator). You can test the analyzer behavior using the Analyze API : https://learn.microsoft.com/rest/api/searchservice/test-analyzer.
It’s important to know that the query terms of regex queries are not analyzed. This means that the ‘#’ character won’t be removed by the analyzer from the regex expression. You can learn more about query processing in Azure Search here: How full text search works in Azure Search
Your string is being tokenized by spaces and punctuation like #. If you want to search for # and other punctuation characters, you could consider tokenzing only by whitespace. Or perhaps do not apply any tokenization at all and treat a whole string as a single token.

elasticsearch - fulltext search for words with special/reserved characters

I am indexing documents that may contain any special/reserved characters in their fulltext body. For example
"PDF/A is an ISO-standardized version of the Portable Document Format..."
I would like to be able to search for pdf/a without having to escape the forward slash.
How should i analyze my query-string and what type of query should i use?
The default standard analyzer will tokenize a string like that so that "PDF" and "A" are separate tokens. The "A" token might get cut out by the stop token filter (See Standard Analyzer). So without any custom analyzers, you will typically get any documents with just "PDF".
You can try creating your own analyzer modeled off the standard analyzer that includes a Mapping Char Filter. The idea would that "PDF/A" might get transformed into something like "pdf_a" at index and query time. A simple match query will work just fine. But this is a very simplistic approach and you might want to consider how '/' characters are used in your content and use slightly more complex regex filters which are also not perfect solutions.
Sorry, I completely missed your point about having to escape the character. Can you elaborate on your use case if this turns out to not be helpful at all?
To support queries containing reserved characters i now use the Simple Query String Query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html)
As of not using a query parser it is a bit limited (e.g. no field-queries like id:5), but it solves the purpose.

What characters does the standard tokenizer delimit on?

I was wondering which characters are used to delimit a string for elastic search's standard tokenizer?
As per the documentation I believe this is the list of symbols/characters used for defining tokens: http://unicode.org/reports/tr29/#Default_Word_Boundaries

Resources