Getting an exact match to the string `#deprecated` in Kibana/ELK - elasticsearch

I'm using Kibana to find all logs containing an exact match of the string #deprecated.
For a reason I don't understand, it matches string with the word "deprecated" without the # sign.
I tried to use escaping for # according to the Lucene Documentation. i.e. message:"\\#deprecated" - without change in results.
How can I query to exact match the #deprecated text exact match only
Why is this happening?

You problem isn't an issue with query syntax, which is what escaping is for, it's with analysis. You analyzer removes punctuation, because it's parsing it as full text. It removes #, in much the same way that it will remove periods and commas.
So, after analysis (assuming standard analysis) of something like: "Class is #deprecated" the token stream generated will have the following tokens: "class", "deprecated" ("is" is a stop word). The indexed form of "#deprecated" and "deprecated" are identical, so it is impossible to have a query that can differentiate between them as it is currently indexed.
To fix this you would have to change your analyzer. WhitespaceAnalyzer may be a good choice, and should fix this issue. However, be careful you aren't doing more harm than good. If you use WhitespaceAnalyzer, you are going to have to contend with other punctuation as well, and a search for "sentence"
would not find "match at the end of this sentence.", because of the period. So, if you are searching full text, this will certainly cause far more problems than it solves.
If you want to know the full rules of standard analysis, by the way, it's an implementation of UAX #29 word boundaries

Related

How do I escape the word "And" in Elasticsearch if I want to search by the literal "And"?

I'm trying to search over an index that includes constellation code names, and the code name for the Andromeda constellation is And.
Unfortunately, if I search using And, all results are returned. This is the only one that doesn't work, across dozens of constellation code names, and I assume it's because it's interpreted as the logical operator AND.
(constellation:(And)) returns my entire result set, regardless of the value of constellation.
Is there a way to fix this without doing tricks like indexing with an underscore in front?
Thanks!
I went for a bit of a hack, indexing the constellation as __Foo__ and then changing my search query accordingly by adding the __ prefix and suffix to the selected constellation.

How to search for # in Azure Search

Hi I have a string field which has an nGram analyzer.
And our query goes like this.
$count=true&queryType=full&searchFields=name&searchMode=any&$skip=0&$top=50&search=/(.*)Site#12(.*)/
The test we are searching for has Site#123
The above query will work with all other alpha numeric charecters except #. Any idea how could I make this work.
If you are using the standard tokenizer, the ‘#’ character was removed from indexed documents as it’s considered a separator. For indexing, you can either use a different tokenizer, such as the whitespace tokenizer, or replace the ‘#’ character with another character such as ‘_’ with the mapping character filter (underscore ‘_’ is not considered a separator). You can test the analyzer behavior using the Analyze API : https://learn.microsoft.com/rest/api/searchservice/test-analyzer.
It’s important to know that the query terms of regex queries are not analyzed. This means that the ‘#’ character won’t be removed by the analyzer from the regex expression. You can learn more about query processing in Azure Search here: How full text search works in Azure Search
Your string is being tokenized by spaces and punctuation like #. If you want to search for # and other punctuation characters, you could consider tokenzing only by whitespace. Or perhaps do not apply any tokenization at all and treat a whole string as a single token.

elasticsearch - fulltext search for words with special/reserved characters

I am indexing documents that may contain any special/reserved characters in their fulltext body. For example
"PDF/A is an ISO-standardized version of the Portable Document Format..."
I would like to be able to search for pdf/a without having to escape the forward slash.
How should i analyze my query-string and what type of query should i use?
The default standard analyzer will tokenize a string like that so that "PDF" and "A" are separate tokens. The "A" token might get cut out by the stop token filter (See Standard Analyzer). So without any custom analyzers, you will typically get any documents with just "PDF".
You can try creating your own analyzer modeled off the standard analyzer that includes a Mapping Char Filter. The idea would that "PDF/A" might get transformed into something like "pdf_a" at index and query time. A simple match query will work just fine. But this is a very simplistic approach and you might want to consider how '/' characters are used in your content and use slightly more complex regex filters which are also not perfect solutions.
Sorry, I completely missed your point about having to escape the character. Can you elaborate on your use case if this turns out to not be helpful at all?
To support queries containing reserved characters i now use the Simple Query String Query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html)
As of not using a query parser it is a bit limited (e.g. no field-queries like id:5), but it solves the purpose.

Multi-Line Regex: Find A where B is absent

I have been looking through a lot on Regex lately and have seen a lot of answers involving the matching of one word, where a second word is absent. I have seen a lot of Regex Examples where I can have a Regex search for a given word (or any more complex regex in its place) and find where a word is missing.
It seems like the works very well on a line by line basis, but after including the multi-line mode it still doesn't seem to match properly.
Example: Match an entire file string where the word foo is included, but the word bar is absent from the file. What I have so far is (?m)^(?=.*?(foo))((?!bar).)*$ which is based off the example link. I have been testing with a Ruby Regex tester, but I think it is a open ended regex problem/question. It seems to match smaller pieces, I would like to have it either match/not match on the entire string as one big chunk.
In the provided example above, matches are found on a line by line basis it seems. What changes need to be made to the regex so it applies over the ENTIRE string?
EDIT: I know there are other more efficient ways to solve this problem that doesn't involve using a regex. I am not looking for a solution to the problem using other means, I am asking from a theoretical regex point of view. It has a multi-line mode (which looks to "work"), it has negative/positive searching which can be combined on a line by line basis, how come combining these two principals doesn't yield the expected result?
Sawa's answer can be simplified, all that's needed is a positive lookahead, a negative lookahead, and since you're in multiline mode, .* takes care of the rest:
/(?=.*foo)(?!.*bar).*/m
Multiline means that . matches \n also, and matches are greedy. So the whole string will match without the need for anchors.
Update
#Sawa makes a good point for the \A being necessary but not the \Z.
Actually, looking at it again, the positive lookahead seems unnecessary:
/\A(?!.*bar).*foo.*/m
A regex that matches an entire string that does not include foo is:
/\A(?!.*foo.*).*\z/m
and a regex that matches from the beginning of an entire string that includes bar is:
/\A.*bar/m
Since you want to satisfy both of these, take a conjunction of these by putting one of them in a lookahead:
/\A(?=.*bar)(?!.*foo.*).*\z/m

Ignoring apostrophes in sphinx indexes

In my sphinx config file, I have the following:
ignore_chars: "U+0027"
charset_table: "0..9, a..z, _, A..Z->a..z, U+00C0->a, U+00C1->a,
U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c, U+00C8->e,
U+00C9->e, U+00CA->e, U+00CB->e, U+00CC->i, U+00CD->i, U+00CE->i [SNIP]"
(The charset_table entry is from here: http://speeple.com/unicode-maps.txt)
The expected result is that querying kyles will return all records matching kyles and/or kyle's, since I'm telling sphinx to exclude ' (single quote/apos) from the index (ab'cd -> abcd). However, in practice, this is not happening.
I believe adding it to the ignore_chars has the opposite of the desired effect. This is telling sphinx not to split on that character, but instead it will collapse the word around the characters to be ignored. So, kyle's will become kyles instead of kyle and s.
The solution I just tried for this issue that seems to have worked was to add s to my list of stopwords (might need 's in there also, can't remember). Sphinx seems to split kyle's up into the words kyle and 's. Because match all mode is on, some documents fail on the match for 's. Adding it to the stop words seems to have the desired effect.
It seems like the normal stemming should take care of this however, so maybe we're both doing something wrong...

Resources