Pattern failure with grok due a longer integer in a column - elasticsearch

I have used grok debugger to get the top format working and it is being seen fine by elasticsearch. Eventually, when a log line like the one below hit it shoots out a tag with "grokparsefailure" due to the extra space before each integer (I'm assuming). Is there a tag I can use to accept anything no matter how long or short for each column?
0000003B 2015-03-14 07:46:14.618 16117 16121
00000DA1 2015-03-14 07:45:54.609 6382 6382

It's also possible to use the built in logstash pattern %{SPACE} to match any number of whitespace characters.
%{INT:num1}%{SPACE}%{INT:num2}

One or more spaces between two integers:
%{INT} +%{INT}

I ended up doing a custom filter since I knew my values were between 4-5 characters and then used patterns_dir => "./patterns" in my conf file.
_ID [0-9A-F]{4,5}
_ID2 [0-9A-F]{4,5}
UPDATE*****
my solution did not work because the number can be anywhere from 3 to 6 characters. The easier solution was provided above. Marked as answer.

Related

How do I escape the word "And" in Elasticsearch if I want to search by the literal "And"?

I'm trying to search over an index that includes constellation code names, and the code name for the Andromeda constellation is And.
Unfortunately, if I search using And, all results are returned. This is the only one that doesn't work, across dozens of constellation code names, and I assume it's because it's interpreted as the logical operator AND.
(constellation:(And)) returns my entire result set, regardless of the value of constellation.
Is there a way to fix this without doing tricks like indexing with an underscore in front?
Thanks!
I went for a bit of a hack, indexing the constellation as __Foo__ and then changing my search query accordingly by adding the __ prefix and suffix to the selected constellation.

Getting an exact match to the string `#deprecated` in Kibana/ELK

I'm using Kibana to find all logs containing an exact match of the string #deprecated.
For a reason I don't understand, it matches string with the word "deprecated" without the # sign.
I tried to use escaping for # according to the Lucene Documentation. i.e. message:"\\#deprecated" - without change in results.
How can I query to exact match the #deprecated text exact match only
Why is this happening?
You problem isn't an issue with query syntax, which is what escaping is for, it's with analysis. You analyzer removes punctuation, because it's parsing it as full text. It removes #, in much the same way that it will remove periods and commas.
So, after analysis (assuming standard analysis) of something like: "Class is #deprecated" the token stream generated will have the following tokens: "class", "deprecated" ("is" is a stop word). The indexed form of "#deprecated" and "deprecated" are identical, so it is impossible to have a query that can differentiate between them as it is currently indexed.
To fix this you would have to change your analyzer. WhitespaceAnalyzer may be a good choice, and should fix this issue. However, be careful you aren't doing more harm than good. If you use WhitespaceAnalyzer, you are going to have to contend with other punctuation as well, and a search for "sentence"
would not find "match at the end of this sentence.", because of the period. So, if you are searching full text, this will certainly cause far more problems than it solves.
If you want to know the full rules of standard analysis, by the way, it's an implementation of UAX #29 word boundaries

Split Logstash/grok pattern that has international characters

Running into this issue.
I need to split up urls to get values from them. This works great when its all english.
URL = /78965asdvc34/Test/testBasins
Pattern = /%{WORD:org}/(?i)test/%{WORD:name}
I get this in the grok debugger.
{"org":[["78965asdvc34"]],"name":[["testBasins"]]}
If I have international characters, grok does not read them with the pattern above.
/78965asdvc34/Test/浸水Basins
Any thoughts how to get this to work? This value can be in any language in the logs, and hopefully there is a way to get it out.
Have you already tried
/%{WORD:org}/(?i)test/%{GREEDYDATA:name}
From hurb.
Thanks Hurb. GREEDYDATA worked.

Regex for Git commit message

I'm trying to come up with a regex for enforcing Git commit messages to match a certain format. I've been banging my head against the keyboard modifying the semi-working version I have, but I just can't get it to work exactly as I want. Here's what I have now:
/^([a-z]{2,4}-[\d]{2,5}[, \n]{1,2})+\n{1}^[\w\n\s\*\-\.\:\'\,]+/i
Here's the text I'm trying to enforce:
AB-1432, ABC-435, ABCD-42
Here is the multiline description, following a blank
line after the Jira issue IDs
- Maybe bullet points, with either dashes
* Or asterisks
Currently, it matches that, but it will also match if there's no blank line after the issue IDs, and if there's multiple blank lines after.
Is there anyway to enforce that, or will I just have to live with it?
It's also pretty ugly, I'm sure there's a more succinct way to write that out.
Thanks.
Your regex allows for \n as one of the possible characters after the required newline, so that's why it matches when there are multiple.
Here's a cleaned up regex:
/^([a-z]{2,4}-\d{2,5}(?=[, \n]),? ?\n?)+^\n([-\w\s*.:',]+\n)+/i
Notes:
This requires at least one [-\w\s*.:',] character before the next newline.
I changed the issue IDs to have one possible comma, space, and newline, in that order (up to one of each). Can you use lookaheads? If so, I added (?=[, \n]) to make sure the issue ID is followed by at least one of those characters.
Also notice that many of the characters don't need to be escaped in a character class.

Ignoring apostrophes in sphinx indexes

In my sphinx config file, I have the following:
ignore_chars: "U+0027"
charset_table: "0..9, a..z, _, A..Z->a..z, U+00C0->a, U+00C1->a,
U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c, U+00C8->e,
U+00C9->e, U+00CA->e, U+00CB->e, U+00CC->i, U+00CD->i, U+00CE->i [SNIP]"
(The charset_table entry is from here: http://speeple.com/unicode-maps.txt)
The expected result is that querying kyles will return all records matching kyles and/or kyle's, since I'm telling sphinx to exclude ' (single quote/apos) from the index (ab'cd -> abcd). However, in practice, this is not happening.
I believe adding it to the ignore_chars has the opposite of the desired effect. This is telling sphinx not to split on that character, but instead it will collapse the word around the characters to be ignored. So, kyle's will become kyles instead of kyle and s.
The solution I just tried for this issue that seems to have worked was to add s to my list of stopwords (might need 's in there also, can't remember). Sphinx seems to split kyle's up into the words kyle and 's. Because match all mode is on, some documents fail on the match for 's. Adding it to the stop words seems to have the desired effect.
It seems like the normal stemming should take care of this however, so maybe we're both doing something wrong...

Resources