In Logstash how to extract substring in a bigger string? - elasticsearch

Feeling difficulty in writing grok patterns.Please help
I have GetIndicatorsByAnalysisProcessIDServlet service method is called and in this how to extract only GetIndicatorsByAnalysisProcess and text GetIndicatorsByAnalysisProcess will not be same
Here challenging i felt is truncating string from backward direction
i followed up
grok {
match => ["destinationid", "(?<fieldname>discard.{7})"]
}
it high-lets considering number of characters from start

If I understand you correctly, you need to have the first word in a variable.
This is achievable via
(?<fieldname>[^\s]*)\s*
with sample output from it
{
"fieldname": [
[
"GetIndicatorsByAnalysisProcessIDServlet"
]
]
}
In case you have various beginnings with optional spaces but an exactly same ending of the sentence, the effective regexp will be different.

Related

Substring with grokPattern

With this example data
GigabitEthernet102/0/0/28 = TLU-46356_CAR_ONE_RIVERO_AUTO_CENTER_PRINCIPAL Traffic (SNMP Traffic) Down (The interface is disconnected: ifOperStatus=down (2) (code: PE058))
Im try to get a substring ever that found "TLU" pattern and create a new field like this "TLU-46356"
Im use a grok pattern like this
if ([logMessage] =~ /TLU-/){ grok { match => { "logMessage" => 'TLU=(?<TLU>[0-9a-fx]{8})' } }
But donĀ“t work and the result is "grokparsefailure"
Any idea, please.
I just tested your example in Grok Debugger in Kibana and this works for me
(?<TLU>[0-9a-fx]{5})
For some reason TLU= in the beginning is not working, but grok will pick up named capture groups anyways, so even without it you'll get field called TLU.
The other part of the problem is that you're trying to match 8 characters from 0 to 9, from a to f or x, but in your example number after TLU- is only 5 characters long. So either the logs have variable number of characters which you didn't mention and you should use something like {n, m} to define length of the string you want to capture. Or you can just say * and terminate it with _ like this
(?<TLU>[0-9a-fx]*)_
If there's anything I'm missing please update the questions with additional data and let me know., good luck!

Is there a way to use grok to break up a message using the character numbers?

So for instance the log I need to break apart is something like this
"01234567895467894ACCP 844"
Where
0123456789 is phone number,
5467894 mandate number,
ACCP is the type of mandate but for instance could be 6 long so it gets 2 spaces afterward. 844 some other number. What I need to do is separate the line based on character number. Which will always be constant.
So Something like %{CHAR 0-10:Phonenumber)%{CHAR 11-18:Mandate}%{CHAR 19-24:Type} Is there someway to do this using groks? I tried looking but did not find anything like it.
The following regular expression based grok expression allows you to capture what you expect:
(?<Phonenumber>\d{10})(?<Mandate>\d{7})(?<Type>[A-Z\s]{4,})(?<Other>\d{3,})
You'd get this:
{
"Phonenumber": "0123456789",
"Mandate": "5467894",
"Type": "ACCP ",
"Other": "844"
}

Using Logstash Ruby filter to parse csv file

I have an elasticsearch index which I am using to index a set of documents.
These documents are originally in csv format and I am looking parse these using logstash.
My problem is that I have something along the following lines.
field1,field2,field3,xyz,abc
field3 is something like 123456789 and I want to parse it as 4.56(789) using ruby code filter.
My try:
I tried with stdin and stdout with the following logstash.conf .
input {
stdin {
}
}
filter {
ruby {
code => "
b = event["message"]
string2=""
for counter in (3..(num.size-1))
if counter == 4
string2+= '_'+ num[counter]
elsif counter == 6
string2+= '('+num[counter]
elsif counter == 8
string2+= num[counter] +')'
else
string2+= num[counter]
end
end
event["randomcheck"] = string2
"
}
}
output {
stdout {
codec=>rubydebug
}
}
I am getting syntax error using this.
My final aim is to use this with my csv file , but first I was trying this with stdin and stdout.
Any help will be highly appreciated.
The reason you're getting a syntax error is most likely because you have unescaped double quotes inside the double quoted string. Either make the string single quoted or keep it double quoted but use single quotes inside. I also don't understand how that code is supposed to work.
But that aside, why use a ruby filter in the first place? You can use a csv filter for the CSV parsing and a couple of standard filters to transform 123456789 to 4.56(789).
filter {
# Parse the CSV fields and then delete the 'message' field.
csv {
remove_field => ["message"]
}
# Given an input such as 123456789, extract 4, 56, and 789 into
# their own fields.
grok {
match => [
"column3",
"\d{3}(?<intpart>\d)(?<fractionpart>\d{2})(?<parenpart>\d{3})"
]
}
# Put the extracted fields together into a single field again,
# then delete the temporary fields.
mutate {
replace => ["column3", "%{intpart}.%{fractionpart}(%{parenpart})"]
remove_field => ["intpart", "factionpart", "parenpart"]
}
}
The temporary fields have really bad names in the example above since I don't know what they represent. Also, depending on what the input can look like you may have to adjust the grok expression. As it stands now it assumes nine-digit input.

Elasticsearch substring matchng without ending

For example if my search word is: "Houses" I want found result "House" how to search without last 1-2 word letters ?
I try "nGram" filter, but it serrch for full word.
I feel you are chasing the wrong approach.
Judging by your example , i feel what you are looking is stemmers.
Elasticsearch has stemmers like snowball which can convert any word to their base forms or stems.
For eg: , the stemmer can convert
[ "jumping" , "jumped" ] -> "jump"
[ "staying" , "stayed" ] -> "stay"
And so on...
Snowball - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-snowball-analyzer.html#analysis-snowball-analyzer

ruby extract string between two string

I am having a string as below:
str1='"{\"#Network\":{\"command\":\"Connect\",\"data\":
{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"'
I wanted to extract the somename string from the above string. Values of xx:xx:xx:xx:xx:xx, somename and 123456789 can change but the syntax will remain same as above.
I saw similar posts on this site but don't know how to use regex in the above case.
Any ideas how to extract the above string.
Parse the string to JSON and get the values that way.
require 'json'
str = "{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
json = JSON.parse(str.strip)
name = json["#Network"]["data"]["Name"]
pwd = json["#Network"]["data"]["Pwd"]
Since you don't know regex, let's leave them out for now and try manual parsing which is a bit easier to understand.
Your original input, without the outer apostrophes and name of variable is:
"{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
You say that you need to get the 'somename' value and that the 'grammar will not change'. Cool!.
First, look at what delimits that value: it has quotes, then there's a colon to the left and comma to the right. However, looking at other parts, such layout is also used near the command and near the pwd. So, colon-quote-data-quote-comma is not enough. Looking further to the sides, there's a \"Name\". It never occurs anywhere in the input data except this place. This is just great! That means, that we can quickly find the whereabouts of the data just by searching for the \"Name\" text:
inputdata = .....
estposition = inputdata.index('\"Name\"')
raise "well-known marker wa not found in the input" unless estposition
now, we know:
where the part starts
and that after the "Name" text there's always a colon, a quote, and then the-interesting-data
and that there's always a quote after the interesting-data
let's find all of them:
colonquote = inputdata.index(':\"', estposition)
datastart = colonquote+3
lastquote = inputdata.index('\"', datastart)
dataend = lastquote-1
The index returns the start position of the match, so it would return the position of : and position of \. Since we want to get the text between them, we must add/subtract a few positions to move past the :\" at begining or move back from \" at end.
Then, fetch the data from between them:
value = inputdata[datastart..dataend]
And that's it.
Now, step back and look at the input data once again. You say that grammar is always the same. The various bits are obviously separated by colons and commas. Let's try using it directly:
parts = inputdata.split(/[:,]/)
=> ["\"{\\\"#Network\\\"",
"{\\\"command\\\"",
"\\\"Connect\\\"",
"\\\"data\\\"",
"\n{\\\"Id\\\"",
"\\\"xx",
"xx",
"xx",
"xx",
"xx",
"xx\\\"",
"\\\"Name\\\"",
"\\\"somename\\\"",
"\\\"Pwd\\\"",
"\\\"123456789\\\"}}}\\0\""]
Please ignore the regex for now. Just assume it says a colon or comma. Now, in parts you will get all the, well, parts, that were detected by cutting the inputdata to pieces at every colon or comma.
If the layout never changes and is always the same, then your interesting-data will be always at place 13th:
almostvalue = parts[12]
=> "\\\"somename\\\""
Now, just strip the spurious characters. Since the grammar is constant, there's 2 chars to be cut from both sides:
value = almostvalue[2..-3]
Ok, another way. Since regex already showed up, let's try with them. We know:
data is prefixed with \"Name\" then colon and slash-quote
data consists of some text without quotes inside (well, at least I guess so)
data ends with a slash-quote
the parts in regex syntax would be, respectively:
\"Name\":\"
[^\"]*
\"
together:
inputdata =~ /\\"Name\\":\\"([^\"]*)\\"/
value = $1
Note that I surrounded the interesting part with (), hence after sucessful match that part is available in the $1 special variable.
Yet another way:
If you look at the grammar carefully, it really resembles a set of embedded hashes:
\"
{ \"#Network\" :
{ \"command\" : \"Connect\",
\"data\" :
{ \"Id\" : \"xx:xx:xx:xx:xx:xx\",
\"Name\" : \"somename\",
\"Pwd\" : \"123456789\"
}
}
}
\0\"
If we'd write something similar as Ruby hashes:
{ "#Network" =>
{ "command" => "Connect",
"data" =>
{ "Id" => "xx:xx:xx:xx:xx:xx",
"Name" => "somename",
"Pwd" => "123456789"
}
}
}
What's the difference? the colon was replaced with =>, and the slashes-before-quotes are gone. Oh, and also opening/closing \" is gone and that \0 at the end is gone too. Let's play:
tmp = inputdata[2..-4] # remove opening \" and closing \0\"
tmp.gsub!('\"', '"') # replace every \" with just "
Now, what about colons.. We cannot just replace : with =>, because it would damage the internal colons of the xx:xx:xx:xx:xx:xx part.. But, look: all the other colons have always a quote before them!
tmp.gsub!('":', '"=>') # replace every quote-colon with quote-arrow
Now our tmp is:
{"#Network"=>{"command"=>"Connect","data"=>{"Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789"}}}
formatted a little:
{ "#Network"=>
{ "command"=>"Connect",
"data"=>
{ "Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789" }
}
}
So, it looks just like a Ruby hash. Let's try 'destringizing' it:
packeddata = eval(tmp)
value = packeddata['#Network']['data']['Name']
Done.
Well, this has grown a bit and Jonas was obviously faster, so I'll leave the JSON part to him since he wrote it already ;) The data was so similar to Ruby hash because it was obviously formatted as JSON which is a hash-like structure too. Using the proper format-reading tools is usually the best idea, but mind that the JSON library when asked to read the data - will read all of the data and then you can ask them "what was inside at the key xx/yy/zz", just like I showed you with the read-it-as-a-Hash attempt. Sometimes when your program is very short on the deadline, you cannot afford to read-it-all. Then, scanning with regex or scanning manually for "known markers" may (not must) be much faster and thus prefereable. But, still, much less convenient. Have fun.

Resources