How to write expression for special KV string in logstash kv fiter? - logstash-configuration

I got plenty of logs like these kind of stuff:
uid[118930] pageview h5_act, actTag[cyts] corpId[2] inviteType[0] clientId[3] clientVer[2.3.0] uniqueId[d317de16a78a0089b0d94d684e7a9585565ffa236138c0.85354991] srcId[0] subSrc[]
Most of these are key-value expression in KEY[VALUE] form.
I have read the document but still cannot figure out how to write the configurations.
Any help would be appreciated!

You can simply configure your kv filter using the value_split and trim settings, like below:
filter {
kv {
value_split => "\["
trim => "\]"
}
}
For the sample log line you've given, you'll get:
{
"message" => "uid[118930] pageview h5_act, actTag[cyts] corpId[2] inviteType[0] clientId[3] clientVer[2.3.0] uniqueId[d317de16a78a0089b0d94d684e7a9585565ffa236138c0.85354991] srcId[0] subSrc[]",
"#version" => "1",
"#timestamp" => "2015-12-12T05:04:00.888Z",
"host" => "iMac.local",
"uid" => "118930",
"actTag" => "cyts",
"corpId" => "2",
"inviteType" => "0",
"clientId" => "3",
"clientVer" => "2.3.0",
"uniqueId" => "d317de16a78a0089b0d94d684e7a9585565ffa236138c0.85354991",
"srcId" => "0",
"subSrc" => ""
}

Related

Extracting nested object with Logstash and Ruby

I have this XML structure:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v44-2014-04-03.dtd" [ ]>
<us-patent-application lang="EN" dtd-version="v4.4 2014-04-03" file="US20180000001A1-20180104.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20171219" date-publ="20180104">
<us-bibliographic-data-application lang="EN" country="US">
<us-parties>
<inventors>
<inventor sequence="00" designation="us-only">
<addressbook>
<last-name>Evans</last-name>
<first-name>Mike</first-name>
<address>
<city>Emerald Park</city>
<country>CA</country>
</address>
</addressbook>
</inventor>
<inventor sequence="01" designation="us-only">
<addressbook>
<last-name>Lucas</last-name>
<first-name>Lisa</first-name>
<address>
<city>Regina</city>
<country>CA</country>
</address>
</addressbook>
</inventor>
<inventor sequence="02" designation="us-only">
<addressbook>
<last-name>Smith</last-name>
<first-name>John R.</first-name>
<address>
<city>Regina</city>
<country>CA</country>
</address>
</addressbook>
</inventor>
</inventors>
</us-parties>
</us-bibliographic-data-application>
</us-patent-application>
I would like Logstash to output this structure:
{
"us-patent-application": {
"us-bibliographic-data-application": {
"us-parties": {
"inventors": [
"Mike Evans",
"Lisa Lucas",
"John R. Smith"
]
}
}
}
}
I have attempted to solve this 'combination' of names into one array in Logstash, but I can't find a working solution.
As of now, I am focused on using a ruby script in the Logstash ruby filter plugin. I am using this approach because I was not able to find a working solution using Xpath in the Logstash XML filter.
Here is the Logstash 'main.conf' configuration file:
input {
file {
path => [
"/opt/uspto/*.xml"
]
start_position => "beginning"
#use for testing
sincedb_path => "/dev/null"
# set this sincedb path when not testing
#sincedb_path => "/opt/logstash/tmp/sincedb"
exclude => "*.gz"
type => "xml"
codec => multiline {
#pattern => "<wo-ocr-published-application"
pattern => "<?xml version=\"1.0\" encoding=\"UTF-8\"\?>"
negate => "true"
what => "previous"
max_lines => 300000
}
}
}
filter {
if "multiline" in [tags] {
xml {
source => "message"
#store_xml => false # this limits the data indexed to only xpath and grok created fields
store_xml => true #saves ALL xml nodes if it can - can be VERY large
target => "xmldata" # only used with store_xml => true
}
ruby {
path => "/etc/logstash/rubyscripts/inventors.rb"
}
}
}
output {
file {
path => [ "/tmp/logstash_output_text_file" ]
codec => rubydebug
}
}
And here is the inventors.rb script:
# the value of `params` is the value of the hash passed to `script_params`
# in the logstash configuration
def register(params)
#drop_percentage = params["percentage"]
end
def filter(event)
# get the number of inventors to loop over
# convert the array key string number 0 to an integer
n = event.get('[num_inventors][0]').to_i
# set a loop number to start with
i = 0
#create empty arrays to fill
firstname = []
lastname = []
# loop over inventors until n is reached
while (i < n) do
#get the inventors first name
fname = event.get('[event][us-patent-application][us-bibliographic-data-application][us-parties][inventors][inventor][addressbook][last-name]')
#puts"first name #{fname}"
# push the first name into firstname array
firstname.push(fname)
#get the inventors last name
lname = event.get('[event][us-patent-application][us-bibliographic-data-application][us-parties][inventors][inventor][addressbook][last-name]')
#puts"last name #{lname}"
# push the last name into firstname array
lastname.push(lname)
#increment n up 1
i += 1
end
#merge firstname and lastname arrays
names = firstname.zip(lastname)
# push the names array to the event
event.set('allnames', names)
return [event]
end
Finally, here is the Elasticsearch output:
{
"host" => "localhost.localdomain",
"allnames" => [],
"type" => "xml",
"#timestamp" => 2018-09-20T17:28:05.332Z,
"message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<!DOCTYPE us-patent-application SYSTEM \"us-patent-application-v44-2014-04-03.dtd\" [ ]>\r\n<us-patent-application lang=\"EN\" dtd-version=\"v4.4 2014-04-03\" file=\"US20180000001A1-20180104.XML\" status=\"PRODUCTION\" id=\"us-patent-application\" country=\"US\" date-produced=\"20171219\" date-publ=\"20180104\">\r\n\t<us-bibliographic-data-application lang=\"EN\" country=\"US\">\r\n\t\t<us-parties>\r\n\t\t\t<inventors>\r\n\t\t\t\t<inventor sequence=\"00\" designation=\"us-only\">\r\n\t\t\t\t\t<addressbook>\r\n\t\t\t\t\t\t<last-name>Evans</last-name>\r\n\t\t\t\t\t\t<first-name>Mike</first-name>\r\n\t\t\t\t\t\t<address>\r\n\t\t\t\t\t\t\t<city>Emerald Park</city>\r\n\t\t\t\t\t\t\t<country>CA</country>\r\n\t\t\t\t\t\t</address>\r\n\t\t\t\t\t</addressbook>\r\n\t\t\t\t</inventor>\r\n\t\t\t\t<inventor sequence=\"01\" designation=\"us-only\">\r\n\t\t\t\t\t<addressbook>\r\n\t\t\t\t\t\t<last-name>Lucas</last-name>\r\n\t\t\t\t\t\t<first-name>Lisa</first-name>\r\n\t\t\t\t\t\t<address>\r\n\t\t\t\t\t\t\t<city>Regina</city>\r\n\t\t\t\t\t\t\t<country>CA</country>\r\n\t\t\t\t\t\t</address>\r\n\t\t\t\t\t</addressbook>\r\n\t\t\t\t</inventor>\r\n\t\t\t\t<inventor sequence=\"02\" designation=\"us-only\">\r\n\t\t\t\t\t<addressbook>\r\n\t\t\t\t\t\t<last-name>Smith</last-name>\r\n\t\t\t\t\t\t<first-name>Scott R.</first-name>\r\n\t\t\t\t\t\t<address>\r\n\t\t\t\t\t\t\t<city>Regina</city>\r\n\t\t\t\t\t\t\t<country>CA</country>\r\n\t\t\t\t\t\t</address>\r\n\t\t\t\t\t</addressbook>\r\n\t\t\t\t</inventor>\r\n\t\t\t</inventors>\r\n\t\t</us-parties>\r\n\t</us-bibliographic-data-application>\r",
"#version" => "1",
"tags" => [
[0] "multiline",
[1] "_xmlparsefailure"
],
"path" => "/opt/uspto/test.xml"
}
The behavior I am looking for the to have the "allnames" => [] array look like:
"allnames" => ["Mike Evans",
"Lisa Lucas",
"John R. Smith"],
I cannot figure out how to properly grab the 'first-name' and last-name' nodes in my ruby script. I'm pulling my hair out trying to get a working solution. Any ideas are welcome!
Please see if this solves your requirement. I was really hopeful that this can be solved with xml + xpath. But I guess all xpath functions are not supported. :(
input {
file {
path => [
"/opt/uspto/*.xml"
]
start_position => "beginning"
#use for testing
sincedb_path => "/dev/null"
# set this sincedb path when not testing
#sincedb_path => "/opt/logstash/tmp/sincedb"
exclude => "*.gz"
type => "xml"
codec => multiline {
#pattern => "<wo-ocr-published-application"
pattern => "<?xml version=\"1.0\" encoding=\"UTF-8\"\?>"
negate => "true"
what => "previous"
max_lines => 300000
}
}
}
filter {
if "multiline" in [tags] {
xml {
source => "message"
store_xml => false
target => "xmldata" # only used with store_xml => true
force_array=> false
xpath => [ "//us-bibliographic-data-application/us-parties/inventors/inventor/addressbook/first-name/text()" , "[xmldata][us-bibliographic-data-application][us-parties][inventors][first_name]" ,
"//us-bibliographic-data-application/us-parties/inventors/inventor/addressbook/last-name/text()", "[xmldata][us-bibliographic-data-application][us-parties][inventors][last_name]" ]
}
ruby {
code => ' first_name = event.get("[xmldata][us-bibliographic-data-application][us-parties][inventors][first_name]")
last_name = event.get("[xmldata][us-bibliographic-data-application][us-parties][inventors][last_name]")
event.set("[xmldata][us-bibliographic-data-application][us-parties][inventors][names]", first_name.zip(last_name).map{ |a| a.join(" ") })
'
}
mutate {
remove_field => ["message", "host", "path", "[xmldata][us-bibliographic-data-application][us-parties][inventors][first_name]", "[xmldata][us-bibliographic-data-application][us-parties][inventors][last_name]" ]
}
}
}
output {
file {
path => [ "/tmp/logstash_output_text_file" ]
codec => rubydebug
}
}
Output is
{
"#timestamp": "2018-09-21T12:00:26.428Z",
"#version": "1",
"tags": [
"multiline"
],
"type": "xml",
"xmldata": {
"us-bibliographic-data-application": {
"us-parties": {
"inventors": {
"names": [
"Mike Evans",
"Lisa Lucas",
"John R. Smith"
]
}
}
}
}
}

Parsing out text from a string using a logstash filter

I have an Apache Access Log that I would like to parse out some text from within the REQUEST field:
GET /foo/bar?contentId=ABC&_=1212121212 HTTP/1.1"
What I would like to do is extract and assign the 12121212122 to a value but the value is based off of the prefix ABC&_ (so I think I need an if statement or something). The prefix could take on other forms (e.g., DDD&_)
So basically I would like to say
if (prefix == ABC&_)
ABCID = 1212121212
elseif (prefix == DDD&_)
DDDID = <whatever value>
else
do nothing
I have been struggling to build the right filter in logstash to extract the id based on the prefix. Any help would be great.
Thank you
For this you would use a grok filter.
For example:
artur#pandaadb:~/dev/logstash$ ./logstash-2.3.2/bin/logstash -f conf2
Settings: Default pipeline workers: 8
Pipeline main started
GET /foo/bar?contentId=ABC&_=1212121212 HTTP/1.1"
{
"message" => "GET /foo/bar?contentId=ABC&_=1212121212 HTTP/1.1\"",
"#version" => "1",
"#timestamp" => "2016-07-28T15:59:12.787Z",
"host" => "pandaadb",
"prefix" => "ABC&_",
"id" => "1212121212"
}
This is your sample input, parsing out your prefix and Id.
There is no need for an if here, since the regular expression of the GROK filter takes care of it.
You can however (if you need to put it in different fields) analyse your field and add it to a different one.
This would output like that:
GET /foo/bar?contentId=ABC&_=1212121212 HTTP/1.1"
{
"message" => "GET /foo/bar?contentId=ABC&_=1212121212 HTTP/1.1\"",
"#version" => "1",
"#timestamp" => "2016-07-28T16:05:07.442Z",
"host" => "pandaadb",
"prefix" => "ABC&_",
"id" => "1212121212",
"ABCID" => "1212121212"
}
GET /foo/bar?contentId=DDD&_=1212121212 HTTP/1.1"
{
"message" => "GET /foo/bar?contentId=DDD&_=1212121212 HTTP/1.1\"",
"#version" => "1",
"#timestamp" => "2016-07-28T16:05:20.026Z",
"host" => "pandaadb",
"prefix" => "DDD&_",
"id" => "1212121212",
"DDDID" => "1212121212"
}
The filter I used for this looks like that:
filter {
grok {
match => {"message" => ".*contentId=%{GREEDYDATA:prefix}=%{NUMBER:id}"}
}
if [prefix] =~ "ABC" {
mutate {
add_field => {"ABCID" => "%{id}"}
}
}
if [prefix] =~ "DDD" {
mutate {
add_field => {"DDDID" => "%{id}"}
}
}
}
I hope that illustrates how to go about it. You can use this to test your grok regex:
http://grokdebug.herokuapp.com/
Have fun!
Artur

CSV filter in logstash throwing "_csvparsefailure" error

I asked another question eairler which I think might be related to this question:
JSON parser in logstash ignoring data?
The reason I think it's related is because in the previous question kibana wasn't displaying results from the JSON parser which have the "PROGRAM" field as "mfd_status". Now I'm changing the way I do things, removed the JSON parser just in case it might be interfering with stuff, but I still don't have any logs with "mfd_status" in them showing up.
csv
{
columns => ["unixTime", "unixTime2", "FACILITY_NUM", "LEVEL_NUM", "PROGRAM", "PID", "MSG_FULL"]
source => "message"
separator => " "
}
In my filter from the previous question I used two grok filters, now I've replaced them with a csv filter. I also have two date and a fingerprint filter but they're irrelevant for this question, I think.
Example log messages:
"1452564798.76\t1452496397.00\t1\t4\tkernel\t\t[ 6252.000246] sonar: sonar_write(): waiting..."
OUTPUT:
"unixTime" => "1452564798.76",
"unixTime2" => "1452496397.00",
"FACILITY_NUM" => "1",
"LEVEL_NUM" => "4",
"PROGRAM" => "kernel",
"PID" => nil,
"MSG_FULL" => "[ 6252.000246] sonar: sonar_write(): waiting...",
"TIMESTAMP" => "2016-01-12T02:13:18.760Z",
"TIMESTAMP_second" => "2016-01-11T07:13:17.000Z"
"1452564804.57\t1452496403.00\t1\t7\tmfd_status\t\t00800F08CFB0\textra\t{\"date\":1452543203,\"host\":\"ABCD1234\",\"inet\":[\"169.254.42.207/16\",\"10.8.207.176/32\",\"172.22.42.207/16\"],\"fb0\":[\"U:1280x800p-60\",32]}"
OUTPUT:
"tags" => [
[0] "_csvparsefailure"
After it says kernel/mfd_status in the logs, there shouldn't be any more deliminators and it should all go under the MSG_FULL field.
So, to summarize, why does one of my log messages parse correctly and the other one not? Also, even if it doesn't parse correctly it should still send it to elasticsearch just with empty fields, I think, why doesn't it do that either?
You're almost good, you need to override two more parameters in your CSV filter and both lines will be parsed correctly.
The first is skip_empty_columns => true because you have one empty field in your second log line and you need to ignore it.
The second is quote_char=> "'" (or anything else than the double quote ") since your JSON contain double quotes.
csv {
columns => ["unixTime", "unixTime2", "FACILITY_NUM", "LEVEL_NUM", "PROGRAM", "PID", "MSG_FULL"]
source => "message"
separator => " "
skip_empty_columns => true
quote_char => "'"
}
Using this, your first log line parses as:
{
"message" => "1452564798.76\\t1452496397.00\\t1\\t4\\tkernel\\t\\t[ 6252.000246] sonar: sonar_write(): waiting...",
"#version" => "1",
"#timestamp" => "2016-01-12T04:21:34.051Z",
"host" => "iMac.local",
"unixTime" => "1452564798.76",
"unixTime2" => "1452496397.00",
"FACILITY_NUM" => "1",
"LEVEL_NUM" => "4",
"PROGRAM" => "kernel",
"MSG_FULL" => "[ 6252.000246] sonar: sonar_write(): waiting..."
}
And the second log lines parses as:
{
"message" => "1452564804.57\\t1452496403.00\\t1\\t7\\tmfd_status\\t\\t00800F08CFB0\\textra\\t{\\\"date\\\":1452543203,\\\"host\\\":\\\"ABCD1234\\\",\\\"inet\\\":[\\\"169.254.42.207/16\\\",\\\"10.8.207.176/32\\\",\\\"172.22.42.207/16\\\"],\\\"fb0\\\":[\\\"U:1280x800p-60\\\",32]}",
"#version" => "1",
"#timestamp" => "2016-01-12T04:21:07.974Z",
"host" => "iMac.local",
"unixTime" => "1452564804.57",
"unixTime2" => "1452496403.00",
"FACILITY_NUM" => "1",
"LEVEL_NUM" => "7",
"PROGRAM" => "mfd_status",
"MSG_FULL" => "00800F08CFB0",
"column8" => "extra",
"column9" => "{\\\"date\\\":1452543203,\\\"host\\\":\\\"ABCD1234\\\",\\\"inet\\\":[\\\"169.254.42.207/16\\\",\\\"10.8.207.176/32\\\",\\\"172.22.42.207/16\\\"],\\\"fb0\\\":[\\\"U:1280x800p-60\\\",32]}"
}

elasticsearch/kiabana - analyze and visualize total time for transactions?

Parsing log files using logstash, here is the json sent to elasticsearch looks like:
For log lines contaning transaction start time, i add db_transaction_commit_begin_time field with the time it is logged.
{
"message" => "2015-05-27 10:26:47,048 INFO [T:3 ID:26] (ClassName.java:396) - End committing transaction",
"#version" => "1",
"#timestamp" => "2015-05-27T15:24:11.594Z",
"host" => "test.com",
"path" => "/abc/xyz/log.logstash.test",
"logTimestampString" => "2015-05-27 10:26:47,048",
"logLevel" => "INFO",
"threadInfo" => "T:3 ID:26",
"class" => "ClassName.java",
"line" => "396",
"logMessage" => "End committing transaction",
"db_transaction_commit_begin_time" => "2015-05-27 10:26:47,048"
}
For log lines contaning transaction end time, i add db_transaction_commit_end_time field with the time it is logged.
{
"message" => "2015-05-27 10:26:47,048 INFO [T:3 ID:26] (ClassName.java:396) - End committing transaction",
"#version" => "1",
"#timestamp" => "2015-05-27T15:24:11.594Z",
"host" => "test.com",
"path" => "/abc/xyz/log.logstash.test",
"logTimestampString" => "2015-05-27 10:26:47,048",
"logLevel" => "INFO",
"threadInfo" => "T:3 ID:26",
"class" => "ClassName.java",
"line" => "396",
"logMessage" => "End committing transaction",
"db_transaction_commit_end_time" => "2015-05-27 10:26:47,048"
}
Is it possible to calculate time for db transaction (db_transaction_commit_end_time - db_transaction_commit_begin_time) where threadinfo is same ?. I know aggregation might help but I am new and couldn't figure it out.
If somehow I get the db_transaction_time calculated and stored in a variable. how can I visualize time taken in a kibana chart ?
Use the elapsed{} filter in logstash.

Logstash Doesn't Read Entire Line With File Input

I'm using Logstash and I'm having troubles getting a rather simple configuration to work.
input {
file {
path => "C:/path/test-data/*.log"
start_position => beginning
type => "usage_data"
}
}
filter {
if [type] == "usage_data" {
grok {
match => { "message" => "^\s*%{NUMBER:lineNumber}\s+%{TIMESTAMP_ISO8601:date},(?<value1>[A-Za-z0-9+/]+),(?<value2>[A-Za-z0-9+/]+),(?<value3>[A-Za-z0-9+/]+),(?<value4>[^,]+),(?<value5>[^\r]*)" }
}
}
if "_grokparsefailure" not in [tags] {
drop { }
}
}
output {
stdout { codec => rubydebug }
}
I call Logstash like this:
SET LS_MAX_MEM=2g
DEL "%USERPROFILE%\.sincedb_*" 2> NUL
"C:\Program Files (x86)\logstash-1.4.1\bin\logstash.bat" agent -p "C:\path\\." -w 1 -f "logstash.conf"
The output:
←[33mUsing milestone 2 input plugin 'file'. This plugin should be stable, but if you see strange behavior, please let us know! For more information on plugin milestones, see http://logstash.net/docs/1.4.1/plugin-milestones {:level=>:w
arn}←[0m
{
"message" => ",",
"#version" => "1",
"#timestamp" => "2014-11-20T09:16:08.591Z",
"type" => "usage_data",
"host" => "my-machine",
"path" => "C:/path/test-data/monitor_20141116223000.log",
"tags" => [
[0] "_grokparsefailure"
]
}
If I parse only C:\path\test-data\monitor_20141116223000.log all lines are read and there is no grokparsefailure. If I remove C:\path\test-data\monitor_20141116223000.log the same grokparsefailure pops up in another log-file:
{
"message" => "atches in another context\r",
"#version" => "1",
"#timestamp" => "2014-11-20T09:14:04.779Z",
"type" => "usage_data",
"host" => "my-machine",
"path" => "C:/path/test-data/monitor_20140829235900.log",
"tags" => [
[0] "_grokparsefailure"
]
}
Especially the last output proves that Logstash doesn't read the entire line or attempts to interpret a newline where there is none. It always breaks at the same line at the same position.
Maybe I should add that the log-files contain \n as a line separator and I'm running Logstash on Windows. However, I'm not getting a whole lot of errors, just that one. And there are quite a lot of lines in there. They all appear properly when I remove the if "_grokparsefailure" ....
I assume that there is some problem with buffering, but I have no clue how to make this work. Any ideas?
Workaround:
# diff -Nur /opt/logstash/vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/lib/filewatch/tail.rb.orig /opt/logstash/vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/lib/filewatch/tail.rb
--- /opt/logstash/vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/lib/filewatch/tail.rb.orig 2015-02-25 10:46:06.916321816 +0700
+++ /opt/logstash/vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/lib/filewatch/tail.rb 2015-02-12 18:39:34.943833909 +0700
## -86,7 +86,9 ##
_read_file(path, &block)
#files[path].close
#files.delete(path)
- #statcache.delete(path)
+ ##statcache.delete(path)
+ inode = #statcache.delete(path)
+ #sincedb[inode] = 0
else
#logger.warn("unknown event type #{event} for #{path}")
end

Resources