Extracting nested object with Logstash and Ruby - ruby

I have this XML structure:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v44-2014-04-03.dtd" [ ]>
<us-patent-application lang="EN" dtd-version="v4.4 2014-04-03" file="US20180000001A1-20180104.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20171219" date-publ="20180104">
<us-bibliographic-data-application lang="EN" country="US">
<us-parties>
<inventors>
<inventor sequence="00" designation="us-only">
<addressbook>
<last-name>Evans</last-name>
<first-name>Mike</first-name>
<address>
<city>Emerald Park</city>
<country>CA</country>
</address>
</addressbook>
</inventor>
<inventor sequence="01" designation="us-only">
<addressbook>
<last-name>Lucas</last-name>
<first-name>Lisa</first-name>
<address>
<city>Regina</city>
<country>CA</country>
</address>
</addressbook>
</inventor>
<inventor sequence="02" designation="us-only">
<addressbook>
<last-name>Smith</last-name>
<first-name>John R.</first-name>
<address>
<city>Regina</city>
<country>CA</country>
</address>
</addressbook>
</inventor>
</inventors>
</us-parties>
</us-bibliographic-data-application>
</us-patent-application>
I would like Logstash to output this structure:
{
"us-patent-application": {
"us-bibliographic-data-application": {
"us-parties": {
"inventors": [
"Mike Evans",
"Lisa Lucas",
"John R. Smith"
]
}
}
}
}
I have attempted to solve this 'combination' of names into one array in Logstash, but I can't find a working solution.
As of now, I am focused on using a ruby script in the Logstash ruby filter plugin. I am using this approach because I was not able to find a working solution using Xpath in the Logstash XML filter.
Here is the Logstash 'main.conf' configuration file:
input {
file {
path => [
"/opt/uspto/*.xml"
]
start_position => "beginning"
#use for testing
sincedb_path => "/dev/null"
# set this sincedb path when not testing
#sincedb_path => "/opt/logstash/tmp/sincedb"
exclude => "*.gz"
type => "xml"
codec => multiline {
#pattern => "<wo-ocr-published-application"
pattern => "<?xml version=\"1.0\" encoding=\"UTF-8\"\?>"
negate => "true"
what => "previous"
max_lines => 300000
}
}
}
filter {
if "multiline" in [tags] {
xml {
source => "message"
#store_xml => false # this limits the data indexed to only xpath and grok created fields
store_xml => true #saves ALL xml nodes if it can - can be VERY large
target => "xmldata" # only used with store_xml => true
}
ruby {
path => "/etc/logstash/rubyscripts/inventors.rb"
}
}
}
output {
file {
path => [ "/tmp/logstash_output_text_file" ]
codec => rubydebug
}
}
And here is the inventors.rb script:
# the value of `params` is the value of the hash passed to `script_params`
# in the logstash configuration
def register(params)
#drop_percentage = params["percentage"]
end
def filter(event)
# get the number of inventors to loop over
# convert the array key string number 0 to an integer
n = event.get('[num_inventors][0]').to_i
# set a loop number to start with
i = 0
#create empty arrays to fill
firstname = []
lastname = []
# loop over inventors until n is reached
while (i < n) do
#get the inventors first name
fname = event.get('[event][us-patent-application][us-bibliographic-data-application][us-parties][inventors][inventor][addressbook][last-name]')
#puts"first name #{fname}"
# push the first name into firstname array
firstname.push(fname)
#get the inventors last name
lname = event.get('[event][us-patent-application][us-bibliographic-data-application][us-parties][inventors][inventor][addressbook][last-name]')
#puts"last name #{lname}"
# push the last name into firstname array
lastname.push(lname)
#increment n up 1
i += 1
end
#merge firstname and lastname arrays
names = firstname.zip(lastname)
# push the names array to the event
event.set('allnames', names)
return [event]
end
Finally, here is the Elasticsearch output:
{
"host" => "localhost.localdomain",
"allnames" => [],
"type" => "xml",
"#timestamp" => 2018-09-20T17:28:05.332Z,
"message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<!DOCTYPE us-patent-application SYSTEM \"us-patent-application-v44-2014-04-03.dtd\" [ ]>\r\n<us-patent-application lang=\"EN\" dtd-version=\"v4.4 2014-04-03\" file=\"US20180000001A1-20180104.XML\" status=\"PRODUCTION\" id=\"us-patent-application\" country=\"US\" date-produced=\"20171219\" date-publ=\"20180104\">\r\n\t<us-bibliographic-data-application lang=\"EN\" country=\"US\">\r\n\t\t<us-parties>\r\n\t\t\t<inventors>\r\n\t\t\t\t<inventor sequence=\"00\" designation=\"us-only\">\r\n\t\t\t\t\t<addressbook>\r\n\t\t\t\t\t\t<last-name>Evans</last-name>\r\n\t\t\t\t\t\t<first-name>Mike</first-name>\r\n\t\t\t\t\t\t<address>\r\n\t\t\t\t\t\t\t<city>Emerald Park</city>\r\n\t\t\t\t\t\t\t<country>CA</country>\r\n\t\t\t\t\t\t</address>\r\n\t\t\t\t\t</addressbook>\r\n\t\t\t\t</inventor>\r\n\t\t\t\t<inventor sequence=\"01\" designation=\"us-only\">\r\n\t\t\t\t\t<addressbook>\r\n\t\t\t\t\t\t<last-name>Lucas</last-name>\r\n\t\t\t\t\t\t<first-name>Lisa</first-name>\r\n\t\t\t\t\t\t<address>\r\n\t\t\t\t\t\t\t<city>Regina</city>\r\n\t\t\t\t\t\t\t<country>CA</country>\r\n\t\t\t\t\t\t</address>\r\n\t\t\t\t\t</addressbook>\r\n\t\t\t\t</inventor>\r\n\t\t\t\t<inventor sequence=\"02\" designation=\"us-only\">\r\n\t\t\t\t\t<addressbook>\r\n\t\t\t\t\t\t<last-name>Smith</last-name>\r\n\t\t\t\t\t\t<first-name>Scott R.</first-name>\r\n\t\t\t\t\t\t<address>\r\n\t\t\t\t\t\t\t<city>Regina</city>\r\n\t\t\t\t\t\t\t<country>CA</country>\r\n\t\t\t\t\t\t</address>\r\n\t\t\t\t\t</addressbook>\r\n\t\t\t\t</inventor>\r\n\t\t\t</inventors>\r\n\t\t</us-parties>\r\n\t</us-bibliographic-data-application>\r",
"#version" => "1",
"tags" => [
[0] "multiline",
[1] "_xmlparsefailure"
],
"path" => "/opt/uspto/test.xml"
}
The behavior I am looking for the to have the "allnames" => [] array look like:
"allnames" => ["Mike Evans",
"Lisa Lucas",
"John R. Smith"],
I cannot figure out how to properly grab the 'first-name' and last-name' nodes in my ruby script. I'm pulling my hair out trying to get a working solution. Any ideas are welcome!

Please see if this solves your requirement. I was really hopeful that this can be solved with xml + xpath. But I guess all xpath functions are not supported. :(
input {
file {
path => [
"/opt/uspto/*.xml"
]
start_position => "beginning"
#use for testing
sincedb_path => "/dev/null"
# set this sincedb path when not testing
#sincedb_path => "/opt/logstash/tmp/sincedb"
exclude => "*.gz"
type => "xml"
codec => multiline {
#pattern => "<wo-ocr-published-application"
pattern => "<?xml version=\"1.0\" encoding=\"UTF-8\"\?>"
negate => "true"
what => "previous"
max_lines => 300000
}
}
}
filter {
if "multiline" in [tags] {
xml {
source => "message"
store_xml => false
target => "xmldata" # only used with store_xml => true
force_array=> false
xpath => [ "//us-bibliographic-data-application/us-parties/inventors/inventor/addressbook/first-name/text()" , "[xmldata][us-bibliographic-data-application][us-parties][inventors][first_name]" ,
"//us-bibliographic-data-application/us-parties/inventors/inventor/addressbook/last-name/text()", "[xmldata][us-bibliographic-data-application][us-parties][inventors][last_name]" ]
}
ruby {
code => ' first_name = event.get("[xmldata][us-bibliographic-data-application][us-parties][inventors][first_name]")
last_name = event.get("[xmldata][us-bibliographic-data-application][us-parties][inventors][last_name]")
event.set("[xmldata][us-bibliographic-data-application][us-parties][inventors][names]", first_name.zip(last_name).map{ |a| a.join(" ") })
'
}
mutate {
remove_field => ["message", "host", "path", "[xmldata][us-bibliographic-data-application][us-parties][inventors][first_name]", "[xmldata][us-bibliographic-data-application][us-parties][inventors][last_name]" ]
}
}
}
output {
file {
path => [ "/tmp/logstash_output_text_file" ]
codec => rubydebug
}
}
Output is
{
"#timestamp": "2018-09-21T12:00:26.428Z",
"#version": "1",
"tags": [
"multiline"
],
"type": "xml",
"xmldata": {
"us-bibliographic-data-application": {
"us-parties": {
"inventors": {
"names": [
"Mike Evans",
"Lisa Lucas",
"John R. Smith"
]
}
}
}
}
}

Related

Grok parse error while parsing multiple line messages

I am trying to figure out grok pattern for parsing multiple messages like exception trace & below is one such log
2017-03-30 14:57:41 [12345] [qtp1533780180-12] ERROR com.app.XYZ - Exception occurred while processing
java.lang.NullPointerException: null
at spark.webserver.MatcherFilter.doFilter(MatcherFilter.java:162)
at spark.webserver.JettyHandler.doHandle(JettyHandler.java:61)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:189)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:119)
at org.eclipse.jetty.server.Server.handle(Server.java:517)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:302)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:242)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:245)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:75)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:213)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:147)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
at java.lang.Thread.run(Thread.java:745)
Here is my logstash.conf
input {
file {
path => ["/debug.log"]
codec => multiline {
# Grok pattern names are valid! :)
pattern => "^%{TIMESTAMP_ISO8601} "
negate => true
what => previous
}
}
}
filter {
mutate {
gsub => ["message", "r", ""]
}
grok {
match => [ "message", "%{TIMESTAMP_ISO8601:timestamp} \[%{NOTSPACE:uid}\] \[%{NOTSPACE:thread}\] %{LOGLEVEL:loglevel} %{DATA:class}\-%{GREEDYDATA:message}" ]
overwrite => [ "message" ]
}
date {
match => [ "timestamp" , "yyyy-MM-dd HH:mm:ss" ]
}
}
output {
elasticsearch { hosts => localhost }
stdout { codec => rubydebug }
}
This works fine for single line logs parsing but fails in
0] "_grokparsefailure"
for multiline exception traces
Can someone please suggest me the correct filter pattern for parsing multiline logs ?
If you are working with Multiline logs then please use Multiline filter provided by logstash. You first need to distinguish the starting of a new record in multiline filter. From your logs I can see new record is starting with "TIMESTAMP", below is the example usage.
Example usage ::
filter {
multiline {
type => "/debug.log"
pattern => "^%{TIMESTAMP}"
what => "previous"
}
}
You can then use Gsub to replace "\n" and "\r" which will be added by multiline filter to your record. After that use Grok.
The above logstash config worked fine after removing
mutate {
gsub => ["message", "r", ""]
}
So the working logstash config for parsing single line & multi line inputs for the above log pattern
input {
file {
path => ["./debug.log"]
codec => multiline {
# Grok pattern names are valid! :)
pattern => "^%{TIMESTAMP_ISO8601} "
negate => true
what => previous
}
}
}
filter {
grok {
match => [ "message", "%{TIMESTAMP_ISO8601:timestamp} \[%{NOTSPACE:uid}\] \[%{NOTSPACE:thread}\] %{LOGLEVEL:loglevel} %{DATA:class}\-%{GREEDYDATA:message}" ]
overwrite => [ "message" ]
}
date {
match => [ "timestamp" , "yyyy-MM-dd HH:mm:ss" ]
}
}
output {
elasticsearch { hosts => localhost }
stdout { codec => rubydebug }
}

How to filter input data of logstash based on date filed?

here is my twitter input tweets
"_source": {
"created_at": "Wed Aug 10 06:42:48 +0000 2016",
"id": 763264318242783200,
"timestamp_ms": "1470811368891",
"#version": "1",
"#timestamp": "2016-08-10T06:42:48.000Z"
}
and my logstash config file which include twitter input plugin filter and output
input {
twitter {
consumer_key => "lvvoeonCRBOHsLAoTPbion9sK"
consumer_secret => "GNHOFzErJhuo0bNq38JUs7xea2BOktMiLa7tunoGwP0oFKCHrY"
oauth_token => "704578110616936448-gfeSklNrITu7fHIZgjw3nwoZ1S0l0Jl"
oauth_token_secret => "IHiyRJRN09jjdUTGrnesALw4DRle35WyX7pdnI3CtEnJ5"
keywords => [ "afghanistan", "TOLOnews", "kabul", "police"]
full_tweet => true
}
}
filter {
date {
match => ["timestamp" , "MMM d YYY HH:mm:ss", "ISO8601"]
}
}
output {
stdout { codec => dots }
elasticsearch {
hosts => "10.20.1.123"
index => "twitter_news"
document_type => "tweets"
}
}
I want to just get new tweets for example today date is 2016-11-16, then I just want to get tweets that have #timestamp= 2016-11-16 not #timestamp= 2016-11-15 or past days tweets, but with this configuration i get past tweets as well, any one help me to how do this ?
the idea here is to use ruby code in logstash config.
I propose to use timestamp_ms for comparing date.
First need to convert timestamp_ms to integer
Add today timestamp in ms with ruby
Compare timestamps
Here is an example:
mutate {
convert => {
"timestamp_ms" => "integer"
}
}
ruby {
code => "
t = Time.now
today_ymd = t.strftime('%Y%m%d')
today_timestamp_ms = DateTime.parse(today_ymd).to_time.to_i*1000
event['#metadata']['today_timestamp_ms'] = today_timestamp_ms
"
}
if [timestamp_ms] < [#metadata][today_timestamp_ms] {
## past days events
mutate {
add_field => { "test" => "past days events" }
}
} else {
# today events
mutate {
add_field => { "test" => "today events" }
}
}

Logstash config - S3 Filenames?

I'm attempting to use Logstash to collect sales data for multiple clients from multiple vendors.
So far i've an S3 (Inbox) bucket that I can drop my files (currently CSVs) into and according the client code prefix on the file, the data gets pushed into an Elastic index for each client. This all works nicely.
The problem I have is that I have data from multiple vendors and need a way of identifying which file is from which vendor. Adding an extra column to the CSV is not an option, so my plan was to add this to the filename, so i'd end up with a filenaming convention something like clientcode_vendorcode_reportdate.csv.
I may be missing something, but it seems that on S3 I can't get access to the filename inside my config, which seems crazy given that the prefix is clearly being read at some point. I was intending to use a Grok or Ruby filter to simply split the filename on the underscore, giving me three key variables that I can use in my config, but all attempts so far have failed. I can't even seem to get the full source_path or filename as a variable.
My config so far looks something like this - i've removed failed attempts at using Grok/Ruby filters.
input {
s3 {
access_key_id =>"MYACCESSKEYID"
bucket => "sales-inbox"
backup_to_bucket => "sales-archive"
delete => "true"
secret_access_key => "MYSECRETACCESSKEY"
region => "eu-west-1"
codec => plain
prefix => "ABC"
type => "ABC"
}
s3 {
access_key_id =>"MYACCESSKEYID"
bucket => "sales-inbox"
backup_to_bucket => "sales-archive"
delete => "true"
secret_access_key => "MYSECRETACCESSKEY"
region => "eu-west-1"
codec => plain
prefix => "DEF"
type => "DEF"
}
}
filter {
if [type] == "ABC" {
csv {
columns => ["Date","Units","ProductID","Country"]
separator => ","
}
mutate {
add_field => { "client_code" => "abc" }
}
}
else if [type] == "DEF" {
csv {
columns => ["Date","Units","ProductID","Country"]
separator => ","
}
mutate {
add_field => { "client_code" => "def" }
}
}
mutate
{
remove_field => [ "message" ]
}
}
output {
elasticsearch {
codec => json
hosts => "myelasticcluster.com:9200"
index => "sales_%{client_code}"
document_type => "sale"
}
stdout { codec => rubydebug }
}
An guidance from those well versed in Logstash configs would be much appreciated!

Logstash Doesn't Read Entire Line With File Input

I'm using Logstash and I'm having troubles getting a rather simple configuration to work.
input {
file {
path => "C:/path/test-data/*.log"
start_position => beginning
type => "usage_data"
}
}
filter {
if [type] == "usage_data" {
grok {
match => { "message" => "^\s*%{NUMBER:lineNumber}\s+%{TIMESTAMP_ISO8601:date},(?<value1>[A-Za-z0-9+/]+),(?<value2>[A-Za-z0-9+/]+),(?<value3>[A-Za-z0-9+/]+),(?<value4>[^,]+),(?<value5>[^\r]*)" }
}
}
if "_grokparsefailure" not in [tags] {
drop { }
}
}
output {
stdout { codec => rubydebug }
}
I call Logstash like this:
SET LS_MAX_MEM=2g
DEL "%USERPROFILE%\.sincedb_*" 2> NUL
"C:\Program Files (x86)\logstash-1.4.1\bin\logstash.bat" agent -p "C:\path\\." -w 1 -f "logstash.conf"
The output:
←[33mUsing milestone 2 input plugin 'file'. This plugin should be stable, but if you see strange behavior, please let us know! For more information on plugin milestones, see http://logstash.net/docs/1.4.1/plugin-milestones {:level=>:w
arn}←[0m
{
"message" => ",",
"#version" => "1",
"#timestamp" => "2014-11-20T09:16:08.591Z",
"type" => "usage_data",
"host" => "my-machine",
"path" => "C:/path/test-data/monitor_20141116223000.log",
"tags" => [
[0] "_grokparsefailure"
]
}
If I parse only C:\path\test-data\monitor_20141116223000.log all lines are read and there is no grokparsefailure. If I remove C:\path\test-data\monitor_20141116223000.log the same grokparsefailure pops up in another log-file:
{
"message" => "atches in another context\r",
"#version" => "1",
"#timestamp" => "2014-11-20T09:14:04.779Z",
"type" => "usage_data",
"host" => "my-machine",
"path" => "C:/path/test-data/monitor_20140829235900.log",
"tags" => [
[0] "_grokparsefailure"
]
}
Especially the last output proves that Logstash doesn't read the entire line or attempts to interpret a newline where there is none. It always breaks at the same line at the same position.
Maybe I should add that the log-files contain \n as a line separator and I'm running Logstash on Windows. However, I'm not getting a whole lot of errors, just that one. And there are quite a lot of lines in there. They all appear properly when I remove the if "_grokparsefailure" ....
I assume that there is some problem with buffering, but I have no clue how to make this work. Any ideas?
Workaround:
# diff -Nur /opt/logstash/vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/lib/filewatch/tail.rb.orig /opt/logstash/vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/lib/filewatch/tail.rb
--- /opt/logstash/vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/lib/filewatch/tail.rb.orig 2015-02-25 10:46:06.916321816 +0700
+++ /opt/logstash/vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/lib/filewatch/tail.rb 2015-02-12 18:39:34.943833909 +0700
## -86,7 +86,9 ##
_read_file(path, &block)
#files[path].close
#files.delete(path)
- #statcache.delete(path)
+ ##statcache.delete(path)
+ inode = #statcache.delete(path)
+ #sincedb[inode] = 0
else
#logger.warn("unknown event type #{event} for #{path}")
end

Multiple levels of multiline in logstash

I want to be able to use a multiline filter, and then another multiline in a deeper level.
To be more exact, I want to have a java exception stack trace for example:
2014-06-20 Some-arbitrary-log
java.lang.IndexOutOfBoundsException: Index: 8, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:604)
at java.util.ArrayList.get(ArrayList.java:382)
And then, after that, combine a couple of those together like so using another multiline and throttle:
2014-06-19 Some-arbitrary-log
java.lang.IndexOutOfBoundsException: Index: 2, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:604)
at java.util.ArrayList.get(ArrayList.java:382)
2014-06-20 Some-arbitrary-log
java.lang.IndexOutOfBoundsException: Index: 8, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:604)
at java.util.ArrayList.get(ArrayList.java:382)
My filter looks like so:
filter {
if [type] =~ /test.+
{
multiline {
pattern => "(^.+Exception.*)|(^\tat.+)"
negate => false
what => "previous"
}
if ("multiline" in [tags]) {
mutate {
add_field => [ "ERROR_TYPE", "java_exception" ]
}
}
if ([ERROR_TYPE] == "java_exception") {
throttle{
key => ".*"
period => 10
before_count => 2
after_count => -1
add_tag => "throttled"
}
if ("throttled" not in [tags]) {
multiline {
pattern => ".*"
negate => false
what => "previous"
}
}
}
}
}
The first level of just having one stack track trace works.
As in, this works as intended:
multiline {
pattern => "(^.+Exception.*)|(^\tat.+)"
negate => false
what => "previous"
}
if ("multiline" in [tags]) {
mutate {
add_field => [ "ERROR_TYPE", "java_exception" ]
}
}
Combining multiple stack traces, however, doesn't work. The output I use is as so:
output {
if [ERROR_TYPE] == "java_exception"{
stdout {codec => rubydebug }
elasticsearch {
cluster => "logstash"
}
}
}
However, there are no combined stack traces. And all of them have the tags "throttled".
To check if there are any non-throttled, I did:
output {
if [ERROR_TYPE] == "java_exception" and "throttled" not in [tags]{
stdout {codec => rubydebug }
elasticsearch {
cluster => "logstash"
}
}
}
And nothing came up. Why does this not get throttled by the "before-count" in the throttle filter?
Any thoughts anyone?

Resources