Data type conversion using logstash grok - elasticsearch

Basic is a float field. The mentioned index is not present in elasticsearch. When running the config file with logstash -f, I am getting no exception. Yet, the data reflected and entered in elasticsearch shows the mapping of Basic as string. How do I rectify this? And how do I do this for multiple fields?
input {
file {
path => "/home/sagnik/work/logstash-1.4.2/bin/promosms_dec15.csv"
type => "promosms_dec15"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
grok{
match => [
"Basic", " %{NUMBER:Basic:float}"
]
}
csv {
columns => ["Generation_Date","Basic"]
separator => ","
}
ruby {
code => "event['Generation_Date'] = Date.parse(event['Generation_Date']);"
}
}
output {
elasticsearch {
action => "index"
host => "localhost"
index => "promosms-%{+dd.MM.YYYY}"
workers => 1
}
}

You have two problems. First, your grok filter is listed prior to the csv filter and because filters are applied in order there won't be a "Basic" field to convert when the grok filter is applied.
Secondly, unless you explicitly allow it, grok won't overwrite existing fields. In other words,
grok{
match => [
"Basic", " %{NUMBER:Basic:float}"
]
}
will always be a no-op. Either specify overwrite => ["Basic"] or, preferably, use mutate's type conversion feature:
mutate {
convert => ["Basic", "float"]
}

Related

Setting variables in logstash config and referencing them

I started ELK a week back to use it for storing multiple CSVs and getting them to kibana for ease of analysing them. One case will involve multiple machines and one machine will generate many CSVs. Now these CSVs have a particular naming pattern. I am taking one particular file ( BrowsingHistoryView_DMZ-machine1.csv ) for reference and setting up the case as index. To define an index I've chosen to rename files to have prefix of '__case_number __' . So the file name will be- __1__BrowsingHistoryView_DMZ-machine1.csv
Now I want to derive two things out of it.1. Get the case number __1 __ and use 1 as index. 1 , 2 , 3 etc will be used as a case numbers.
2. Get the filetype (BrowsingHistoryView for ex.) and add a tag name to the uploaded file.
3. Get the machine name DMZ-machine1 (don't know yet where I'll use it).
I created a config file for it, which is as below-
file {
path => "/home/kriss/Documents/*.csv" # get the files from Documents
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
separator => ","
}
if [path] =~ "BrowsingHistory" { mutate { add_tag => ["Browsinghistory"] } # define a new tag for browser history, this worked
grok { match => ["path", "__(?<case>[0-9]+)__(?<category>\w+_)(?<machine>(.+)).csv"] # This regex pattern is to get category(browsingHistory), MachineName
}
}
if [path] =~ "Inbound_RDP_Events" { mutate { add_tag => {"Artifact" => "RDP" } } }
} # This tagging worked
output {
elasticsearch {
hosts => "localhost"
index => "%{category}" # This referencing the category variable didn't work
}
stdout {}
}
When I run this config on logstash, the index generated is %category . I needed it to capture browser_history for the index of that file. Also if I can convert the category to small letters, since sometimes uppercases don't work well in index. I tried to follow the official documentation but didn't get the complete info that I need.
There's a grok debugger in Dev Tools in Kibana you can use to work on these kinds of problems, or an online one at https://grokdebug.herokuapp.com/ - it's great.
Below is a slightly modified version of your config. I've removed your comments and inserted my own.
The changes are:
The path regex in your config doesn't match the example filename you gave. You might want to change it back, depending on how accurate your example was.
The grok pattern has been tweaked
Changed your Artifact tag to a field, because it looks like you're trying to create a field
I tried to stick to your spacing convention :)
input {
file {
path => "/home/kriss/Documents/*.csv"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
separator => ","
}
# I replaced your regex with something matches your example
# filename, but given that you said you already had this
# working, you might want to change it back.
if [path] =~ "browser_history" {
mutate { add_tag => ["Browsinghistory"] }
grok {
# I replaced custom captures with a more grokish style, and
# use GREEDYDATA to capture everything up to the last '_'
match => [ "path", "^_+%{NUMBER:case}_+%{GREEDYDATA:category}_+%{DATA:case}\.csv$" ]
}
}
# Replaced `add_tag` with `add_field` so that the syntax makes sense
if [path] =~ "Inbound_RDP_Events" { mutate { add_field => {"Artifact" => "RDP" } } }
# Added the `mutate` filter's `lowercase` function for "category"
mutate {
lowercase => "category"
}
}
output {
elasticsearch {
hosts => "localhost"
index => "%{category}"
}
stdout {}
}
Not tested, but I hope it gives you enough clues.
So, for the reference for anyone who is trying to use custom variables in logstash config file. Below is the working config-
input {
file {
path => "/home/user1/Documents/__1__BrowsingHistoryView_DMZ-machine1.csv" # Getting the absolte path (necessary)
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
separator => ","
}
if [path] =~ "BrowsingHistory" { mutate { add_field => {"artifact" => "Browsinghistory"} } # if BrowsingHistory is found in path, add a tag called Browsinghistory
grok { match => ["path", "__(?<case>[0-9]+)__(?<category>\w+_)(?<machine>(.+)).csv"] # get the caseNumber, logCategory, machineName into variables
}
}
if [path] =~ "Inbound_RDP_Events" { mutate { add_field => {"artifact" => "RDP"} } } # another tag if RDP event file found in path
}
output {
elasticsearch {
hosts => "localhost"
index => "%{case}" # passing the variable value derived from regex
# index => "%{category}" # another regex variable
# index => "%{machine}" # another regex variable
}
stdout {}
}
I wasn't very sure whether to add a new tag or a new field (add_field => {"artifact" => "Browsinghistory"}) for easy identification of a file in kibana. If someone could provide some info on how to chose one out of them.

how filter {"foo":"bar", "bar": "foo"} with grok to get only foo field?

I copied
{"name":"myapp","hostname":"banana.local","pid":40161,"level":30,"msg":"hi","time":"2013-01-04T18:46:23.851Z","v":0}
from https://github.com/trentm/node-bunyan and save it as my logs.json. I am trying to import only two fields (name and msg) to ElasticSearch via LogStash. The problem is that I depend on a sort of filter that I am not able to accomplish. Well I have successfully imported such line as a single message but certainly it is not worth in my real case.
That said, how can I import only name and msg to ElasticSearch? I tested several alternatives using http://grokdebug.herokuapp.com/ to reach an useful filter with no success at all.
For instance, %{GREEDYDATA:message} will bring the entire line as an unique message but how to split it and ignore all other than name and msg fields?
At the end, I am planing to use here:
input {
file {
type => "my_type"
path => [ "/home/logs/logs.log" ]
codec => "json"
}
}
filter {
grok {
match => { "message" => "data=%{GREEDYDATA:request}"}
}
#### some extra lines here probably
}
output
{
elasticsearch {
codec => json
hosts => "http://127.0.0.1:9200"
index => "indextest"
}
stdout { codec => rubydebug }
}
I have just gone through the list of available Logstash filters. The prune filter should match your need.
Assume you have installed the prune filter, your config file should look like:
input {
file {
type => "my_type"
path => [ "/home/logs/logs.log" ]
codec => "json"
}
}
filter {
prune {
whitelist_names => [
"#timestamp",
"type",
"name",
"msg"
]
}
}
output {
elasticsearch {
codec => json
hosts => "http://127.0.0.1:9200"
index => "indextest"
}
stdout { codec => rubydebug }
}
Please be noted that you will want to keep type for Elasticsearch to index it into a correct type. #timestamp is required if you will view the data on Kibana.

How to force encoding for Logstash filters? (umlauts from message not recognized)

I am trying to import historical log data into ElasticSearch (Version 5.2.2) using Logstash (Version 5.2.1) - all running under Windows 10.
Sample log file
The sample log file I am importing looks like this:
07.02.2017 14:16:42 - Critical - General - Ähnlicher Fehler mit übermäßger Ödnis
08.02.2017 14:13:52 - Critical - General - ästhetisch überfällige Fleißarbeit
Working configuration
For starters I tried the following simple Logstash configuration (it's running on Windows so don't get confused by the mixed slashes ;)):
input {
file {
path => "D:/logstash/bin/*.log"
sincedb_path => "C:\logstash\bin\file_clientlogs_lastrun"
ignore_older => 999999999999
start_position => "beginning"
stat_interval => 60
type => "clientlogs"
}
}
output {
if [type] == "clientlogs" {
elasticsearch {
index => "logstash-clientlogs"
}
}
}
And this works fine - I see that input gets read line by line into the index I specified - when I check with Kibana for example those two lines might look like this (I just ommitted host-name - click to enlarge):
More complex (not working) configuration
But of course this is still pretty flat data and I really want to extract the proper timestamps from my lines and also the other fields and replace #timestamp and message with those; so I inserted some filter-logic involving grok-, mutate- and date-filter in between inputand output so the resulting configuration looks like this:
input {
file {
path => "D:/logs/*.log"
sincedb_path => "C:\logstash\bin\file_clientlogs_lastrun"
ignore_older => 999999999999
start_position => "beginning"
stat_interval => 60
type => "clientlogs"
}
}
filter {
if [type] == "clientlogs" {
grok {
match => [ "message", "%{MONTHDAY:monthday}.%{MONTHNUM2:monthnum}.%{YEAR:year} %{TIME:time} - %{WORD:severity} - %{WORD:granularity} - %{GREEDYDATA:logmessage}" ]
}
mutate {
add_field => {
"timestamp" => "%{year}-%{monthnum}-%{monthday} %{time}"
}
replace => [ "message", "%{logmessage}" ]
remove_field => ["year", "monthnum", "monthday", "time", "logmessage"]
}
date {
locale => "en"
match => ["timestamp", "YYYY-MM-dd HH:mm:ss"]
timezone => "Europe/Vienna"
target => "#timestamp"
add_field => { "debug" => "timestampMatched"}
}
}
}
output {
if [type] == "clientlogs" {
elasticsearch {
index => "logstash-clientlogs"
}
}
}
Now, when I look at those logs for example with Kibana, I see the fields I wanted to add do appear and the timestamp and message are replaced correctly, but my umlauts are all gone (click to enlarge):
Forcing charset in input and output
I also tried setting
codec => plain {
charset => "UTF-8"
}
for input and output, but that also did not change anything for the better.
Different output-type
When I change output to stdout { }
The output seems okay:
2017-02-07T13:16:42.000Z MYPC Ähnlicher Fehler mit übermäßger Ödnis
2017-02-08T13:13:52.000Z MYPC ästhetisch überfällige Fleißarbeit
Querying without Kibana
I also queried against the index using this PowerShell-command:
Invoke-WebRequest –Method POST -Uri 'http://localhost:9200/logstash-clientlogs/_search' -Body '
{
"query":
{
"regexp": {
"message" : ".*"
}
}
}
' | select -ExpandProperty Content
But it also returns the same messed up contents Kibana reveals:
{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"logstash-clientlogs","_type":"clientlogs","_id":"AVskdTS8URonc
bfBgFwC","_score":1.0,"_source":{"severity":"Critical","debug":"timestampMatched","message":"�hnlicher Fehler mit �berm��ger �dnis\r","type":"clientlogs","path":"D:/logs/Client.log","#timestamp":"2017-02-07T13:16:42.000Z","granularity":"General","#version":"1","host":"MYPC","timestamp":"2017-02-07 14:16:42"}},{"_index":"logstash-clientlogs","_type":"clientlogs","_id":"AVskdTS8UR
oncbfBgFwD","_score":1.0,"_source":{"severity":"Critical","debug":"timestampMatched","message":"�sthetisch �berf�llige Flei�arbeit\r","type":"clientlogs","path":"D:/logs/Client.log","#timestamp":"2017-02-08T13:13:52.000Z","granularity":"General","#version":"1","host":"MYPC","timestamp":"2017-02-08 14:13:52"}}]}}
Has anyone else experienced this and has a solution for this use-case? I don't see any setting for grok to specify any encoding (the file I am passing is UTF-8 with BOM) and encoding for input itself does not seem necessary, because it gets me the correct message when I leave out the filter.

Logstash config - S3 Filenames?

I'm attempting to use Logstash to collect sales data for multiple clients from multiple vendors.
So far i've an S3 (Inbox) bucket that I can drop my files (currently CSVs) into and according the client code prefix on the file, the data gets pushed into an Elastic index for each client. This all works nicely.
The problem I have is that I have data from multiple vendors and need a way of identifying which file is from which vendor. Adding an extra column to the CSV is not an option, so my plan was to add this to the filename, so i'd end up with a filenaming convention something like clientcode_vendorcode_reportdate.csv.
I may be missing something, but it seems that on S3 I can't get access to the filename inside my config, which seems crazy given that the prefix is clearly being read at some point. I was intending to use a Grok or Ruby filter to simply split the filename on the underscore, giving me three key variables that I can use in my config, but all attempts so far have failed. I can't even seem to get the full source_path or filename as a variable.
My config so far looks something like this - i've removed failed attempts at using Grok/Ruby filters.
input {
s3 {
access_key_id =>"MYACCESSKEYID"
bucket => "sales-inbox"
backup_to_bucket => "sales-archive"
delete => "true"
secret_access_key => "MYSECRETACCESSKEY"
region => "eu-west-1"
codec => plain
prefix => "ABC"
type => "ABC"
}
s3 {
access_key_id =>"MYACCESSKEYID"
bucket => "sales-inbox"
backup_to_bucket => "sales-archive"
delete => "true"
secret_access_key => "MYSECRETACCESSKEY"
region => "eu-west-1"
codec => plain
prefix => "DEF"
type => "DEF"
}
}
filter {
if [type] == "ABC" {
csv {
columns => ["Date","Units","ProductID","Country"]
separator => ","
}
mutate {
add_field => { "client_code" => "abc" }
}
}
else if [type] == "DEF" {
csv {
columns => ["Date","Units","ProductID","Country"]
separator => ","
}
mutate {
add_field => { "client_code" => "def" }
}
}
mutate
{
remove_field => [ "message" ]
}
}
output {
elasticsearch {
codec => json
hosts => "myelasticcluster.com:9200"
index => "sales_%{client_code}"
document_type => "sale"
}
stdout { codec => rubydebug }
}
An guidance from those well versed in Logstash configs would be much appreciated!

Logstash - find length of split result inside mutate

I'm newbie with Logstash. Currently i'm trying to parse a log in CSV format. I need to split a field with whitespace delimiter, then i'll add new field(s) based on split result.
Here is the filter i need to create:
filter {
...
mutate {
split => ["user", " "]
if [user.length] == 2 {
add_field => { "sourceUsername" => "%{user[0]}" }
add_field => { "sourceAddress" => "%{user[1]}" }
}
else if [user.length] == 1 {
add_field => { "sourceAddress" => "%{user[0]}" }
}
}
...
}
I got error after the if script.
Please advice, is there any way to capture the length of split result inside mutate plugin.
Thanks,
Heri
According to your code example I suppose that you are done with csv parsing and you already have a field user which has either a value that contains a sourceAddress or a value that contains a sourceUsername sourceAddress (separated by whitespace).
Now, there are a lot of filters that can be used to retrieve further fields. You don't need to use the mutate filter to split the field. In this case, a more flexible approach would be the grok filter.
Filter:
grok {
match => {
"user" => [
"%{WORD:sourceUsername} %{IP:sourceAddress}",
"%{WORD:sourceUsername}"
]
}
}
A field "user" => "192.168.0.99" would result in
"sourceAddress" => "191.168.0.99".
A field "user" => "Herry 192.168.0.99" would result in
"sourceUsername" => "Herry", "sourceAddress" => "191.168.0.99"
Of course you can change IP to WORD if your sourceAddress is not an IP.

Resources