Logstash config - S3 Filenames? - ruby

I'm attempting to use Logstash to collect sales data for multiple clients from multiple vendors.
So far i've an S3 (Inbox) bucket that I can drop my files (currently CSVs) into and according the client code prefix on the file, the data gets pushed into an Elastic index for each client. This all works nicely.
The problem I have is that I have data from multiple vendors and need a way of identifying which file is from which vendor. Adding an extra column to the CSV is not an option, so my plan was to add this to the filename, so i'd end up with a filenaming convention something like clientcode_vendorcode_reportdate.csv.
I may be missing something, but it seems that on S3 I can't get access to the filename inside my config, which seems crazy given that the prefix is clearly being read at some point. I was intending to use a Grok or Ruby filter to simply split the filename on the underscore, giving me three key variables that I can use in my config, but all attempts so far have failed. I can't even seem to get the full source_path or filename as a variable.
My config so far looks something like this - i've removed failed attempts at using Grok/Ruby filters.
input {
s3 {
access_key_id =>"MYACCESSKEYID"
bucket => "sales-inbox"
backup_to_bucket => "sales-archive"
delete => "true"
secret_access_key => "MYSECRETACCESSKEY"
region => "eu-west-1"
codec => plain
prefix => "ABC"
type => "ABC"
}
s3 {
access_key_id =>"MYACCESSKEYID"
bucket => "sales-inbox"
backup_to_bucket => "sales-archive"
delete => "true"
secret_access_key => "MYSECRETACCESSKEY"
region => "eu-west-1"
codec => plain
prefix => "DEF"
type => "DEF"
}
}
filter {
if [type] == "ABC" {
csv {
columns => ["Date","Units","ProductID","Country"]
separator => ","
}
mutate {
add_field => { "client_code" => "abc" }
}
}
else if [type] == "DEF" {
csv {
columns => ["Date","Units","ProductID","Country"]
separator => ","
}
mutate {
add_field => { "client_code" => "def" }
}
}
mutate
{
remove_field => [ "message" ]
}
}
output {
elasticsearch {
codec => json
hosts => "myelasticcluster.com:9200"
index => "sales_%{client_code}"
document_type => "sale"
}
stdout { codec => rubydebug }
}
An guidance from those well versed in Logstash configs would be much appreciated!

Related

Read a CSV in Logstash level and filter on basis of the extracted data

I am using Metricbeat to get process-level data and push it to Elastic Search using Logstash.
Now, the aim is to categorize the processes into 2 tags i.e the process running is either a browser or it is something else.
I am able to do that statically using this block of code :
input {
beats {
port => 5044
}
}
filter{
if [process][name]=="firefox.exe" or [process][name]=="chrome.exe" {
mutate {
add_field => { "process.type" => "browsers" }
convert => {
"process.type" => "string"
}
}
}
else {
mutate {
add_field => { "process.type" => "other" }
}
}
}
output {
elasticsearch {
hosts => "localhost:9200"
# manage_template => false
index => "metricbeatlogstash"
}
}
But when I try to make that if condition dynamic by reading the process list from a CSV, I am not getting any valid results in Kibana, nor a error on my LogStash level.
The CSV config file code is as follows :
input {
beats {
port => 5044
}
file{
path=>"filePath"
start_position=>"beginning"
sincedb_path=>"NULL"
}
}
filter{
csv{
separator=>","
columns=>["processList","IT"]
}
if [process][name] in [processList] {
mutate {
add_field => { "process.type" => "browsers" }
convert => {
"process.type" => "string"
}
}
}
else {
mutate {
add_field => { "process.type" => "other" }
}
}
}
output {
elasticsearch {
hosts => "localhost:9200"
# manage_template => false
index => "metricbeatlogstash2"
}
}
What you are trying to do does not work that way in logstash, the events in a logstash pipeline are independent from each other.
The events received by your beats input have no knowledge about the events received by your csv input, so you can't use fields from different events in a conditional.
To do what you want you can use the translate filter with the following config.
translate {
field => "[process][name]"
destination => "[process][type]"
dictionary_path => "process.csv"
fallback => "others"
refresh_interval => 300
}
This filter will check the value of the field [process][name] against a dictionary, loaded into memory from the file process.csv, the dictionary is a .csv file with two columns, the first is the name of the browser process and the second is always browser.
chrome.exe,browser
firefox.exe,browser
If the filter got a match, it will populate the field [process][type] (not process.type) with the value from the second column, in this case, always browser, if there is no match, it will populate the field [process][type] with the value of the fallback config, in this case, others, it will also reload the content of the process.csv file every 300 seconds (5 minutes)

how filter {"foo":"bar", "bar": "foo"} with grok to get only foo field?

I copied
{"name":"myapp","hostname":"banana.local","pid":40161,"level":30,"msg":"hi","time":"2013-01-04T18:46:23.851Z","v":0}
from https://github.com/trentm/node-bunyan and save it as my logs.json. I am trying to import only two fields (name and msg) to ElasticSearch via LogStash. The problem is that I depend on a sort of filter that I am not able to accomplish. Well I have successfully imported such line as a single message but certainly it is not worth in my real case.
That said, how can I import only name and msg to ElasticSearch? I tested several alternatives using http://grokdebug.herokuapp.com/ to reach an useful filter with no success at all.
For instance, %{GREEDYDATA:message} will bring the entire line as an unique message but how to split it and ignore all other than name and msg fields?
At the end, I am planing to use here:
input {
file {
type => "my_type"
path => [ "/home/logs/logs.log" ]
codec => "json"
}
}
filter {
grok {
match => { "message" => "data=%{GREEDYDATA:request}"}
}
#### some extra lines here probably
}
output
{
elasticsearch {
codec => json
hosts => "http://127.0.0.1:9200"
index => "indextest"
}
stdout { codec => rubydebug }
}
I have just gone through the list of available Logstash filters. The prune filter should match your need.
Assume you have installed the prune filter, your config file should look like:
input {
file {
type => "my_type"
path => [ "/home/logs/logs.log" ]
codec => "json"
}
}
filter {
prune {
whitelist_names => [
"#timestamp",
"type",
"name",
"msg"
]
}
}
output {
elasticsearch {
codec => json
hosts => "http://127.0.0.1:9200"
index => "indextest"
}
stdout { codec => rubydebug }
}
Please be noted that you will want to keep type for Elasticsearch to index it into a correct type. #timestamp is required if you will view the data on Kibana.

Logstash - find length of split result inside mutate

I'm newbie with Logstash. Currently i'm trying to parse a log in CSV format. I need to split a field with whitespace delimiter, then i'll add new field(s) based on split result.
Here is the filter i need to create:
filter {
...
mutate {
split => ["user", " "]
if [user.length] == 2 {
add_field => { "sourceUsername" => "%{user[0]}" }
add_field => { "sourceAddress" => "%{user[1]}" }
}
else if [user.length] == 1 {
add_field => { "sourceAddress" => "%{user[0]}" }
}
}
...
}
I got error after the if script.
Please advice, is there any way to capture the length of split result inside mutate plugin.
Thanks,
Heri
According to your code example I suppose that you are done with csv parsing and you already have a field user which has either a value that contains a sourceAddress or a value that contains a sourceUsername sourceAddress (separated by whitespace).
Now, there are a lot of filters that can be used to retrieve further fields. You don't need to use the mutate filter to split the field. In this case, a more flexible approach would be the grok filter.
Filter:
grok {
match => {
"user" => [
"%{WORD:sourceUsername} %{IP:sourceAddress}",
"%{WORD:sourceUsername}"
]
}
}
A field "user" => "192.168.0.99" would result in
"sourceAddress" => "191.168.0.99".
A field "user" => "Herry 192.168.0.99" would result in
"sourceUsername" => "Herry", "sourceAddress" => "191.168.0.99"
Of course you can change IP to WORD if your sourceAddress is not an IP.

Data type conversion using logstash grok

Basic is a float field. The mentioned index is not present in elasticsearch. When running the config file with logstash -f, I am getting no exception. Yet, the data reflected and entered in elasticsearch shows the mapping of Basic as string. How do I rectify this? And how do I do this for multiple fields?
input {
file {
path => "/home/sagnik/work/logstash-1.4.2/bin/promosms_dec15.csv"
type => "promosms_dec15"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
grok{
match => [
"Basic", " %{NUMBER:Basic:float}"
]
}
csv {
columns => ["Generation_Date","Basic"]
separator => ","
}
ruby {
code => "event['Generation_Date'] = Date.parse(event['Generation_Date']);"
}
}
output {
elasticsearch {
action => "index"
host => "localhost"
index => "promosms-%{+dd.MM.YYYY}"
workers => 1
}
}
You have two problems. First, your grok filter is listed prior to the csv filter and because filters are applied in order there won't be a "Basic" field to convert when the grok filter is applied.
Secondly, unless you explicitly allow it, grok won't overwrite existing fields. In other words,
grok{
match => [
"Basic", " %{NUMBER:Basic:float}"
]
}
will always be a no-op. Either specify overwrite => ["Basic"] or, preferably, use mutate's type conversion feature:
mutate {
convert => ["Basic", "float"]
}

reparsing a logstash record? fix extracts?

I'm taking a JSON message (Cloudtrail, many objects concatenated together) and by the time I'm done filtering it, Logstash doesn't seem to be parsing the message correctly. It's as if the hash was simply dumped into a string.
Anyhow, here's the input and filter.
input {
s3 {
bucket => "stanson-ops"
delete => false
#snipped unimportant bits
type => "cloudtrail"
}
}
filter {
if [type] == "cloudtrail" {
json { # http://logstash.net/docs/1.4.2/filters/json
source => "message"
}
ruby {
code => "event['RecordStr'] = event['Records'].join('~~~')"
}
split {
field => "RecordStr"
terminator => "~~~"
remove_field => [ "message", "Records" ]
}
}
}
By the time I'm done, elasticsearch entries include a RecordStr key with the following data. It doesn't have a message field, nor does it have a Records field.
{"eventVersion"=>"1.01", "userIdentity"=>{"type"=>"IAMUser", "principalId"=>"xxx"}}
Note that is not JSON style, it's been parsed. (which is important for the concat->split thing to work).
So, the RecordStr key looks not quite right as one value. Further, in Kibana, filterable fields include RecordStr (no subfields). It includes some entries that aren't there anymore: Records.eventVersion, Records.userIdentity.type.
Why is that? How can I get the proper fields?
edit 1 here's part of the input.
{"Records":[{"eventVersion":"1.01","userIdentity":{"type":"IAMUser",
It's unprettified JSON. It appears the body of the file (the above) is in the message field, json extracts it and I end up with an array of records in the Records field. That's why I join and split it- I then end up with individual documents, each with a single RecordStr entry. However, the template(?) doesn't seem to understand the new structure.
I've worked out a method that allows for indexing the appropriate CloudTrail fields as you requested. Here are the modified input and filter configs:
input {
s3 {
backup_add_prefix => \"processed-logs/\"
backup_to_bucket => \"test-bucket\"
bucket => \"test-bucket\"
delete => true
interval => 30
prefix => \"AWSLogs/<account-id>/CloudTrail/\"
type => \"cloudtrail\"
}
}
filter {
if [type] == \"cloudtrail\" {
json {
source => \"message\"
}
ruby {
code => \"event.set('RecordStr', event.get('Records').join('~~~'))\"
}
split {
field => \"RecordStr\"
terminator => \"~~~\"
remove_field => [ \"message\", \"Records\" ]
}
mutate {
gsub => [
\"RecordStr\", \"=>\", \":\"
]
}
mutate {
gsub => [
\"RecordStr\", \"nil\", \"null\"
]
}
json {
skip_on_invalid_json => true
source => \"RecordStr\"
target => \"cloudtrail\"
}
mutate {
add_tag => [\"cloudtrail\"]
remove_field=>[\"RecordStr\", \"#version\"]
}
date {
match => [\"[cloudtrail][eventTime]\",\"ISO8601\"]
}
}
}
The key observation here is that once the split is done we no longer possess valid json in the event and are therefore required to execute the mutate replacements ('=>' to ':' and 'nil' to 'null'). Additionally, I found it useful to get the timestamp out of the CloudTrail eventTime and do some cleanup of unnecessary fields.

Resources