Logstash : Mutate filter does not work - elasticsearch

I have the following filter
filter {
grok {
break_on_match => false
match => { 'message' => '\[(?<log_time>\d{0,2}\/\d{0,2}\/\d{2} \d{2}:\d{2}:\d{2}:\d{3} [A-Z]{3})\]%{SPACE}%{BASE16NUM}%{SPACE}%{WORD:system_stat}%{GREEDYDATA}\]%{SPACE}%{LOGLEVEL}%{SPACE}(?<log_method>[a-zA-Z\.]+)%{SPACE}-%{SPACE}%{GREEDYDATA:log_message}%{SPACE}#%{SPACE}%{IP:app_host}:%{INT:app_port};%{SPACE}%{GREEDYDATA}Host:%{IPORHOST:host_name}:%{POSINT:host_port}' }
match => { 'message' => '\[(?<log_time>\d{0,2}\/\d{0,2}\/\d{2} \d{2}:\d{2}:\d{2}:\d{3} [A-Z]{3})\]'}
}
kv{
field_split => "\n;"
value_split => "=:"
trimkey => "<>\[\],;\n"
trim => "<>\[\],;\n"
}
date{
match => [ "log_time","MM/dd/YY HH:mm:ss:SSS z" ]
target => "log_time"
locale => "en"
}
mutate {
convert => {
"line_number" => "integer"
"app_port" => "integer"
"host_port" => "integer"
"et" => "integer"
}
#remove_field => [ "message" ]
}
mutate {
rename => {
"et" => "execution_time"
"URI" => "uri"
"Method" => "method"
}
}
}
i can get results out of the grok and kv filters but neither of the mutate filters work. Is it because of the kv filter?
EDIT: Purpose
my problem is that my log contains heterogenous log records. For example
[9/13/16 15:01:18:301 EDT] 89798797 SystemErr jbhsdbhbdv [vjnwnvurnuvuv] INFO djsbbdyebycbe - Filter.doFilter(..) took 0 ms.
[9/13/16 15:01:18:302 EDT] 4353453443 SystemErr sdgegrebrb [dbebtrntn] INFO sverbrebtnnrb - [SECURITY AUDIT] Received request from: "null" # wrvrbtbtbtf:000222; Headers=Host:vervreertherg:1111
Connection:keep-alive
User-Agent:Mozilla/5.0
Accept:text/css,*/*;q=0.1
Referer:https:kokokfuwnvuwnev/ikvdwninirnv/inwengi
Accept-Encoding:gzip
Accept-Language:en-US,en;q=0.8
; Body=; Method=GET; URI=dasd/wgomnwiregnm/iwenviewn; et=10ms; SC=200
all i care about is capturing the timestamp at the beginning of each record and a few other fields if they are present. i want Method,et,Host,loglevel and URI. If these fields are not present, i still want to capture the event with the loglevel and the message being logged.
is it advisable to capture such events using the same logstash process? should i be running two logstash processes? The problem is that i dont know the structure of the logs beforehand, apart from the few fields that i do want to capture.
Multiline config
path => ["path to log"]
start_position => "beginning"
ignore_older => 0
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^\[\d{0,2}\/\d{0,2}\/\d{2} \d{2}:\d{2}:\d{2}:\d{3} [A-Z]{3}\]"
negate => "true"
what => "previous"

Maybe it is because some fields (line_number, et, URI, Method) aren't being created during the initial grok. For example, I see you define "log_method" but in mutate->rename, you refer to "Method". Is there a json codec or something applied in the input block that adds these extra fields?
If you post sample logs, I can test them with your filter and help you more. :)
EDIT:
I see that the log you sent has multiple lines. Are you using a multiline filter on input? Could you share your input block as well?
You definitely don't need to run two Logstash processes. One Logstash can take care of multiple log formats. You can use conditionals, try/catch, or mark the fields as optional by adding a '?' after.
MORE EDIT:
I'm getting output that implies that your mutate filters work:
"execution_time" => 10,
"uri" => "dasd/wgomnwiregnm/iwenviewn",
"method" => "GET"
once I changed trimkey => "<>\[\],;\n" to trimkey => "<>\[\],;( )?\n". I noticed that those fields (et, Method) were being prefixed with a space.
Note: I'm using the following multiline filter for testing, if yours is different it would affect the outcome. Let me know if that helps.
codec => multiline {
pattern => "\n"
negate => true
what => previous
}

Related

How to filter data with Logstash before storing parsed data in Elasticsearch

I understand that Logstash is for aggregating and processing logs. I have NGIX logs and had Logstash config setup as:
filter {
grok {
match => [ "message" , "%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}"]
overwrite => [ "message" ]
}
mutate {
convert => ["response", "integer"]
convert => ["bytes", "integer"]
convert => ["responsetime", "float"]
}
geoip {
source => "clientip"
target => "geoip"
add_tag => [ "nginx-geoip" ]
}
date {
match => [ "timestamp" , "dd/MMM/YYYY:HH:mm:ss Z" ]
remove_field => [ "timestamp" ]
}
useragent {
source => "agent"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "weblogs-%{+YYYY.MM}"
document_type => "nginx_logs"
}
stdout { codec => rubydebug }
}
This would parse the unstructured logs into a structured form of data, and store the data into monthly indexes.
What I discovered is that the majority of logs were contributed by robots/web-crawlers. In python I would filter them out by:
browser_names = browser_names[~browser_names.str.\
match('^[\w\W]*(google|bot|spider|crawl|headless)[\w\W]*$', na=False)]
However, I would like to filter them out with Logstash so I can save a lot of disk space in Elasticsearch server. Is there a way to do that? Thanks in advance!
Thanks LeBigCat for generously giving a hint. I solved this problem by adding the following under the filter:
if [browser_names] =~ /(?i)^[\w\W]*(google|bot|spider|crawl|headless)[\w\W]*$/ {
drop {}
}
the (?i) flag is for case insensitive matching.
In your filter you can ask for drop (https://www.elastic.co/guide/en/logstash/current/plugins-filters-drop.html). As you already got your pattern, should be pretty fast ;)

Logstash Filtering and Parsing Dies Output

Environment
Ubuntu 16.04
Logstash 5.2.1
ElasticSearch 5.1
I've configured our Deis platform to send logs to our Logstack node with no issues. However, I'm still new to Ruby and Regexes are not my strong suit.
Log Example:
2017-02-15T14:55:24UTC deis-logspout[1]: 2017/02/15 14:55:24 routing all to udp://x.x.x.x:xxxx\n
Logstash Configuration:
input {
tcp {
port => 5000
type => syslog
codec => plain
}
udp {
port => 5000
type => syslog
codec => plain
}
}
filter {
json {
source => "syslog_message"
}
}
output {
elasticsearch { hosts => ["foo.somehost"] }
}
Elasticsearch output:
"#timestamp" => 2017-02-15T14:55:24.408Z,
"#version" => "1",
"host" => "x.x.x.x",
"message" => "2017-02-15T14:55:24UTC deis-logspout[1]: 2017/02/15 14:55:24 routing all to udp://x.x.x.x:xxxx\n",
"type" => "json"
Desired outcome:
"#timestamp" => 2017-02-15T14:55:24.408Z,
"#version" => "1",
"host" => "x.x.x.x",
"type" => "json"
"container" => "deis-logspout"
"severity level" => "Info"
"message" => "routing all to udp://x.x.x.x:xxxx\n"
How can I extract the information out of the message into their individual fields?
Unfortunately your assumptions about what you are trying to do is slightly off, but we can fix that!
You created a regex for JSON, but you are not parsing JSON. You are simply parsing a log that is bastardized syslog (see syslogStreamer in the source), but is not in fact syslog format (either RFC 5424 or 3164). Logstash afterwards provides JSON output.
Let's break down the message, which becomes the source that you parse. The key is you have to parse the message front to back.
Message:
2017-02-15T14:55:24UTC deis-logspout[1]: 2017/02/15 14:55:24 routing all to udp://x.x.x.x:xxxx\n
2017-02-15T14:55:24UTC: Timestamp is a common grok pattern. This mostly follows TIMESTAMP_ISO8601 but not quite.
deis-logspout[1]: This would be your logsource, which you can name container. You can use the grok pattern URIHOST.
routing all to udp://x.x.x.x:xxxx\n: Since the message for most logs is contained at the end of the message, you can just then use the grok pattern GREEDYDATA which is the equivalent of .* in a regular expression.
2017/02/15 14:55:24: Another timestamp (why?) that doesn't match common grok patterns.
With grok filters, you can map a syntax (abstraction from regular expressions) to a semantic (name for the value that you extract). For example %{URIHOST:container}
You'll see I did some hacking together of the grok filters to make the formatting work. You have match parts of the text, even if you don't intend to capture the results. If you can't change the formatting of the timestamps to match standards, create a custom pattern.
Configuration:
input {
tcp {
port => 5000
type => deis
}
udp {
port => 5000
type => deis
}
}
filter {
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp}(UTC|CST|EST|PST) %{URIHOST:container}\[%{NUMBER}\]: %{YEAR}/%{MONTHNUM}/%{MONTHDAY} %{TIME} %{GREEDYDATA:msg}" }
}
}
output {
elasticsearch { hosts => ["foo.somehost"] }
}
Output:
{
"container" => "deis-logspout",
"msg" => "routing all to udp://x.x.x.x:xxxx",
"#timestamp" => 2017-02-22T23:55:28.319Z,
"port" => 62886,
"#version" => "1",
"host" => "10.0.2.2",
"message" => "2017-02-15T14:55:24UTC deis-logspout[1]: 2017/02/15 14:55:24 routing all to udp://x.x.x.x:xxxx",
"timestamp" => "2017-02-15T14:55:24"
"type" => "deis"
}
You can additionally mutate the item to drop #timestamp, #host, etc. as these are provided by Logstash by default. Another suggestion is to use the date filter to convert any timestamps found into usable formats (better for searching).
Depending on the log formatting, you may have to slightly alter the pattern. I only had one example to go off of. This also maintains the original full message, because any field operations done in Logstash are destructive (they overwrite values with fields of the same name).
Resources:
Grok
Grok Patterns
Grok Debugger

Logstash config - S3 Filenames?

I'm attempting to use Logstash to collect sales data for multiple clients from multiple vendors.
So far i've an S3 (Inbox) bucket that I can drop my files (currently CSVs) into and according the client code prefix on the file, the data gets pushed into an Elastic index for each client. This all works nicely.
The problem I have is that I have data from multiple vendors and need a way of identifying which file is from which vendor. Adding an extra column to the CSV is not an option, so my plan was to add this to the filename, so i'd end up with a filenaming convention something like clientcode_vendorcode_reportdate.csv.
I may be missing something, but it seems that on S3 I can't get access to the filename inside my config, which seems crazy given that the prefix is clearly being read at some point. I was intending to use a Grok or Ruby filter to simply split the filename on the underscore, giving me three key variables that I can use in my config, but all attempts so far have failed. I can't even seem to get the full source_path or filename as a variable.
My config so far looks something like this - i've removed failed attempts at using Grok/Ruby filters.
input {
s3 {
access_key_id =>"MYACCESSKEYID"
bucket => "sales-inbox"
backup_to_bucket => "sales-archive"
delete => "true"
secret_access_key => "MYSECRETACCESSKEY"
region => "eu-west-1"
codec => plain
prefix => "ABC"
type => "ABC"
}
s3 {
access_key_id =>"MYACCESSKEYID"
bucket => "sales-inbox"
backup_to_bucket => "sales-archive"
delete => "true"
secret_access_key => "MYSECRETACCESSKEY"
region => "eu-west-1"
codec => plain
prefix => "DEF"
type => "DEF"
}
}
filter {
if [type] == "ABC" {
csv {
columns => ["Date","Units","ProductID","Country"]
separator => ","
}
mutate {
add_field => { "client_code" => "abc" }
}
}
else if [type] == "DEF" {
csv {
columns => ["Date","Units","ProductID","Country"]
separator => ","
}
mutate {
add_field => { "client_code" => "def" }
}
}
mutate
{
remove_field => [ "message" ]
}
}
output {
elasticsearch {
codec => json
hosts => "myelasticcluster.com:9200"
index => "sales_%{client_code}"
document_type => "sale"
}
stdout { codec => rubydebug }
}
An guidance from those well versed in Logstash configs would be much appreciated!

Logstash - find length of split result inside mutate

I'm newbie with Logstash. Currently i'm trying to parse a log in CSV format. I need to split a field with whitespace delimiter, then i'll add new field(s) based on split result.
Here is the filter i need to create:
filter {
...
mutate {
split => ["user", " "]
if [user.length] == 2 {
add_field => { "sourceUsername" => "%{user[0]}" }
add_field => { "sourceAddress" => "%{user[1]}" }
}
else if [user.length] == 1 {
add_field => { "sourceAddress" => "%{user[0]}" }
}
}
...
}
I got error after the if script.
Please advice, is there any way to capture the length of split result inside mutate plugin.
Thanks,
Heri
According to your code example I suppose that you are done with csv parsing and you already have a field user which has either a value that contains a sourceAddress or a value that contains a sourceUsername sourceAddress (separated by whitespace).
Now, there are a lot of filters that can be used to retrieve further fields. You don't need to use the mutate filter to split the field. In this case, a more flexible approach would be the grok filter.
Filter:
grok {
match => {
"user" => [
"%{WORD:sourceUsername} %{IP:sourceAddress}",
"%{WORD:sourceUsername}"
]
}
}
A field "user" => "192.168.0.99" would result in
"sourceAddress" => "191.168.0.99".
A field "user" => "Herry 192.168.0.99" would result in
"sourceUsername" => "Herry", "sourceAddress" => "191.168.0.99"
Of course you can change IP to WORD if your sourceAddress is not an IP.

Data type conversion using logstash grok

Basic is a float field. The mentioned index is not present in elasticsearch. When running the config file with logstash -f, I am getting no exception. Yet, the data reflected and entered in elasticsearch shows the mapping of Basic as string. How do I rectify this? And how do I do this for multiple fields?
input {
file {
path => "/home/sagnik/work/logstash-1.4.2/bin/promosms_dec15.csv"
type => "promosms_dec15"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
grok{
match => [
"Basic", " %{NUMBER:Basic:float}"
]
}
csv {
columns => ["Generation_Date","Basic"]
separator => ","
}
ruby {
code => "event['Generation_Date'] = Date.parse(event['Generation_Date']);"
}
}
output {
elasticsearch {
action => "index"
host => "localhost"
index => "promosms-%{+dd.MM.YYYY}"
workers => 1
}
}
You have two problems. First, your grok filter is listed prior to the csv filter and because filters are applied in order there won't be a "Basic" field to convert when the grok filter is applied.
Secondly, unless you explicitly allow it, grok won't overwrite existing fields. In other words,
grok{
match => [
"Basic", " %{NUMBER:Basic:float}"
]
}
will always be a no-op. Either specify overwrite => ["Basic"] or, preferably, use mutate's type conversion feature:
mutate {
convert => ["Basic", "float"]
}

Resources