Logstash Doesn't Read Entire Line With File Input - filter

I'm using Logstash and I'm having troubles getting a rather simple configuration to work.
input {
file {
path => "C:/path/test-data/*.log"
start_position => beginning
type => "usage_data"
}
}
filter {
if [type] == "usage_data" {
grok {
match => { "message" => "^\s*%{NUMBER:lineNumber}\s+%{TIMESTAMP_ISO8601:date},(?<value1>[A-Za-z0-9+/]+),(?<value2>[A-Za-z0-9+/]+),(?<value3>[A-Za-z0-9+/]+),(?<value4>[^,]+),(?<value5>[^\r]*)" }
}
}
if "_grokparsefailure" not in [tags] {
drop { }
}
}
output {
stdout { codec => rubydebug }
}
I call Logstash like this:
SET LS_MAX_MEM=2g
DEL "%USERPROFILE%\.sincedb_*" 2> NUL
"C:\Program Files (x86)\logstash-1.4.1\bin\logstash.bat" agent -p "C:\path\\." -w 1 -f "logstash.conf"
The output:
←[33mUsing milestone 2 input plugin 'file'. This plugin should be stable, but if you see strange behavior, please let us know! For more information on plugin milestones, see http://logstash.net/docs/1.4.1/plugin-milestones {:level=>:w
arn}←[0m
{
"message" => ",",
"#version" => "1",
"#timestamp" => "2014-11-20T09:16:08.591Z",
"type" => "usage_data",
"host" => "my-machine",
"path" => "C:/path/test-data/monitor_20141116223000.log",
"tags" => [
[0] "_grokparsefailure"
]
}
If I parse only C:\path\test-data\monitor_20141116223000.log all lines are read and there is no grokparsefailure. If I remove C:\path\test-data\monitor_20141116223000.log the same grokparsefailure pops up in another log-file:
{
"message" => "atches in another context\r",
"#version" => "1",
"#timestamp" => "2014-11-20T09:14:04.779Z",
"type" => "usage_data",
"host" => "my-machine",
"path" => "C:/path/test-data/monitor_20140829235900.log",
"tags" => [
[0] "_grokparsefailure"
]
}
Especially the last output proves that Logstash doesn't read the entire line or attempts to interpret a newline where there is none. It always breaks at the same line at the same position.
Maybe I should add that the log-files contain \n as a line separator and I'm running Logstash on Windows. However, I'm not getting a whole lot of errors, just that one. And there are quite a lot of lines in there. They all appear properly when I remove the if "_grokparsefailure" ....
I assume that there is some problem with buffering, but I have no clue how to make this work. Any ideas?

Workaround:
# diff -Nur /opt/logstash/vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/lib/filewatch/tail.rb.orig /opt/logstash/vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/lib/filewatch/tail.rb
--- /opt/logstash/vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/lib/filewatch/tail.rb.orig 2015-02-25 10:46:06.916321816 +0700
+++ /opt/logstash/vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/lib/filewatch/tail.rb 2015-02-12 18:39:34.943833909 +0700
## -86,7 +86,9 ##
_read_file(path, &block)
#files[path].close
#files.delete(path)
- #statcache.delete(path)
+ ##statcache.delete(path)
+ inode = #statcache.delete(path)
+ #sincedb[inode] = 0
else
#logger.warn("unknown event type #{event} for #{path}")
end

Related

Logstash aggregate fields

I am trying to configure logstash to aggregate similar syslog based on a message field and in a specific timestamp.
To make my case clear, this is an example of what I would like to do.
example: I have those junk syslog coming through my logstash
timestamp. message
13:54:24. hello
13:54:35. hello
What I would like to do is have a condition that check if the message are the same and those message occurs in a specific timespan (for example 10min) I would like to aggregate them into one row, and increase the count
the output I am expecting to see is as follow
timestamp. message. count
13.54.35. hello. 2
I know and I saw that there is the opportunity to aggregate the fields, but I was wondering if there is a chance to do this aggregation based on a specific time range
If anyone can help me I would be extremely grateful as I am new to logstash and I have the problem that in my server I am receiving tons of junk syslog and I would like to reduce that amount.
So far I did some cleaning with this configuration
input {
syslog {
port => 514
}
}
filter {
prune {
whitelist_names =>["timestamp","message","newfield"]
}
mutate {
add_field => {"newfield" => "%{#timestamp}%{message}"}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash_index"
}
stdout {
codec => rubydebug
}
}
Now I just need to do the aggregation.
Thank you so much for your help guys
EDIT:
Following the documentation, I put in place this configuration:
input {
syslog {
port => 514
}
}
filter {
prune {
whitelist_names =>["timestamp","message","newfield"]
}
mutate {
add_field => {"newfield" => "%{#timestamp}%{message}"}
}
if [message] =~ "MESSAGE FROM" {
aggregate {
task_id => "%{message}"
code => "map['message'] ||= 0; map['message'] += 1;"
push_map_as_event_on_timeout => true
timeout_task_id_field => "message"
timeout => 60
inactivity_timeout => 50
timeout_tags => ['_aggregatetimeout']
timeout_code => "event.set('count_message', event.get('message') > 1)"
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash_index"
}
stdout {
codec => rubydebug
}
}
I don't get any error but the output is not what I am expecting.
The actual output is that it create a tag field (Good) passing an array with _aggregationtimeout and _aggregationexception
{
"message" => "<88>MESSAGE FROM\r\n",
"tags" => [
[0] "_aggregatetimeout",
[1] "_aggregateexception"
],
"#timestamp" => 2021-07-23T12:10:45.646Z,
"#version" => "1"
}

Grok parse error while parsing multiple line messages

I am trying to figure out grok pattern for parsing multiple messages like exception trace & below is one such log
2017-03-30 14:57:41 [12345] [qtp1533780180-12] ERROR com.app.XYZ - Exception occurred while processing
java.lang.NullPointerException: null
at spark.webserver.MatcherFilter.doFilter(MatcherFilter.java:162)
at spark.webserver.JettyHandler.doHandle(JettyHandler.java:61)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:189)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:119)
at org.eclipse.jetty.server.Server.handle(Server.java:517)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:302)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:242)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:245)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:75)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:213)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:147)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
at java.lang.Thread.run(Thread.java:745)
Here is my logstash.conf
input {
file {
path => ["/debug.log"]
codec => multiline {
# Grok pattern names are valid! :)
pattern => "^%{TIMESTAMP_ISO8601} "
negate => true
what => previous
}
}
}
filter {
mutate {
gsub => ["message", "r", ""]
}
grok {
match => [ "message", "%{TIMESTAMP_ISO8601:timestamp} \[%{NOTSPACE:uid}\] \[%{NOTSPACE:thread}\] %{LOGLEVEL:loglevel} %{DATA:class}\-%{GREEDYDATA:message}" ]
overwrite => [ "message" ]
}
date {
match => [ "timestamp" , "yyyy-MM-dd HH:mm:ss" ]
}
}
output {
elasticsearch { hosts => localhost }
stdout { codec => rubydebug }
}
This works fine for single line logs parsing but fails in
0] "_grokparsefailure"
for multiline exception traces
Can someone please suggest me the correct filter pattern for parsing multiline logs ?
If you are working with Multiline logs then please use Multiline filter provided by logstash. You first need to distinguish the starting of a new record in multiline filter. From your logs I can see new record is starting with "TIMESTAMP", below is the example usage.
Example usage ::
filter {
multiline {
type => "/debug.log"
pattern => "^%{TIMESTAMP}"
what => "previous"
}
}
You can then use Gsub to replace "\n" and "\r" which will be added by multiline filter to your record. After that use Grok.
The above logstash config worked fine after removing
mutate {
gsub => ["message", "r", ""]
}
So the working logstash config for parsing single line & multi line inputs for the above log pattern
input {
file {
path => ["./debug.log"]
codec => multiline {
# Grok pattern names are valid! :)
pattern => "^%{TIMESTAMP_ISO8601} "
negate => true
what => previous
}
}
}
filter {
grok {
match => [ "message", "%{TIMESTAMP_ISO8601:timestamp} \[%{NOTSPACE:uid}\] \[%{NOTSPACE:thread}\] %{LOGLEVEL:loglevel} %{DATA:class}\-%{GREEDYDATA:message}" ]
overwrite => [ "message" ]
}
date {
match => [ "timestamp" , "yyyy-MM-dd HH:mm:ss" ]
}
}
output {
elasticsearch { hosts => localhost }
stdout { codec => rubydebug }
}

How to force encoding for Logstash filters? (umlauts from message not recognized)

I am trying to import historical log data into ElasticSearch (Version 5.2.2) using Logstash (Version 5.2.1) - all running under Windows 10.
Sample log file
The sample log file I am importing looks like this:
07.02.2017 14:16:42 - Critical - General - Ähnlicher Fehler mit übermäßger Ödnis
08.02.2017 14:13:52 - Critical - General - ästhetisch überfällige Fleißarbeit
Working configuration
For starters I tried the following simple Logstash configuration (it's running on Windows so don't get confused by the mixed slashes ;)):
input {
file {
path => "D:/logstash/bin/*.log"
sincedb_path => "C:\logstash\bin\file_clientlogs_lastrun"
ignore_older => 999999999999
start_position => "beginning"
stat_interval => 60
type => "clientlogs"
}
}
output {
if [type] == "clientlogs" {
elasticsearch {
index => "logstash-clientlogs"
}
}
}
And this works fine - I see that input gets read line by line into the index I specified - when I check with Kibana for example those two lines might look like this (I just ommitted host-name - click to enlarge):
More complex (not working) configuration
But of course this is still pretty flat data and I really want to extract the proper timestamps from my lines and also the other fields and replace #timestamp and message with those; so I inserted some filter-logic involving grok-, mutate- and date-filter in between inputand output so the resulting configuration looks like this:
input {
file {
path => "D:/logs/*.log"
sincedb_path => "C:\logstash\bin\file_clientlogs_lastrun"
ignore_older => 999999999999
start_position => "beginning"
stat_interval => 60
type => "clientlogs"
}
}
filter {
if [type] == "clientlogs" {
grok {
match => [ "message", "%{MONTHDAY:monthday}.%{MONTHNUM2:monthnum}.%{YEAR:year} %{TIME:time} - %{WORD:severity} - %{WORD:granularity} - %{GREEDYDATA:logmessage}" ]
}
mutate {
add_field => {
"timestamp" => "%{year}-%{monthnum}-%{monthday} %{time}"
}
replace => [ "message", "%{logmessage}" ]
remove_field => ["year", "monthnum", "monthday", "time", "logmessage"]
}
date {
locale => "en"
match => ["timestamp", "YYYY-MM-dd HH:mm:ss"]
timezone => "Europe/Vienna"
target => "#timestamp"
add_field => { "debug" => "timestampMatched"}
}
}
}
output {
if [type] == "clientlogs" {
elasticsearch {
index => "logstash-clientlogs"
}
}
}
Now, when I look at those logs for example with Kibana, I see the fields I wanted to add do appear and the timestamp and message are replaced correctly, but my umlauts are all gone (click to enlarge):
Forcing charset in input and output
I also tried setting
codec => plain {
charset => "UTF-8"
}
for input and output, but that also did not change anything for the better.
Different output-type
When I change output to stdout { }
The output seems okay:
2017-02-07T13:16:42.000Z MYPC Ähnlicher Fehler mit übermäßger Ödnis
2017-02-08T13:13:52.000Z MYPC ästhetisch überfällige Fleißarbeit
Querying without Kibana
I also queried against the index using this PowerShell-command:
Invoke-WebRequest –Method POST -Uri 'http://localhost:9200/logstash-clientlogs/_search' -Body '
{
"query":
{
"regexp": {
"message" : ".*"
}
}
}
' | select -ExpandProperty Content
But it also returns the same messed up contents Kibana reveals:
{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"logstash-clientlogs","_type":"clientlogs","_id":"AVskdTS8URonc
bfBgFwC","_score":1.0,"_source":{"severity":"Critical","debug":"timestampMatched","message":"�hnlicher Fehler mit �berm��ger �dnis\r","type":"clientlogs","path":"D:/logs/Client.log","#timestamp":"2017-02-07T13:16:42.000Z","granularity":"General","#version":"1","host":"MYPC","timestamp":"2017-02-07 14:16:42"}},{"_index":"logstash-clientlogs","_type":"clientlogs","_id":"AVskdTS8UR
oncbfBgFwD","_score":1.0,"_source":{"severity":"Critical","debug":"timestampMatched","message":"�sthetisch �berf�llige Flei�arbeit\r","type":"clientlogs","path":"D:/logs/Client.log","#timestamp":"2017-02-08T13:13:52.000Z","granularity":"General","#version":"1","host":"MYPC","timestamp":"2017-02-08 14:13:52"}}]}}
Has anyone else experienced this and has a solution for this use-case? I don't see any setting for grok to specify any encoding (the file I am passing is UTF-8 with BOM) and encoding for input itself does not seem necessary, because it gets me the correct message when I leave out the filter.

Logstash multiline failing with custom json parsing

I have a Kafka queue with json objects. I am filling this queue with a java based offline producer. The structure of json object is shown as an example:
{
"key": "999998",
"message" : "dummy \n Messages \n Line 1 ",
"type" : "app_event",
"stackTrace" : "dummyTraces",
"tags" : "dummyTags"
}
Note the \n in the "message".
I loaded the queue with million objects and started logstash with the following script:
input {
kafka {
zk_connect => "localhost:2181"
topic_id => "MemoryTest"
type => "app_event"
group_id => "dash_prod"
}
}
filter{
if [type] == "app_event" {
multiline {
pattern => "^\s"
what => "previous"
}
}
}
output {
if [type] == "app_event" {
stdout {
codec => rubydebug
}
elasticsearch {
host => "localhost"
protocol => "http"
port => "9200"
index => "app_events"
index_type => "event"
}
}
}
The multiline filter is expected to remove \n from the message field. When I start logstash, I start getting two issues:
None of the event is pushed into Elastic. I am getting error:_jsonparsefailure. Also notice that the message of one event 'gobbles' up consecutive events.
{
"message" => "{ \n\t\"key\": \"146982\", \n\t\"message\" : \"dummy \n Messages \n Line 1 \", \n\t\"type\" : \"app_event\", \n\t\"stackTrace\" : \"dummyTraces\", \n\t\"tags\" : \"dummyTags\" \n \t \n}\n{ \n\t\"key\": \"146983\", \n\t\"message\" : \"dummy \n Messages \n Line 1 \", \n\t\"type\" : \"app_event\", \n\t\"stackTrace\" : \"dummyTraces\", \n\t\"tags\" : \"dummyTags\" \n \t \n}\n{ \n\t\"key\": \"146984\", \n\t\"message\" : \"dummy \n Messages \n Line 1 \", \n\t\"type\" : \"app_event\", \n\t\"stackTrace\" : \"dummyTraces\", \n\t\"tags\" : \"dummyTags\" \n \t \n},
"tags" => [
[0] "_jsonparsefailure",
1 "multiline"
],
"#version" => "1",
"#timestamp" => "2015-09-21T18:38:32.005Z",
"type" => "app_event"
}
After few minutes, the available heap memory reached a cap and logstash stopped.
A memory profile is attached with this issue. After 13 minutes, logstash hit the memory cap and stopped responding.
I am trying to understand how to get multiline worked for this scenario and what causes memory crash.
To replace part of a string, use mutate->gsub{}.
filter {
mutate {
gsub => [
# replace all forward slashes with underscore
"fieldname", "/", "_",
# replace backslashes, question marks, hashes, and minuses
# with a dot "."
"fieldname2", "[\\?#-]", "."
]
}
}
multiline is, as you've discovered, for combining several events into one.

Logstash date parsing as timestamp using the date filter

Well, after looking around quite a lot, I could not find a solution to my problem, as it "should" work, but obviously doesn't.
I'm using on a Ubuntu 14.04 LTS machine Logstash 1.4.2-1-2-2c0f5a1, and I am receiving messages such as the following one:
2014-08-05 10:21:13,618 [17] INFO Class.Type - This is a log message from the class:
BTW, I am also multiline
In the input configuration, I do have a multiline codec and the event is parsed correctly. I also separate the event text in several parts so that it is easier to read.
In the end, I obtain, as seen in Kibana, something like the following (JSON view):
{
"_index": "logstash-2014.08.06",
"_type": "customType",
"_id": "PRtj-EiUTZK3HWAm5RiMwA",
"_score": null,
"_source": {
"#timestamp": "2014-08-06T08:51:21.160Z",
"#version": "1",
"tags": [
"multiline"
],
"type": "utg-su",
"host": "ubuntu-14",
"path": "/mnt/folder/thisIsTheLogFile.log",
"logTimestamp": "2014-08-05;10:21:13.618",
"logThreadId": "17",
"logLevel": "INFO",
"logMessage": "Class.Type - This is a log message from the class:\r\n BTW, I am also multiline\r"
},
"sort": [
"21",
1407315081160
]
}
You may have noticed that I put a ";" in the timestamp. The reason is that I want to be able to sort the logs using the timestamp string, and apparently logstash is not that good at that (e.g.: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/multi-fields.html).
I have unsuccessfull tried to use the date filter in multiple ways, and it apparently did not work.
date {
locale => "en"
match => ["logTimestamp", "YYYY-MM-dd;HH:mm:ss.SSS", "ISO8601"]
timezone => "Europe/Vienna"
target => "#timestamp"
add_field => { "debug" => "timestampMatched"}
}
Since I read that the Joda library may have problems if the string is not strictly ISO 8601-compliant (very picky and expects a T, see https://logstash.jira.com/browse/LOGSTASH-180), I also tried to use mutate to convert the string to something like 2014-08-05T10:21:13.618 and then use "YYYY-MM-dd'T'HH:mm:ss.SSS". That also did not work.
I do not want to have to manually put a +02:00 on the time because that would give problems with daylight saving.
In any of these cases, the event goes to elasticsearch, but date does apparently nothing, as #timestamp and logTimestamp are different and no debug field is added.
Any idea how I could make the logTime strings properly sortable? I focused on converting them to a proper timestamp, but any other solution would also be welcome.
As you can see below:
When sorting over #timestamp, elasticsearch can do it properly, but since this is not the "real" log timestamp, but rather when the logstash event was read, I need (obviously) to be able to sort also over logTimestamp. This is what then is output. Obviously not that useful:
Any help is welcome! Just let me know if I forgot some information that may be useful.
Update:
Here is the filter config file that finally worked:
# Filters messages like this:
# 2014-08-05 10:21:13,618 [17] INFO Class.Type - This is a log message from the class:
# BTW, I am also multiline
# Take only type- events (type-componentA, type-componentB, etc)
filter {
# You cannot write an "if" outside of the filter!
if "type-" in [type] {
grok {
# Parse timestamp data. We need the "(?m)" so that grok (Oniguruma internally) correctly parses multi-line events
patterns_dir => "./patterns"
match => [ "message", "(?m)%{TIMESTAMP_ISO8601:logTimestampString}[ ;]\[%{DATA:logThreadId}\][ ;]%{LOGLEVEL:logLevel}[ ;]*%{GREEDYDATA:logMessage}" ]
}
# The timestamp may have commas instead of dots. Convert so as to store everything in the same way
mutate {
gsub => [
# replace all commas with dots
"logTimestampString", ",", "."
]
}
mutate {
gsub => [
# make the logTimestamp sortable. With a space, it is not! This does not work that well, in the end
# but somehow apparently makes things easier for the date filter
"logTimestampString", " ", ";"
]
}
date {
locale => "en"
match => ["logTimestampString", "YYYY-MM-dd;HH:mm:ss.SSS"]
timezone => "Europe/Vienna"
target => "logTimestamp"
}
}
}
filter {
if "type-" in [type] {
# Remove already-parsed data
mutate {
remove_field => [ "message" ]
}
}
}
I have tested your date filter. it works on me!
Here is my configuration
input {
stdin{}
}
filter {
date {
locale => "en"
match => ["message", "YYYY-MM-dd;HH:mm:ss.SSS"]
timezone => "Europe/Vienna"
target => "#timestamp"
add_field => { "debug" => "timestampMatched"}
}
}
output {
stdout {
codec => "rubydebug"
}
}
And I use this input:
2014-08-01;11:00:22.123
The output is:
{
"message" => "2014-08-01;11:00:22.123",
"#version" => "1",
"#timestamp" => "2014-08-01T09:00:22.123Z",
"host" => "ABCDE",
"debug" => "timestampMatched"
}
So, please make sure that your logTimestamp has the correct value.
It is probably other problem. Or can you provide your log event and logstash configuration for more discussion. Thank you.
This worked for me - with a slightly different datetime format:
# 2017-11-22 13:00:01,621 INFO [AtlassianEvent::0-BAM::EVENTS:pool-2-thread-2] [BuildQueueManagerImpl] Sent ExecutableQueueUpdate: addToQueue, agents known to be affected: []
input {
file {
path => "/data/atlassian-bamboo.log"
start_position => "beginning"
type => "logs"
codec => multiline {
pattern => "^%{TIMESTAMP_ISO8601} "
charset => "ISO-8859-1"
negate => true
what => "previous"
}
}
}
filter {
grok {
match => [ "message", "(?m)^%{TIMESTAMP_ISO8601:logtime}%{SPACE}%{LOGLEVEL:loglevel}%{SPACE}\[%{DATA:thread_id}\]%{SPACE}\[%{WORD:classname}\]%{SPACE}%{GREEDYDATA:logmessage}" ]
}
date {
match => ["logtime", "yyyy-MM-dd HH:mm:ss,SSS", "yyyy-MM-dd HH:mm:ss,SSS Z", "MMM dd, yyyy HH:mm:ss a" ]
timezone => "Europe/Berlin"
}
}
output {
elasticsearch { hosts => ["localhost:9200"] }
stdout { codec => rubydebug }
}

Resources