Grok pattern optimisation in logstash - elasticsearch

Earlier I had only one type of log for an index, but recently I changed the logs pattern. Now my grok pattern looks like
grok {
match => { "message" => "%{DATA:created_timestamp},%{DATA:request_id},%{DATA:tenant},%{DATA:username},%{DATA:job_code},%{DATA:stepname},%{DATA:quartz_trigger_timestamp},%{DATA:execution_level},%{DATA:facility_name},%{DATA:channel_code},%{DATA:status},%{DATA:current_step_time_ms},%{DATA:total_time_ms},\'%{DATA:error_message}\',%{DATA:tenant_mode},%{GREEDYDATA:channel_src_code},\'%{GREEDYDATA:jobSpecificMetaData}\'" }
match => { "message" => "%{DATA:created_timestamp},%{DATA:request_id},%{DATA:tenant},%{DATA:username},%{DATA:job_code},%{DATA:stepname},%{DATA:quartz_trigger_timestamp},%{DATA:execution_level},%{DATA:facility_name},%{DATA:channel_code},%{DATA:status},%{DATA:current_step_time_ms},%{DATA:total_time_ms},%{DATA:error_message},%{DATA:tenant_mode},%{GREEDYDATA:channel_src_code}" }
}
and sample logs are
2023-01-11 15:16:20.932,edc71ada-62f5-46be-99a4-3c8b882a6ef0,geocommerce,null,UpdateInventoryTask,MQ_TO_EVENTHANDLER,Wed Jan 11 15:16:13 IST 2023,TENANT,null,AMAZON_URBAN_BASICS,SUCCESSFUL,5903,7932,'',LIVE,AMAZON_IN,'{"totalCITCount":0}'
2023-01-11 15:16:29.368,fedca039-e834-4393-bbaa-e1903c3c92e6,bellacasa,null,UpdateInventoryTask,MQ_TO_EVENTHANDLER,Wed Jan 11 15:16:03 IST 2023,TENANT,null,FLIPKART_SMART,SUCCESSFUL,24005,26368,'',LIVE,FLIPKART_SMART,'{"totalCITCount":0}'
2023-01-11 15:16:31.684,762b8b46-2d21-437b-83fc-a1cc40737c84,ishitaknitfab,null,UpdateInventoryTask,MQ_TO_EVENTHANDLER,Wed Jan 11 15:15:48 IST 2023,TENANT,null,FLIPKART_SMART,SUCCESSFUL,41442,43684,'',LIVE,FLIPKART_SMART,'{"totalCITCount":0}'
2023-01-11 15:15:58.739,1416f5f2-a67b-416a-8e38-6bd7de457f6a,kapiva,null,PickingReplanner,MQ_TO_JOBSERVICE,Wed Jan 11 15:15:56 IST 2023,FACILITY,Non Sellable Bengaluru Return,null,SUCCESSFUL,393,2739,Task completed successfully,LIVE,null
2023-01-11 15:15:58.743,1416f5f2-a67b-416a-8e38-6bd7de457f6a,kapiva,null,PickingReplanner,MQ_TO_JOBSERVICE,Wed Jan 11 15:15:56 IST 2023,FACILITY,Delhi Main,null,SUCCESSFUL,371,2743,Task completed successfully,LIVE,null
2023-01-11 15:15:58.744,1416f5f2-a67b-416a-8e38-6bd7de457f6a,kapiva,null,PickingReplanner,MQ_TO_JOBSERVICE,Wed Jan 11 15:15:56 IST 2023,FACILITY,Bengaluru D2C,null,SUCCESSFUL,388,2744,Task completed successfully,LIVE,null
Logstash has to process approximately 150000 events in 5 minutes for this index and approx. 400000 events for the other index.
Now whenever I try to change grok, the CPU usage of the logstash server reaches 100%.
I don't know how to optimize my grok.
can anyone help me is this?

The first step to improve grok would be to anchor the patterns. grok is slow when it fails to match, not when it matches. More details on how much anchoring matters can be found in this blog post from Elastic.
The second step would be to define a custom pattern to use instead of DATA, such as
pattern_definitions => { "NOTCOMMA" => "[^,]*" }
that will prevent DATA attempting to consume more than one field while failing to match.

Related

Missing Indices in Elasticsearch after 6.30pm UTC

We have a ingestion pipeline that will create indices every 2 hours, eg: index-2022-05-10-0 at 12am UTC, index-2022-05-10-1 at 2am UTC and so on..The problem is after 7pm UTC there is no index seen in Elasticsearch. Is it due to the timezone issue? But I know Elasticsearch uses UTC and ES servers are also configured on UTC.
What might be the issue? The new index for next day is created at UTC 12am correctly. And if I see the index creation time according to IST, its 5.30am.
Since I am working from India, and its 5.30 hours ahead of UTC so when its 7pm utc, then in IST the day changes and its 12.30am, is that the time zone issue due to which further indices are not created? Could someone please help?
Below is the pipeline code
...
"script": { "lang": "painless", "source": "Date d=new Date((long)(timestampfield)*1000); DateFormat f = new SimpleDateFormat("HH"); String crh=(Integer.parseInt(f.format(d))/2).toString(); String nvFormat="yyyy-MM-dd-"+crh; DateFormat f2=new SimpleDateFormat(nvFormat); ctx['_index']="index-"+f2.format(d);"

Set year on syslog events read into logstash read after new year

Question
When reading syslog events with Logstash, how can one set a proper year where:
Syslog events still by default lack the year
Logstash processing can be late in processing - logs arriving late, logstash down for maintenance, syslog queue backing up
In short - events can come in un-even order - and all / most lack the year.
The Logstash date filter will successfully parse a Syslog date, and use the current year by default. This can be wrong.
One constraint: Logs will never be from the future, not counting TimeZone +/- 1 day.
How can logic be applied to logstash to:
Check if a parsed date appears to be in the future?
Handle "Feb 29" if parsed in the year after the actual leap year.
Date extraction and parsing
I've used the GROK filter plugin to extract the SYSLOGTIMESTAMP from the message into a syslog_timestamp field.
Then the Logstash date filter plugin to parse syslog_timestamp into the #timestamp field.
#
# Map the syslog date into the elasticsearch #timestamp field
#
date {
match => ["syslog_timestamp",
"MMM dd HH:mm:ss",
"MMM d HH:mm:ss",
"MMM dd yyyy HH:mm:ss",
"MMM d yyyy HH:mm:ss" ]
timezone => "Europe/Oslo"
target => "#timestamp"
add_tag => [ "dated" ]
tag_on_failure => [ "_dateparsefailure" ]
}
# Check if a localized date filter can read the date.
if "_dateparsefailure" in [tags] {
date {
match => ["syslog_timestamp",
"MMM dd HH:mm:ss",
"MMM d HH:mm:ss",
"MMM dd yyyy HH:mm:ss",
"MMM d yyyy HH:mm:ss" ]
locale => "no_NO"
timezone => "Europe/Oslo"
target => "#timestamp"
add_tag => [ "dated" ]
tag_on_failure => [ "_dateparsefailure_locale" ]
}
}
Background
We are storing syslog events into Elasticsearch using Logstash. The input comes from a wide variety of servers both of different OS and OS versions, several hundred in total.
On the logstash server the logs are read from file. Servers ship their logs using the standard syslog forwarding protocol.
The standard Syslog event still only has the month and date in each log, and configuring all servers to also add the year is out of scope for this question.
Problem
From time to time an event will occur where a servers syslog queue backs up. The queue will then (mostly) be released after a syslog / or server restart. The patching regime ensures that all servers are booted several times a year, so (most likely) any received events will at most be under a year old.
In addition any delay in processing, such as between 31/12 (December) and 1/1 (January) makes an event belong to another year than the year it is processed.
From time to time you also will need to re-read some logs, and then there's the leap year issue of February 29th - 29/02 - "Feb 29".
Examples:
May 25 HH:MM:SS
May 27 HH:MM:SS
May 30 HH:MM:SS
May 31 HH:MM:SS
Mai 31 HH:MM:SS # Localized
In sum: Logs may be late, and we need to handle it.
More advanced DateTime logic can be done with the Logstash Ruby filter plugin.
Leap year
29th of February every four years makes "Feb 29" a valid date for the year 2020, but not in 2021.
The date is saved in syslog_timestamp and run through the date filters in the Q.
The following Ruby code will:
Check if this year is a leap year (probably not since parsing failed)
Check if last year was a leap year.
If the date falls outside these checks we can't rightly assume anything else, so this check falls into the "I know and accept the risk."
#
# Handle old leap syslog messages, typically from the previous year, while in a non-leap-year
# Ruby comes with a price, so don't run it unless the date filter has failed and the date is "Feb 29".
#
if "_dateparsefailure" in [tags] and "_dateparsefailure_locale" in [tags] and [syslog_timestamp] =~ /^Feb 29/ {
ruby {
code => "
today = DateTime.now
last_year = DateTime.now().prev_year
if not today.leap? then
if last_year.leap? then
timestamp = last_year.strftime('%Y') + event.get('syslog_timestamp')
event.set('[#metadata][fix_leapyear]', LogStash::Timestamp.new(Time.parse(timestamp)))
end
end
"
}
#
# Overwrite the `#timestamp` field if successful and remove the failure tags
#
if [#metadata][fix_leapyear] {
mutate {
copy => { "[#metadata][fix_leapyear]" => "#timestamp" }
remove_tag => ["_dateparsefailure", "_dateparsefailure_locale"]
add_tag => ["dated"]
}
}
}
Date in the future
Dates "in the future" occurs if you get i.e. Nov 11 in a log parsed after New Year.
This Ruby filter will:
Set a tomorrow date variable two days in the future (ymmv)
Check if the parsed event date #timestamp is after (in the future) tomorrow
When reading syslog we assume that logs from the future does not exist. If you run test servers to simulate later dates you must of course adapt to that, but that is outside the scope.
# Fix Syslog date without YEAR.
# If the date is "in the future" we assume it is really in the past by one year.
#
if ![#metadata][fix_leapyear] {
ruby {
code => "
#
# Create a Time object for two days from the current time by adding 172800 seconds.
# Depends on that [event][timestamp] is set before any 'date' filter or use Ruby's `Time.now`
#
tomorrow = event.get('[event][timestamp]').time.localtime() + 172800
#
# Read the #timestamp set by the 'date' filter
#
timestamp = event.get('#timestamp').time.localtime()
#
# If the event timestamp is _newer_ than two days from now
# we assume that this is syslog, and a really old message, and that it is really from
# last year. We cannot be sure that it is not even older, hence the 'assume'.
#
if timestamp > tomorrow then
if defined?(timestamp.usec_with_frac) then
new_date = LogStash::Timestamp.new(Time.new(timestamp.year - 1, timestamp.month, timestamp.day, timestamp.hour, timestamp.min, timestamp.sec, timestamp.usec_with_frac)
else
new_date = LogStash::Timestamp.new(Time.new(timestamp.year - 1, timestamp.month, timestamp.day, timestamp.hour, timestamp.min, timestamp.sec))
end
event.set('#timestamp', new_date)
event.set('[event][timestamp_datefilter]', timestamp)
end
"
}
}
Caveat: I'm by no means a Ruby expert, so other answers or comments on how to improve on the Ruby code or logic will be greatly appreciated.
In the hope that this can help or inspire others.

How do I create a cron expression running in Kibana on weekday?

I would like my watcher to run from Monday to Friday only. So I'm trying to use this schedule:
"trigger": {
"schedule" : { "cron" : "0 0 0/4 * * MON-FRI" }
},
"input": {
...
However, I'm getting
Error
Watcher: [parse_exception] could not parse [cron] schedule
when I'm trying to save the watcher. Removing MON-FRI does helps but I need it.
This expression works:
0 0 0/4 ? * MON-FRI
But I'm not sure I understand why ? is required for either the day_of_week or day_of_month
Thank you!
I believe this is what you are looking for:
"0 0 0/4 ? * MON-FRI"
You can use croneval to check your cron expressions 1:
$ /usr/share/elasticsearch/bin/x-pack/croneval "0 0 0/4 ? * MON-FRI"
Valid!
Now is [Mon, 20 Aug 2018 13:32:26]
Here are the next 10 times this cron expression will trigger:
1. Mon, 20 Aug 2018 09:00:00
2. Mon, 20 Aug 2018 13:00:00
3. Mon, 20 Aug 2018 17:00:00
4. Mon, 20 Aug 2018 21:00:00
5. Tue, 21 Aug 2018 01:00:00
6. Tue, 21 Aug 2018 05:00:00
7. Tue, 21 Aug 2018 09:00:00
8. Tue, 21 Aug 2018 13:00:00
9. Tue, 21 Aug 2018 17:00:00
10. Tue, 21 Aug 2018 21:00:00
For the first expression you'll get following java exception:
java.lang.IllegalArgumentException: support for specifying both a day-of-week AND a day-of-month parameter is not implemented.
You can also use Crontab guru to get human readable descriptions like:
At every minute past every 4th hour from 0 through 23 on every day-of-week from Monday through Friday.
The question mark means 'No Specific value'. From the documentation on Quartz's website:
? (“no specific value”) - useful when you need to specify something in one of the two fields in which the character is allowed, but not the other. For example, if I want my trigger to fire on a particular day of the month (say, the 10th), but don’t care what day of the week that happens to be, I would put “10” in the day-of-month field, and “?” in the day-of-week field. See the examples below for clarification.
http://www.quartz-scheduler.org/documentation/quartz-2.x/tutorials/crontrigger.html
I suppose since you want your schedule to run every 4 hours, mon-fri, the actual day of the month is irrelevant, so the ? specifies that. * on teh other hand would be 'all values' which would not make sense since you are specifying only mon-fri for day of the week.
Hope that helps!

I don't know how to filter my log file with grok and logstash

I have an small java app that loads logs similar to these ones bellow:
Fri May 29 12:10:34 BST 2015 Trade ID: 2 status is :received
Fri May 29 14:12:36 BST 2015 Trade ID: 4 status is :received
Fri May 29 17:15:39 BST 2015 Trade ID: 3 status is :received
Fri May 29 21:19:43 BST 2015 Trade ID: 3 status is :Parsed
Sat May 30 02:24:48 BST 2015 Trade ID: 8 status is :received
Sat May 30 08:30:54 BST 2015 Trade ID: 3 status is :Data not found
Sat May 30 15:38:01 BST 2015 Trade ID: 3 status is :Book not found
Sat May 30 23:46:09 BST 2015 Trade ID: 6 status is :received
I want to use ELK stack to analyse my logs and filter them.
I would like at least 3 filters : Date and time, trade Id and status.
In the filter part of my logstash configuration file here is what I did:
filter {
grok {
match => { "message" => "%{DAY} %{MONTH} %{DAY} %{TIME} BST %{YEAR} Trade ID: %{NUMBER:tradeId} status is : %{WORD:status}" }
}
And for the moment I can't filter my logs as I want.
You have some extra spaces between the pattern, and for the status, you would like to parse the entire message, so using the GREEEDYDATA instead of the WORD is your choice.
filter {
grok {
match => { "message" => "%{DAY:day} %{MONTH:month} %{MONTHDAY:monthday} %{TIME:time} BST %{YEAR:year} Trade ID: %{NUMBER:tradeId} status is :%{GREEDYDATA:status}" }
}
}
For this log line:
Sat May 30 15:38:01 BST 2015 Trade ID: 3 status is :Book not found
You will end up with a json like:
{
"message" => "Sat May 30 15:38:01 BST 2015 Trade ID: 3 status is :Book not found",
"#version" => "1",
"#timestamp" => "2015-08-18T18:28:47.195Z",
"host" => "Gabriels-MacBook-Pro.local",
"day" => "Sat",
"month" => "May",
"monthday" => "30",
"time" => "15:38:01",
"year" => "2015",
"tradeId" => "3",
"status" => "Book not found"
}

How to save a timezone correctly with Ruby and MongoId?

Please excuse me if this is a bit of a noob issue:
I have an app where users can set their own Timezones in their profile.
When someone adds a Lineup (app specific terminology), I do the following:
time = ActiveSupport::TimeZone.new(user.timezone).parse(
"Wednesday, 26 October, 2011 13:30:00"
)
# This outputs: 2011-10-26 13:30:00 +0200 - valid according to the user selected TZ
I then save the Lineup:
Lineup.create({
:date => time.gmtime,
:uid => user._id,
:pid => product._id
})
This should (in theory) save the date as gmtime, but I get the following when viewing the record:
{
"_id": ObjectId("4e9c6613e673454f93000002"),
"date": "Wed, 26 Oct 2011 13: 30: 00 +0200",
"uid": "4e9b81f6e673454c8a000001",
"pid": "4e9c6613e673454f93000001",
"created_at": "Mon, 17 Oct 2011 19: 29: 55 +0200"
}
As you can see the date field is wrong - it still maintaining the user timezone, it should be GMT, not timezone specific.
If I output time.gmtime, I get the right time (that should be saved):
2011-10-26 11:30:00 UTC (correct)
Any ideas how to save the GMT date so that it actually saves the GMT date?
It looks like you need to specify the field type of your date attribute. I would use a Time field if you want mongoid to handle the zones properly.
class Lineup
include Mongoid::Document
field :date, type: Time
end
You will also probably want to set the following in config/mongoid.yml
defaults: &defaults
use_utc: false
use_activesupport_time_zone: true
This sounds counterintuitive, but this is the current way to make mongoid use UTC as the default timezone.
Finally, have a look at the mongoid-metastamp gem. It will give you much better support for querying across multiple timezones, while still seamlessly working like a native Time field.

Resources