How to monitor log files and set up an alert based on keywords using Elastic Search? - elasticsearch

Recently, I just started to learn Elastic Search and I found it provides several monitoring and alerting function. For example, users can create a watcher to monitor a certain metric, a alert will be triggered if it exceeds the threshold.
Meanwhile, Elastic Search is designed for full text searching.
I wonder whether I can monitor log files and set up an alert based on eywords? For example, the alert triggered by the keyword "test", if the incoming log file contains word "test", the alert will be triggered.
If somebody have done anything related or have a clue about this, please give me a hint!

Related

Enrich CloudWatch and CloudTrail with custom Lambda invocation input

Problem:
I have an application with many lambda functions. However, most of them never log anything. That makes it hard to retrieve anything when there's a problem.
We use CloudWatch and CloudTrail. But the CloudWatch logs are often empty (just the start/stop is shown).
When we do find an event, it's difficult to get a full invocation trail, because each lambda has its own log group, so we often have to look through multiple log files. Which basically CloudTrail could help us with ...
However, CloudTrail isn't of much use either, because there are more than 1000 invocations each minute. While all events are unique, most of them look identical inside CloudWatch. That makes it hard to filter them. (e.g. There's no URL to filter on, as most of our events are first queued in SQS, and only later handled by a lambda. Because of that, there isn't any URL to search on in CloudTrail.)
On a positive side, for events that are coming from an SQS, we have a DLQ configured, which we can poll to see what the failed events look like. However, then still, it's hard to find the matching CloudTrail record.
Question:
To get more transparency,
is there a convenient way to log the input body of all lambda invocations to CloudWatch? That would solve half of the problem.
And while doing so, is there a possibility to make recurring fields of the input searchable in CloudTrail?
Adding more metadata to a CloudTrail record would help us:
It would actually make it possible to filter, without hitting the 1000 results limit.
It would be easier to find the full CloudTrail for a given CloudWatch event or DLQ message.
Ideally, can any of this be done without changing the code of the existing lambda functions? (Simply, because there are so many of them.)
Have you considered emitting JSON logs from your Lambdas and using CloudWatch Logs Insights to search them? If you need additional custom metrics, I’d look at the Embedded Metric Format: https://aws.amazon.com/blogs/mt/enhancing-workload-observability-using-amazon-cloudwatch-embedded-metric-format/
I’d also recommend taking a look at some of the capabilities provided by Lambda Power Tools: https://awslabs.github.io/aws-lambda-powertools-python/2.5.0/
There are a few things in here so I'll attempt to break them down one by one:
Searching across multiple log groups
As #jaredcnance recommended, CloudWatch Logs Insights will enable you to easily and quickly search across multiple log groups. You can likely get started with a simple filter #message like /my pattern/ query.
I suggest testing with 1-2 log groups and a small-ish time window so that you can get your queries correct. Once you're happy, query all of your log groups and save the queries so that you can quickly and easily run them in the future.
Logging Lambda event payloads
Yes, you can easily do this with Lambda Power Tools. If you're not using Python, check the landing page to see if your runtime is supported. If you are using a Lambda runtime that doesn't have LPT support, you can log JSON output yourself.
When you log with JSON it's trivial to query with CW Logs Insights. For example, a Python statement like this:
from aws_lambda_powertools import Logger
logger = Logger()
logger.info({
"action": "MOVE",
"game_id": game.id,
"player1": game.player_x.id,
"player2": game.player_o.id,
})
enables queries like this:
fields #timestamp, correlation_id, message.action, session_id, location
| filter ispresent(message.action) AND message.action = 'MOVE'
| sort #timestamp desc
Updating Lambda functions
Lambda runs your code and will not update itself. If you want emit logs you have to update your code. There is no way around that.
Cloudtrail
CloudTrail is designed as a security and governance tool. What you are trying to do is operational in nature (debugging). As such, logging and monitoring solutions like CW Logs are going to be your friends. While some of the data plane operations may end up in CloudTrail, CloudWatch or other logging solutions are better suited.

Is Elastic/Metricbeats suitable for process monitoring and alerting?

Do you use Elastic and Metricbeats for process monitoring and alerting? How did you configure your data gathering and alerting?
I am currently trying to set this up, and running into some basic issues. These issues are making me question whether Elastic is a suitable tool for alerting. Here is my planned setup:
Use Metricbeats to gather process data
Create an Elastic dashboard/lens for certain processes
If the process.cpu.start_time from Metricbeats is very young (e.g. it has only been running for under 5 minutes), alert!
I have been working my way through this using the following approach:
From Metricbeats, the processes include process.cpu.start_time, as a text string in ISO date format. Elastic lens queries are very limited with dates.
Workaround: use Logstash to create a filter field process.cpu.start_epoch, which is an integer - the Unix epoch: "seconds since January 1, 1970".
Create a dashboard lens, querying only my process, and only the last metric. This works and gives me "the time that the process started, as a Unix epoch".
I next need to calculate the time difference between now and that integer. However I don't see anything in the lens documentation about doing date math. So I'm stuck.
The difficulties I am encountering are making me wonder if I am "doing it wrong"? Is Elastic/Metricbeats a suitable tool for what I am trying to achieve?
Answer: find the right hammer!
What I needed is called "Elastic runtime fields". There's a step-by-step writeup here: https://elastic-content-share.eu/elastic-runtime-field-example-repository/
Summary:
open index
click the "dots"
choose "add field to index pattern"
set output field name as desired
for me this is process.cpu.start.age
set output type
for me this is "long"
write your script in "painless"
for me this is emit(Date().getTime() - doc['process.cpu.start'].value.toEpochMilli());
PS: I deleted my logstash filters, because they were superfluous.

Kibana Watcher query for searching text

I am looking for pointers to create a Kibana watcher where I want to look at my logs and I want to send an alert if I see the text "Security Alert" in my logs more than 10 times within any 30 mins period.
I am referring to this article
https://www.elastic.co/guide/en/kibana/current/watcher-ui.html#watcher-create-threshold-alert
It's not clear in the doc how I can 1> read through and filter and parse the string 2> how to set up counts for the same.
For this requirement you should use the advanced watchers over the more simple (and less powerful) threshold watchers. In the Kibana-Watcher UI you can choose between both types.
See
https://www.elastic.co/guide/en/kibana/current/watcher-ui.html#watcher-create-advanced-watch for an introduction and
https://www.elastic.co/guide/en/elasticsearch/reference/current/how-watcher-works.html for the syntax and the overal behaviour of advanced watchers.
So based on the requirements you described in your question, heres how you would implement the watcher (conceptually in a nutshell):
the 30 minutes would be the trigger interval.
The input section has to be an appropiate elasticsearch query where you match the "Security Alert" text
the condition would be like "numberOfHits gte 10". So the watcher gets triggered every 30 mins but only when the condition is met, the actions will be executed.
in the actions section you would need to choose between the available options (log, mail, slack messages etc.). If you want to send mails, then you need to setup mail accounts first.
I hope I could help you.

Datadog event monitor aggregation

I have created a Multi alert event monitor
events('sources:rds event_source:db-instance').by('dbinstanceidentifier').rollup('count').last('1d') >= 1
I wanted it to be aggregated by "dbinstanceidentifier" but it shows the accumulative count for "* (Entire Infrastructure)". Basically it doesn't see any groups. But I can see them in the infrastructure. Is it a problem of datadog? May be it's only available in a kind of "premium" subscription?
I've tried with this query which is similar to yours:
events('sources:rds priority:all tags:event_source:db-instance').by('dbinstanceidentifier').rollup('count').last('1d') > 1
And this seems to give me a count by dbinstanceidentifier results.
Do you have more information to provide? Maybe an event list and a monitor result screenshot?

Logstash aggregation based on 'temporary id'

I'm not sure if this sort of aggregation is best done after being indexed by elasticsearch or if logstash is a good place to do it.
We are logging information about commands run against a server. Each set of metrics regarding a single command is logged as a single log event, there are multiple 'metric sets' per command. Each metric is of its own document type in ES (currently at least). So we will have multiple events across multiple documents regarding one command run against the server.
Each of these events will have a 'cmdno' field which is a temporary id given to the command we are logging about. Once the command has finished with all events logged, the 'cmdno' may be reused for other commands.
Is it possible to use logstash 'aggregate' plugin to link the events of a single command together using the 'cmdno'? (or any plugin)
All events that pertain to a single command will have the same timestamp + cmdno. I would like to add a UUID to the events as a permanent unique id for that command, so that a single query will give us all events for that single command.
Was thinking along the lines of:
if [cmdno] {
aggregate {
task_id => "%{cmdno}"
code => "map['cmdid'] ||= <some uuid generator>; event['cmdid'] == map['cmdid'] ? event['#timestamp'] == map['<stored timestamp for previous event from the same command>'] : continue"
}
}
Just started learning the ELK stack, not entirely sure as to the programming contructs logstash affords me yet.
I don't know if there is a better way to relate these events, this seemed the most suitable for our needs, if there are more ELK'y methods please let me know, they do need to stay as separate documents of different types though.
Any help much appreciated, let me know if I am missing anything.
Cheers,
Brett

Resources