Datadog event monitor aggregation

Datadog event monitor aggregation - events

I have created a Multi alert event monitor
events('sources:rds event_source:db-instance').by('dbinstanceidentifier').rollup('count').last('1d') >= 1
I wanted it to be aggregated by "dbinstanceidentifier" but it shows the accumulative count for "* (Entire Infrastructure)". Basically it doesn't see any groups. But I can see them in the infrastructure. Is it a problem of datadog? May be it's only available in a kind of "premium" subscription?

I've tried with this query which is similar to yours:
events('sources:rds priority:all tags:event_source:db-instance').by('dbinstanceidentifier').rollup('count').last('1d') > 1
And this seems to give me a count by dbinstanceidentifier results.
Do you have more information to provide? Maybe an event list and a monitor result screenshot?

Related

Enrich CloudWatch and CloudTrail with custom Lambda invocation input

Problem:
I have an application with many lambda functions. However, most of them never log anything. That makes it hard to retrieve anything when there's a problem.
We use CloudWatch and CloudTrail. But the CloudWatch logs are often empty (just the start/stop is shown).
When we do find an event, it's difficult to get a full invocation trail, because each lambda has its own log group, so we often have to look through multiple log files. Which basically CloudTrail could help us with ...
However, CloudTrail isn't of much use either, because there are more than 1000 invocations each minute. While all events are unique, most of them look identical inside CloudWatch. That makes it hard to filter them. (e.g. There's no URL to filter on, as most of our events are first queued in SQS, and only later handled by a lambda. Because of that, there isn't any URL to search on in CloudTrail.)
On a positive side, for events that are coming from an SQS, we have a DLQ configured, which we can poll to see what the failed events look like. However, then still, it's hard to find the matching CloudTrail record.
Question:
To get more transparency,
is there a convenient way to log the input body of all lambda invocations to CloudWatch? That would solve half of the problem.
And while doing so, is there a possibility to make recurring fields of the input searchable in CloudTrail?
Adding more metadata to a CloudTrail record would help us:
It would actually make it possible to filter, without hitting the 1000 results limit.
It would be easier to find the full CloudTrail for a given CloudWatch event or DLQ message.
Ideally, can any of this be done without changing the code of the existing lambda functions? (Simply, because there are so many of them.)

Have you considered emitting JSON logs from your Lambdas and using CloudWatch Logs Insights to search them? If you need additional custom metrics, I’d look at the Embedded Metric Format: https://aws.amazon.com/blogs/mt/enhancing-workload-observability-using-amazon-cloudwatch-embedded-metric-format/
I’d also recommend taking a look at some of the capabilities provided by Lambda Power Tools: https://awslabs.github.io/aws-lambda-powertools-python/2.5.0/

There are a few things in here so I'll attempt to break them down one by one:
Searching across multiple log groups
As #jaredcnance recommended, CloudWatch Logs Insights will enable you to easily and quickly search across multiple log groups. You can likely get started with a simple filter #message like /my pattern/ query.
I suggest testing with 1-2 log groups and a small-ish time window so that you can get your queries correct. Once you're happy, query all of your log groups and save the queries so that you can quickly and easily run them in the future.
Logging Lambda event payloads
Yes, you can easily do this with Lambda Power Tools. If you're not using Python, check the landing page to see if your runtime is supported. If you are using a Lambda runtime that doesn't have LPT support, you can log JSON output yourself.
When you log with JSON it's trivial to query with CW Logs Insights. For example, a Python statement like this:
from aws_lambda_powertools import Logger
logger = Logger()
logger.info({
"action": "MOVE",
"game_id": game.id,
"player1": game.player_x.id,
"player2": game.player_o.id,
})
enables queries like this:
fields #timestamp, correlation_id, message.action, session_id, location
| filter ispresent(message.action) AND message.action = 'MOVE'
| sort #timestamp desc
Updating Lambda functions
Lambda runs your code and will not update itself. If you want emit logs you have to update your code. There is no way around that.
Cloudtrail
CloudTrail is designed as a security and governance tool. What you are trying to do is operational in nature (debugging). As such, logging and monitoring solutions like CW Logs are going to be your friends. While some of the data plane operations may end up in CloudTrail, CloudWatch or other logging solutions are better suited.

Kibana Watcher query for searching text

I am looking for pointers to create a Kibana watcher where I want to look at my logs and I want to send an alert if I see the text "Security Alert" in my logs more than 10 times within any 30 mins period.
I am referring to this article
https://www.elastic.co/guide/en/kibana/current/watcher-ui.html#watcher-create-threshold-alert
It's not clear in the doc how I can 1> read through and filter and parse the string 2> how to set up counts for the same.

For this requirement you should use the advanced watchers over the more simple (and less powerful) threshold watchers. In the Kibana-Watcher UI you can choose between both types.
See
https://www.elastic.co/guide/en/kibana/current/watcher-ui.html#watcher-create-advanced-watch for an introduction and
https://www.elastic.co/guide/en/elasticsearch/reference/current/how-watcher-works.html for the syntax and the overal behaviour of advanced watchers.
So based on the requirements you described in your question, heres how you would implement the watcher (conceptually in a nutshell):
the 30 minutes would be the trigger interval.
The input section has to be an appropiate elasticsearch query where you match the "Security Alert" text
the condition would be like "numberOfHits gte 10". So the watcher gets triggered every 30 mins but only when the condition is met, the actions will be executed.
in the actions section you would need to choose between the available options (log, mail, slack messages etc.). If you want to send mails, then you need to setup mail accounts first.
I hope I could help you.

Flink web UI: Monitor Metrics doesn't work

run with flink-1.9.0 on yarn(2.6.0-cdh5.11.1), but the flink web ui metrics does'nt work, as shown below:

I guess you are looking at the wrong metrics. Due no data flows from one task to another (you can see only one box at the UI) there is nothing to show. The metrics you are looking at only show the data which flows from one flink task to another. At your example everything happens within this task.
Look at this example:
You can see two tasks sending data to the map-task which emits this data to another task. Therefore you see incoming and outgoing data.
But on the other hand a source task never has incoming data(I must admit that this is confusing at the first look):
The number of records recieved is 0 but it send a couple of records to the downstream task.
Back to your problem: What you can do is have a look at the operator metrics. If you look at the metrics tab (the one at the very right) you can select beside the task metrics also some operator metrics. These metrics have a name like 0.Map.numRecordsIn.
The name assembles like this <slot>.<operatorName>.<metricsname>. But be aware that this metrics are not recorded, you don't have any historic data and once you leave this tab or remove a metric the data collected until that point are gone. I would recommend to use a proper metrics backend like influx, prometheus or graphite. You can find a description at the flink docs.
Hope that helped.

How to monitor log files and set up an alert based on keywords using Elastic Search?

Recently, I just started to learn Elastic Search and I found it provides several monitoring and alerting function. For example, users can create a watcher to monitor a certain metric, a alert will be triggered if it exceeds the threshold.
Meanwhile, Elastic Search is designed for full text searching.
I wonder whether I can monitor log files and set up an alert based on eywords? For example, the alert triggered by the keyword "test", if the incoming log file contains word "test", the alert will be triggered.
If somebody have done anything related or have a clue about this, please give me a hint!

Logstash aggregation based on 'temporary id'

I'm not sure if this sort of aggregation is best done after being indexed by elasticsearch or if logstash is a good place to do it.
We are logging information about commands run against a server. Each set of metrics regarding a single command is logged as a single log event, there are multiple 'metric sets' per command. Each metric is of its own document type in ES (currently at least). So we will have multiple events across multiple documents regarding one command run against the server.
Each of these events will have a 'cmdno' field which is a temporary id given to the command we are logging about. Once the command has finished with all events logged, the 'cmdno' may be reused for other commands.
Is it possible to use logstash 'aggregate' plugin to link the events of a single command together using the 'cmdno'? (or any plugin)
All events that pertain to a single command will have the same timestamp + cmdno. I would like to add a UUID to the events as a permanent unique id for that command, so that a single query will give us all events for that single command.
Was thinking along the lines of:
if [cmdno] {
aggregate {
task_id => "%{cmdno}"
code => "map['cmdid'] ||= <some uuid generator>; event['cmdid'] == map['cmdid'] ? event['#timestamp'] == map['<stored timestamp for previous event from the same command>'] : continue"
}
}
Just started learning the ELK stack, not entirely sure as to the programming contructs logstash affords me yet.
I don't know if there is a better way to relate these events, this seemed the most suitable for our needs, if there are more ELK'y methods please let me know, they do need to stay as separate documents of different types though.
Any help much appreciated, let me know if I am missing anything.
Cheers,
Brett

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio