Enrich CloudWatch and CloudTrail with custom Lambda invocation input - aws-lambda

Problem:
I have an application with many lambda functions. However, most of them never log anything. That makes it hard to retrieve anything when there's a problem.
We use CloudWatch and CloudTrail. But the CloudWatch logs are often empty (just the start/stop is shown).
When we do find an event, it's difficult to get a full invocation trail, because each lambda has its own log group, so we often have to look through multiple log files. Which basically CloudTrail could help us with ...
However, CloudTrail isn't of much use either, because there are more than 1000 invocations each minute. While all events are unique, most of them look identical inside CloudWatch. That makes it hard to filter them. (e.g. There's no URL to filter on, as most of our events are first queued in SQS, and only later handled by a lambda. Because of that, there isn't any URL to search on in CloudTrail.)
On a positive side, for events that are coming from an SQS, we have a DLQ configured, which we can poll to see what the failed events look like. However, then still, it's hard to find the matching CloudTrail record.
Question:
To get more transparency,
is there a convenient way to log the input body of all lambda invocations to CloudWatch? That would solve half of the problem.
And while doing so, is there a possibility to make recurring fields of the input searchable in CloudTrail?
Adding more metadata to a CloudTrail record would help us:
It would actually make it possible to filter, without hitting the 1000 results limit.
It would be easier to find the full CloudTrail for a given CloudWatch event or DLQ message.
Ideally, can any of this be done without changing the code of the existing lambda functions? (Simply, because there are so many of them.)

Have you considered emitting JSON logs from your Lambdas and using CloudWatch Logs Insights to search them? If you need additional custom metrics, I’d look at the Embedded Metric Format: https://aws.amazon.com/blogs/mt/enhancing-workload-observability-using-amazon-cloudwatch-embedded-metric-format/
I’d also recommend taking a look at some of the capabilities provided by Lambda Power Tools: https://awslabs.github.io/aws-lambda-powertools-python/2.5.0/

There are a few things in here so I'll attempt to break them down one by one:
Searching across multiple log groups
As #jaredcnance recommended, CloudWatch Logs Insights will enable you to easily and quickly search across multiple log groups. You can likely get started with a simple filter #message like /my pattern/ query.
I suggest testing with 1-2 log groups and a small-ish time window so that you can get your queries correct. Once you're happy, query all of your log groups and save the queries so that you can quickly and easily run them in the future.
Logging Lambda event payloads
Yes, you can easily do this with Lambda Power Tools. If you're not using Python, check the landing page to see if your runtime is supported. If you are using a Lambda runtime that doesn't have LPT support, you can log JSON output yourself.
When you log with JSON it's trivial to query with CW Logs Insights. For example, a Python statement like this:
from aws_lambda_powertools import Logger
logger = Logger()
logger.info({
"action": "MOVE",
"game_id": game.id,
"player1": game.player_x.id,
"player2": game.player_o.id,
})
enables queries like this:
fields #timestamp, correlation_id, message.action, session_id, location
| filter ispresent(message.action) AND message.action = 'MOVE'
| sort #timestamp desc
Updating Lambda functions
Lambda runs your code and will not update itself. If you want emit logs you have to update your code. There is no way around that.
Cloudtrail
CloudTrail is designed as a security and governance tool. What you are trying to do is operational in nature (debugging). As such, logging and monitoring solutions like CW Logs are going to be your friends. While some of the data plane operations may end up in CloudTrail, CloudWatch or other logging solutions are better suited.

Related

Orchestrating lambda functionality similar to a strategy pattern

Is there a good way to orchestrate lambda functionality that changes based on a queue message? I was thinking about taking a similar approach described in the strategy pattern.
The lambda function is polling an SQS queue. The queue message would contain some context that is passed into a lambda telling it what workflow needs to be executed. Based on this message, the lambda would execute some corresponding script.
The idea behind this is that I can write code for different ad hoc jobs and use the same queue + lambda function for these jobs but have it delegate the work. This way, I can track unsuccessful jobs in a dead letter queue. Are there any red flags here or potential pitfalls I should be aware of when you hear this? Any advice would be appreciated. TIA!
EDIT: For some additional context, this different workflows triggered by this lambda will vary in compute resources needed. An example is ingesting a large dataset from an api call and doing some custom schematization on the contents before making an api call.
This is indeed possible, but there's a variety of approaches you may take. These depend on what type of workflow/processing you require.
As you highlight, Lambda could be used for this. It's worth noting that Lambda functions do not work well for computationally-intensive tasks.
If you were looking to perform a workflow with some complexity, you should consider AWS Step Functions. Suppose you had three "tasks" to choose from, you could define a Step Function for each, then use Lambda to (1.) receive the message & work out which task is required, then (2.) start an execution for the desired Step Function.
FYI, you don't need to make your Lambda function poll the SQS queue, instead, you can set up SQS to automatically trigger Lambda once a new message is added to the queue. See AWS Docs - Configuring a queue to trigger an AWS Lambda function.
If you edit your question with more info on what you're looking to do (processing-wise) with each message, people will be able to better help with your use-case.
Best of luck! :)

How to count metrics from executions of AWS lambdas?

I have all sorts of metrics I would like to count and later query. For example I have a lambda that processes stuff from a queue, and for each batch I would like to save a count like this:
{
"processes_count": 6,
"timestamp": 1695422215,
"count_by_type": {
"type_a": 4,
"type_b": 2
}
}
I would like to dump these pieces somewhere and later have the ability to query how many were processed within a time range.
So these are the options I considered:
write the json to the logs, and later have a component (beats?) that processed these logs and send to a timeseries db.
in the end of each execution send it directly to a timeseries db (like elasticearch).
What is better in terms of cost / scalability? Are there more options I should consider?
I think Cloud Watch Embedded Metric Format (EMF) would be good here. There are client libraries for Node.js, Python, Java, and C#.
CW EMF allows you to push metrics out of Lambda into CloudWatch in a managed async way. So it's a cost-effective and low-effort way of producing metrics.
The client library produces a particular JSON format to stdout, when CW sees a message of this type it automatically creates the metrics for you from it.
You can also include key-value pairs in the EMF format which allows you to go back and query the data with these keys in the future.
High-level clients are available with Lambda Powertools in Python and Java.

List error logs from Stckdriver of matching pattern

I am evaluating approaches for a scenario where i need to fetch list of logs from Stackdriver. There can be multiple Filter criteria's (for eg. payload contains a word 'retry' of logs of type 'Warning' ...)
With help gcp sdk i was able to query stackdriver but not sure how efficient is this approach. Kindly suggest other approaches where i can use elastic search client to query stackdriver and list matching logs
It looks like you have multiple sets of logs that you wish to consume separately and each of those log sets can be described with a Stackdriver filter. This is a good start since running filters against Stackdriver is an effective way to sort your data. And you are right that running the same filter against Stackdriver over and over again would be pretty inefficient.
The following approach uses Stackdriver log sinks and this is how we manage logs on our GCP account. Our monitoring team is pretty happy with it and it's easy to maintain.
You can read up on log sinks here and aggregated log sinks here.
The general idea is to have Google automatically filter and export the logs for you using multiple log sinks (one sink per filter). The export destination can be Google Storage, BigQuery, or Pub/Sub. Each sink should export to a different location and will do so continuously as long as the sink exists. Also, log sinks can be set up per project or at the organization level (where it can inherit all projects underneath).
For example, let's say you want to set up three log sinks. Each sink uses a different filter and different export location (but all to the same bucket):
Log Sink 1 (compute logs) -> gs://my-example-log-dump-bucket/compute-logs/
Log Sink 2 (network logs) -> gs://my-example-log-dump-bucket/network-logs/
Log Sink 3 (linux logs) -> gs://my-example-log-dump-bucket/linux-logs/
Once this is set up, your code's SDK can just access each location based on what logs it currently needs. This eliminates the need for your code to do the filtering since Google has already handled it for you in the background.
One thing to note: log exports to BigQuery and Pub/Sub are instant, but exports to Google Storage occur at the top of every hour. So if you need a fast turnaround on the logs, avoid Google Storage and go with either BigQuery or Pub/Sub.
Hope this helps!

How to efficiently log metrics from API Gateway when using cache?

My scenario is this:
API Gateway which has a single endpoint serves roughly 250 million requests per month, backed by a Lambda function.
Caching is enabled, and 99% of the requests hits the cache.
The request contains query parameters which we want to derive statistics from.
Since cache is used, most requests never hit the Lambda function. We have currently enabled full request/response logging in API Gateway to capture the query parameters in CloudWatch. Once a week, we run a script to parse the logs and compile the statistics we are interested in.
Challenges with this setup:
Our script takes ~5 hours to run, and only gives a snapshot for the last week. We would ideally be interested in tracking the statistics continuously over time, say every 5 minutes or every hour.
Using full request/response logging produces HUGE amounts of logs, most of which does not contain anything that we are interested in.
Ideally we would like to turn of full request/response logging but still get the statistics we are interested in. I have considered logging to CloudWatch from Lambda#Edge to be able to capture the query parameters before the request hits the cache, and then use a metric filter, or perhaps Kinesis to get the statistics we want.
Would this be a viable solution, are can you propose another setup that could solve our problems in a more efficient way without costing too much?
You can configure access logging on your API ( https://docs.aws.amazon.com/apigateway/latest/developerguide/set-up-logging.html ) which give way to select (portion of request and response) and publish more structured logs to cloudwatch.
You can then use cloudwatch filter pattern (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html ) to generate some metrics or feed logs to your analytical engine ( or run script as you are running now ).

Logstash aggregation based on 'temporary id'

I'm not sure if this sort of aggregation is best done after being indexed by elasticsearch or if logstash is a good place to do it.
We are logging information about commands run against a server. Each set of metrics regarding a single command is logged as a single log event, there are multiple 'metric sets' per command. Each metric is of its own document type in ES (currently at least). So we will have multiple events across multiple documents regarding one command run against the server.
Each of these events will have a 'cmdno' field which is a temporary id given to the command we are logging about. Once the command has finished with all events logged, the 'cmdno' may be reused for other commands.
Is it possible to use logstash 'aggregate' plugin to link the events of a single command together using the 'cmdno'? (or any plugin)
All events that pertain to a single command will have the same timestamp + cmdno. I would like to add a UUID to the events as a permanent unique id for that command, so that a single query will give us all events for that single command.
Was thinking along the lines of:
if [cmdno] {
aggregate {
task_id => "%{cmdno}"
code => "map['cmdid'] ||= <some uuid generator>; event['cmdid'] == map['cmdid'] ? event['#timestamp'] == map['<stored timestamp for previous event from the same command>'] : continue"
}
}
Just started learning the ELK stack, not entirely sure as to the programming contructs logstash affords me yet.
I don't know if there is a better way to relate these events, this seemed the most suitable for our needs, if there are more ELK'y methods please let me know, they do need to stay as separate documents of different types though.
Any help much appreciated, let me know if I am missing anything.
Cheers,
Brett

Resources