List error logs from Stckdriver of matching pattern - elasticsearch

I am evaluating approaches for a scenario where i need to fetch list of logs from Stackdriver. There can be multiple Filter criteria's (for eg. payload contains a word 'retry' of logs of type 'Warning' ...)
With help gcp sdk i was able to query stackdriver but not sure how efficient is this approach. Kindly suggest other approaches where i can use elastic search client to query stackdriver and list matching logs

It looks like you have multiple sets of logs that you wish to consume separately and each of those log sets can be described with a Stackdriver filter. This is a good start since running filters against Stackdriver is an effective way to sort your data. And you are right that running the same filter against Stackdriver over and over again would be pretty inefficient.
The following approach uses Stackdriver log sinks and this is how we manage logs on our GCP account. Our monitoring team is pretty happy with it and it's easy to maintain.
You can read up on log sinks here and aggregated log sinks here.
The general idea is to have Google automatically filter and export the logs for you using multiple log sinks (one sink per filter). The export destination can be Google Storage, BigQuery, or Pub/Sub. Each sink should export to a different location and will do so continuously as long as the sink exists. Also, log sinks can be set up per project or at the organization level (where it can inherit all projects underneath).
For example, let's say you want to set up three log sinks. Each sink uses a different filter and different export location (but all to the same bucket):
Log Sink 1 (compute logs) -> gs://my-example-log-dump-bucket/compute-logs/
Log Sink 2 (network logs) -> gs://my-example-log-dump-bucket/network-logs/
Log Sink 3 (linux logs) -> gs://my-example-log-dump-bucket/linux-logs/
Once this is set up, your code's SDK can just access each location based on what logs it currently needs. This eliminates the need for your code to do the filtering since Google has already handled it for you in the background.
One thing to note: log exports to BigQuery and Pub/Sub are instant, but exports to Google Storage occur at the top of every hour. So if you need a fast turnaround on the logs, avoid Google Storage and go with either BigQuery or Pub/Sub.
Hope this helps!

Related

Enrich CloudWatch and CloudTrail with custom Lambda invocation input

Problem:
I have an application with many lambda functions. However, most of them never log anything. That makes it hard to retrieve anything when there's a problem.
We use CloudWatch and CloudTrail. But the CloudWatch logs are often empty (just the start/stop is shown).
When we do find an event, it's difficult to get a full invocation trail, because each lambda has its own log group, so we often have to look through multiple log files. Which basically CloudTrail could help us with ...
However, CloudTrail isn't of much use either, because there are more than 1000 invocations each minute. While all events are unique, most of them look identical inside CloudWatch. That makes it hard to filter them. (e.g. There's no URL to filter on, as most of our events are first queued in SQS, and only later handled by a lambda. Because of that, there isn't any URL to search on in CloudTrail.)
On a positive side, for events that are coming from an SQS, we have a DLQ configured, which we can poll to see what the failed events look like. However, then still, it's hard to find the matching CloudTrail record.
Question:
To get more transparency,
is there a convenient way to log the input body of all lambda invocations to CloudWatch? That would solve half of the problem.
And while doing so, is there a possibility to make recurring fields of the input searchable in CloudTrail?
Adding more metadata to a CloudTrail record would help us:
It would actually make it possible to filter, without hitting the 1000 results limit.
It would be easier to find the full CloudTrail for a given CloudWatch event or DLQ message.
Ideally, can any of this be done without changing the code of the existing lambda functions? (Simply, because there are so many of them.)
Have you considered emitting JSON logs from your Lambdas and using CloudWatch Logs Insights to search them? If you need additional custom metrics, I’d look at the Embedded Metric Format: https://aws.amazon.com/blogs/mt/enhancing-workload-observability-using-amazon-cloudwatch-embedded-metric-format/
I’d also recommend taking a look at some of the capabilities provided by Lambda Power Tools: https://awslabs.github.io/aws-lambda-powertools-python/2.5.0/
There are a few things in here so I'll attempt to break them down one by one:
Searching across multiple log groups
As #jaredcnance recommended, CloudWatch Logs Insights will enable you to easily and quickly search across multiple log groups. You can likely get started with a simple filter #message like /my pattern/ query.
I suggest testing with 1-2 log groups and a small-ish time window so that you can get your queries correct. Once you're happy, query all of your log groups and save the queries so that you can quickly and easily run them in the future.
Logging Lambda event payloads
Yes, you can easily do this with Lambda Power Tools. If you're not using Python, check the landing page to see if your runtime is supported. If you are using a Lambda runtime that doesn't have LPT support, you can log JSON output yourself.
When you log with JSON it's trivial to query with CW Logs Insights. For example, a Python statement like this:
from aws_lambda_powertools import Logger
logger = Logger()
logger.info({
"action": "MOVE",
"game_id": game.id,
"player1": game.player_x.id,
"player2": game.player_o.id,
})
enables queries like this:
fields #timestamp, correlation_id, message.action, session_id, location
| filter ispresent(message.action) AND message.action = 'MOVE'
| sort #timestamp desc
Updating Lambda functions
Lambda runs your code and will not update itself. If you want emit logs you have to update your code. There is no way around that.
Cloudtrail
CloudTrail is designed as a security and governance tool. What you are trying to do is operational in nature (debugging). As such, logging and monitoring solutions like CW Logs are going to be your friends. While some of the data plane operations may end up in CloudTrail, CloudWatch or other logging solutions are better suited.

using open telemetry UI components (jaeger/zipkin) with legacy storage/formats

We have a large set of deployed applications that implement our own form of instrumentalization, which is by chance very similar to the open-telemetry standard.
using NLog we trace usage, activity, exceptions and metrics
our traces include logger context, time fields, duration (span) and transaction-id. we do not trace parent-id.
all that massive payload is stored on ElasticSearch
We use different viewers and Kibana to display the log entries, but I'd like to view them as trees and full spans. For that purpose I want to use Jaeger/Zipkin front ends as a display layer (with its search components).
Is it possible to do that?
Do those UIs offer a way to configure and map both our field names to the open-telemetry standard and our ElasticSearch indices as sources?

Flink web UI: Monitor Metrics doesn't work

run with flink-1.9.0 on yarn(2.6.0-cdh5.11.1), but the flink web ui metrics does'nt work, as shown below:
I guess you are looking at the wrong metrics. Due no data flows from one task to another (you can see only one box at the UI) there is nothing to show. The metrics you are looking at only show the data which flows from one flink task to another. At your example everything happens within this task.
Look at this example:
You can see two tasks sending data to the map-task which emits this data to another task. Therefore you see incoming and outgoing data.
But on the other hand a source task never has incoming data(I must admit that this is confusing at the first look):
The number of records recieved is 0 but it send a couple of records to the downstream task.
Back to your problem: What you can do is have a look at the operator metrics. If you look at the metrics tab (the one at the very right) you can select beside the task metrics also some operator metrics. These metrics have a name like 0.Map.numRecordsIn.
The name assembles like this <slot>.<operatorName>.<metricsname>. But be aware that this metrics are not recorded, you don't have any historic data and once you leave this tab or remove a metric the data collected until that point are gone. I would recommend to use a proper metrics backend like influx, prometheus or graphite. You can find a description at the flink docs.
Hope that helped.

How to efficiently log metrics from API Gateway when using cache?

My scenario is this:
API Gateway which has a single endpoint serves roughly 250 million requests per month, backed by a Lambda function.
Caching is enabled, and 99% of the requests hits the cache.
The request contains query parameters which we want to derive statistics from.
Since cache is used, most requests never hit the Lambda function. We have currently enabled full request/response logging in API Gateway to capture the query parameters in CloudWatch. Once a week, we run a script to parse the logs and compile the statistics we are interested in.
Challenges with this setup:
Our script takes ~5 hours to run, and only gives a snapshot for the last week. We would ideally be interested in tracking the statistics continuously over time, say every 5 minutes or every hour.
Using full request/response logging produces HUGE amounts of logs, most of which does not contain anything that we are interested in.
Ideally we would like to turn of full request/response logging but still get the statistics we are interested in. I have considered logging to CloudWatch from Lambda#Edge to be able to capture the query parameters before the request hits the cache, and then use a metric filter, or perhaps Kinesis to get the statistics we want.
Would this be a viable solution, are can you propose another setup that could solve our problems in a more efficient way without costing too much?
You can configure access logging on your API ( https://docs.aws.amazon.com/apigateway/latest/developerguide/set-up-logging.html ) which give way to select (portion of request and response) and publish more structured logs to cloudwatch.
You can then use cloudwatch filter pattern (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html ) to generate some metrics or feed logs to your analytical engine ( or run script as you are running now ).

Using my own Cassandra driver to write aggregations results

I'm trying to create a simple application which writes to Cassandra the page views of each web page on my site. I want to write every 5 minutes the accumulative page views from the start of a logical hour.
My code for this looks something like this:
KTable<Windowed<String>, Long> hourlyPageViewsCounts = keyedPageViews
.groupByKey()
.count(TimeWindows.of(TimeUnit.MINUTES.toMillis(60)), "HourlyPageViewsAgg")
Where I also set my commit interval to 5 minutes by setting the COMMIT_INTERVAL_MS_CONFIG property. To my understanding that should aggregate on full hour and output intermediate accumulation state every 5 minutes.
My questions now are two:
Given that I have my own Cassandra driver, how do I write the 5 min intermediate results of the aggregation to Cassandra? Tried to use foreach but that doesn't seem to work.
I need a write only after 5 min of aggregation, not on each update. Is it possible? Reading here suggests it might not without using low-level API, which I'm trying to avoid as it seems like a simple enough task to be accomplished with the higher level APIs.
Committing and producing/writing output is two different concepts in Kafka Streams API. In Kafka Streams API, output is produced continuously and commits are used to "mark progress" (ie, to commit consumer offsets including the flushing of all stores and buffered producer records).
You might want to check out this blog post for more details: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
1) To write to Casandra, it is recommended to write the result of you application back into a topic (via #to("topic-name")) and use Kafka Connect to get the data into Casandra.
Compare: External system queries during Kafka Stream processing
2) Using low-level API is the only way to go (as you pointed out already) if you want to have strict 5-minutes intervals. Note, that next release (Kafka 1.0) will include wall-clock-time punctuations which should make it easier for you to achieve your goal.

Resources