In performance monitor, what is the difference between counter log and trace log? Any guidelines on when we should use each of them?
Counter logs poll the event in some period interval (for e.g. every minutes) based on a setting, whereas Trace logs only log the event when something happens.
Related
I'm using Aspects Pointcut to trace the execution time of my SQL call.
Similar to - no logging for PerformanceMonitorInterceptor.
I'm able to see the trace logs with the respective response time. I see the logged time is in ns unit. Is that the default unit or is there a way we can enforce to print the response time in ms?
Plz assist!
I have been running lambdas using C# with serverless.com framework for some months now, and I consistently notice holes in the cloudwatch logs. So far it has only been an annoyance. I have been looking around for some explanation, but it is starting to get to the point where I need to understand/fix the problem.
For instance, today I can see the lambda monitor shows hundreds to thousands of executions between 7AM and 8AM, but the cloudwatch logs show logfiles up until 7:19AM and then nothing again until 8:52AM.
What is going on here?
Logs are by Invocation of the lambda and log group links are by concurrent executions. If you look at your lambda metrics, you will see a stat called ConcurrentExecution - this is the total number of simultaneous serverless lambda containers you have running at any given moment - but that does NOT equal the same as Invocations. The headless project im on is doing about 5k invocations an hour and we've never been above 5 concurrent executions of any of our 25ish lambda's (helps that they all run after start up at about 300ms)
So if you have 100 invocations in 10 seconds, but they all take less than a second to run, once a given lambda container is spun up it will be reused as long as it is continually receiving events. This is how AWS works around the 'cold start' problem as much as possible where a given lambda may take 10-15 or more seconds to start up. By trying to predict traffic flow (and you can manipulate these settings as well) AWS is attempting to have a warm lambda ready to go for you whenever you need it.
These concurrent executions are slowly shut down as their volume drops off, their calls brought back in to other ones that are still active.
What this means for Log Group logs is two fold:
you may see large 'gaps' in the times but if you look closely any given log group will have multiple invocations in it.
log groups are delayed by several seconds to several minutes depending on the server load, so at any given time you may not actually be seeing all the logs of a given moment.
The other possibility is that you logging is not set up correctly (Python lambda's in particular have difficulty in logging properly to cloudwatch - the default Logging Handler doesn't play nice with the way lambda boots up a handler to attach it to the logGroup) or what you are getting is a ton of hits that are not actually doing anything - only pings/keep alive events that do not actually trigger any of your log statement - at which you will generally only see the concurrent start up/shutdown log statements (as stated above they are far fewer)
What do you mean with gaps in log groups?
A log group gets its log by log streams and one of the same lambda container use the same log stream. So it may not be the most recent log stream in your log group that have the latest log entry.
Here you can read more about it:
https://dashbird.io/blog/how-to-save-hundreds-hours-debugging-lambda/
While trying to edit my question with screenshots and tallies of the data, I came upon the answer. I thought it would be helpful for this to be a separate answer as it is extremely specific and enlightening.
The crux of the problem is that I didn't expect such huge gaps between invocation times and log write times. 12 minutes is an eternity compared to the work I have done in the past.
Consider this graph:
12:59 UTC should be 7:59AM CST. Counting the invocations between 12:59 and 13:08, I get roughly ~110.
Cloudwatch shows these log streams:
Looking at these log streams, there seems to be a large gap. The timestamp on the log stream is the "file close" time. The logstream for 8:08:37 includes events from 12 minutes before.
So the timestamps on the log streams are not very useful for finding debug data. The search all has not been very helpful up until now either. Slow and very limited. I will look into some other method for crunching logs.
I'm creating a monitoring for a process using New Relic. The process itself is an AWS Lambda that finishes running in around 15 seconds. Any time this process fails, I want to an alert to be triggered and an email to be sent to me per the policy I've configured.
For testing purposes I'm causing the lambda to fail in a QA environment multiple times in a row to see what gets picked up by New Relic, although in production the failure would only occur a couple (less than 3) times per week, potentially a few days apart.
Here is the chart that depicts all of the failures, the NRQL query, and the thresholds. As we can see, the summed errors are well above the threshold but for some reason the alert email is not being dispatched. Any ideas?
Try increasing your evaluation offset in Condition Settings -> Advanced Settings > Evaluation offset
New Relic polls for Lambda metrics every 5 minutes so if your offset is lower than this you may find that the alert doesn't fire.
In reality I've found this quite unreliable and I'd suggest setting quite a high offset initially to test the alert - maybe 20 or 30 minutes.
According to me the red highlighted area is the timeframe where the alert condition is being violated. Alert should had been triggered, check your notification channel and try sending test notification.
I've added traces to measure execution of some code running on GCP. The traces appear in trace list on StackDriver page and I can see their duration. What I cannot find is number of times reach traces was issued. Where does this number appear?
So the comments confirm that StackDriver Trace is not usable for collecting info about number (density) of events along the timeline. I'd expect to get this data from the trace, but will have to find another tools to obtain it.
I have a process that crashes unexpectedly.
About the same time the crash occurs, I see an error in the log infrastructure process and then it softly shut down.
I'm trying to understand which of the processes is causing the problem, the log infra getting my process crash or the other way around.
In order to do that, I'm looking at the crash dump my process produced (taken with adplus) and trying to understand, at what time exactly the first exit-related method was called, then compare it with the log infra error time and shutdown time.
How can I do that, is there a way to get, method calls time stamp, in stack?
Thanks.
Attach WinDbg or start your app with WinDbg and change the show time stamps parameter:
.echotimestamps 1
This will insert timestamps into the output for all events such as exceptions, thread creations etc.. see this msdn link.
I would also write a log to disk immediately once WinDbg attaches:
.logopen c:\temp\mylog.txt
to capture the output, this should achieve what you want.