Get time stamps for stack in crash dump - debugging

I have a process that crashes unexpectedly.
About the same time the crash occurs, I see an error in the log infrastructure process and then it softly shut down.
I'm trying to understand which of the processes is causing the problem, the log infra getting my process crash or the other way around.
In order to do that, I'm looking at the crash dump my process produced (taken with adplus) and trying to understand, at what time exactly the first exit-related method was called, then compare it with the log infra error time and shutdown time.
How can I do that, is there a way to get, method calls time stamp, in stack?
Thanks.

Attach WinDbg or start your app with WinDbg and change the show time stamps parameter:
.echotimestamps 1
This will insert timestamps into the output for all events such as exceptions, thread creations etc.. see this msdn link.
I would also write a log to disk immediately once WinDbg attaches:
.logopen c:\temp\mylog.txt
to capture the output, this should achieve what you want.

Related

Why are there holes in my cloudwatch logs?

I have been running lambdas using C# with serverless.com framework for some months now, and I consistently notice holes in the cloudwatch logs. So far it has only been an annoyance. I have been looking around for some explanation, but it is starting to get to the point where I need to understand/fix the problem.
For instance, today I can see the lambda monitor shows hundreds to thousands of executions between 7AM and 8AM, but the cloudwatch logs show logfiles up until 7:19AM and then nothing again until 8:52AM.
What is going on here?
Logs are by Invocation of the lambda and log group links are by concurrent executions. If you look at your lambda metrics, you will see a stat called ConcurrentExecution - this is the total number of simultaneous serverless lambda containers you have running at any given moment - but that does NOT equal the same as Invocations. The headless project im on is doing about 5k invocations an hour and we've never been above 5 concurrent executions of any of our 25ish lambda's (helps that they all run after start up at about 300ms)
So if you have 100 invocations in 10 seconds, but they all take less than a second to run, once a given lambda container is spun up it will be reused as long as it is continually receiving events. This is how AWS works around the 'cold start' problem as much as possible where a given lambda may take 10-15 or more seconds to start up. By trying to predict traffic flow (and you can manipulate these settings as well) AWS is attempting to have a warm lambda ready to go for you whenever you need it.
These concurrent executions are slowly shut down as their volume drops off, their calls brought back in to other ones that are still active.
What this means for Log Group logs is two fold:
you may see large 'gaps' in the times but if you look closely any given log group will have multiple invocations in it.
log groups are delayed by several seconds to several minutes depending on the server load, so at any given time you may not actually be seeing all the logs of a given moment.
The other possibility is that you logging is not set up correctly (Python lambda's in particular have difficulty in logging properly to cloudwatch - the default Logging Handler doesn't play nice with the way lambda boots up a handler to attach it to the logGroup) or what you are getting is a ton of hits that are not actually doing anything - only pings/keep alive events that do not actually trigger any of your log statement - at which you will generally only see the concurrent start up/shutdown log statements (as stated above they are far fewer)
What do you mean with gaps in log groups?
A log group gets its log by log streams and one of the same lambda container use the same log stream. So it may not be the most recent log stream in your log group that have the latest log entry.
Here you can read more about it:
https://dashbird.io/blog/how-to-save-hundreds-hours-debugging-lambda/
While trying to edit my question with screenshots and tallies of the data, I came upon the answer. I thought it would be helpful for this to be a separate answer as it is extremely specific and enlightening.
The crux of the problem is that I didn't expect such huge gaps between invocation times and log write times. 12 minutes is an eternity compared to the work I have done in the past.
Consider this graph:
12:59 UTC should be 7:59AM CST. Counting the invocations between 12:59 and 13:08, I get roughly ~110.
Cloudwatch shows these log streams:
Looking at these log streams, there seems to be a large gap. The timestamp on the log stream is the "file close" time. The logstream for 8:08:37 includes events from 12 minutes before.
So the timestamps on the log streams are not very useful for finding debug data. The search all has not been very helpful up until now either. Slow and very limited. I will look into some other method for crunching logs.

Need to write the single logfile by different SP running in different session concurerntly in PL SQL

I want to write the single log file (which gets created on daily basis) by multiple SPs running in different session.
This is what i have done.
create or replace PKG_LOG:
procedure SP_LOGFILE_OPEN
step 1) Open the logfile:
LF_LOG := UTL_FILE.FOPEN(LV_FILE_LOC,O_LOGFILE,'A',32760);
end SP_LOGFILE_OPEN;
procedure SP_LOGFILE_write
step 1) Write the logs as per application need:
UTL_FILE.PUT_LINE(LF_LOG,'whatever i want to write');
step 2) Flush the content as i want to logs to be written in real time.
UTL_FILE.FFLUSH(LF_LOG);
end SP_LOGFILE_write;
Now whenever in any stored procedure i want to write the log first i am calling SP_LOGFILE_OPEN and then SP_LOGFILE_write(as many time as i want).
Problem is, if there are two stored procedures say SP1 and SP2. If both of them try to open it same concurrently it never throughs error or waits for another to finish. Instead it gets open in both the sessions where SP1 and SP2 is executing.
The content of SP1(if it started running first) will be completly written into logfile but content from SP2 will be partially written into logfile. SP2 starts wrtting only when SP1's execution stops. Also initial content of SP2 which it was trying to write into logfile gets lost due to FFLUSH.
As per my requirement i dont want to lose the content of second SP2 when SP1 was running.
Any suggestions please. I dont want to drop teh idea of FFLUSH as i need in real time.
Thanks.
You could use DBMS_LOCK to get a custom lock or wait until a lock is available, then do your write, then release the lock. It has to be serialized.
But this will make your concurrency problem even worse. You're basically saying that all calls to this procedure must get in a line and be processed one by one. And remember that disk I/O is slow, so your database is now only as fast as your disk.
Yours is a bad idea. Instead of writing directly to a file, simply enqueue a log message to an Oracle advanced queue and create a job running very frequently (every few seconds) to dequeue from the AQ. It's the procedure invoked by the job that actually writes to the file. This way you can synchronize different SP executions trying to log concurrently on the same file. The actual logging is made by one single SP invoked by the job.

Oracle streams - waiting for redo when the redo file is gone forever

I have streams configured, which stopped working after a while.
To resync, I stopped all capture/apply processes and exported the tables from source to target.
After starting up again, it still says waiting for that same redo file.
Is it possible to "restart" streams from a current file?
Do you want to "shift" your capture process ahead in time? You could try to switch your capture process to a certain SCN by invoking
DBMS_CAPTURE_ADM.ALTER_CAPTURE('YOUR_CAPTURE_NAME', start_scn => :SCN);
where :SCN is a valid system change number from which you want your messages to be captured. Take a look at DBMS_CAPTURE_ADM description.

Erlang "system" memory section keeps growing

I have an application with the following pattern:
2 long running processes that go into hibernate after some idle time
and their memory consumption goes down as expected
N (0 < N < 100) worker processes that do some work and hibernate when idle more than
10 seconds or terminate if idle more than two hours
during the night,
when there is no activity the process memory goes back to almost the
same value that was at the application start, which is expected as
all the workers have died.
The issue is that "system" section keeps growing (around 1GB/week).
My question is how can I debug what is stored there or who's allocating memory in that area and is not freeing it.
I've already tested lists:keysearch/3 and it doesn't seem to leak memory, as that is the only native thing I'm using (no ports, no drivers, no NIFs, no BIFs, nothing). Erlang version is R15B03.
Here is the current erlang:memory() output (slight traffic, app started on Feb 03):
[{total,378865650},
{processes,100727351},
{processes_used,100489511},
{system,278138299},
{atom,1123505},
{atom_used,1106100},
{binary,4493504},
{code,7960564},
{ets,489944},
{maximum,402598426}]
This is a 64-bit system. As you can see, "system" section has ~270MB and "processes" is at around 100MB (that drops down to ~16MB during the night).
It seems that I've found the issue.
I have a "process_killer" gen_server where processes can subscribe for periodic GC or kill. Its subscribe functions are called on each message received by some processes to postpone the GC/kill (something like re-arm).
This process performs an erlang:monitor if not already monitored to catch a dead process and remove it from watch list. If I comment our the re-subscription line on each handled message, "system" area seems to behave normally. That means it is a bug in my process_killer that does leak monitor refs (remember you can call erlang:monitor multiple times and each call creates a reference).
I was lead to this idea because I've tested a simple module which was calling erlang:monitor in a loop and I have seen ~13 bytes "system" area grow on each call.
The workers themselves were OK because they would die anyway taking their monitors along with them. There is one long running (starts with the app, stops with the app) process that dispatches all the messages to the workers that was calling GC re-arm on each received message, so we're talking about tens of thousands of monitors spawned per hour and never released.
I'm writing this answer here for future reference.
TL;DR; make sure you are not leaking monitor refs on a long running process.

how long it takes for kernel handles to close by Windows when an application crashes

I know Windows close kernel handles when an application crashes, but if I want to wait on this event, can I be sure it will happen in milisec or it might take a while? I would like to trigger a new function the moment one application is crashed and I'm checking if this handle is NULL but it seems like I can't get a NULL value in this case.
How long it will take may vary depending upon many factors including implementation, type of crash, etc. It might take awhile.
If you want to know when a process has crashed, you should set up a "watchdog" thread or process that waits on the application's Process Handle, using a function such as WaitForSingleObject. When the process dies, the event will be signaled and you can act accordingly.
Windows does not close handles when an application "crashes" - it closes them when the process terminates, no matter how the process terminates. By the time this happens the variables don't exist any more because the user mode address space has been shut down.
What are you trying to do?

Resources