Procdump: How to capture usable IIS crash dump - windows

I have been trying to get a w3wp crash dump to see the crash callstack. I got two dumps but both of them have a single thread in them - seems almost like AppDomain has been recycled already and there is nothing useful left in the process when the dump was saved.
Command used: "procdump -mm -e -n 1 -l pt <PID>"
Also tried -ma for full dump but result is the same:
0:000> ~
. 0 Id: fa4.1dc8 Suspend: -1 Teb: 000000b1`77a78000 Unfrozen
I am not sure if I am missing something in the command, or IIS does not provide usable managed dumps when capturing them with procdump - any inputs are highly appreciated!
Additional detail: I was seeing STACK_OVERFLOW exception being logged by the procdump, which - apparently needs different method to capture useful dump. See my own answer below for details.
It only took few hours - hopefully this will save some time for others like me.

Found a way to do this:
procdump -mm -e 1 -l -f C00000FD.STACK_OVERFLOW -g <PID>
It works! Thanks to the unknown fellow member whose hint lead me to this. I was reading too many pages and missed saving the page link to post acknowledgement here.

Related

How do I get handles information from a full-dump of a windows process?

I'm trying to debug an issue with what may be handle leak. I have a dump created on a remote windows machine and I would like to see the handles information. I'm using WinDbg. I have seen some articles from the MSDN and from other sources, like https://www.codeproject.com/Articles/6988/Debug-Tutorial-Part-5-Handle-Leaks, but I can't get it to work, so I need some help.
I tried the next
Using !handles or !devhandles - these have failed. I either get no export handles found or a failure to load kdexts. Apparently kernel debugging is not enabled.
I found winxp/kdexts.dll in my path (given from .chain command) but it wouldn't load - .load kdexts yields `DebugExtensionInitializeFailed1 with error code 0x80004005.
Using !handle - with !handle -? I get help for the command but when I try something else, I get "Unable to read handle information". For example,
!handle - I expected a full list of handles
!handle 0 0
!handle 0 0 file
My setup
The remote process is Windows server 2012 (64bit), just as my own machine
I'm using the latest WinDbg from the windows sdk 10
I have a full dump, created by right-clicking task manager
I need some help if possible
Do I need to do kernel-debugging in order to view the list of handles from the dump?
Is it possible at all to do kernel-debugging on a full dump created from task-manager? Or is it required that the dump be taken differently?
How can I know if a given dump file includes the handle information?
How can I use the !handle command properly?
Is there any other alternative, such as using visual studio, another utility, etc.?
I'd appreciate any help
!tamir
your dump was probably a dump taken without handle information
you may use dumpchk.exe that comes with windbg installation to see if Handle Stream exists in the dump
if you have control over dump creation check how to use .dump /ma with windbg
or you may also explore sysinternals procdump.exe
and also make sure you are using the correct bitted debugger for the dump in question
a sample path
D:\>dir /s /b "c:\Program Files (x86)\Windows Kits\10\Debuggers\cdb.exe"
c:\Program Files (x86)\Windows Kits\10\Debuggers\arm\cdb.exe
c:\Program Files (x86)\Windows Kits\10\Debuggers\arm64\cdb.exe
c:\Program Files (x86)\Windows Kits\10\Debuggers\x64\cdb.exe
c:\Program Files (x86)\Windows Kits\10\Debuggers\x86\cdb.exe
here is a sample dump creation with and without handle stream in the dump
:000> .dump /ma d:\madump.dmp
Creating d:\madump.dmp - mini user dump
Dump successfully written
0:000> .dump d:\nomadump.dmp
Creating d:\nomadump.dmp - mini user dump
Dump successfully written
0:000> q
analysing both the dumps with dumpchk and checking for streams present
dumpchk nomadump.dmp > nomachk.txt
dumpchk madump.dmp > machk.txt
D:\>type machk.txt |grep -i number.*stream
NumberOfStreams 17
D:\>type nomachk.txt |grep -i number.*stream
NumberOfStreams 13
diff
D:\>diff -y machk.txt nomachk.txt
Microsoft (R) Windows Debugger Version 10.0.17763.132 AMD64 Microsoft (R) Windows Debugger Version 10.0.17763.132 AMD64
Loading Dump File [D:\madump.dmp] | Loading Dump File [D:\nomadump.dmp]
User Mini Dump File with Full Memory: Only application d | User Mini Dump File: Only registers, stack and portions of me
----- User Mini Dump Analysis ----- User Mini Dump Analysis
MINIDUMP_HEADER: MINIDUMP_HEADER:
Version A793 (A063) Version A793 (A063)
NumberOfStreams 17 | NumberOfStreams 13
Flags 441826 | Flags 40000
0002 MiniDumpWithFullMemory <
0004 MiniDumpWithHandleData <
0020 MiniDumpWithUnloadedModules <
0800 MiniDumpWithFullMemoryInfo <
1000 MiniDumpWithThreadInfo <
40000 MiniDumpWithTokenInformation 40000 MiniDumpWithTokenInformation
400000 MiniDumpWithIptTrace <
if you feel enterprising take a look here for some hints to deciphering a dump without windbg /dbgeng
forgot to post the result of doing !handle on both dumps
D:\>cdb -c "!handle;q" -z nomadump.dmp |awk /Reading/,/quit/"
0:000> cdb: Reading initial command '!handle;q'
ERROR: !handle: extension exception 0x80004002.
"Unable to read handle information"
quit:
D:\>cdb -c "!handle;q" -z madump.dmp |awk /Reading/,/quit/"
0:000> cdb: Reading initial command '!handle;q'
Handle 0000000000000004
Type File
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxSNIPxxxxxxxxx
Handle 0000000000000128
Type Mutant
Handle 000000000000012c
Type
Handle 0000000000000180
Type File
70 Handles
Type Count
None 27
Event 13
File 8
Directory 2
Mutant 1
Semaphore 2
Key 6
IoCompletion 2
TpWorkerFactory 2
ALPC Port 1
WaitCompletionPacket 6
quit:
Check the tool which was used for creating the crash dump. Perhaps it provides an option to include handle data.
Task Manager includes handle data by default
Visual Studio includes handle data by default
In WinDbg, .dump can be used with the /mh switch to include handle data. /ma is a shortcut for /mfFhut, so it also includes handle data.
ProcDump automatically includes handle data.
Windows Error Reporting LocalDumps can be configured with a Registry value called CustomDumpFlags.
If you create the dump programmatically yourself with MiniDumpWriteDump(), use MINIDUMP_TYPE::MiniDumpWithHandleData.

How to trace dynamic instruction in spike (on RISC-V)

I’m new for spike and RISC V. I’m trying to do some dynamic instruction trace with spike. These instructions are from a sample.c file. I have tried the following commands:
$ riscv64-unknown-elf-gcc simple.c -g -o simple.out
$ riscv64-unknown-elf-objdump -d --line-numbers -S simple.out
But these commands display the assembled instructions in an out file, which is not I want. I need to trace the dynamic executed instruction in runtime. I find only two relative commands in spike host option:
-g - track histogram of PCs
-l - generate a log of execution
I’m not sure if the result is what I expected as above.
Does anyone have an idea how to do the dynamic instruction trace in spike?
Thanks a lot!
Yes, you can call spike with -l to get a trace of all executed instructions.
Example:
$ spike -l --isa=RV64gc ~/riscv/pk/riscv64-unknown-elf/bin/pk ./hello 2> ins.log
Note that this trace also contains all instructions executed by the proxy-kernel - rather than just the trace of your user program.
The trace can still be useful, e.g. you can search for the start address of your code (i.e. look it up in the objdump output) and consume the trace from there.
Also, when your program invokes a syscall you see something like this in the trace:
[.. inside your program ..]
core 0: 0x0000000000010088 (0x00000073) ecall
core 0: exception trap_user_ecall, epc 0x0000000000010088
core 0: 0x0000000080001938 (0x14011173) csrrw sp, sscratch, sp
[.. inside the pk ..]
sret
[.. inside your program ..]
That means you can skip to the sycall instruction (that are executed in the pk) by searching for the next sret.
Alternatively, you can call spike with -d to enter debug mode. Then you can set a breakpoint on the first instruction of interest in your program (until pc 0 YOURADDRESS - look up the address in the objdump output) and single step from there (by hitting return multiple times). See also the help screen by entering h at the spike prompt.

Bash commands putting out extra information which results into issues with scripts

Okay, hopefully I can explain this correctly as I have no idea what's causing this or how to resolve this.
For some reason bash commands (on a CentOS 6.x server) are displaying more information than "normally" and that causes issues with certain scripts. I have no clue if there is a name for this, but hopefully someone knows a solution for this.
First example.
Correct / good server:
[root#goodserver ~]# vzctl enter 3567
entered into CT 3567
[root#example /]#
(this is the correct behaviour)
Incorrect / bad server:
[root#badserver /]# vzctl enter 3127
Entering CT
entered into CT 3127
Open /dev/pts/0
[root#example /]#
With the "bad" server it will display more information as usual, like:
Entering CT
Open /dev/pts/0
It's like it parsing extra information on what it's doing.
Ofcourse the above is purely something cosmetic, however with several bash scripts we use, these issues are really issues.
A part of the script we use, uses the following command (there are more, but this is mainly a example of what's wrong):
DOMAIN=`vzctl exec $VEID 'hostname -d'`
The result of the above information is parsed in /etc/named.conf.
On the GOOD server it would be added in the named.conf like this:
zone "example.com" {
type master;
file "example.com";
allow-transfer {
200.190.100.10;
200.190.101.10;
common-allow-transfer;
};
};
The above is correct.
On the BAD server it would be added in the named.conf like this:
zone "Executing command: hostname -d
example.com" {
type master;
file "Executing command: hostname -d
example.com";
allow-transfer {
200.190.100.10;
200.190.101.10;
common-allow-transfer;
};
};
So it's add stuff of the action it does, in this example "Executing command: hostname -d"
Another example here when I run the command on a good server and on the bad server.
Bad server:
[root#bad-server /]# DOMAIN=`vzctl exec 3333 'hostname -d'`
[root#bad-server /]# echo $DOMAIN
Executing command: hostname -d example.com
Good server:
[root#good-server ~]# DOMAIN=`vzctl exec 4444 'hostname -d'`
[root#good-server ~]# echo $DOMAIN
example.com
My knowledge is limited, but I have tried several things checking rsyslog and the grub.conf, but nothing seems out of the ordinary.
I have no clue why it's displaying the extra information.
Probably it's something simple / stupid, but I have been trying to solve this for hours now and I really have no clue...
So any help is really appreciated.
Added information:
Both servers use: kernel.printk = 7 4 1 7
(I don't know if that's useful)
Well (thanks to Aaron for pointing me in the right direction) I finally found the little culprit which was causing all the issues I experienced with this script (which worked for every other server, so no need to change that obviously).
The issues were caused by the VERBOSE leven set in vz.conf (located in /etc/vz/ directory). There is an option in there called "VERBOSE" and in my case it was set to 3.
According to OpenVZ's website it does the following:
Increments logging level up from the default. Can be used multiple times.
Default value is set to the value of VERBOSE parameter in the global
configuration file vz.conf(5), or to 0 if not set by VERBOSE parameter.
After I changed VERBOSE=3 to VERBOSE=0 my script worked fine once again (as it did for every other server). :-)
So a big shoutout to Aaron for pointing me in the right direction. The answer is easy when you know where to look!
Sorry to say, but I am kinda disappointed by ndim's reaction. This is the 2nd time he was very unhelpful and rude in his response after that. He clearly didn't read the issue I posted correctly. Oh well.
I would make sure to properly parse the output of the command. In this case, we are only interested in lines of the form
entered into CT 12345
One way of doing this would be to pipe everything through sed and having sed print only the number when the line looks as above (untested, and I always forget which braces/brackets/parens need a backslash in front of them):
whateverthecommand | sed -n 's/^entered into CT ([0-9]{1,})$/\1/p'

Random corruption in file created/updated from shell script on a singular client to NFS mount

We have bash script (job wrapper) that writes to a file, launches a job, then at job completion it appends to the file information about the job. The wrapper is run on one of several thousand batch nodes, but has only cropped up with several batch machines (I believe RHEL6) accessing one NFS server, and at least one known instance of a different batch job on a different batch node using a different NFS server. In all cases, only one client host is writing to the files in question. Some jobs take hours to run, others take minutes.
In the same time period that this has occurred, there seems to be 10-50 issues out of 100,000+ jobs.
Here is what I believe to effectively be the (distilled) version of the job wrapper:
#!/bin/bash
## cwd is /nfs/path/to/jobwd
## This file is /nfs/path/to/jobwd/job_wrapper
gotEXIT()
{
## end of script, however gotEXIT is called because we trap EXIT
END="EndTime: `date`\nStatus: Ended”
echo -e "${END}" >> job_info
cat job_info | sendmail jobtracker#example.com
}
trap gotEXIT EXIT
function jobSetVar { echo "job.$1: $2" >> job_info; }
export -f jobSetVar
MSG=“${email_metadata}\n${job_metadata}”
echo -e "${MSG}\nStatus: Started" | sendmail jobtracker#example.com
echo -e "${MSG}" > job_info
## At the job’s end, the output from `time` command is the first non-corrupt data in job_info
/usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command
## 10-360 minutes later…
RC=$?
echo -e "ExitCode: ${RC}" >> job_info
So I think there are two possibilities:
echo -e "${MSG}" > job_info
This command throws out corrupt data.
/usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command
This corrupts the existing data, then outputs it’s data correctly.
However, some job, but not all, call jobSetVar, which doesn't end up being corrupt.
So, I dig into time.c (from GNU time 1.7) to see when the file is open. To summarize, time.c is effectively this:
FILE *outfp;
void main (int argc, char** argv) {
const char **command_line;
RESUSE res;
/* internally, getargs opens “job_info”, so outfp = fopen ("job_info", "a”) */
command_line = getargs (argc, argv);
/* run_command doesn't care about outfp */
run_command (command_line, &res);
/* internally, summarize calls fprintf and putc on outfp FILE pointer */
summarize (outfp, output_format, command_line, &res); /
fflush (outfp);
}
So, time has FILE *outfp (job_info handle) open the entire time of the job. It then writes the summary at the end of the job, and then doesn’t actually appear to close the file (not sure if this is necessary with fflush?) I've no clue if bash also has the file handle open concurrently as well.
EDIT:
A corrupted file will typically end consist of the corrupted part, followed with the non-corrupted part, which may look like this:
The corrupted section, which would occur before the non-corrupted section, is typically largely a bunch of 0x0000, with maybe some cyclic garbage mixed in:
Here's an example hexdump:
40000000 00000000 00000000 00000000
00000000 00000000 C8B450AC 772B0000
01000000 00000000 C8B450AC 772B0000
[ 361 x 0x00]
Then, at the 409th byte, it continues with the non-corrupted section:
Elapsed: 879.07
User: 0.71
System: 31.49
ExitCode: 0
EndTime: Fri Dec 6 15:29:27 PST 2013
Status: Ended
Another file looks like this:
01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000
[96 x 0x00]
[Repeat above 3 times ]
01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000
Followed by the non corrupted section:
Elapsed: 12621.27
User: 12472.32
System: 40.37
ExitCode: 0
EndTime: Thu Nov 14 08:01:14 PST 2013
Status: Ended
There are other files that have much more random corruption sections, but more than a few were cyclical similar to above.
EDIT 2: The first email, sent from the echo -e statement goes through fine. The last email is never sent due to no email metadata from corruption. So MSG isn't corrupted at that point. It's assumed that job_info probably isn't corrupt at that point either, but we haven't been able to verify that yet. This is a production system which hasn't had major code modifications and I have verified through audit that no jobs have been ran concurrently which would touch this file. The problem seems to be somewhat recent (last 2 months), but it's possible it's happened before and slipped through. This error does prevent reporting which means jobs are considered failed, so they are typically resubmitted, but one user in specific has ~9 hour jobs in which this error is particularly frustrating. I wish I could come up with more info or a way of reproducing this at will, but I was hoping somebody has maybe seen a similar problem, especially recently. I don't manage the NFS servers, but I'll try to talk to the admins to see what updates the NFS servers at the time of these issues (RHEL6 I believe) were running.
Well, the emails corresponding to the corrupt job_info files should tell you what was in MSG (which will probably be business as usual). You may want to check how NFS is being run: there's a remote possibility that you are running NFS over UDP without checksums. That could explain some corrupt data. I also hear that UDP/TCP checksums are not strong enough and the data can still end up corrupt -- maybe you are hitting such a problem (I have seen corrupt packets slipping through a network stack at least once before, and I'm quite sure some checksumming was going on). Presumably the MSG goes out as a single packet and there might be something about it that makes checksum conflicts with the garbage you see more likely. Of course it could also be an NFS bug (client or server), a server-side filesystem bug, busted piece of RAM... possibilities are almost endless here (although I see how the fact that it's always MSG that gets corrupted makes some of those quite unlikely). The problem might be related to seeking (which happens during the append). You could also have a bug somewhere else in the system, causing multiple clients to open the same job_info file, making it a jumble.
You can also try using different file for 'time' output and then merge them together with job_info at the end of script. That may help to isolate problem further.
Shell opens 'job_info' file for writing, outputs MSG and then shall close its file descriptor before launching main job. 'time' program opens same file for append as stream and I suspect the seek over NFS is not done correctly which may cause that garbage. Can't explain why, but normally this shall not happen (and is not happening). Such rare occasions may point to some race condition somewhere, can be caused by out of sequence packet delivery (due to network latency spike) or retransmits which causes duplicate data, or a bug somewhere. At first look I would suspect some bug, but that bug may be triggered by some network behavior, e.g. unusually large delay or spike of packet loss.
File access between different processes are serialized by kernel, but for additional safeguard may be worth adding some artificial delays - sleep timers between outputs for example.
Network is not transparent, especially a large one. There can be WAN optimization devices which are known to cause application issues sometimes. CIFS and NFS are good candidates for optimization over WAN with local caching of filesystem operations. Might be worth looking for recent changes with network admins..
Another thing to try, although can be difficult due to rare occurrences is capture of interesting NFS sessions via tcpdump or wireshark. In really tough cases we do simultaneous capturing on both client and server side and then compare the protocol logic to prove that network is or is not working correctly. That's a whole topic in itself, requires thorough preparation and luck but usually a last resort of desperate troubleshooting :)
It turns out this was actually another issue altogether, apparently to do with an out-of-date page being written to disk.
A bug fix was supplied to the linux-nfs implementation:
http://www.spinics.net/lists/linux-nfs/msg41357.html

Using logman to collect data

I try to use logman instead of DDK tracelog for colecting *.etl data produced by my application that uses WPP but was not able to see any data (in *.etl) after reading the etl file and decoding *.fmt information by using traceview.
What am I doing wrong? I generate *etl like this
logman start "Session" -o "Trace.etl" -p "{28EE579B-CF67-43b6-9D19-8930E7AAA131}" -ets
logman stop "Session" -ets
When opening with traceview the generated Trace.etl it shows not errors only that there is no collected data there.
EDIT: I specify that I registered my generated *.mof file using mofcompiler on the system and that by using traceview directly I can see data.
The problem was the fact that I didn't specify any flags and because of that (WPP messages being with flags set I was not seeing any data:
logman start "Session" -o "Trace.etl" -p "{28EE579B-CF67-43b6-9D19-8930E7AAA131}" 0xFFFF -ets logman stop "Session" -ets
I was searching for the same issue and found this useful documentation on MSDN:
CLR ETW Keywords and Levels
The levels have the following meanings:
0x5 - Verbose
0x4 - Informational
0x3 - Warning
0x2 - Error
0x1 - Critical
0x0 - LogAlways

Resources