Meaning of unwind_code operation 6 - windows

In windows SEH data, there exists a facility to define for a given RUNTIME_FUNCTION some pseudo-operations which describe its effect on the stack. They are defined here in the documentation for UNWIND_CODE: http://msdn.microsoft.com/en-us/library/ck9asaa9.aspx
However, Microsoft have only documented version 1! In some things that ship with windows (such as notepad), the given version for the exception handling data is 2. It looks to me as though they are very similar (see also http://sourceforge.net/mailarchive/message.php?msg_id=31650579), except that there is a mysterious and very common operation code 6, which is not defined in the version 1 documentation.
From my admittedly small sample, it appears that the op info is always 0 or 1, and the operation appears to consume only one slot in the unwind op data.
It seems that in ReactOS, it is taken to mean UWOP_SAVE_XMM (see http://doxygen.reactos.org/d4/d85/rtl_2amd64_2unwind_8c_a552b9ed5f66cb1a0015f7523bfce2eb9.html#a552b9ed5f66cb1a0015f7523bfce2eb9). This is an unlikely meaning for it in this case (the "version 2" case), since the following slots contain valid unwind operations rather than sane memory offsets.
What is the meaning of unwind operation 6?

Related

what is meaning of pgd_bad, pmd_bad, pud_bad, while converting kernel addressess?

Can somebody please explain below macros in kernel page tables?
#define pgd_bad(pgd) (!(pgd_val(pgd) & 2))
#define pmd_bad(pmd) (!(pmd_val(pmd) & 2))
#define pud_bad(pud) (!(pud_val(pud) & 2))
As originally defined in the now rather out of date (but excellent) Understanding the Linux Virtual Memory Manager _bad() macros determine whether a specified page table entry is not in a suitable state for modification.
However it turns out the macros are slightly nebulous across architectures, and might better be described in arm64 as determining whether an entry does not contain a reference to a page containing the table for the next level page table - see this commit for some details.
Expanding on #Notlikethat's answer, the macros do refer to the 'table' bit, which in turn (at least in linux kernel use, there might be more detailed architecture-specific details at play) determines whether the entry's physical address refers to a huge page (64KiB - see the arm64 memory layout doc) or not.
If we look at pte_huge() which determines whether a page entry refers to a huge page or not we see:
#define pte_huge(pte) (!(pte_val(pte) & PTE_TABLE_BIT))
This indicates that if the bit is set the page size is 4KiB, if it's cleared it's 64KiB and therefore huge.
This leads to the conclusion that, in arm64, the pXX_bad() macros return true if huge pages are used.
However, the definitions you give are actually not used if huge tables are enabled :)
Looking again at the memory layout doc referenced above, it turns out that the huge page table layout uses only 2 levels. The way linux deals with architectures which have less than 4 page table levels is to 'fold in' non-existent levels into existing tables, and let the compiler remove all of the unnecessary code.
And if we look at arch/arm64/include/asm/pgtable.h above the pgd_bad() definition (on line 445 at the time of writing), we see:
#if CONFIG_PGTABLE_LEVELS > 3
And above the pud_bad() definition (on line 392 at the time of writing) we see:
#if CONFIG_PGTABLE_LEVELS > 2
So in fact since CONFIG_PGTABLE_LEVELS == 2 in the huge table case, different definitions are used.
In arch/arm64/include/asm/pgtable-types.h we see (at line 89 at the time of writing):
#if CONFIG_PGTABLE_LEVELS == 2
#include <asm-generic/pgtable-nopmd.h>
#elif CONFIG_PGTABLE_LEVELS == 3
#include <asm-generic/pgtable-nopud.h>
#endif
So in fact definitions from include/asm-generic/pgtable-nopmd.h are used, which itself imports include/asm-generic/pgtable-nopud.h, giving us:
static inline int pgd_bad(pgd_t pgd) { return 0; }
static inline int pud_bad(pud_t pud) { return 0; }
And we already had:
#define pmd_bad(pmd) (!(pmd_val(pmd) & 2))
Which means that any huge page PMD entry which has pmd_bad() called on it will return true. However, if you look at the commit I mentioned above, you'll see that prior to the commit, a _bad() returning true would result in code handling the 'section map' case, where presumably the PMD contains different metadata (I don't want to dive too deep here :) and separate code is used, now that's handled via pmd_sect() and we don't care about what pmd_bad() says, if the entry is a section map.
Looking at arch/arm64/mm/mmu.c, pmd_set_huge() (line 828 at the time of writing), it seems like huge table PMDs are set as section maps meaning that we don't need to care about pmd_bad() (I may be mistaken here, I've not looked deep or gone through with a debugger, but it seems this way.)
So it seems overall the _bad() case now means something different - it just indicates that a bug has resulted in some code that's meant to be handling a non-section map/huge page table entry is handling a huge section map page table entry, which clearly needs to be indicated.
NOTE: Since I created an account just to reply here, I don't have enough reputation to give more links :) a kind person with more rep might want to add links where they make sense to, e.g. the 'at line XX at the time of writing' entries.)
In the ARMv8 64-bit page table descriptor format, for a valid (i.e. bit 0 set) level 0, 1, or 2 entry, bit 1 discriminates between table and block (hugepage) entries. Thus these macros will return false if the given entry has bit 1 set, indicating the expected table entry, or true if it is clear indicating a block or invalid entry. The p*d_val() accessors are simply wrappers for optionally enforcing type-safety.

WinDbg not showing register values

Basically, this is the same question that was asked here.
When performing kernel debugging of a machine running Windows 7 or older, with WinDbg version 6.2 and up, the debugger doesn't show anything in the registers window. Pressing the Customize... button results in a message box that reads Registers are not yet known.
At the same time, issuing the r command results in perfectly valid register values being printed out.
What is the reason for this behaviour, and can it be fixed?
TL;DR: I wrote an extension DLL that fixes the bug. Available here.
The Problem
To understand the problem, we first need to understand that WinDbg is basically just a frontend to Microsoft's Windows Symbolic Debugger Engine, implemented inside dbgeng.dll. Other frontends include the command-line kd.exe (kernel debugger) and cdb.exe (user-mode debugger).
The engine implements everything we expect from a debugger: working with symbol files, read and writing memory and registers, setting breakpoitns, etc. The engine then exposes all of this functionality through COM-like interfaces (they implement IUnknown but are not registered components). This allows us, for instance, to write our own debugger (like this person did).
Armed with this knowledge, we can now make an educated guess as to how WinDbg obtains the values of the registers on the target machine.
The engine exposes the IDebugRegisters interface for manipulating registers. This interface declares the GetValues method for retrieving the values of multiple registers in one go. But how does WinDbg know how many registers are there? That why we have the GetNumberRegisters method.
So, to retrieve the values of all registers on the target, we'll have to do something like this:
Call IDebugRegisters::GetNumberRegisters to get the total number of registers.
Call IDebugRegisters::GetValues with the Count parameter set to the total number of registers, the Indices parameter set to NULL, and the Start parameter set to 0.
One tiny problem, though: the second call fails with E_INVALIDARG.
Ehm, excuse me? How can it fail? Especially puzzling is the documentation for this return value:
The value of the index of one of the registers is greater than the number of registers on the target machine.
But I just asked you how many registers there are, so how can that value be out of range? Okay, let's continue reading the docs anyway, maybe something will become clear:
If the return value is not S_OK, some of the registers still might have been read. If the target was not accessible, the return type is E_UNEXPECTED and Values is unchanged; otherwise, Values will contain partial results and the registers that could not be read will have type DEBUG_VALUE_INVALID.
(Emphasis mine.)
Aha! So maybe the engine just couldn't read one of the registers! But which one? Turns out that the engine chokes on the xcr0 register. From the Intel 64 and IA-32 Architectures Software Developer’s Manual:
Extended control register XCR0 contains a state-component bitmap that specifies the user state components that software has enabled the XSAVE feature set to manage. If the bit corresponding to a state component is clear in XCR0, instructions in the XSAVE feature set will not operate on that state component, regardless of the value of the instruction mask.
Okay, so the register controls the operation of the XSAVE instruction, which saves the state of the CPU's extended features (like XMM and AVX). According to the last comment on this page, this instruction requires some support from the operating system. Although the comment states that Windows 7 (that's what the VM I was testing on was running) does support this instruction, it seems that the issue at hand is related to the OS anyway, as when the target is Windows 8 everything works fine.
Really, it's unclear whether the bug is within the debugger engine, which reports more registers than it can retrieve values for, or within WinDbg, which refuses to show any values at all if the engine fails to produce all of them.
The Solution
We could, of course, bite the bullet and just use an older version of WinDbg for debugging older Windows versions. But where's the challenge in that?
Instead, I present to you a debugger extension that solves this problem. It does so by hooking (with the help of this library) the relevant debugger engine methods and returning S_OK if the only register that failed was xcr0. Otherwise, it propagates the failure. The extension supports runtime unload, so if you experience problems you can always disable the hooks.
That's it, have fun!

Cuda kernel launch failure

I am trying to call two kernels as shown below
for (t=0; t<=time_total; t++)
{
//kernel calls
kernel1<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
kernel2<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
}
And the structure of the second kernel is
var[index+0]=**SOME CALCULATION**
var[index+1]=**SOME CALCULATION**
var[index+2]=**SOME CALCULATION**
Now when I execute this code, checkCudaError does not report anything and the code is executed giving some output but visual studio gives the following exception
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
And when I check on Nsight it says kernel 2 is having the following error
CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
Now the problem is that var array in kernel 2 is giving some of the rows correct some are copies of other row values and some are garbage.
Also when I do this
var[index+0]=3
var[index+1]=3
var[index+2]=3
All the values of var are set to 3
A few side notes:
cudaThreadSynchronize() is deprecated in favor of cudaDeviceSynchronize().
The fact that nsight is reporting an error on the 2nd kernel launch, but your error checking code is not, leads me to believe your error checking code is broken.
Now, regarding your issue, out of resources is frequently due to a code requesting too many registers (too many registers per thread times the number of threads per threadblock requested.) Try re-compiling your code specifying -Xptxas -v to get verbose output, and then recompiling again with -maxrregcount 20 (or something like that) to try to work around this for test purposes.
If this "fixes" your problem, you may then want to consider the following:
See if there is a way you can re-order or restructure your code to reduce the register pressure
If not, then adjust your maxrregcount value upwards to approximately the highest value that will allow your code to compile and run according to the launch configurations (number of threads per block) that you care about. You may also want to benchmark your code at different levels of this setting, as it can affect occupancy. Usually if you have it set to the highest value that will compile and run, then you are limiting yourself to one threadblock per SM at execution time. This may be OK, or there may be a lower setting that is better, allowing two threadblocks per SM residency, and possibly higher performance. Only benchmarking your code will tell.

need explanation on why does EXCEPTION_ACCESS_VIOLATION occur

Hi I know that this error which I'm going to show can't be fixed through code. I just want to know why and how is it caused and I also know its due to JVM trying to access address space of another program.
A fatal error has been detected by the Java Runtime Environment:
EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x6dcd422a, pid=4024, tid=3900
JRE version: 6.0_14-b08
Java VM: Java HotSpot(TM) Server VM (14.0-b16 mixed mode windows-x86 )
Problematic frame:
V [jvm.dll+0x17422a]
An error report file with more information is saved as:
C:\PServer\server\bin\hs_err_pid4024.log
If you would like to submit a bug report, please visit:
http://java.sun.com/webapps/bugreport/crash.jsp
The book "modern operating systems" from Tanenbaum, which is available online here:
http://lovingod.host.sk/tanenbaum/Unix-Linux-Windows.html
Covers the topic in depth. (Chapter 4 is in Memory Management and Chapter 4.8 is on Memory segmentation). The short version:
It would be very bad if several programs on your PC could access each other's memory. Actually even within one program, even in one thread you have multiple areas of memory that must not influence one other. Usually a process has at least one memory area called the "stack" and one area called the "heap" (commonly every process has one heap + one stack per thread. There MAY be more segments, but this is implementation dependent and it does not matter for the explanation here). On the stack things like you function's arguments and your local variables are saved. On the heap are variables saved that's size and lifetime cannot be determined by the compiler at compile time (that would be in Java everything that you use the "new"-Operator on. Example:
public void bar(String hi, int myInt)
{
String foo = new String("foobar");
}
in this example are two String objects: (referenced by "foo" and "hi"). Both these objects are on the heap (you know this, because at some point both Strings were allocated using "new". And in this example 3 values are on the stack. This would be the value of "myInt", "hi" and "foo". It is important to realize that "hi" and "foo" don't really contain Strings directly, but instead they contain some id that tells them were on the heap they can find the String. (This is not as easy to explain using java because java abstracts a lot. In C "hi" and "foo" would be a Pointer which is actually just an integer which represents the address in the heap where the actual value is stored).
You might ask yourself why there is a stack and a heap anyway. Why not put everything in the same place. Explaining that unfortunately exceeds the scope of this answer. Read the book I linked ;-). The short version is that stack and heap are differently managed and the separation is done for reasons of optimization.
The size of stack and heap are limited. (On Linux execute ulimit -a and you'll get a list including "data seg size" (heap) and "stack size" (yeah... stack :-)).).
The stack is something that just grows. Like an array that gets bigger and bigger if you append more and more data. Eventually you run out of space. In this case you may end up writing in the memory area that does not belong to you anymore. And that would be extremely bad. So the operating systems notices that and stops the program if that happens. On Linux you get a "Segmenation fault" and on Windows you get an "Access violation".
In other languages like in C, you need to manage your memory manually. A tiny error can easily cause you to accidentally write into some space that does not belong to you. In Java you have "automatic memory management" which means that the JVM does all this for you. You don't need to care and that takes loads from your shoulders as a developer (it usually does. I bet there are people out there who would disagree about the "loads" part ;-)). This means that it /should/ be impossible to produce segmentation faults with java. Unfortunatelly the JVM is not perfect. Sometimes it has bugs and screws up. And then you get what you got.

CreateThread() fails on 64 bit Windows, works on 32 bit Windows. Why?

Operating System: Windows XP 64 bit, SP2.
I have an unusual problem. I am porting some code from 32 bit to 64 bit. The 32 bit code works just fine. But when I call CreateThread() for the 64 bit version the call fails. I have three places where this fails. 2 call CreateThread(). 1 calls beginthreadex() which calls CreateThread().
All three calls fail with error code 0x3E6, "Invalid access to memory location".
The problem is all the input parameters are correct.
HANDLE h;
DWORD threadID;
h = CreateThread(0, // default security
0, // default stack size
myThreadFunc, // valid function to call
myParam, // my param
0, // no flags, start thread immediately
&threadID);
All three calls to CreateThread() are made from a DLL I've injected into the target program at the start of the program execution (this is before the program has got to the start of main()/WinMain()). If I call CreateThread() from the target program (same params) via say a menu, it works. Same parameters etc. Bizarre.
If I pass NULL instead of &threadID, it still fails.
If I pass NULL as myParam, it still fails.
I'm not calling CreateThread from inside DllMain(), so that isn't the problem. I'm confused and searching on Google etc hasn't shown any relevant answers.
If anyone has seen this before or has any ideas, please let me know.
Thanks for reading.
ANSWER
Short answer: Stack Frames on x64 need to be 16 byte aligned.
Longer answer:
After much banging my head against the debugger wall and posting responses to the various suggestions (all of which helped in someway, prodding me to try new directions) I started exploring what-ifs about what was on the stack prior to calling CreateThread(). This proved to be a red-herring but it did lead to the solution.
Adding extra data to the stack changes the stack frame alignment. Sooner or later one of the tests gets you to 16 byte stack frame alignment. At that point the code worked. So I retraced my steps and started putting NULL data onto the stack rather than what I thought was the correct values (I had been pushing return addresses to fake up a call frame). It still worked - so the data isn't important, it must be the actual stack addresses.
I quickly realised it was 16 byte alignment for the stack. Previously I was only aware of 8 byte alignment for data. This microsoft document explains all the alignment requirements.
If the stackframe is not 16 byte aligned on x64 the compiler may put large (8 byte or more) data on the wrong alignment boundaries when it pushes data onto the stack.
Hence the problem I faced - the hooking code was called with a stack that was not aligned on a 16 byte boundary.
Quick summary of alignment requirements, expressed as size : alignment
1 : 1
2 : 2
4 : 4
8 : 8
10 : 16
16 : 16
Anything larger than 8 bytes is aligned on the next power of 2 boundary.
I think Microsoft's error code is a bit misleading. The initial STATUS_DATATYPE_MISALIGNMENT could be expressed as a STATUS_STACK_MISALIGNMENT which would be more helpful. But then turning STATUS_DATATYPE_MISALIGNMENT into ERROR_NOACCESS - that actually disguises and misleads as to what the problem is. Very unhelpful.
Thank you to everyone that posted suggestions. Even if I disagreed with the suggestions, they prompted me to test in a wide variety of directions (including the ones I disagreed with).
Written a more detailed description of the problem of datatype misalignment here: 64 bit porting gotcha #1! x64 Datatype misalignment.
The only reason that 64bit would make a difference is that threading on 64bit requires 64bit aligned values. If threadID isn't 64bit aligned, you could cause this problem.
Ok, that idea's not it. Are you sure it's valid to call CreateThread before main/WinMain? It would explain why it works in a menu- because that's after main/WinMain.
In addition, I'd triple-check the lifetime of myParam. CreateThread returns (this I know from experience) long before the function you pass in is called.
Post the thread routine's code (or just a few lines).
It suddenly occurs to me: Are you sure that you're injecting your 64bit code into a 64bit process? Because if you had a 64bit CreateThread call and tried to inject that into a 32bit process running under WOW64, bad things could happen.
Starting to seriously run out of ideas. Does the compiler report any warnings?
Could the bug be due to a bug in the host program, rather than the DLL? There's some other code, such as loading a DLL if you used __declspec(import/export), that occurs before main/WinMain. If that DLLMain, for example, had a bug in it.
I ran into this issue today. And I checked every argument feed into _beginthread/CreateThread/NtCreateThread via rohitab's Windows API Monitor v2. Every argument is aligned properly (AFAIK).
So, where does STATUS_DATATYPE_MISALIGNMENT come from?
The first few lines of NtCreateThread validate parameters passed from user mode.
ProbeForReadSmallStructure (ThreadContext, sizeof (CONTEXT), CONTEXT_ALIGN);
for i386
#define CONTEXT_ALIGN (sizeof(ULONG))
for amd64
#define STACK_ALIGN (16UI64)
...
#define CONTEXT_ALIGN STACK_ALIGN
On amd64, if the ThreadContext pointer is not aligned to 16 bytes, NtCreateThread will return STATUS_DATATYPE_MISALIGNMENT.
CreateThread (actually CreateRemoteThread) allocated ThreadContext from stack, and did nothing special to guarantee the alignment requirement is satisfied. Things will work smoothly if every piece of your code followed Microsoft x64 calling convention, which unfortunately not true for me.
PS: The same code may work on newer Windows (say Vista and newer). I didn't check though. I'm facing this issue on Windows Server 2003 R2 x64.
I'm in the business of using parallel threads under windows
for calculations. No funny business, no dll-calls, and certainly
no call-back's. The following works in 32 bits windows. I set up the stack for my calculation, well within the area reserved for my program.
All releveant data about area's and start addresses is contained in
a data structure that is passed to CreateThread as parameter 3.
The address that is called contains a small assembler routine
that uses this data stucture.
Indeed this routine finds the address to return to on the stack,
then the address of the data structure.
There is no reason to go far into this. It just works and it calculates
the number of primes below 2,000,000,000 just fine, in one thread,
in two threads or in 20 threads.
Now CreateThread in 64 bits doesn't push the address of the data
structure. That seems implausible so I show you the smoking gun,
a dump of a debug session.
In the subwindow at the bottom right you see the stack, and
there is merely the return address, amidst a sea of zeroes.
The mechanism I use to fill in parameters is portable between 32 and 64 bits.
No other call exhibits a difference between word-sizes.
Moreover why would the code address work but not the data address?
The bottom line: one would expect that CreateThread passes the data parameter on the stack in the same way in 64 bits as in 32 bits, then does a subroutine call. At the assembler level it doesn't work that way. If there are any hidden requirements to e.g. RSP that are automatically fullfilled in C++ that would be very nasty.
P.S. No there are no 16 byte alignment problems. That lies ages behind me.
Try using _beginthread() or _beginthreadex() instead, you shouldn't be using CreateThread directly.
See this previous question.

Resources