What is a context register in golang?

What is a context register in golang? - go

What is a context register and how does it change the way Go code is compiled?
Context: I've seen uses of stub functions in some parts of GOROOT, i.e. reflect, but am not quite sure how these work.

The expression "context register" first appeared in commit b1b67a3 (Feb. 2013, Go 1.1rc2) for implementing step 3 of the Go 1.1 Function Calls
Change the reflect.MakeFunc implementation to avoid run-time code generation as well
It was picked up in commit 4a000b9 in Feb. 2014, Go 1.3beta1, for assembly and system calls for Native Client x86-64, where sigreturn is managed as:
NaCl has abidcated its traditional operating system responsibility and declined to implement 'sigreturn'.
Instead the only way to return to the execution of our program is to restore the registers ourselves.
Unfortunately, that is impossible to do with strict fidelity, because there is no way to do the final update of PC that ends the sequence without either
(1) jumping to a register, in which case the register ends holding the PC value instead of its intended value, or
(2) storing the PC on the stack and using RET, which imposes the requirement that SP is valid and that is okay to smash the word below it.
The second would normally be the lesser of the two evils, except that on NaCl, the linker must rewrite RET into "POP reg; AND $~31, reg; JMP reg", so either way we are going to lose a register as a result of the incoming signal.
Similarly, there is no way to restore EFLAGS; the usual way is to use POPFL, but NaCl rejects that instruction.
We could inspect the bits and execute a sequence of instructions designed to recreate those flag settings, but that's a lot of work.
Thankfully, Go's signal handlers never try to return directly to the executing code, so all the registers and EFLAGS are dead and can be smashed.
The only registers that matter are the ones that are setting up for the simulated call that the signal handler has created.
Today those registers are just PC and SP, but in case additional registers are relevant in the future (for example DX is the Go func context register) we restore as many registers as possible.
Much more recently (Q4 2016), for Go 1.8, we have commit d5bd797 and commit bf9c71c, for eliminating stack rescanning:
morestack writes the context pointer to gobuf.ctxt, but since
morestack is written in assembly (and has to be very careful with
state), it does not invoke the requisite write barrier for this
write. Instead, we patch this up later, in newstack, where we invoke
an explicit write barrier for ctxt.
This already requires some subtle reasoning, and it's going to get a
lot hairier with the hybrid barrier.
Fix this by simplifying the whole mechanism.
Instead of writing gobuf.ctxt in morestack, just pass the value of the context register to newstack and let it write it to gobuf.ctxt. This is a normal Go pointer write, so it gets the normal Go write barrier. No subtle reasoning required.

Related

How does the tct command work under the hood?

The windbg command tct executes a program until it reaches a call instruction or a ret instruction. I am wondering how the debugger implements this functionality under the hood.
I could imagine that the debugger scans the instructions from the current instructions for the next call or ret and sets according breakpoints on the found instructions. However, I think this is unlikely because it would also have to take into account jmp instructions so that there are an arbitrary number of possible call or ret instructions where such a breakpoint would have to be set.
On the other hand, I wonder if the x86/x64 CPU provides a functionality that raises an exception to be caught by the debugger whenever the CPU is about to process a call or ret instruction. Yet, I have not heard of such a functionality.

I'd guess that it single-steps repeatedly, until the next instruction is a call or ret, instead of trying to figure out where to set a breakpoint. (Which in the general case could be as hard as solving the Halting Problem.)
It's possible it could optimize that by scanning forward over "straight line" code and setting a breakpoint on the next jmp/jcc/loop or other control-transfer instruction (e.g. xabort), and also catching signals/exceptions that could transfer control to an SEH handler.
I'm also not aware of any HW support for breaking on a certain type of instruction or opcode: the x86 debug registers DR0..7 allow hardware breakpoints at code addresses without rewriting machine code to int3, and also hardware watchpoints (to trap data load/store to a specific address or range of addresses). But not filtering by opcode.

In a Linux system call, are system call parameters preserved in registers after the syscall finished (at the sys_exit tracepoint)?

Is it guaranteed to be able to read all the syscall parameters at sys_exit tracepoint?
sysdig driver is a kernel module to capture syscall using kernel static tracepoint. In this project some of system call parameters are read at sys_enter tracepoint, and some other parameters are read at sys_exit (return value of course, and contents in userspace to avoid pagefault).
Why not read all parameters at sys_exit? Is this because some parameters may be not be available at sys_exit?

Is it guaranteed to be able to read all the syscall parameters at sys_exit tracepoint?
Yes... and no, we need to distinguish parameters from registers. Linux syscalls should preserve all general purpose userspace registers, except the register used for the return value (and on some architectures also a second register to indicate if an error occurred). However, this does not mean that the input parameters of the syscall cannot change between entry and exit: if a register holds the value of a pointer to some data, while the register itself does not change, the data it points to could very well change.
Looking at the code for the static tracepoint sys_exit, you can see that only the syscall number (id) and its return value (ret) are traced. See note at the bottom of my answer for more.
Why not read all parameters at sys_exit? Is this because some parameters may be not available at sys_exit?
Yes, I would say that ensuring the correctness of the traced parameters is the main reason why tracing only at the exit would be a bad idea. Even if you get the values of the register, you cannot know the real parameters at syscall exit. Even if a syscall per se is guaranteed to save and restore the state of user registers, the syscall itself can alter the data that is being passed as argument. For example, the recvmsg syscall takes a pointer to a struct msghdr in memory which is used both as an input and an output parameter; the poll syscall does the same with a pointer to struct pollfd. Furthermore, another thread or program could have very well modified the memory of the program while it was making a syscall, therefore altering the data.
Under specific circumstances a syscall can also take a very long time before returning (think for example of a sleep, or a blocking read on your terminal, an accept on a listening socket, etc). If you only trace at the exit, you will have very incorrect timing information, and most importantly you will have to wait a lot before any meaningful information can be captured, even though that information is already available at the entry point.
Note on sys_exit tracepoint
Although you could thecnically extract the values of the saved registers of the current task, I am not entirely sure about the semantics of doing so while in the sys_exit tracepoint. I searched for some documentation on this specific case, but had no luck, and kernel code is well... complex.
The chain of calls to reach the exit hook should be:
Arch specific entry point (e.g. entry_INT80_32 for x86 int 0x80)
Arch specific entry handler (e.g. do_int80_syscall_32() for x86 int 0x80)
syscall_exit_to_user_code()
syscall_exit_to_user_mode_prepare()
syscall_exit_work()
trace_sys_exit()
If a deadly signal is delivered to a process during a syscall, while the actual process will never reach the exit of the syscall (i.e. no value is ever returned to user space), the tracepoint will still be hit. When a signal delivery of this kind happens, a special internal return value is used, like -ERESTARTSYS (see here). This value is not an actual syscall return value (it is not returned to user space), but rather it is only meant to be used by kernel. So it looks like the sys_exit tracepoint is being hit with the special -ERESTARTSYS if a deadly signal is received by the process. This does not happen for example in the case of SIGSTOP + SIGCONT. Take this with a grain of salt though, since I was not able to find proper documentation for this.

Is it valid to write below ESP?

For a 32-bit windows application is it valid to use stack memory below ESP for temporary swap space without explicitly decrementing ESP?
Consider a function that returns a floating point value in ST(0). If our value is currently in EAX we would, for example,
PUSH EAX
FLD [ESP]
ADD ESP,4 // or POP EAX, etc
// return...
Or without modifying the ESP register, we could just :
MOV [ESP-4], EAX
FLD [ESP-4]
// return...
In both cases the same thing happens except that in the first case we take care to decrement the stack pointer before using the memory, and then to increment it afterwards. In the latter case we do not.
Notwithstanding any real need to persist this value on the stack (reentrancy issues, function calls between PUSHing and reading the value back, etc) is there any fundamental reason why writing to the stack below ESP like this would be invalid?

TL:DR: no, there are some SEH corner cases that can make it unsafe in practice, as well as being documented as unsafe. #Raymond Chen recently wrote a blog post that you should probably read instead of this answer.
His example of a code-fetch page-fault I/O error that can be "fixed" by prompting the user to insert a CD-ROM and retry is also my conclusion for the only practically-recoverable fault if there aren't any other possibly-faulting instructions between store and reload below ESP/RSP.
Or if you ask a debugger to call a function in the program being debugged, it will also use the target process's stack.
This answer has a list of some things you'd think would potentially step on memory below ESP, but actually don't, which might be interesting. It seems to be only SEH and debuggers that can be a problem in practice.
First of all, if you care about efficiency, can't you avoid x87 in your calling convention? movd xmm0, eax is a more efficient way to return a float that was in an integer register. (And you can often avoid moving FP values to integer registers in the first place, using SSE2 integer instructions to pick apart exponent / mantissa for a log(x), or integer add 1 for nextafter(x).) But if you need to support very old hardware, then you need a 32-bit x87 version of your program as well as an efficient 64-bit version.
But there are other use-cases for small amounts of scratch space on the stack where it would be nice to save a couple instructions that offset ESP/RSP.
Trying to collect up the combined wisdom of other answers and discussion in comments under them (and on this answer):
It is explicitly documented as being not safe by Microsoft: (for 64-bit code, I didn't find an equivalent statement for 32-bit code but I'm sure there is one)
Stack Usage (for x64)
All memory beyond the current address of RSP is considered volatile: The OS, or a debugger, may overwrite this memory during a user debug session, or an interrupt handler.
So that's the documentation, but the interrupt reason stated doesn't make sense for the user-space stack, only the kernel stack. The important part is that they document it as not guaranteed safe, not the reasons given.
Hardware interrupts can't use the user stack; that would let user-space crash the kernel with mov esp, 0, or worse take over the kernel by having another thread in the user-space process modify return addresses while an interrupt handler was running. This is why kernels always configure things so interrupt context is pushed onto the kernel stack.
Modern debuggers run in a separate process, and are not "intrusive". Back in 16-bit DOS days, without a multi-tasking protected-memory OS to give each task its own address space, debuggers would use the same stack as the program being debugged, between any two instructions while single-stepping.
#RossRidge points out that a debugger might want to let you call a function in the context of the current thread, e.g. with SetThreadContext. This would run with ESP/RSP just below the current value. This could obviously have side-effects for the process being debugged (intentional on the part of the user running the debugger), but clobbering local variables of the current function below ESP/RSP would be an undesirable and unexpected side-effect. (So compilers can't put them there.)
(In a calling convention with a red-zone below ESP/RSP, a debugger could respect that red-zone by decrementing ESP/RSP before making the function call.)
There are existing program that intentionally break when being debugged at all, and consider this a feature (to defend against efforts to reverse-engineer them).
Related: the x86-64 System V ABI (Linux, OS X, all other non-Windows systems) does define a red-zone for user-space code (64-bit only): 128 bytes below RSP that is guaranteed not to be asynchronously clobbered. Unix signal handlers can run asynchronously between any two user-space instructions, but the kernel respects the red-zone by leaving a 128 byte gap below the old user-space RSP, in case it was in use. With no signal handlers installed, you have an effectively unlimited red-zone even in 32-bit mode (where the ABI does not guarantee a red-zone). Compiler-generated code, or library code, of course can't assume that nothing else in the whole program (or in a library the program called) has installed a signal handler.
So the question becomes: is there anything on Windows that can asynchronously run code using the user-space stack between two arbitrary instructions? (i.e. any equivalent to a Unix signal handler.)
As far as we can tell, SEH (Structured Exception Handling) is the only real obstacle to what you propose for user-space code on current 32 and 64-bit Windows. (But future Windows could include a new feature.)
And I guess debugging if you happen ask your debugger to call a function in the target process/thread as mentioned above.
In this specific case, not touching any other memory other than the stack, or doing anything else that could fault, it's probably safe even from SEH.
SEH (Structured Exception Handling) lets user-space software have hardware exceptions like divide by zero delivered somewhat similarly to C++ exceptions. These are not truly asynchronous: they're for exceptions triggered by instructions you ran, not for events that happened to come after some random instruction.
But unlike normal exceptions, one thing a SEH handler can do is resume from where the exception occurred. (#RossRidge commented: SEH handlers are are initially called in the context of the unwound stack and can choose to ignore the exception and continue executing at the point where the exception occurred.)
So that's a problem even if there's no catch() clause in the current function.
Normally HW exceptions can only be triggered synchronously. e.g. by a div instruction, or by a memory access which could fault with STATUS_ACCESS_VIOLATION (the Windows equivalent of a Linux SIGSEGV segmentation fault). You control what instructions you use, so you can avoid instructions that might fault.
If you limit your code to only accessing stack memory between the store and reload, and you respect the stack-growth guard page, your program won't fault from accessing [esp-4]. (Unless you reached the max stack size (Stack Overflow), in which case push eax would fault, too, and you can't really recover from this situation because there's no stack space for SEH to use.)
So we can rule out STATUS_ACCESS_VIOLATION as a problem, because if we get that on accessing stack memory we're hosed anyway.
An SEH handler for STATUS_IN_PAGE_ERROR could run before any load instruction. Windows can page out any page it wants to, and transparently page it back in if it's needed again (virtual memory paging). But if there's an I/O error, your Windows attempts to let your process handle the failure by delivering a STATUS_IN_PAGE_ERROR
Again, if that happens to the current stack, we're hosed.
But code-fetch could cause STATUS_IN_PAGE_ERROR, and you could plausibly recover from that. But not by resuming execution at the place where the exception occurred (unless we can somehow remap that page to another copy in a highly fault-tolerant system??), so we might still be ok here.
An I/O error paging in the code that wants to read what we stored below ESP rules out any chance of reading it. If you weren't planning to do that anyway, you're fine. A generic SEH handler that doesn't know about this specific piece of code wouldn't be trying to do that anyway. I think usually a STATUS_IN_PAGE_ERROR would at most try to print an error message or maybe log something, not try to carry on whatever computation was happening.
Accessing other memory in between the store and reload to memory below ESP could trigger a STATUS_IN_PAGE_ERROR for that memory. In library code, you probably can't assume that some other pointer you passed isn't going to be weird and the caller is expecting to handle STATUS_ACCESS_VIOLATION or PAGE_ERROR for it.
Current compilers don't take advantage of space below ESP/RSP on Windows, even though they do take advantage of the red-zone in x86-64 System V (in leaf functions that need to spill / reload something, exactly like what you're doing for int -> x87.) That's because MS says it isn't safe, and they don't know whether SEH handlers exist that could try to resume after an SEH.
Things that you'd think might be a problem in current Windows, and why they're not:
The guard page stuff below ESP: as long as you don't go too far below the current ESP, you'll be touching the guard page and trigger allocation of more stack space instead of faulting. This is fine as long as the kernel doesn't check user-space ESP and find out that you're touching stack space without having "reserved" it first.
kernel reclaim of pages below ESP/RSP: apparently Windows doesn't currently do this. So using a lot of stack space once ever will keep those pages allocated for the rest of your process lifetime, unless you manually VirtualAlloc(MEM_RESET) them. (The kernel would be allowed to do this, though, because the docs say memory below RSP is volatile. The kernel could effectively zero it asynchronously if it wants to, copy-on-write mapping it to a zero page instead of writing it to the pagefile under memory pressure.)
APC (Asynchronous Procedure Calls): They can only be delivered when the process is in an "alertable state", which means only when inside a call to a function like SleepEx(0,1). calling a function already uses an unknown amount of space below E/RSP, so you already have to assume that every call clobbers everything below the stack pointer. Thus these "async" callbacks are not truly asynchronous with respect to normal execution the way Unix signal handlers are. (fun fact: POSIX async io does use signal handlers to run callbacks).
Console-application callbacks for ctrl-C and other events (SetConsoleCtrlHandler). This looks exactly like registering a Unix signal handler, but in Windows the handler runs in a separate thread with its own stack. (See RbMm's comment)
SetThreadContext: another thread could change our EIP/RIP asynchronously while this thread is suspended, but the whole program has to be written specially for that to make any sense. Unless it's a debugger using it. Correctness is normally not required when some other thread is messing around with your EIP unless the circumstances are very controlled.
And apparently there are no other ways that another process (or something this thread registered) can trigger execution of anything asynchronously with respect to the execution of user-space code on Windows.
If there are no SEH handlers that could try to resume, Windows more or less has a 4096 byte red-zone below ESP (or maybe more if you touch it incrementally?), but RbMm says nobody takes advantage of it in practice. This is unsurprising because MS says not to, and you can't always know if your callers might have done something with SEH.
Obviously anything that would synchronously clobber it (like a call) must also be avoided, again same as when using the red-zone in the x86-64 System V calling convention. (See https://stackoverflow.com/tags/red-zone/info for more about it.)

in general case (x86/x64 platform) - interrupt can be executed at any time, which overwrite memory bellow stack pointer (if it executed on current stack). because this, even temporary save something bellow stack pointer, not valid in kernel mode - interrupt will be use current kernel stack. but in user mode situation another - windows build interrupt table (IDT) suchwise that when interrupt raised - it will be always executed in kernel mode and in kernel stack. as result user mode stack (below stack pointer) will be not affected. and possible temporary use some stack space bellow it pointer, until you not do any functions calls. if exception will be (say by access invalid address) - also space bellow stack pointer will be overwritten - cpu exception of course begin executed in kernel mode and kernel stack, but than kernel execute callback in user space via ntdll.KiDispatchExecption already on current stack space. so in general this is valid in windows user mode (in current implementation), but you need good understand what you doing. however this is very rarely i think used
of course, how correct noted in comments that we can, in windows user mode, write below stack pointer - is just the current implementation behavior. this not documented or guaranteed.
but this is very fundamental - unlikely will be changed: interrupts always will be executed in privileged kernel mode only. and kernel mode will be use only kernel mode stack. the user mode context not trusted at all. what will be if user mode program set incorrect stack pointer ? say by
mov rsp,1 or mov esp,1 ? and just after this instruction interrupt will be raised. what will be if it begin executed on such invalid esp/rsp ? all operation system just crashed. exactly because this interrupt will be executed only on kernel stack. and not overwrite user stack space.
also need note that stack is limited space (even in user mode), access it bellow 1 page (4Kb)already error (need do stack probing page by page, for move guard page down).
and finally really there is no need usually access [ESP-4], EAX - in what problem decrement ESP first ? even if we need access stack space in loop huge count of time - decrement stack pointer need only once - 1 additional instruction (not in loop) nothing change in performance or code size.
so despite formal this is will be correct work in windows user mode, better (and not need) use this
of course formal documentation say:
Stack Usage
All memory beyond the current address of RSP is considered volatile
but this is for common case, including kernel mode too. i wrote about user mode and based on current implementation
possible in future windows and add "direct" apc or some "direct" signals - some code will be executed via callback just after thread enter to kernel (during usual hardware interrupt). after this all below esp will be undefined. but until this not exist. until this code will be work always(in current builds) correct.

In general (not specifically related to any OS); it's not safe to write below ESP if:
It's possible for the code to be interrupted and the interrupt handler will run at the same privilege level. Note: This is typically very unlikely for "user-space" code, but extremely likely for kernel code.
You call any other code (where either the call or the stack used by the called routine can trash the data you stored below ESP)
Something else depends on "normal" stack use. This can include signal handling, (language based) exception unwinding, debuggers, "stack smashing protector"
It's safe to write below ESP if it's not "not safe".
Note that for 64-bit code, writing below RSP is built into the x86-64 ABI ("red zone"); and is made safe by support for it in tool chains/compilers and everything else.

When a thread gets created, Windows reserves a contiguous region of virtual memory of a configurable size (the default is 1 MB) for the thread's stack. Initially, the stack looks like this (the stack grows downwards):
--------------
| committed |
--------------
| guard page |
--------------
| . |
| reserved |
| . |
| . |
| |
--------------
ESP will be pointing somewhere inside the committed page. The guard page is used to support automatic stack growth. The reserved pages region ensures that the requested stack size is available in virtual memory.
Consider the two instructions from the question:
MOV [ESP-4], EAX
FLD [ESP-4]
There are three possibilities:
The first instruction executes successfully. There is nothing that uses the user-mode stack that can execute between the two instructions. So the second instruction will use the correct value (#RbMm stated this in the comments under his answer and I agree).
The first instruction raises an exception and an exception handler does not return EXCEPTION_CONTINUE_EXECUTION. As long as the second instruction is immediately after the first one (it is not in the exception handler or placed after it), then the second instruction will not execute. So you're still safe. Execution continues from stack frame where the exception handler exists.
The first instruction raises an exception and an exception handler returns EXCEPTION_CONTINUE_EXECUTION. Execution continues from the same instruction that raised the exception (potentially with a context modified by the handler). In this particular example, the first will be re-executed to write a value below ESP. No problem. If the second instruction raised an exception or there are more than two instructions, then the exception might occur a place after a value is written below ESP. When the exception handler gets called, it may overwrite the value and then return EXCEPTION_CONTINUE_EXECUTION. But when execution resumes, the value written is assumed to still be there, but it's not anymore. This is a situation where it's not safe to write below ESP. This applies even if all of the instructions are placed consecutively. Thanks to #RaymondChen for pointing this out.
In general, if the two instructions are not placed back-to-back, if you are writing to locations beyond ESP, there is no guarantee that the written values won't get corrupted or overwritten. One case that I can think of where this might happen is structured exception handling (SEH). If a hardware-defined exception (such as divide by zero) occurs, the kernel exception handler will be invoked (KiUserExceptionDispatcher) in kernel-mode, which will invoke the user-mode side of the handler (RtlDispatchException). When switching from user-mode to kernel-mode and then back to user-mode, whatever value was in ESP will be saved and restored. However, the user-mode handler itself uses the user-mode stack and will iterate over a registered list of exception handlers, each of which uses the user-mode stack. These functions will modify ESP as required. This may lead to losing the values you've written beyond ESP. A similar situation occurs when using software-define exceptions (throw in VC++).
I think you can deal with this by registering your own exception handler before any other exception handlers (so that it is called first). When your handler gets called, you can save your data beyond ESP elsewhere. Later, during unwinding, you get the cleanup opportunity to restore your data to the same location (or any other location) on the stack.
You need also to similarly watch out for asynchronous procedure calls (APCs) and callbacks.

Several answers here mention APCs (Asynchronous Procedure Calls), saying that they can only be delivered when the process is in an "alertable state", and are not truly asynchronous with respect to normal execution the way Unix signal handlers are
Windows 10 version 1809 introduces Special User APCs, which can fire at any moment just like Unix signals. See this article for low level details.
The Special User APC is a mechanism that was added in RS5 (and exposed through NtQueueApcThreadEx), but lately (in an insider build) was exposed through a new syscall - NtQueueApcThreadEx2. If this type of APC is used, the thread is signaled in the middle of the execution to execute the special APC.

WinDbg not showing register values

Basically, this is the same question that was asked here.
When performing kernel debugging of a machine running Windows 7 or older, with WinDbg version 6.2 and up, the debugger doesn't show anything in the registers window. Pressing the Customize... button results in a message box that reads Registers are not yet known.
At the same time, issuing the r command results in perfectly valid register values being printed out.
What is the reason for this behaviour, and can it be fixed?

TL;DR: I wrote an extension DLL that fixes the bug. Available here.
The Problem
To understand the problem, we first need to understand that WinDbg is basically just a frontend to Microsoft's Windows Symbolic Debugger Engine, implemented inside dbgeng.dll. Other frontends include the command-line kd.exe (kernel debugger) and cdb.exe (user-mode debugger).
The engine implements everything we expect from a debugger: working with symbol files, read and writing memory and registers, setting breakpoitns, etc. The engine then exposes all of this functionality through COM-like interfaces (they implement IUnknown but are not registered components). This allows us, for instance, to write our own debugger (like this person did).
Armed with this knowledge, we can now make an educated guess as to how WinDbg obtains the values of the registers on the target machine.
The engine exposes the IDebugRegisters interface for manipulating registers. This interface declares the GetValues method for retrieving the values of multiple registers in one go. But how does WinDbg know how many registers are there? That why we have the GetNumberRegisters method.
So, to retrieve the values of all registers on the target, we'll have to do something like this:
Call IDebugRegisters::GetNumberRegisters to get the total number of registers.
Call IDebugRegisters::GetValues with the Count parameter set to the total number of registers, the Indices parameter set to NULL, and the Start parameter set to 0.
One tiny problem, though: the second call fails with E_INVALIDARG.
Ehm, excuse me? How can it fail? Especially puzzling is the documentation for this return value:
The value of the index of one of the registers is greater than the number of registers on the target machine.
But I just asked you how many registers there are, so how can that value be out of range? Okay, let's continue reading the docs anyway, maybe something will become clear:
If the return value is not S_OK, some of the registers still might have been read. If the target was not accessible, the return type is E_UNEXPECTED and Values is unchanged; otherwise, Values will contain partial results and the registers that could not be read will have type DEBUG_VALUE_INVALID.
(Emphasis mine.)
Aha! So maybe the engine just couldn't read one of the registers! But which one? Turns out that the engine chokes on the xcr0 register. From the Intel 64 and IA-32 Architectures Software Developer’s Manual:
Extended control register XCR0 contains a state-component bitmap that specifies the user state components that software has enabled the XSAVE feature set to manage. If the bit corresponding to a state component is clear in XCR0, instructions in the XSAVE feature set will not operate on that state component, regardless of the value of the instruction mask.
Okay, so the register controls the operation of the XSAVE instruction, which saves the state of the CPU's extended features (like XMM and AVX). According to the last comment on this page, this instruction requires some support from the operating system. Although the comment states that Windows 7 (that's what the VM I was testing on was running) does support this instruction, it seems that the issue at hand is related to the OS anyway, as when the target is Windows 8 everything works fine.
Really, it's unclear whether the bug is within the debugger engine, which reports more registers than it can retrieve values for, or within WinDbg, which refuses to show any values at all if the engine fails to produce all of them.
The Solution
We could, of course, bite the bullet and just use an older version of WinDbg for debugging older Windows versions. But where's the challenge in that?
Instead, I present to you a debugger extension that solves this problem. It does so by hooking (with the help of this library) the relevant debugger engine methods and returning S_OK if the only register that failed was xcr0. Otherwise, it propagates the failure. The extension supports runtime unload, so if you experience problems you can always disable the hooks.
That's it, have fun!

Why doesn't gcc handle volatile register?

I'm working on a timing loop for the AVR platform where I'm counting down a single byte inside an ISR. Since this task is a primary function of my program, I'd like to permanently reserve a processor register so that the ISR doesn't have to hit a memory barrier when its usual code path is decrement, compare to zero, and reti.
The avr-libc docs show how to bind a variable to a register, and I got that working without a problem. However, since this variable is shared between the main program (for starting the timer countdown) and the ISR (for actually counting and signaling completion), it should also be volatile to ensure that the compiler doesn't do anything too clever in optimizing it.
In this context (reserving a register across an entire monolithic build), the combination volatile register makes sense to me semantically, as "permanently store this variable in register rX, but don't optimize away checks because the register might be modified externally". GCC doesn't like this, however, and emits a warning that it might go ahead and optimize away the variable access anyway.
The bug history of this combination in GCC suggests that the compiler team is simply unwilling to consider the type of scenario I'm describing and thinks it's pointless to provide for it. Am I missing some fundamental reason why the volatile register approach is in itself a Bad Idea, or is this a case that makes semantic sense but that the compiler team just isn't interested in handling?

The semantics of volatile are not exactly as you describe "don't optimize away checks because the register might be modified externally" but are actually more narrow: Try to think of it as "don't cache the variable's value from RAM in a register".
Seen this way, it does not make any sense to declare a register as volatile because the register itself cannot be 'cached' and therefore cannot possibly be inconsistent with the variable's 'actual' value.
The fact that read accesses to volatile variables are usually not optimzed away is merely a side effect of the above semantics, but it's not guaranteed.
I think GCC should assume by default that a value in a register is 'like volatile' but I have not verified that it actually does so.
Edit:
I just did a small test and found:
avr-gcc 4.6.2 does not treat global register variables like volatiles with respect to read accesses, and
the Naggy extension for Atmel Studio detects an error in my code: "global register variables are not supported".
Assuming that global register variables are actually considered "unsupported" I am not surprised that gcc treats them just like local variables, with the known implications.
My test code looks like this:
uint8_t var;
volatile uint8_t volVar;
register uint8_t regVar asm("r13");
#define NOP asm volatile ("nop\r\n":::)
int main(void)
{
var = 1; // <-- kept
if ( var == 0 ) {
NOP; // <-- optimized away, var is not volatile
}
volVar = 1; // <-- kept
if ( volVar == 0 ) {
NOP; // <-- kept, volVar *is* volatile
}
regVar = 1; // <-- optimized away, regVar is treated like a local variable
if ( regVar == 0 ) {
NOP; // <-- optimized away consequently
}
for(;;){}
}

The reason you would use the volatile keyword on AVR variables is to, as you said, avoid the compiler optimizing access to the variable. The question now is, how does this happen though?
A variable has two places it can reside. Either in the general purpose register file or in some location in RAM. Consider the case where the variable resides in RAM. To access the latest value of the variable, the compiler loads the variable from RAM, using some form of the ld instruction, say lds r16, 0x000f. In this case, the variable was stored in RAM location 0x000f and the program made a copy of this variable in r16. Now, here is where things get interesting if interrupts are enabled. Say that after loading the variable, the following occurs inc r16, then an interrupt triggers and its corresponding ISR is run. Within the ISR, the variable is also used. There is a problem, however. The variable exists in two different versions, one in RAM and one in r16. Ideally, the compiler should use the version in r16, but this one is not guaranteed to exist, so it loads it from RAM instead, and now, the code does not operate as needed. Enter then the volatile keyword. The variable is still stored in RAM, however, the compiler must ensure that the variable is updated in RAM before anything else happens, thus the following assembly may be generated:
cli
lds r16, 0x000f
inc r16
sei
sts 0x000f, r16
First, interrupts are disabled. Then, the the variable is loaded into r16. The variable is increased, interrupts are enabled and then the variable is stored. It may appear confusing for the global interrupt flag to be enabled before the variable is stored back in RAM, but from the instruction set manual:
The instruction following SEI will be executed before any pending interrupts.
This means that the sts instruction will be executed before any interrupts trigger again, and that the interrupts are disabled for the minimum amount of time possible.
Consider now the case where the variable is bound to a register. Any operations done on the variable are done directly on the register. These operations, unlike operations done to a variable in RAM, can be considered atomic, as there is no read -> modify -> write cycle to speak of. If an interrupt triggers after the variable is updated, it will get the new value of the variable, since it will read the variable from the register it was bound to.
Also, since the variable is bound to a register, any test instructions will utilize the register itself and will not be optimized away on the grounds the compiler may have a "hunch" it is a static value, given that registers by their very nature are volatile.
Now, from experience, when using interrupts in AVR, I have sometimes noticed that the global volatile variables never hit RAM. The compiler kept them on the registers all the time, bypassing the read -> modify -> write cycle alltogether. This was due, however, to compiler optimizations, and it should not be relied on. Different compilers are free to generate different assembly for the same piece of code. You can generate a disassembly of your final file or any particular object files using the avr-objdump utility.
Cheers.

Reserving a register for one variable for a complete compilation unit is probably too restrictive for a compiler's code generator. That is, every C routine would have to NOT use that register.
How do you guarantee that other called routines do NOT use that register once your code goes out of scope? Even stuff like serial i/o routines would have to NOT use that reserved register. Compilers do NOT recompile their run-time libraries based on a data definition in a user program.
Is your application really so time sensitive that the extra delay for bringing memory up from L2 or L3 can be detected? If so, then your ISR might be running so frequently that the required memory location is always available (i.e. it doesn't get paged back down thru the cache) and thus does NOT hit a memory barrier (I assume by memory barrier you are referring to how memory in a cpu really operates, through caching, etc.). But for this to really be true the up would have to have a fairly large L1 cache and the ISR would have to run at a very high frequency.
Finally, sometimes an application's requirements make it necessary to code it in ASM in which case you can do exactly what you are requesting!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio