EXC_GUARD exception - macos

A OSX app crashes when I try to close a socket handle, it worked fine in all the previous platforms, but it appears to crash in Yosemite.
The line where is crashes is
-(void)stopPacketReceiver
{
close(sd);
}
In Xcode it pauses all the threads and show EXC_GUARD exception, what kind of exception is this, any ideas ?
Thanks,
Ahmed
EDIT:
Here r the exception codes that I get
Exception Type: EXC_GUARD
Exception Codes: 0x4000000100000000, 0x08fd4dbfade2dead

From a post in Apple's old developer forums from Quinn "The Eskimo" (Apple Developer Relations, Developer Technical Support, Core OS/Hardware), edited by me to remove things which were specific to that specific case:
EXC_GUARD is a change in 10.9 designed to help you detect file
descriptor problems. Specifically, the system can now flag specific
file descriptors as being guarded, after which normal operations on
those descriptors will trigger an EXC_GUARD crash (when it wants to
operate on these file descriptors, the system uses special 'guarded'
private APIs).
We added this to the system because we found a lot of apps were
crashing mysteriously after accidentally closing a file descriptor
that had been opened by a system library. For example, if an app
closes the file descriptor used to access the SQLite file backing a
Core Data store, Core Data would then crash mysteriously much later
on. The guard exception gets these problems noticed sooner, and thus
makes them easier to debug.
For an EXC_GUARD crash, the exception codes break down as follows:
o The first exception code … contains three bit
fields:
The top three bits … indicate [the type of guard].
The remainder of the top 32 bits … indicate [which operation was disallowed].
The bottom 32 bits indicate the descriptor in question ….
o The second exception code is a magic number associated with the
guard. …
Your code is closing a socket it doesn't own. Maybe sd contains the descriptor number for a descriptor that you once owned but is now a dangling reference, because you already closed your descriptor and that number has now been reused for somebody else's descriptor. Or maybe sd just has a junk value somehow.
We can decode some more information from the exception codes, but most likely you just have to trace exactly where you're doing with sd over its life.
Update:
From the edited question, I see that you've posted the exception codes. Using the constants from the kernel source, the type of guard is GUARD_TYPE_FD, the operation that was disallowed was kGUARD_EXC_CLOSE (i.e. close()), and the descriptor was 0 (FILENO_STDIN).
So, in all probability, your stopPacketReceiver was called when the sd instance variable was uninitialized and had the default 0 value that all instance variables get when an object is first allocated.
The magic value is 0x08fd4dbfade2dead, which according to the original developer forums post, "indicates that the guard was applied by SQLite". That seems strange. Descriptor 0 would normally be open from process launch (perhaps referencing /dev/null). So, SQLite should not own that.
I suspect what has happened is that your code has actually closed descriptor 0 twice. The first time it was not guarded. It's legal to close FILENO_STDIN. Programs sometimes do it to reopen that descriptor to reference something else (such as /dev/null) if they don't want/need the original standard input. In your case, it would have been an accident but would not have raised an exception. Once it was closed, the descriptor would have been available to be reallocated to the next thing which opened a descriptor. I guess that was SQLite. At that time, SQLite put a guard on the descriptor. Then, your code tried to close it again and got the EXC_GUARD exception.
If I'm right, then it's somewhat random that your code got the exception (although it was always doing something bad). The fact that file descriptor 0 got assigned to a subsystem that applied a guard to it could be a race condition or it could be a change in order of operations between versions of the OS.
You need to be more careful to not close descriptors that you didn't open. You should initialize any instance variable meant to hold a file descriptor to -1, not 0. Likewise, if you close a descriptor that you did own, you should set the instance variable back to -1.

Firstly, that sounds awesome - it sounds like it caught what would have been EXC_BAD_ACCESS (but this is a guess).
My guess is that sd isn't a valid descriptor. It's possible an API changed in Yosemite that's causing the place you create the descriptor to return NULL, or it's possible a change in the event timeline in Yosemite causes it to have already been cleaned up.
Debugging tip here: trace back sd all the way to its creation.

Related

ReadFile !=0, lpNumberOfBytesRead=0 but offset is not at the end of the file

We're struggling to understand the source of the following bug:
We have a call to "ReadFile" (Synchronous) that returns a non-zero value (success) but fills the lpNumberOfBytesRead parameter to 0. In theory, that indicates that the offset is outside the file but, in practice, that is not true. GetLastError returns ERROR_SUCCESS(0).
The files in question are all on a shared network drive (Windows server 2016 + DFS, windows 8-10 clients, SMBv3). The files are used in shared mode. In-file locking (lockFileEx) is used to handle concurrent file access (we're just locking the first byte of the file before any read/write).
The handle used is not fresh: it isn't created locally in the functions but retrieved from a application-wide "file handle cache manager". This means that it could have been created (unused) some times ago. However, everything we did indicates the handle is valid at the moment of the call: GetLastError returns 0, GetFileInformationByHandle returns "true" and a valid structure.
The error is logged to a file that is located on the same file server as the problematic files.
We have done a lot of logging and testing around this issue. here are the additional facts we gathered:
Most (but not all) of the problematic read happen at the very tail of the file: we're reading the last record. However, the read is still within the file GetlastError does not return ERROR_HANDLE_EOF. If the program is restarted, the same read with the same parameters works.
The issue is not temporary: repeated calls yield the same result even if we let the program loop indefinitely. Restarting the program, however, does not automatically leads to the issue being raised again immediately.
We are sure the offset if inside the file: we check the actual file pointer location after the failure and compare it with the expected value as well as the size of the file as reported by the OS: all matches across multiple retries.
The issue only shows up randomly: there is no real pattern to the program working as expected and the program failing. It occurs a 2-4 times a day in our office (about 20 people).
The issue does not only occurs in our network. we've seen the symptoms and the log entries in multiple locations although we have no clear view of the OS involved in these cases.
We just deployed a new version of the program that will attempt to re-open the file in case of failure but that is a workaround, not a fix: we need to understand what is happening here and I must admit that I found no rational explanation for it
Any suggestion about what could be the cause of this error or what other steps could be taken to find out will be welcome.
Edit 2
(In the light of keeping this clear, I removed the code: the new evidence gives a better explanation of the issue)
We managed to get a procmon trace while the problem was happening and we got the following sequence of events that we simply cannot explain:
Text version:
"Time of Day","Process Name","PID","Operation","Path","Result","Detail","Command Line"
"9:43:24.8243833 AM","wacprep.exe","33664","ReadFile","\\office.git.ch\dfs\Data\EURDATA\GIT18\JNLS.DTA","END OF FILE","Offset: 7'091'712, Length: 384, Priority: Normal","O:\WinEUR\wacprep.exe /company:GIT18"
"9:43:24.8244011 AM","wacprep.exe","33664","QueryStandardInformationFile","\\office.git.ch\dfs\Data\EURDATA\GIT18\JNLS.DTA","SUCCESS","AllocationSize: 7'094'272, EndOfFile: 7'092'864, NumberOfLinks: 1, DeletePending: False, Directory: False","O:\WinEUR\wacprep.exe /company:GIT18"
(there are thousands of these logged since the application is in an infinite loop.)
As we understand this, the ReadFile call should succeed: the offset is well within the boundary of the file. Yet, it fails. ProcMon reports END OF FILEalthough I suspect it's just because ReadFile returned != 0 and reported 0 bytes read.
While the loop was running, we managed to unblock it by increasing the size of the file from a different machine:
"Time of Day","Process Name","PID","Operation","Path","Result","Detail","Command Line"
"9:46:58.6204637 AM","wacprep.exe","33664","ReadFile","\\office.git.ch\dfs\Data\EURDATA\GIT18\JNLS.DTA","END OF FILE","Offset: 7'091'712, Length: 384, Priority: Normal","O:\WinEUR\wacprep.exe /company:GIT18"
"9:46:58.6204810 AM","wacprep.exe","33664","QueryStandardInformationFile","\\office.git.ch\dfs\Data\EURDATA\GIT18\JNLS.DTA","SUCCESS","AllocationSize: 7'094'272, EndOfFile: 7'092'864, NumberOfLinks: 1, DeletePending: False, Directory: False","O:\WinEUR\wacprep.exe /company:GIT18"
"9:46:58.7270730 AM","wacprep.exe","33664","ReadFile","\\office.git.ch\dfs\Data\EURDATA\GIT18\JNLS.DTA","SUCCESS","Offset: 7'091'712, Length: 384, Priority: Normal","O:\WinEUR\wacprep.exe /company:GIT18"

How an assembler instruction could not read the memory it is placed at

Using some software in Windows XP that works as a Windows service and doing a restart from the logon screen I see an infamous error message
The instruction at "00x..." referenced memory at "00x...". The memory
could not be read.
I reported the problem to the developers, but looking at the message once again, I noticed that the addresses are the same. So
The instruction at "00xdf3251" referenced memory at "00xdf3251". The memory
could not be read.
Whether this is a bug in the program or not, but what is the state of the memory/access rights or something else that prevents an instruction from reading the memory it is placed. Is it something specific to services?
I would guess there was an attempt to execute an instruction at the address 0xdf3251 and that location wasn't backed up by a readable and executable page of memory (perhaps, completely unmapped).
If that's the case, the exception (page fault, in fact) originates from that instruction and the exception handler has its address on the stack (the location to return to, in case the exception can be somehow resolved and the faulting instruction restarted when the handler returns). And that's the first address you're seeing.
The CR2 register that the page fault handler reads, which is the second address you're seeing, also has the same address because it has to contain the address of an inaccessible memory location irrespective of whether the page fault has been caused by:
complete absence of mapping (there's no page mapped at all)
lack of write permission (the page is read-only)
lack of execute permission (the page has the no-execute bit set) OR
lack of kernel privilege (the page is marked as accessible only in the kernel)
and irrespective of whether it was during a data access or while fetching an instruction (the latter being our case).
That's how you can get the instruction and memory access addresses equal.
Most likely the code had a bug resulting in a memory corruption and some pointer (or a return address on the stack) was overwritten with a bogus value pointing to an inaccessible memory location. And then one way or the other the CPU was directed to continue execution there (most likely using one of these instructions: jmp, call, ret). There's also a chance of having a race condition somewhere.
This kind of crash is most typically caused by stack corruption. A very common kind is a stack buffer overflow. Write too much data in an array stored on the stack and it overwrites a function's return address with the data. When the function then returns, it jumps to the bogus return address and the program falls over because there's no code at the address. They'll have a hard time fixing the bug since there's no easy way to find out where the corruption occurred.
This is a rather infamous kind of bug, it is a major attack vector for malware. Since it can commandeer a program to jump to arbitrary code with data. You ought to have a sitdown with these devs and point this out, it is a major security risk. The cure is easy enough, they should update their tools. Countermeasures against buffer overflow are built into the compilers these days.

SEH setup for fibers with exception chain validation (SEHOP) active

I'm working on a native fiber/coroutine implementation – fairly standard, for each fiber, a separate stack is allocated, and to switch contexts, registers are pushed onto the source context stack and popped from the target stack. It works well, but now I hit a little problem:
I need SEH to work within a fiber (it's okay if the program terminates or strange things start to happen when an exception goes unhandled until the fiber's last stack frame, it won't). Just saving/restoring FS:[0] (along with FS:[4] and FS:[8], obviously) during the context switch and initially setting FS:[0] for newly allocated fibers to 0xFFFFFFFF (so that the exception handler set after the context switch will be the root of the chain) almost works.
To be precise, it works on all non-server Windows OSes I tested – the problem is that Windows Server 2008 and 2008 R2 have the exception chain validation (SEHOP, SEH overwrite protection) feature enabled by default, which makes RaiseException check if the original handler (somewhere in ntdll.dll) is still the root of the chain, and immediately terminates the program as if no handlers were installed otherwise.
Thus, I'm facing the problem of constructing an appropriate root frame on the stack to keep the validation code happy. Are there any (hidden?) API functions I can call to do that, or do I have to figure out what is needed to keep RtlDispatchException and friends happy and construct the appropriate _EXCEPTION_REGISTRATION entry myself? I can't just reuse the Windows-supplied one from the creating thread because it would be at the wrong address (the SEH implementation also checks if the handler address is in the boundaries given by FS:[4] and FS:[8], and possibly also if the address order is consistent).
Oh, and I'd strongly prefer not to resort to the CreateFiber WinAPI family of functions.
The approach I mentioned in the comments, generating a fake EXCEPTION_REGISTRATION entry pointing to ntdll!FinalExceptionHandler seems to work in practice indeed – at least, that's what we have in the D runtime now, and so far there have been no reports of problems:
https://github.com/D-Programming-Language/druntime/blob/c39de42dd11311844c0ef90953aa65f333ea55ab/src/core/thread.d#L4027

Windows: TCP/IP: force close connection: avoid memleaks in kernel/user-level

A question to windows network programming experts.
When I use pseudo-code like this:
reconnect:
s = socket(...);
// more code...
read_reply:
recv(...);
// merge received data
if(high_level_protocol_error) {
// whoops, there was a deviation from protocol, like overflow
// need to reset connection and discard data right now!
closesocket(s);
goto reconnect;
}
Does kernel un-associate and frees all data "physically" received from NIC(since it must really already be there, in kernel memory, waiting for user-level to read it with recv()), when I closesocket()? Well, it logically should since data is not associated with any internal object anymore, right?
Because I don't really want to waste unknown amount of time for clean shutdown like "call recv() until returns error". That does not make sense: what if it will never return error, say, server continues to send data forever and not closes connection, but that is bad behaviour?
I'm wondering about it since I don't want my application to cause memory leaks anywhere. Is this way of forced resetting connection, that still expected to send in unknown amount of data correct?
// optional addition to question: if this method considered correct for windows, can it be considered correct (with change of closesocket() to close() ) for UNIX-compliant OS?
Kernel drivers in Windows (or any OS really), including tcpip.sys, are supposed to avoid memory leaks in all circumstances, regardless of what you do in user mode. I would think that the developers have charted the possible states, including error states, to make sure that resources aren't leaked. As for user mode, I'm not exactly sure but I wouldn't think that resources are leaked in your process either.
Sockets are just file objects in Windows. When you close the last handle to a file, the IO manager sends a IRP_MJ_CLEANUP message to the driver that owns the file to clean up resources associated with it. The receive buffers associated with the socket would be freed along with the file object.
It does say in the closesocket documentation that pending operations are canceled but that async operations may complete after the function returns. It sounds like closing the socket while in use is a supported scenario and wouldn't lead to a memory leak.
There will be no leak and you are under no obligation to read the stream to EOS before closing. If the sender is still sending after you close it will eventually get a 'connection reset'.

What is 0x%08lx?

I've been getting a lot of blue screens on my XP box at work recently. So many in fact that I downloaded debugging tools for windows(x86) and have been analyzing the crash dumps. So many in fact that I've changed the dumps to mini only or else I would probably end up tanking half a work day each week just waiting for the blue screen to finish recording the detailed crash log.
Almost without exception every dump tells me that the cause of the blue screen is some kind of memory misallocation or misreference and the memory at 0x%08lx referenced 0x%08lx and could not be %s.
Out of idle curiosity I put "0x%08lx" into Google and found that quite a few crash dumps include this bizarre message. Am I to take it that 0x%08lx is a place holder for something that should be meaningful? "%s" which is part of the concluding sentence "The memory could not be %s" definitely looks like it's missing a variable or something.
Does anyone know the provenance of this message? Is it actually supposed to be useful and what is it supposed to look like?
It's not a major thing I have always worked around it. It's just strange that so many people should see this in so many crash dumps and nobody ever says: "Oh the crash dump didn't complete that message properly it's supposed to read..."
I'm just curious as to whether anyone knows the purpose of this strange error message artefact.
0x%08lx and %s are almost certainly format specifiers for the C function sprintf. But looks like the driver developers did as good a job in their error handling code as they did in the critical code, as you should never see these specifiers in the GUI -- they should be replaced with meaningful values.
0x%08lx should turn into something like "0xE001D4AB", a hexadecimal 32-bit pointer value.
%s should be replaced by another string, in this case a description. Something like
the memory at 0xE001D4AB referenced
0xE005123F and could not be read.
Note that I made up the values. Basically, a kernel mode access violation occurred. Hopefully in the mini dumps you can see which module caused it and uninstall / update / whatever it.
I believe it is just the placeholder for the memory address. 0x is a string prefix that would notify the user that it is an hexadecimal, while %08lx is the actual placeholder for a long int (l) converted to hexadecimal (x) with a padding of 8 zeroes (08).

Resources