According to the LockFileEx() documentation, a file offset is specified in lpOverlapped->Offset/OffsetHigh. But when debugging winword.exe to analyze its file system behaviors, I see it calls LockFileEx() on a 122-byte file with Offset=0xfffffffb and OffsetHigh=0xffffffff, and the call completes successfully. Apparently this is not a valid offset, what does this mean?
From MSDN:
Locking a region that goes beyond the current end-of-file position is not an error.
They could be using the lock as some kind of flag or for synchronization.
Related
We're struggling to understand the source of the following bug:
We have a call to "ReadFile" (Synchronous) that returns a non-zero value (success) but fills the lpNumberOfBytesRead parameter to 0. In theory, that indicates that the offset is outside the file but, in practice, that is not true. GetLastError returns ERROR_SUCCESS(0).
The files in question are all on a shared network drive (Windows server 2016 + DFS, windows 8-10 clients, SMBv3). The files are used in shared mode. In-file locking (lockFileEx) is used to handle concurrent file access (we're just locking the first byte of the file before any read/write).
The handle used is not fresh: it isn't created locally in the functions but retrieved from a application-wide "file handle cache manager". This means that it could have been created (unused) some times ago. However, everything we did indicates the handle is valid at the moment of the call: GetLastError returns 0, GetFileInformationByHandle returns "true" and a valid structure.
The error is logged to a file that is located on the same file server as the problematic files.
We have done a lot of logging and testing around this issue. here are the additional facts we gathered:
Most (but not all) of the problematic read happen at the very tail of the file: we're reading the last record. However, the read is still within the file GetlastError does not return ERROR_HANDLE_EOF. If the program is restarted, the same read with the same parameters works.
The issue is not temporary: repeated calls yield the same result even if we let the program loop indefinitely. Restarting the program, however, does not automatically leads to the issue being raised again immediately.
We are sure the offset if inside the file: we check the actual file pointer location after the failure and compare it with the expected value as well as the size of the file as reported by the OS: all matches across multiple retries.
The issue only shows up randomly: there is no real pattern to the program working as expected and the program failing. It occurs a 2-4 times a day in our office (about 20 people).
The issue does not only occurs in our network. we've seen the symptoms and the log entries in multiple locations although we have no clear view of the OS involved in these cases.
We just deployed a new version of the program that will attempt to re-open the file in case of failure but that is a workaround, not a fix: we need to understand what is happening here and I must admit that I found no rational explanation for it
Any suggestion about what could be the cause of this error or what other steps could be taken to find out will be welcome.
Edit 2
(In the light of keeping this clear, I removed the code: the new evidence gives a better explanation of the issue)
We managed to get a procmon trace while the problem was happening and we got the following sequence of events that we simply cannot explain:
Text version:
"Time of Day","Process Name","PID","Operation","Path","Result","Detail","Command Line"
"9:43:24.8243833 AM","wacprep.exe","33664","ReadFile","\\office.git.ch\dfs\Data\EURDATA\GIT18\JNLS.DTA","END OF FILE","Offset: 7'091'712, Length: 384, Priority: Normal","O:\WinEUR\wacprep.exe /company:GIT18"
"9:43:24.8244011 AM","wacprep.exe","33664","QueryStandardInformationFile","\\office.git.ch\dfs\Data\EURDATA\GIT18\JNLS.DTA","SUCCESS","AllocationSize: 7'094'272, EndOfFile: 7'092'864, NumberOfLinks: 1, DeletePending: False, Directory: False","O:\WinEUR\wacprep.exe /company:GIT18"
(there are thousands of these logged since the application is in an infinite loop.)
As we understand this, the ReadFile call should succeed: the offset is well within the boundary of the file. Yet, it fails. ProcMon reports END OF FILEalthough I suspect it's just because ReadFile returned != 0 and reported 0 bytes read.
While the loop was running, we managed to unblock it by increasing the size of the file from a different machine:
"Time of Day","Process Name","PID","Operation","Path","Result","Detail","Command Line"
"9:46:58.6204637 AM","wacprep.exe","33664","ReadFile","\\office.git.ch\dfs\Data\EURDATA\GIT18\JNLS.DTA","END OF FILE","Offset: 7'091'712, Length: 384, Priority: Normal","O:\WinEUR\wacprep.exe /company:GIT18"
"9:46:58.6204810 AM","wacprep.exe","33664","QueryStandardInformationFile","\\office.git.ch\dfs\Data\EURDATA\GIT18\JNLS.DTA","SUCCESS","AllocationSize: 7'094'272, EndOfFile: 7'092'864, NumberOfLinks: 1, DeletePending: False, Directory: False","O:\WinEUR\wacprep.exe /company:GIT18"
"9:46:58.7270730 AM","wacprep.exe","33664","ReadFile","\\office.git.ch\dfs\Data\EURDATA\GIT18\JNLS.DTA","SUCCESS","Offset: 7'091'712, Length: 384, Priority: Normal","O:\WinEUR\wacprep.exe /company:GIT18"
I've just found out by accident that doing this GetModuleHandle("ntdll.dll") works without a previous call to LoadLibrary("ntdll.dll").
This means ntdll.dll is already loaded in my process.
Is it safe to assume that ntdll.dll will always be loaded on Win32 applications, so that a call to LoadLibrary is not necessary?
From MSDN on LoadLibrary() (emphasis mine):
The system maintains a per-process reference count on all loaded
modules. Calling LoadLibrary increments the reference count. Calling
the FreeLibrary or FreeLibraryAndExitThread function decrements the
reference count. The system unloads a module when its reference count
reaches zero or when the process terminates (regardless of the
reference count).
In other words, continue to call LoadLibrary() and ensure you get your handle to ntdll.dll to be safe -- but the system will almost certainly be bumping a reference count as it should already be loaded.
As for "is it really always loaded?", see Windows Internals on the Image Loader (the short answer is yes, ntdll.dll is part of the loader itself and is always present).
The relevant paragraph is:
The image loader lives in the user-mode system DLL Ntdll.dll and not in the kernel library. Therefore, it behaves just like standard code that is part of a DLL, and it is subject to the same restrictions in terms of memory access and security rights. What makes this code special is the guaranty that it will always be present in the running process (Ntdll.dll is always loaded) and that it is the first piece of code to run in user mode as part of a new application. (When the system builds the initial context, the program counter, or instruction pointer is set to an initialization function inside Ntdll.dll.)
A OSX app crashes when I try to close a socket handle, it worked fine in all the previous platforms, but it appears to crash in Yosemite.
The line where is crashes is
-(void)stopPacketReceiver
{
close(sd);
}
In Xcode it pauses all the threads and show EXC_GUARD exception, what kind of exception is this, any ideas ?
Thanks,
Ahmed
EDIT:
Here r the exception codes that I get
Exception Type: EXC_GUARD
Exception Codes: 0x4000000100000000, 0x08fd4dbfade2dead
From a post in Apple's old developer forums from Quinn "The Eskimo" (Apple Developer Relations, Developer Technical Support, Core OS/Hardware), edited by me to remove things which were specific to that specific case:
EXC_GUARD is a change in 10.9 designed to help you detect file
descriptor problems. Specifically, the system can now flag specific
file descriptors as being guarded, after which normal operations on
those descriptors will trigger an EXC_GUARD crash (when it wants to
operate on these file descriptors, the system uses special 'guarded'
private APIs).
We added this to the system because we found a lot of apps were
crashing mysteriously after accidentally closing a file descriptor
that had been opened by a system library. For example, if an app
closes the file descriptor used to access the SQLite file backing a
Core Data store, Core Data would then crash mysteriously much later
on. The guard exception gets these problems noticed sooner, and thus
makes them easier to debug.
For an EXC_GUARD crash, the exception codes break down as follows:
o The first exception code … contains three bit
fields:
The top three bits … indicate [the type of guard].
The remainder of the top 32 bits … indicate [which operation was disallowed].
The bottom 32 bits indicate the descriptor in question ….
o The second exception code is a magic number associated with the
guard. …
Your code is closing a socket it doesn't own. Maybe sd contains the descriptor number for a descriptor that you once owned but is now a dangling reference, because you already closed your descriptor and that number has now been reused for somebody else's descriptor. Or maybe sd just has a junk value somehow.
We can decode some more information from the exception codes, but most likely you just have to trace exactly where you're doing with sd over its life.
Update:
From the edited question, I see that you've posted the exception codes. Using the constants from the kernel source, the type of guard is GUARD_TYPE_FD, the operation that was disallowed was kGUARD_EXC_CLOSE (i.e. close()), and the descriptor was 0 (FILENO_STDIN).
So, in all probability, your stopPacketReceiver was called when the sd instance variable was uninitialized and had the default 0 value that all instance variables get when an object is first allocated.
The magic value is 0x08fd4dbfade2dead, which according to the original developer forums post, "indicates that the guard was applied by SQLite". That seems strange. Descriptor 0 would normally be open from process launch (perhaps referencing /dev/null). So, SQLite should not own that.
I suspect what has happened is that your code has actually closed descriptor 0 twice. The first time it was not guarded. It's legal to close FILENO_STDIN. Programs sometimes do it to reopen that descriptor to reference something else (such as /dev/null) if they don't want/need the original standard input. In your case, it would have been an accident but would not have raised an exception. Once it was closed, the descriptor would have been available to be reallocated to the next thing which opened a descriptor. I guess that was SQLite. At that time, SQLite put a guard on the descriptor. Then, your code tried to close it again and got the EXC_GUARD exception.
If I'm right, then it's somewhat random that your code got the exception (although it was always doing something bad). The fact that file descriptor 0 got assigned to a subsystem that applied a guard to it could be a race condition or it could be a change in order of operations between versions of the OS.
You need to be more careful to not close descriptors that you didn't open. You should initialize any instance variable meant to hold a file descriptor to -1, not 0. Likewise, if you close a descriptor that you did own, you should set the instance variable back to -1.
Firstly, that sounds awesome - it sounds like it caught what would have been EXC_BAD_ACCESS (but this is a guess).
My guess is that sd isn't a valid descriptor. It's possible an API changed in Yosemite that's causing the place you create the descriptor to return NULL, or it's possible a change in the event timeline in Yosemite causes it to have already been cleaned up.
Debugging tip here: trace back sd all the way to its creation.
What is the purpose of this flag (from the OS side)?
Which functions use this flag except isDebuggerPresent?
thanks a lot
It's effectively the same, but reading the PEB doesn't require a trip through kernel mode.
More explicitly, the IsDebuggerPresent API is documented and stable; the PEB structure is not, and could, conceivably, change across versions.
Also, the IsDebuggerPresent API (or flag) only checks for user-mode debuggers; kernel debuggers aren't detected via this function.
Why put it in the PEB? It saves some time, which was more important in early versions of NT. (There are a bunch of user-mode functions that check this flag before doing some runtime validation, and will break to the debugger if set.)
If you change the PEB field to 0, then IsDebuggerPresent will also return 0, although I believe that CheckRemoteDebuggerPresent will not.
As you have found the IsDebuggerPresent flag reads this from the PEB. As far as I know the PEB structure is not an official API but IsDebuggerPresent is so you should stick to that layer.
The uses of this method are quite limited if you are after a copy protection to prevent debugging your app. As you have found it is only a flag in your process space. If somebody debugs your application all he needs to do is to zero out the flag in the PEB table and let your app run.
You can raise the level by using the method CheckRemoteDebuggerPresent where you pass in your own process handle to get an answer. This method goes into the kernel and checks for the existence of a special debug structure which is associated with your process if it is beeing debugged. A user mode process cannot fake this one but you know there are always ways around by simply removing your check ....
I'm currently writing a simple "multicaster" module.
Only one process can open a proc filesystem file for writing, and the rest can open it for reading.
To do so i use the inode_operation .permission callback, I check the operation and when i detect someone open a file for writing I set a flag ON.
i need a way to detect if a process that opened a file for writing has decided to close the file so i can set the flag OFF, so someone else can open for writing.
Currently in case someone is open for writing i save the current->pid of that process and when the .close callback is called I check if that process is the one I saved earlier.
Is there a better way to do that? Without saving the pid, perhaps checking the files that the current process has opened and it's permission...
Thanks!
No, it's not safe. Consider a few scenarios:
Process A opens the file for writing, and then fork()s, creating process B. Now both A and B have the file open for writing. When Process A closes it, you set the flag to 0 but process B still has it open for writing.
Process A has multiple threads. Thread X opens the file for writing, but Thread Y closes it. Now the flag is stuck at 1. (Remember that ->pid in kernel space is actually the userspace thread ID).
Rather than doing things at the inode level, you should be doing things in the .open and .release methods of your file_operations struct.
Your inode's private data should contain a struct file *current_writer;, initialised to NULL. In the file_operations.open method, if it's being opened for write then check the current_writer; if it's NULL, set it to the struct file * being opened, otherwise fail the open with EPERM. In the file_operations.release method, check if the struct file * being released is equal to the inode's current_writer - if so, set current_writer back to NULL.
PS: Bandan is also correct that you need locking, but the using the inode's existing i_mutex should suffice to protect the current_writer.
I hope I understood your question correctly: When someone wants to write to your proc file, you set a variable called flag to 1 and also save the current->pid in a global variable. Then, when any close() entry point is called, you check current->pid of the close() instance and compare that with your saved value. If that matches, you turn flag to off. Right ?
Consider this situation : Process A wants to write to your proc resource, and so you check the permission callback. You see that flag is 0, so you can set it to 1 for process A. But at that moment, the scheduler finds out process A has used up its time share and chooses a different process to run(flag is still o!). After sometime, process B comes up wanting to write to your proc resource also, checks that the flag is 0, sets it to 1, and then goes about writing to the file. Unfortunately at this moment, process A gets scheduled to run again and since, it thinks that flag is 0 (remember, before the scheduler pre-empted it, flag was 0) and so sets it to 1 and goes about writing to the file. End result : data in your proc resource goes corrupt.
You should use a good locking mechanism provided by the kernel for this type of operation and based on your requirement, I think RCU is the best : Have a look at RCU locking mechanism