Implicit interprocess shared memory on Windows? - winapi

What I would like to do is mark a specific area of memory as being automatically shared between processes of the same image/binary, similar to __declspec(allocate)... and __pragma(section...).
I know that I can use names pipes or equivalent, but for this purpose I would like to avoid system calls or additional overhead. I'm just unsure if there is any way to inform the NT kernel to map a specific range of pages automatically for each process of an image. I haven't found anything on MSDN, though MSDN doesn't include undocumented functionally (by definition), which I am fine with using.
I also don't see any specific PE section names/flags which would indicate such, though it is possible that I am missing something.
Ed: I've noticed that there is actually a PE section flag IMAGE_SCN_MEM_SHARED, though I need to investigate how it works.

You can use #pragma comment(linker, "/SECTION:.shared,RWS") and #pragma data_seg(".shared") to declare things in a shared memory segment (only works in Visual Studio). See Sharing Variables Between Win32 Executables.
Otherwise, if that is not an option for you, the only other way to share memory between processes is to use a Memory Mapped File via CreateFileMapping() and MapViewOfFile/Ex(). See Creating Named Shared Memory.

Related

How did Wine64 manage to handle macOS?

It has been a major obstacle for decade. It was reported as impossible. Forum talks referred to issues related to setting and restoring GS. Wine HQ FAQ still refers to ABI incompatibility page which is not a live wiki page, but news archive link.
Wine 2.0 announced macOS 64-bit support. But… how? Isn't it something that all macOS hackers should know? Maybe some elegant (or dirty) trick interesting on its own for any x86-64 hacker.
The primary obstacle is a conflict over the GS segment base address (GS.base) maintained by the CPU under the control of the OS.
On 64-bit Windows, GS.base is used to hold the address of the Thread Environment Block (TEB) structure for each thread. Windows apps expect to access the TEB using %gs-relative addresses. This is hard-coded into the app code rather than being behind an API function.
On macOS, GS.base is used to hold the base of the thread-local storage area of the thread's struct _pthread, an internal implementation detail of the Pthreads implementation. It's less common for Mac apps to have hard-coded %gs-relative accesses baked into them, but some do and so do the system libraries.
On Linux, GS.base is available for 64-bit apps to use for their own purpose. So, there, Wine simply sets it using the OS-provided mechanism. Wine can't do that on macOS. Not only does the OS not provide any mechanism to do it but, if Wine could, it would break the system libaries. (It would also pose potential problems for the kernel on context switches and/or the kernel might fail to restore any value Wine might have set.)
The solution we figured out is only a partial solution. The most commonly accessed fields of the TEB structure are the "self" field (%gs:0x30) and a field for the thread-local storage implementation (%gs:0x58). Often, if apps need to access other fields, they first read the self field and then reference off of that.
On macOS, %gs:0x30 and %gs:0x58 correspond to particular slots of the thread-local storage area. They are in a part that's reserved to Apple (rather than, say, application uses). We found that one of those slots was unused. The other was used for the ttyname() function in the C library. As it happens, Wine never calls that function and there's little reason to expect any of the system libraries that it uses to do so, either.
So, Wine simply pokes the appropriate values at those %gs-relative locations. Therefore, when 64-bit Windows app code reads them, it gets what it needs. The actual TEB that Wine has allocated is located elsewhere (in heap-allocated memory), but apps find the address of the TEB in the place they expect to be the TEB self field, so they find it that way.
Apple has since graciously permanently reserved both of those slots for uses like Wine's. ttyname() now uses a different slot.
That said, as mentioned above, this solution is only partial. Some apps access other fields of the TEB directly using %gs-relative addresses at offsets other than 0x30 or 0x58. When they do so, they get junk values and/or overwrite values used by other parts of the system. So, Wine's support for 64-bit Windows apps is not complete on macOS. Some such apps will crash or otherwise misbehave. Luckily, it happens infrequently enough that it's not much of a problem in practice.
For reference, here are the commits that implement this solution:
http://source.winehq.org/git/wine.git/?a=commit;h=7501942008f91a9a137fe598ce5ce7cb47de5522
http://source.winehq.org/git/wine.git/?a=commit;h=3d8efb238808a519902e047d8673237debb0f0a2

Is it possible to write a libPOSIX for Windows (Win32) without requiring a background service or DLL that's always loaded?

I know about Cygwin, and I know of its shortcomings. I also know about the slowness of fork, but not why on Earth it's not possible to work around that. I also know Cygwin requires a DLL. I also understand POSIX defines a whole environment (shell, etc...), that's not really what I care about here.
My question is asking if there is another way to tackle the problem. I see more and more of POSIX functionality being implemented by the MinGW projects, but there's no complete solution providing a full-blown (comparable to Linux/Mac/BSD implementation status) POSIX functionality.
The question really boils down to:
Can the Win32 API (as of MSVC20??) be efficiently used to provide a complete POSIX layer over the Windows API?
Perhaps this will turn out to be a full libc that only taps into the OS library for low-level things like filesystem access, threads, and process control. But I don't know exactly what else POSIX consists of. I doubt a library can turn Win32 into a POSIX compliant entiity.
POSIX <> Win32.
If you're trying to write apps that target POSIX, why are you not using some variant of *N*X? If you prefer to run Windows, you can run Linux/BSD/whatever inside Hyper-V/VMWare/Parallels/VirtualBox on your PC/laptop/etc.
Windows used to have a POSIX compliant environment that ran alongside the Win32 subsystem, but was discontinued after NT4 due to lack of demand. Microsoft bought Interix and released Services For Unix (SFU). While it's still available for download, SFU 3.5 is now deprecated and no longer developed or supported.
As to why fork is so slow, you need to understand that fork isn't just "Create a new process", it's "create a new process (itself an expensive operation) which is a duplicate of the calling process along with all memory".
In *N*X, the forked process is mapped to the same memory pages as the parent (i.e. is pretty quick) and is only given new pages as and when the forked process tried to modify any shared pages. This is known as copy on write. This is largely achievable because in UNIX, there is no hard barrier between the parent and forked processes.
In NT, on the other hand, all processes are separated by a barrier enforced by CPU hardware. In NT, the easiest way to spawn a parallel activity which has access to your process' memory and resources, is to create a thread. Threads run within the memory space of the creating process and have access to all of the process' memory and resources.
You can also share data between processes via various forms of IPC, RPC, Named Pipes, mailslots, memory-mapped files but each technique has its own complexities, performance characteristics, etc. Read this for more details.
Because it tries to mimic UNIX, CygWin's 'fork' operation creates a new child process (in its own isolated memory space) and has to duplicate every page of memory in the parent process within the newly forked child. This can be a very costly operation.
Again, if you want to write POSIX code, do so in *N*X, not NT.
How about this
Most of the Unix API is implemented by the POSIX.DLL dynamically loaded (shared) library. Programs linked with POSIX.DLL run under the Win32 subsystem instead of the POSIX subsystem, so programs can freely intermix Unix and Win32 library calls.
From http://en.wikipedia.org/wiki/UWIN
The UWIN environment may be what you're looking for, but note that it is hosted at research.att.com, while UWIN is distributed under a liberal license it is not the GNU license. Also, as it is research for att, and only 2ndarily something that they are distributing for use, there are a lot of issues with documentation.
See more info see my write-up as the last answer for Regarding 'for' loop in KornShell
Hmm main UWIN link is bad link in that post, try
http://www2.research.att.com/sw/download/
Also, You can look at
https://mailman.research.att.com/pipermail/uwin-users/
OR
https://mailman.research.att.com/pipermail/uwin-developers/
To get a sense of the features vs issues.
I hope this helps.
The question really boils down to: Can the Win32 API (as of MSVC20??)
be efficiently used to provide a complete POSIX layer over the Windows
API?
Short answer: No.
"Complete POSIX" means fork(), mmap(), signal() and such, and these are [almost] impossible to implement on NT.
To drive the point home: GNU Hurd has problems with fork() as well, because Hurd kernel is not POSIX.
NT is not POSIX too.
Another difference is persisence:
In POSIX-compliant systems it is possible to create system objects and leave them there. Examples of such objects are named pipes and shared memory objects (shms). You can create a named pipe or a shm, and leave it in the filesystem (or in a special filesystem-like place) where other processes will be able to access it. The downside is that a process might die and fail to clean up after itself, leaving unused objects behind (you know about zombie processes? same thing).
In NT every object is reference-counted, and is destroyed as soon as its last handle is closed. Files are among the few objects that persist.
Symlinks are a filesystem feature, and don't exactly depend on NT kernel, but current implementation (in Vista and later) is incapable of creating object-type-agnostic symlinks. That is, a symlink is either a file or a directory, and must link to either a file or a directory. If the target has wrong type, the symlink won't work. You can give it the right type if the target exists when you create the symlink, but POSIX requires that symlinks may be created without their target existing. I can't imagine a use-case for a symlink that points first to a file, then to a directory, but POSIX says that this should work, and if it doesn't, you're not completely POSIX-compliant. Or if your symlinking API/utility can be given an option that specifies the right type, when target doesn't exist, that also breaks POSIX compatibility.
It is possible to replicate some POSIX features to some degree (such as "integer descriptors from in a single namespace, referencing any I/O object, and being select()able" without sacrificing [much] performance, but that is still a major undertaking, and POSIX interface is really restrictive (that is, if you could just add one more argument to that function, it would have been possible to Do The Right Thing...but you couldn't, unless you want to throw POSIX compliance away).
Your best bet is to not to rely on POSIX features that are difficult to port to non-POSIX systems, or abstract in such a way that lower levels may have separate implementations for different OSes, and upper levels do not care about the details.

In Windows, should I use CreateFile or fopen, portability aside?

What are the differences, and in what cases one or the other would prove superior in some way?
First of all the function fopen can be used only for simple portable operations with files.
CreateFile on the other side can be used not only for operations with files, but also with directories (with use of corresponding options), pipes and various Windows devices.
CreateFile has a lot of additional useful switches, like FILE_FLAG_NO_BUFFERING, FILE_ATTRIBUTE_TEMPORARY and FILE_FLAG_SEQUENTIAL_SCAN, which can be very useful in different scenarios.
You can use CreateFile with a filename longer that MAX_PATH characters. It can be important for some server applications or ones which must be able to open any file (a virus scanner or a backup application for example). This is enabled by using namespace semantics, though this mode has its own concerns, like ability to actually create a file named ".." or L"\xfeff\x20\xd9ab" (good luck trying to delete them later).
You can use CreateFile in different security scenarios. I mean not only usage of security attributes. If current process has SE_BACKUP_NAME or SE_RESTORE_NAME privilege (like Administrators typically have) and enable this privilege, one can use CreateFile to open any file also a file to which you have no access through security descriptor.
If you only want to read the content of a file, you can use CreateFile, CreateFileMapping and MapViewOfFile to create file mapping. Then you can work with a file as with a block of memory, which can possibly increase your application's speed.
There are also other uses of the function, which are described in detail in the corresponding MSDN article.
So I can summarize: only if you have a hard portability requirements or if you need to pass a FILE* to some external library, then you have to use fopen. In all other cases I would recommend you to use CreateFile.
For best results, I would also advise to learn Windows API specifically, as there are many features that you can find a good use for.
UPDATED: Not directly related to your question, but I also recommend you to take a glance at transactional I/O functions which are supported starting with Windows Vista. Using this feature, you can commit a bunch of operation with files, directories or registry as one transaction that cannot be interrupted. It is a very powerful and interesting tool. If you are not ready now to use the transactional I/O functions, you can start with CreateFile and port your application to transactional I/O later.
That really depends on what type of program you are writing. If it is supposed to be portable, fopen will make your life easier. fopen will call CreateFile "behind the scenes".
Some more advanced options (cache control, file access control, etc) are only available if you are using the Win32 API (they depend on the Win32 file handle, as opposed to the FILE pointer in stdio), so if you are writing a pure Win32 application, you may want to use CreateFile.
CreateFile lets you
Open file for asynchronous I/O
Pass optimization hints like FILE_FLAG_SEQUENTIAL_SCAN
Set security and inherit settings without threading issues
They don't return the same handle type, with fopen/FILE object you can call other runtime functions such as fputs (as well as converting it to a "native" file handle)
Whenever possible, prefer object oriented wrappers that support RAII, like fstream or boost file IO objects.
You should, of course, care about the share mode, so fopen() and STL are insufficient.

Finding undocumented APIs in Windows

I was curious as to how does one go about finding undocumented APIs in Windows.
I know the risks involved in using them but this question is focused towards finding them and not whether to use them or not.
Use a tool to dump the export table from a shared library (for example, a .dll such as kernel32.dll). You'll see the named entry points and/or the ordinal entry points. Generally for windows the named entry points are unmangled (extern "C"). You will most likely need to do some peeking at the assembly code and derive the parameters (types, number, order, calling convention, etc) from the stack frame (if there is one) and register usage. If there is no stack frame it is a bit more difficult, but still doable. See the following links for references:
http://www.sf.org.cn/symbian/Tools/symbian_18245.html
http://msdn.microsoft.com/en-us/library/31d242h4.aspx
Check out tools such as dumpbin for investigating export sections.
There are also sites and books out there that try to keep an updated list of undocumented windows APIs:
The Undocumented Functions
A Primer of the Windows Architecture
How To Find Undocumented Constants Used by Windows API Functions
Undocumented Windows
Windows API
Edit:
These same principles work on a multitude of operating systems however, you will need to replace the tool you're using to dump the export table. For example, on Linux you could use nm to dump an object file and list its exports section (among other things). You could also use gdb to set breakpoints and step through the assembly code of an entry point to determine what the arguments should be.
IDA Pro is your best bet here, but please please double please don't actually use them for anything ever.
They're internal because they change; they can (and do) even change as a result of a Hotfix, so you're not even guaranteed your undocumented API will work for the specific OS version and Service Pack level you wrote it for. If you ship a product like that, you're living on borrowed time.
Everybody here so far is missing some substantial functionality that comprises hugely un-documented portions of the Windows OS RPC . RPC (think rpcrt4.dll, lsass.exe, csrss.exe, etc...) operations occur very frequently across all subsystems, via LPC ports or other interfaces, their functionality is buried in the mysticism incantations of various type/sub-type/struct-typedef's etc... which are substantially more difficult to debug, due to the asynchronous nature or the fact that they are destine for process's which if you were to debug via single stepping or what have you, you would find the entire system lockup due to blocking keyboard or other I/O from being passed ;)
ReactOS is probably the most expedient way to investigate undocumented API. They have a fairly mature kernel and other executive's built up. IDA is fairly time-intensive and it's unlikely you will find anything the ReactOS people have not already.
Here's a blurb from the linked page;
ReactOS® is a free, modern operating
system based on the design of Windows®
XP/2003. Written completely from
scratch, it aims to follow the
Windows® architecture designed by
Microsoft from the hardware level
right through to the application
level. This is not a Linux based
system, and shares none of the unix
architecture.
The main goal of the
ReactOS project is to provide an
operating system which is binary
compatible with Windows. This will
allow your Windows applications and
drivers to run as they would on your
Windows system. Additionally, the look
and feel of the Windows operating
system is used, such that people
accustomed to the familiar user
interface of Windows® would find using
ReactOS straightforward. The ultimate
goal of ReactOS is to allow you to
remove Windows® and install ReactOS
without the end user noticing the
change.
When I am investigating some rarely seen Windows construct, ReactOS is often the only credible reference.
Look at the system dlls and what functions they export. Every API function, whether documented or not, is exported in one of them (user, kernel, ...).
For user mode APIs you can open Kernel32.dll User32.dll Gdi32.dll, specially ntdll.dll in dependancy walker and find all the exported APIs. But you will not have the documentation offcourse.
Just found a good article on Native APIS by Mark Russinovich

Windows malloc replacement (e.g., tcmalloc) and dynamic crt linking

A C++ program that uses several DLLs and QT should be equipped with a malloc replacement (like tcmalloc) for performance problems that can be verified to be caused by Windows malloc. With linux, there is no problem, but with windows, there are several approaches, and I find none of them appealing:
1. Put new malloc in lib and make sure to link it first (Other SO-question)
This has the disadvantage, that for example strdup will still use the old malloc and a free may crash the program.
2. Remove malloc from the static libcrt library with lib.exe (Chrome)
This is tested/used(?) for chrome/chromium, but has the disadvantage that it just works with static linking the crt. Static linking has the problem if one system library is linked dynamically against msvcrt there may be mismatches in the heap allocation/deallocation. If I understand it correctly, tcmalloc could be linked dynamically such that there is a common heap for all self-compiled dlls (which is good).
3. Patch crt-source code (firefox)
Firefox's jemalloc apparently patches the windows CRT source code and builds a new crt. This has again the static/dynamic linking problem above.
One could think of using this to generate a dynamic MSVCRT, but I think this is not possible, because the license forbids providing a patched MSVCRT with the same name.
4. Dynamically patching loaded CRT at run time
Some commercial memory allocators can do such magic. tcmalloc can do, too, but this seems rather ugly. It had some issues, but they have been fixed. Currently, with tcmalloc it does not work under 64 bit windows.
Are there better approaches? Any comments?
Q: A C++ program that is split accross several dlls should:
A) replace malloc?
B) ensure that allocation and de-allocation happens in the same dll module?
A: The correct answer is B. A c++ application design that incorporates multiple DLLs SHOULD ensure that a mechanism exists to ensure that things that are allocated on the heap in one dll, are free'd by the same dll module.
Why would you split a c++ program into several dlls anyway? By c++ program I mean that the objects and types you are dealing with are c++ templates, STL objects, classes etc. You CAN'T pass c++ objects accross dll boundries without either lot of very careful design and lots of compiler specific magic, or suffering from massive duplication of object code in the various dlls, and as a result an application that is extremely version sensitive. Any small change to a class definition will force a rebuild of all exe's and dll's, removing at least one of the major benefits of a dll approach to app development.
Either stick to a straight C interface between app and dll's, suffer hell, or just compile the entire c++ app as one exe.
It's a bold claim that a C++ program "should be equipped with a malloc replacement (like tcmalloc) for performance problems...."
"[In] 6 out of 8 popular benchmarks ... [real-sized applications] replacing back the custom allocator, in which people had invested significant amounts of time and money, ... with the system-provided dumb allocator [yielded] better performance. ... The simplest custom allocators, tuned for very special situations, are the only ones that can provide gains." --Andrei Alexandrescu
Most system allocators are about as good as a general purpose allocator can be. You can do better only if you have a very specific allocation pattern.
Typically, such special patterns apply only to a portion of the program, in which case, it's better to apply the custom allocator to the specific portion that can benefit than it is to globally replace the allocator.
C++ provides a few ways to selectively replace the allocator. For example, you can provide an allocator to an STL container or you can override new and delete on a class by class basis. Both of these give you much better control than any hack which globally replaces the allocator.
Note also that replacing malloc and free will not necessarily change the allocator used by operators new and delete. While the global new operator is typically implemented using malloc, there is no requirement that it do so. So replacing malloc may not even affect most of the allocations.
If you're using C, chances are you can wrap or replace key malloc and free calls with your custom allocator just where it matters and leave the rest of the program to use the default allocator. (If that's not the case, you might want to consider some refactoring.)
System allocators have decades of development behind them. They are stable and well-tested. They perform extremely well for general cases (in terms of raw speed, thread contention, and fragmentation). They have debugging versions for leak detection and support for tracking tools. Some even improve the security of your application by providing defenses against heap buffer overrun vulnerabilities. Chances are, the libraries you want to use have been tested only with the system allocator.
Most of the techniques to replace the system allocator forfeit these benefits. In some cases, they can even increase memory demand (because they can't be shared with the DLL runtime possibly used by other processes). They also tend to be extremely fragile in the face of changes in the compiler version, runtime version, and even OS version. Using a tweaked version of the runtime prevents your users from getting benefits of runtime updates from the OS vendor. Why give all that up when you can retain those benefits by applying a custom allocator just to the exceptional part of the program that can benefit from it?
Where does your premise "A C++ program that uses several DLLs and QT should be equipped with a malloc replacement" come from?
On Windows, if the all the dlls use the shared MSVCRT, then there is no need to replace malloc. By default, Qt builds against the shared MSVCRT dll.
One will run into problems if they:
1) mix dlls that use static linking vs using the shared VCRT
2) AND also free memory that was not allocated where it came from (ie, free memory in a statically linked dll that was allocated by the shared VCRT or vice versa).
Note that adding your own ref counted wrapper around a resource can help mitigate that problems associated with resources that need to be deallocated in particular ways (ie, a wrapper that disposes of one type of resource via a call back to the originating dll, a different wrapper for a resource that originates from another dll, etc).
nedmalloc? also NB that smplayer uses a special patch to override malloc, which may be the direction you're headed in.

Resources