Working around fls limitations with too many statically linked CRTs?

Working around fls limitations with too many statically linked CRTs? - windows

When loading external DLLs (not under our control) via LoadLibrary, we're hitting a problem where the statically linked CRT in those DLLs are failing to allocate fiber-local storage. This is similar to mskb 193462, except that this is FLS and there's only 128 of them.
Are there any useful ways to work around the problem? The CRT is using GetProcAddress to find FlsAlloc anyway (since that apparently never existed in XP), so does it even really need it?
(This is on Vista, where FlsAlloc actually exists; the DLLs appear to be using MSVC8)

There is frankly no solution here, short of loading less dlls.
You could hook the dll's import address table - but that will happen too late as you can only install an IAT hook when LoadLibrary returns, and the CRT initialization code probably executes in response to DllProcessAttach which will already have been processed.
You could I guess find the kernel32.dll module in memory, and patch the export address for GetProcAddress or perhaps FlsAlloc to point to your implementation. But that approach is getting seriously hackish.

Related

actual machine code to execute what Win APIs do stays in OS kernel memory space or compiled together as part of the app?

If this question deals with too basic a matter, please forgive me.
As a somewhat-close-to-beginner-level programmer, I really wonder about this--whether the underlying code of every win API function is compiled altogether at the time of writing an app, or whether the machine code for executing win APIs stays in the memory as part of the OS since the pc is booted up, and only the app uses them?
All the APIs for an OS are used by many apps by means of function call. So I thought that rather than making every individual app include the API machine code on their own, apps just contain the header or signature to call the APIs and the API machine code addresses are mapped when launching the app.
I am sorry that I failed to make this question succinct due to my poor English. I really would like to get your insights. Thank you.

The implementation for (most) API calls is provided by the system by way of compiled modules (Portable Executable images). Application code only contains enough information so that the system can identify and load the required modules, and resolve the respective imports.
As an example consider the following code that shows a message box, waits for it to close, and then exits the program:
#include <Windows.h>
int main()
{
::MessageBoxW(nullptr, L"Foo", L"Bar", MB_OK);
}
Given the function signature (declared in WinUser.h, which gets pulled in from Windows.h) the compiler can almost generate a call instruction. It knows the number of arguments, their expected types, and the order and location the callee expects them in. What's missing is the actual target address inside user32.dll, that's only known after a process was fully initialized, and had the user32.dll module mapped into its address space.
Clearly, the compiler cannot postpone code generation until after load time. It needs to generate a call instruction now. Since we know that "all problems in computer science can be solved by another level of indirection" that's what the compiler does, too: Instead of emitting a direct call instruction it generates an indirect call. The difference is that, while a direct call immediately needs to provide the target address, an indirect call can specify the address at which the target address is stored.
In x86 assembly, instead of having to say
call _MessageBoxW#16 ; uh-oh, not yet known
the compiler can conveniently delegate the call to the Import Address Table (IAT):
call dword ptr [__imp__MessageBoxW#16]
Disaster averted, we've bought us just enough time to fix things up before the code actually executes.
Once a process object is created the system hands over control to its primary thread to finish initialization. Part of that initialization is loading dependencies (such as user32.dll here). Once that has completed, the system finally knows the load address (and ultimately the address of imported symbols, such as _MessageBoxW#16), and can overwrite the IAT entry at address __imp__MessageBoxW#16 with the imported function address.
And that is approximately how the system provides implementations for system services without requiring client applications to know where (physically) they will find them.
I'm saying "approximately" because things are somewhat more involved in reality. If that is something you'll want to learn about, I'll leave it up to Raymond Chen. He has published a series of blog entries covering this topic in far more detail:
How were DLL functions exported in 16-bit Windows?
How were DLL functions imported in 16-bit Windows?
How are DLL functions exported in 32-bit Windows?
Exported functions that are really forwarders
Rethinking the way DLL exports are resolved for 32-bit Windows
Calling an imported function, the naive way
How a less naive compiler calls an imported function
Issues related to forcing a stub to be created for an imported function
What happens when you get dllimport wrong?
Names in the import library are decorated for a reason
Why can't I GetProcAddress a function I dllexport'ed?

Linking with TCMalloc but the CRT malloc always called

I would like to experiment a bit with TCMalloc on Windows. I have built the VisualStudio solution which is part of the gperftools package I downloaded. But when I run any of the test apps which also came with the download, say tcmalloc_minimal_unittest.exe, all the memory allocation calls go to the standard malloc. Has anybody seen this already and knows what I should do? Many thanks.

Ok, I answer my own question. This may be useful to someone else. I was seeing on the VS debugger that the CRT malloc was invoked, but looking at the assembler code I see that the beginning of the function is patched, showing a jump to Perftools_malloc. So, apparently, instead of presenting a different API, TCMalloc hooks into regular calls to the CRT.

Whether the APIs in kernel32.dll (or others) have subrutines

I was wondering that whether the APIs in kernel32.dll (or others) have subrutines.
For example the CopyFile function, it should take different action to copy file from C: to D: and from a netshare path (\HOSTNAME\SHAREDFOLDER\FILENAME) to somewhere, or trigger the windows server 2012 (hyper-v) new feature ODX.
So in the definition of the CopyFile function, there should be some if/else branch, and call some sub function, isn't it?
If the subrutines exist. Is it possible to call the these sub functions directly, and is it possible to hook them?
Thanks.

As far as I know, the current implementation of kernel32.dll calls functions in ntdll.dll. The functions in ntdll.dll then do a syscall into the kernel somehow.
To answer your question, yes, it calls subroutines, and they probably can be hooked, but most of the logic about how specifically to read from and write to filesystems in different ways is probably buried in the kernel.
Keep in mind that you're probably not supposed to be digging into the internals of these DLLs — it's best to use the public interface. Relying on implementation details makes your code more fragile and likely to break with operating system upgrades.

Is it safe to call LoadLibrary from DllMain if you've used a kernel driver to ensure yours is the first library loaded?

I've been looking at some hooking code which selectively loads a library into certain processes and then hooks certain native API functions (using Detours). The chain of events looks like this:
Kernel driver loads A.dll into every process.
A.dll::DllMain() decides whether to load B.dll (LoadLibraryEx) which contains actual Detours hooks.
B.dll runs for the duration of the process hooking said functions.
The second bullet here appears to break the DllMain rules specified here, but I'm trying to work out if the way the driver loads A.dll works around the limitations. Specifically, the kernel driver uses PsSetLoadImageNotifyRoutine to get notifications when each process starts and then queues an APC to call LoadLibraryEx on A.dll which means it's pretty much the first DLL loaded when the process starts. Does this circumvent the problems with calling LoadLibrary within DllMain?

Doesn't matter how the LoadLibraryEx was triggered. Once triggered, the DLL loading process is the same, and the same rules apply.

The documentation very specifically says not to call LoadLibrary in DllMain. Even in the unlikely event that you figured out a safe way to make it work, it may not work in the next version (or even the next service pack) of Windows.

dynamic link library

I know that dynamic link library are loaded in memory when an application loaded, the reference is resolved by operation system loader. For example, in windows kernel32.dll, user32.dll and gdi32 dll, so if my application reference a function in a kernel32.dll, for example CreateWindow, is that the whole dll must be loaded in the process, or just part of the dll?
Thanks

whole thing, but don't worry, it's not re-loading the dll over and over, there is one instance for all the programs that use it....another name for dll is so....or shared object, and that's the whole point, to share.
http://en.wikipedia.org/wiki/Dynamic_link_library

You reference one function, you get the whole DLL. You can't load just part of a DLL.
It's annoying because you get all of Shell32.dll just to find where someone's home directory is. Sigh.

Don't worry about this so much, when you "load" a DLL, it's really just a mapped memory file; the Windows OS uses the page fault mechanism to bring in pages on-demand; so if you only use a small piece of the DLL you aren't actually going to fault the whole thing in.

Only the functions you use in that DLL is required, do not worry about cramping the memory, as most of these DLL's are standard and not alone that they are dynamic, the very reason why only certain functions that your code uses are loaded, not the entire dll.
Hope this helps,
Best regards,
Tom.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Working around fls limitations with too many statically linked CRTs? - windows

Related

actual machine code to execute what Win APIs do stays in OS kernel memory space or compiled together as part of the app?

Linking with TCMalloc but the CRT malloc always called

Whether the APIs in kernel32.dll (or others) have subrutines

Is it safe to call LoadLibrary from DllMain if you've used a kernel driver to ensure yours is the first library loaded?

dynamic link library

Categories

Resources