Drawbacks of using /LARGEADDRESSAWARE for 32-bit Windows executables? - windows

We need to link one of our executables with this flag as it uses lots of memory.
But why give one EXE file special treatment. Why not standardize on /LARGEADDRESSAWARE?
So the question is: Is there anything wrong with using /LARGEADDRESSAWARE even if you don't need it. Why not use it as standard for all EXE files?

blindly applying the LargeAddressAware flag to your 32bit executable deploys a ticking time bomb!
by setting this flag you are testifying to the OS:
yes, my application (and all DLLs being loaded during runtime) can cope with memory addresses up to 4 GB.
so don't restrict the VAS for the process to 2 GB but unlock the full range (of 4 GB)".
but can you really guarantee?
do you take responsibility for all the system DLLs, microsoft redistributables and 3rd-party modules your process may use?
usually, memory allocation returns virtual addresses in low-to-high order. so, unless your process consumes a lot of memory (or it has a very fragmented virtual address space), it will never use addresses beyond the 2 GB boundary. this is hiding bugs related to high addresses.
if such bugs exist they are hard to identify. they will sporadically show up "sooner or later". it's just a matter of time.
luckily there is an extremely handy system-wide switch built into the windows OS:
for testing purposes use the MEM_TOP_DOWN registry setting.
this forces all memory allocations to go from the top down, instead of the normal bottom up.
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management]
"AllocationPreference"=dword:00100000
(this is hex 0x100000. requires windows reboot, of course)
with this switch enabled you will identify issues "sooner" rather than "later".
ideally you'll see them "right from the beginning".
side note: for first analysis i strongly recommend the tool VMmap (SysInternals).
conclusions:
when applying the LAA flag to your 32bit executable it is mandatory to fully test it on a x64 OS with the TopDown AllocationPreference switch set.
for issues in your own code you may be able to fix them.
just to name one very obvious example: use unsigned integers instead of signed integers for memory pointers.
when encountering issues with 3rd-party modules you need to ask the author to fix his bugs. unless this is done you better remove the LargeAddressAware flag from your executable.
a note on testing:
the MemTopDown registry switch is not achieving the desired results for unit tests that are executed by a "test runner" that itself is not LAA enabled.
see: Unit Testing for x86 LargeAddressAware compatibility
PS:
also very "related" and quite interesting is the migration from 32bit code to 64bit.
for examples see:
As a programmer, what do I need to worry about when moving to 64-bit windows?
https://www.sec.cs.tu-bs.de/pubs/2016-ccs.pdf (twice the bits, twice the trouble)

Because lots of legacy code is written with the expectation that "negative" pointers are invalid. Anything in the top two Gb of a 32bit process has the msb set.
As such, its far easier for Microsoft to play it safe, and require applications that (a) need the full 4Gb and (b) have been developed and tested in a large memory scenario, to simply set the flag.
It's not - as you have noticed - that hard.
Raymond Chen - in his blog The Old New Thing - covers the issues with turning it on for all (32bit) applications.

No, "legacy code" in this context (C/C++) is not exclusively code that plays ugly tricks with the MSB of pointers.
It also includes all the code that uses 'int' to store the difference between two pointer, or the length of a memory area, instead of using the correct type 'size_t' : 'int' being signed has 31 bits, and can not handle a value of more than 2 Gb.
A way to cure a good part of your code is to go over it and correct all of those innocuous "mixing signed and unsigned" warnings. It should do a good part of the job, at least if you haven't defined function where an argument of type int is actually a memory length.
However that "legacy code" will probably apparently work right for quite a while, even if you correct nothing.
You'll only break when you'll allocate more than 2 Gb in one block. Or when you'll compare two unrelated pointers that are more than 2 Gb away from each other.
As comparing unrelated pointers is technically an undefined behaviour anyway, you won't encounter that much code that does it (but you can never be sure).
And very frequently even if in total you need more than 2Gb, your program actually never makes single allocations that are larger than that. In fact in Windows, even with LARGEADDRESSAWARE you won't be able by default to allocate that much given the way the memory is organized. You'd need to shuffle the system DLL around to get a continuous block of more than 2Gb
But Murphy's laws says that kind of code will breaks one day, it's just that it will happen very long after you've enable LARGEADDRESSAWARE without checking, and when nobody will remember this has been done.

Related

x64 allows less threads per block than Win32?

When I am executing some cuda kernel, I noticed that for the many of my own cuda kernels, x64 build would cause failure, whereas Win32 would not.
I am very confused because the cuda source code are the same, and build is fine. It is just when x64 executes, it says it requests too much resource to launch. But shouldn't x64 allows more resources than Win32 in conceptually?
I normally like to use 1024 threads per block if it is possible. So to make x64 code work, I have to downsize the block to 256.
Any one has any idea?
Yes, it's possible. Presumably the issue you are talking about is a registers-per-thread issue.
In 32-bit mode, all pointers are 32-bits and require only one 32-bit register for storage on the GPU. With the exact same source code, those pointers will require 64-bits for storage and therefore will effectively require two 32-bit registers (and, as #njuffa points out below, certain other types can change their size as well, requiring double the registers.) The number of available 32-bit registers is a hardware limit that does not change whether compiling for 32-bit or 64-bit mode, but pointer storage will use twice as many registers in 64-bit mode.
Pointer arithmetic (or arithmetic involving any of the types that increase in size) may likewise be impacted, as some of it may need to be done using 64-bit arithmetic vs. 32-bit arithmetic.
If these registers-per-thread increases in 64-bit mode place your overall usage over the limit, then you will have to use one of a variety of methods to manage it. You've mentioned one already: reduce the number of threads. You can also investigate the nvcc -maxrregcount ... switch, and/or the launch bounds directive.

Forcing Windows to load DLL's at places so that memory is minimally fragmented

My application needs lots of memory and big data structure in order to perform its work.
Often the application needs more than 1 GB of memory, and in some cases my customers really need to use the 64-bit version of the application because they have several gigabytes of memory.
In the past, I could easily explain to the user that if memory reached 1.6 to 1.7 GB of memory usage, it was 'out of memory' or really close to an 'out of memory' situation, and that they needed to reduce their memory or move to a 64-bit version.
The last year I noticed that often the application only uses about 1 GB before it already runs out of memory. After some investigations it seemed that the cause of this problem is memory fragmentation. I used VMMAP (a SysInternals utility) to look at the memory usage of my application and saw something like this:
The orange areas are memory allocated by my application. The purple areas are executable code.
As you can see in the bottom halve of the image, the purple areas (which are the DLL's) are loaded at many different addresses, causing my memory to be fragmented. This isn't really a problem if my customer does not have a lot of data, but if my customer has data sets that take more than 1 GB, and a part of the application needs a big block of memory (e.g. 50 MB), it can result in a memory allocation failure, causing my application to crash.
Most of my data structures are STL-based and often don't require big chunks of contiguous memory, but in some cases (e.g. very big strings), it is really needed to have a contiguous block of memory. Unfortunately, it is not always possible to change the code so that it doesn't need such a contiguous block of memory.
The questions are:
How can I influence the location where DLL's are loaded in memory, without explicitly using REBASE on all the DLL's on the customer's computer, or without loading all DLL's explicitly.
Is there a way to specify load addresses of DLL's in your own application manifest file?
Or is there a way to tell Windows (via the manifest file?) to not scatter the DLL's around (I think this scattering is called ASLR).
Of course the best solution is one that I can influence from within my application's manifest file, since I rely on the automatic/dynamic loading of DLL's by Windows.
My application is a mixed mode (managed+unmanaged) application, although the major part of the application is unmanaged.
Anyone suggestions?
First, your virtual address space fragmentation should not necessarily cause the out-of-memory condition. This would be the case if your application had to allocate contiguous memory blocks of the appropriate size. Otherwise the impact of the fragmentation should be minor.
You say most of your data is "STL-based", but if for example you allocate a huge std::vector you'll need a contiguous memory block.
AFAIK there is no way to specify the preferred mapping address of the DLL upon its load. So that there are only two options: either rebase it (the DLL file), or implement DLL loading yourself (which is not trivial of course).
Usually you don't need to rebase the standard Windows API DLLs, they are loaded at your address space very tightly. Fragmentation may arrive from some 3rd-party DLLs (such as windows hooks, antivirus injections and etc.)
You cannot do this with a manifest, it must be done by the linker's /BASE option. Linker + Advanced + Base address in the IDE. The most flexible way is to use the /BASE:#filename,key syntax so that the linker reads the base address from a text file.
The best way to get the text file filled is from the Debug + Windows + Modules window. Get the Release build of your program loaded in the debugger and load up the whole shebang. Debug + Break All, bring up the window and copy-paste it into the text file. Edit it to match the required format, calculating load addresses from the Address column. Leave enough space between the DLLs so that you don't constantly have to tweak the text file.
If you are able to execute some of your own code before the libraries in question are loaded, you could reserve yourself a nice big chunk of address space ahead of time to allocate from.
Otherwise, you need to identify the DLLs responsible, to determine why they are being loaded. For example, are they part of .NET, the language's runtime library, your own code, or third party libraries you are using?
For your own code the most sensible solution is probably to use static instead of dynamic linking. This should also be possible for the language runtime and may be possible for third party libraries.
For third party libraries, you can change from using implicit loading to explicit loading, so that loading only takes place after you've reserved your chunk of address space.
I don't know if there's anything you can do about .NET libraries; but since most of your code is unmanaged, it might be possible to eliminate the managed components in order to get rid of .NET. Or perhaps you could split the .NET parts into a separate process.

How portable is mmap?

I've been considering using mmap for file reading, and was wondering how portable that is.
I'm developing on a Linux platform, but would like my program to work on Mac OS X and Windows.
Can I assume mmap is working on these platforms?
The mmap() function is a POSIX call. It works fine on MacOS X (and Linux, and HP-UX, and AIX, and Solaris).
The problem area will be Windows. I'm not sure whether there is an _mmap() call in the POSIX 'compatibility' sub-system. It is likely to be there — but will have the name with the leading underscore because Microsoft has an alternative view on namespaces and considers mmap() to intrude on the user name space, even if you ask for POSIX functionality. You can find a definition of an alternative Windows interface MapViewOfFile() and discussion about performance in another SO question (mmap() vs reading blocks).
If you try to map large files on a 32-bit system, you may find there isn't enough contiguous space to allocate the whole file in memory, so the memory mapping will fail. Do not assume it will work; decide what your fallback strategy is if it fails.
Using mmap for reading files isn't portable if you rely on mapping large bits of large files into your address space - 32-bit systems can easily not have a single large usable space - say 1G - of address space available so mmap would fail quite often for a 1G mapping.
The principle of a memory mapped file is fairly portable, but you don't have mmap() on Windows (but things like MapViewOfFile() exist). You could take a peek at the python mmap modules c code to see how they do it for various platforms.
I consider memory mapped io on UNIXs
as not useable for interactive applications,
as it may result in a SIGSEGV/SIGBUS
(in case of the file has been truncated meanwhile by some other process).
Ignoring such sick "solutions" as setjmp/longjmp
there is nothing one can do other than to terminate the process after getting SIGSEGV/SIGBUS.
The new G++ feature to convert such signals into exceptions
seems to be intended mainly for apples OS,
since the description states, that one needs runtime support for this G++ feature
and there is no information to be found about this G++ feature anywhere.
We probably have to wait a couple of years, until structured exception handling like it can be found on windows since more than 20 years makes its way into UNIXs.

Seeking articles on shared memory locking issues

I'm reviewing some code and feel suspicious of the technique being used.
In a linux environment, there are two processes that attach multiple
shared memory segments. The first process periodically loads a new set
of files to be shared, and writes the shared memory id (shmid) into
a location in the "master" shared memory segment. The second process
continually reads this "master" location and uses the shmid to attach
the other shared segments.
On a multi-cpu host, it seems to me it might be implementation dependent
as to what happens if one process tries to read the memory while it's
being written by the other. But perhaps hardware-level bus locking prevents
mangled bits on the wire? It wouldn't matter if the reading process got
a very-soon-to-be-changed value, it would only matter if the read was corrupted
to something that was neither the old value nor the new value. This is an edge case: only 32 bits are being written and read.
Googling for shmat stuff hasn't led me to anything that's definitive in this
area.
I suspect strongly it's not safe or sane, and what I'd really
like is some pointers to articles that describe the problems in detail.
It is legal -- as in the OS won't stop you from doing it.
But is it smart? No, you should have some type of synchronization.
There wouldn't be "mangled bits on the wire". They will come out either as ones or zeros. But there's nothing to say that all your bits will be written out before another process tries to read them. And there are NO guarantees on how fast they'll be written vs how fast they'll be read.
You should always assume there is absolutely NO relationship between the actions of 2 processes (or threads for that matter).
Hardware level bus locking does not happen unless you get it right. It can be harder then expected to make your compiler / library / os / cpu get it right. Synchronization primitives are written to makes sure it happens right.
Locking will make it safe, and it's not that hard to do. So just do it.
#unknown - The question has changed somewhat since my answer was posted. However, the behavior you describe is defiantly platform (hardware, os, library and compiler) dependent.
Without giving the compiler specific instructions, you are actually not guaranteed to have 32 bits written out in one shot. Imagine a situation where the 32 bit word is not aligned on a word boundary. This unaligned access is acceptable on x86, and in the case of the x68, the access is turned into a series of aligned accesses by the cpu.
An interrupt can occurs between those operations. If a context switch happens in the middle, some of the bits are written, some aren't. Bang, You're Dead.
Also, lets think about 16 bit cpus or 64 bit cpus. Both of which are still popular and don't necessarily work the way you think.
So, actually you can have a situation where "some other cpu-core picks up a word sized value 1/2 written to". You write you code as if this type of thing is expected to happen if you are not using synchronization.
Now, there are ways to preform your writes to make sure that you get a whole word written out. Those methods fall under the category of synchronization, and creating synchronization primitives is the type of thing that's best left to the library, compiler, os, and hardware designers. Especially if you are interested in portability (which you should be, even if you never port your code)
The problem's actually worse than some of the people have discussed. Zifre is right that on current x86 CPUs memory writes are atomic, but that is rapidly ceasing to be the case - memory writes are only atomic for a single core - other cores may not see the writes in the same order.
In other words if you do
a = 1;
b = 2;
on CPU 2 you might see location b modified before location 'a' is. Also if you're writing a value that's larger than the native word size (32 bits on an x32 processor) the writes are not atomic - so the high 32 bits of a 64 bit write will hit the bus at a different time from the low 32 bits of the write. This can complicate things immensely.
Use a memory barrier and you'll be ok.
You need locking somewhere. If not at the code level, then at the hardware memory cache and bus.
You are probably OK on a post-PentiumPro Intel CPU. From what I just read, Intel made their later CPUs essentially ignore the LOCK prefix on machine code. Instead the cache coherency protocols make sure that the data is consistent between all CPUs. So if the code writes data that doesn't cross a cache-line boundary, it will work. The order of memory writes that cross cache-lines isn't guaranteed, so multi-word writes are risky.
If you are using anything other than x86 or x86_64 then you are not OK. Many non-Intel CPUs (and perhaps Intel Itanium) gain performance by using explicit cache coherency machine commands, and if you do not use them (via custom ASM code, compiler intrinsics, or libraries) then writes to memory via cache are not guaranteed to ever become visible to another CPU or to occur in any particular order.
So just because something works on your Core2 system doesn't mean that your code is correct. If you want to check portability, try your code also on other SMP architectures like PPC (an older MacPro or a Cell blade) or an Itanium or an IBM Power or ARM. The Alpha was a great CPU for revealing bad SMP code, but I doubt you can find one.
Two processes, two threads, two cpus, two cores all require special attention when sharing data through memory.
This IBM article provides an excellent overview of your options.
Anatomy of Linux synchronization methods
Kernel atomics, spinlocks, and mutexes
by M. Tim Jones (mtj#mtjones.com), Consultant Engineer, Emulex
http://www.ibm.com/developerworks/linux/library/l-linux-synchronization.html
I actually believe this should be completely safe (but is depends on the exact implementation). Assuming the "master" segment is basically an array, as long as the shmid can be written atomically (if it's 32 bits then probably okay), and the second process is just reading, you should be okay. Locking is only needed when both processes are writing, or the values being written cannot be written atomically. You will never get a corrupted (half written values). Of course, there may be some strange architectures that can't handle this, but on x86/x64 it should be okay (and probably also ARM, PowerPC, and other common architectures).
Read Memory Ordering in Modern Microprocessors, Part I and Part II
They give the background to why this is theoretically unsafe.
Here's a potential race:
Process A (on CPU core A) writes to a new shared memory region
Process A puts that shared memory ID into a shared 32-bit variable (that is 32-bit aligned - any compiler will try to align like this if you let it).
Process B (on CPU core B) reads the variable. Assuming 32-bit size and 32-bit alignment, it shouldn't get garbage in practise.
Process B tries to read from the shared memory region. Now, there is no guarantee that it'll see the data A wrote, because you missed out the memory barrier. (In practise, there probably happened to be memory barriers on CPU B in the library code that maps the shared memory segment; the problem is that process A didn't use a memory barrier).
Also, it's not clear how you can safely free the shared memory region with this design.
With the latest kernel and libc, you can put a pthreads mutex into a shared memory region. (This does need a recent version with NPTL - I'm using Debian 5.0 "lenny" and it works fine). A simple lock around the shared variable would mean you don't have to worry about arcane memory barrier issues.
I can't believe you're asking this. NO it's not safe necessarily. At the very least, this will depend on whether the compiler produces code that will atomically set the shared memory location when you set the shmid.
Now, I don't know Linux, but I suspect that a shmid is 16 to 64 bits. That means it's at least possible that all platforms would have some instruction that could write this value atomically. But you can't depend on the compiler doing this without being asked somehow.
Details of memory implementation are among the most platform-specific things there are!
BTW, it may not matter in your case, but in general, you have to worry about locking, even on a single CPU system. In general, some device could write to the shared memory.
I agree that it might work - so it might be safe, but not sane.
The main question is if this low-level sharing is really needed - I am not an expert on Linux, but I would consider to use for instance a FIFO queue for the master shared memory segment, so that the OS does the locking work for you. Consumer/producers usually need queues for synchronization anyway.
Legal? I suppose. Depends on your "jurisdiction". Safe and sane? Almost certainly not.
Edit: I'll update this with more information.
You might want to take a look at this Wikipedia page; particularly the section on "Coordinating access to resources". In particular, the Wikipedia discussion essentially describes a confidence failure; non-locked access to shared resources can, even for atomic resources, cause a misreporting / misrepresentation of the confidence that an action was done. Essentially, in the time period between checking to see whether or not it CAN modify the resource, the resource gets externally modified, and therefore, the confidence inherent in the conditional check is busted.
I don't believe anybody here has discussed how much of an impact lock contention can have over the bus, especially on bus bandwith constrained systems.
Here is an article about this issue in some depth, they discuss some alternative schedualing algorythems which reduse the overall demand on exclusive access through the bus. Which increases total throughput in some cases over 60% than a naieve scheduler (when considering the cost of an explicit lock prefix instruction or implicit xchg cmpx..). The paper is not the most recent work and not much in the way of real code (dang academic's) but it worth the read and consideration for this problem.
More recent CPU ABI's provide alternative operations than simple lock whatever.
Jeffr, from FreeBSD (author of many internal kernel components), discusses monitor and mwait, 2 instructions added for SSE3, where in a simple test case identified an improvement of 20%. He later postulates;
So this is now the first stage in the
adaptive algorithm, we spin a while,
then sleep at a high power state, and
then sleep at a low power state
depending on load.
...
In most cases we're still idling in
hlt as well, so there should be no
negative effect on power. In fact, it
wastes a lot of time and energy to
enter and exit the idle states so it
might improve power under load by
reducing the total cpu time required.
I wonder what would be the effect of using pause instead of hlt.
From Intel's TBB;
ALIGN 8
PUBLIC __TBB_machine_pause
__TBB_machine_pause:
L1:
dw 090f3H; pause
add ecx,-1
jne L1
ret
end
Art of Assembly also uses syncronization w/o the use of lock prefix or xchg. I haven't read that book in a while and won't speak directly to it's applicability in a user-land protected mode SMP context, but it's worth a look.
Good luck!
If the shmid has some type other than volatile sig_atomic_t then you can be pretty sure that separate threads will get in trouble even on the very same CPU. If the type is volatile sig_atomic_t then you can't be quite as sure, but you still might get lucky because multithreading can do more interleaving than signals can do.
If the shmid crosses cache lines (partly in one cache line and partly in another) then while the writing cpu is writing you sure find a reading cpu reading part of the new value and part of the old value.
This is exactly why instructions like "compare and swap" were invented.
Sounds like you need a Reader-Writer Lock : http://en.wikipedia.org/wiki/Readers-writer_lock.
The answer is - it's absolutely safe to do reads and writes simultaneously.
It is clear that the shm mechanism
provides bare-bones tools for the
user. All access control must be taken
care of by the programmer. Locking and
synchronization is being kindly
provided by the kernel, this means the
user have less worries about race
conditions. Note that this model
provides only a symmetric way of
sharing data between processes. If a
process wishes to notify another
process that new data has been
inserted to the shared memory, it will
have to use signals, message queues,
pipes, sockets, or other types of IPC.
From Shared Memory in Linux article.
The latest Linux shm implementation just uses copy_to_user and copy_from_user calls, which are synchronised with memory bus internally.

What standard techniques are there for using cpu specific features in DLLs?

Short version: I'm wondering if it's possible, and how best, to utilise CPU specific
instructions within a DLL?
Slightly longer version:
When downloading (32bit) DLLs from, say, Microsoft it seems that one size fits all processors.
Does this mean that they are strictly built for the lowest common denominator (ie. the
minimum platform supported by the OS)?
Or is there some technique that is used to export a single interface within the DLL but utilise
CPU specific code behind the scenes to get optimal performance? And if so, how is it done?
I don't know of any standard technique but if I had to make such a thing, I would write some code in the DllMain() function to detect the CPU type and populate a jump table with function pointers to CPU-optimized versions of each function.
There would also need to be a lowest common denominator function for when the CPU type is unknown.
You can find current CPU info in the registry here:
HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor
The DLL is expected to work on every computer WIN32 runs on, so you are stuck to the i386 instruction set in general. There is no official method of exposing functionality/code for specific instruction sets. You have to do it by hand and transparently.
The technique used basically is as follows:
- determine CPU features like MMX, SSE in runtime
- if they are present, use them, if not, have fallback code ready
Because you cannot let your compiler optimise for anything else than i386, you will have to write the code using the specific instruction sets in inline assembler. I don't know if there are higher-language toolkits for this. Determining the CPU features is straight forward, but could also need to be done in assembler.
An easy way to get the SSE/SSE2 optimizations is to just use the /arch argument for MSVC. I wouldn't worry about fallback--there is no reason to support anything below that unless you have a very niche application.
http://msdn.microsoft.com/en-us/library/7t5yh4fd.aspx
I believe gcc/g++ have equivalent flags.
Intel's ICC can compile code twice, for different architectures. That way, you can have your cake and eat it. (OK, you get two cakes - your DLL will be bigger). And even MSVC2005 can do it for very specific cases (E.g. memcpy() can use SSE4)
There are many ways to switch between different versions. A DLL is loaded, because the loading process needs functions from it. Function names are converted into addresses. One solution is to let this lookup depend on not just function name, but also processor features. Another method uses the fact that the name to address function uses a table of pointers in an interim step; you can switch out the entire table. Or you could even have a branch inside critical functions; so foo() calls foo__sse4 when that's faster.
DLLs you download from Microsoft are targeted for the generic x86 architecture for the simple reason that it has to work across all the multitude of machines out there.
Until the Visual Studio 6.0 time frame (I do not know if it has changed) Microsoft used to optimize its DLLs for size rather than speed. This is because the reduction in the overall size of the DLL gave a higher performance boost than any other optimization that the compiler could generate. This is because speed ups from micro optimization would be decidedly low compared to speed ups from not having the CPU wait for the memory. True improvements in speed come from reducing I/O or from improving the base algorithm.
Only a few critical loops that run at the heart of the program could benefit from micro optimizations simply because of the huge number of times they are invoked. Only about 5-10% of your code might fall in this category. You could rest assured that such critical loops would already be optimized in assembler by the Microsoft software engineers to some level and not leave much behind for the compiler to find. (I know it's expecting too much but I hope they do this)
As you can see, there would be only drawbacks from the increased DLL code that includes additional versions of code that are tuned for different architectures when most of this code is rarely used / are never part of the critical code that consumes most of your CPU cycles.

Resources