Visual Studio parameter alignment restrictions and Windows x64 ABI - windows

With Visual C++ on WIN32 there's a long-standing problem with functions with 4 or more SSE parameters, e.g.
__m128i foo4(__m128i m0, __m128i m1, __m128i m2, __m128i m3) {}
generates an error:
align.c(8) : error C2719: 'm3': formal parameter with __declspec(align('16')) won't be aligned
To compound the problem, Visual C++ still needlessly imposes the ABI restriction even if the function is __inline.
I'm wondering if this is still a problem on 64 bit Windows ? Does the ABI restriction still apply on x64 ?
(I don't have access to a 64 bit Windows system otherwise I'd try it myself, and an extensive Google search hasn't turned up anything definitive.)

You can pass as many 128 bit SSE intrinsic parameters as you like under x64. The x64 ABI was designed with these types in mind.
From the MSDN documentation:
__m128 types, arrays and strings are never passed by immediate value but rather a pointer is passed to memory allocated by the caller. Structs/unions of size 8, 16, 32, or 64 bits and __m64 are passed as if they were integers of the same size. Structs/unions other than these sizes are passed as a pointer to memory allocated by the caller. For these aggregate types passed as a pointer (including __m128), the caller-allocated temporary memory will be 16-byte aligned.

Related

what type is DWORD_PTR, LONG_PTR, UNIT_PTR

I'm asking if its an integer, string, character or boolean.
I'm trying to use them like normal variable types. It would be best if someone answered them separately and consecutively, I know it would be better to ask Microsoft itself, but they say,
"The following data types are always the size of a pointer, that is, 32 bits wide in 32-bit applications, and 64 bits wide in 64-bitapplications. The size is determined at compile time. When a 32-bit application runs on 64-bit
Windows, these data types are still 4 bytes wide."
And I have no idea whatever that means.
"The following data types are always the size of a pointer,
Pointers are generally 4-bytes on 32-bit applications and 8-bytes on 64-bit applications
that is, 32 bits wide in 32-bit applications, and 64 bits wide in
64-bitapplications.
What I just said.
32-bitapplication runs on 64-bit Windows, these data types are still 4
bytes wideThe size is determined at compile time. When a
The pointer size on a 32-bit application running on 64-bit Windows is still 4 bytes.
Basically breaks down like this:
For 32-bit apps (running on 32-bit or 64-bit OS):
DWORD_PTR => unsigned long
LONG_PTR => long
UINT_PTR => unsigned int
// sizeof(int) and sizeof(long) are both 4 on 32-bit windows
On 64-bit apps that only run on 64-bit windows
DWORD_PTR => unsigned __int64
LONG_PTR => __int64
UINT_PTR => unsigned __int64
Back in the day, before 64-bit Windows, Microsoft recognized a habit of developers was to smuggle pointers (often to C++ class instances) to data across different APIs and libraries by casting to integer.
Case in point. Seeing this smack in the middle of some code that was trying to send some data between different windows in the same application:
SendMessage(hWnd, WM_MY_CUSTOM_MESSAGE, (WPARAM)ptrToSomeData, (LPARAM)ptrToSomeObject);
When 64-bit Windows came along, a lot of these APIs and programming techniques were inherently broken because assumptions made that "DWORD" or "unsigned int" was always big enough for a pointer case. For other app compat and code reasons, they wouldn't just blindly make all these types 8-bytes wide on 64-bit builds. So the *_PTR types were invented for developers that wanted to build coding patterns that enabled another developer to freely pass an absolute integer or a pointer.
and I have no idea whatever that means
I hope you do now.

bad_alloc with unordered_map initializer_list and MMX instruction, possible heap corruption?

I am getting a bad_alloc thrown from the code below compiled with gcc (tried 4.9.3, 5.40 and 6.2). gdb tells me it happens on the last line with the initalizer_list for the unordered_map. If I comment out the mmx instruction _m_maskmovq there is no error. Similarly if I comment out the initialization of the unordered_map this is no error. Only when invoking the mmx instruction and initializing the unordered_map with an initializer_list do I get the bad_alloc. If I default construct the unordered_map and call map.emplace(1,1) there is also no error. I've run this on a centos7 machine with 48 cores (intel xeon) and 376 GB RAM and also on a Dell laptop (intel core i7) under Ubuntu WSL with the same result. What is going on here? Is the MMX instruction corrupting the heap? Valgrind didn't seem to identify anything useful.
Compiler command and output:
$g++ -g -std=c++11 main.cpp
$./a.out
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted
Source code (main.cpp):
#include <immintrin.h>
#include <unordered_map>
int main()
{
__m64 a_64 = _mm_set_pi8(0,0,0,0,0,0,0,0);
__m64 b_64 = _mm_set_pi8(0,0,0,0,0,0,0,0);
char dest[8] = {0};
_m_maskmovq(a_64, b_64, dest);
std::unordered_map<int, int> map{{ 1, 1}};
}
Update:
The _mm_empty() workaround does fix this example. This doesn't seem like a viable solution when using multithreaded code where one thread is doing vector instructions and another is using an unordered_map. Another interesting point, if I turn optimization on -O3 the bad_alloc goes away. Fingers crossed we never hit this error during production (cringe).
There is no heap corruption. This happens because std::unordered_map uses long double internally, for computing the bucket count from the number of elements in the initializer (see _Prime_rehash_policy::_M_bkt_for_elements in the libstdc++ sources).
It is necessary to call _mm_empty before switching from MMX code to FPU code. This has to do with a historic decision to reuse the FPU registers for the MMX register file (sort of the opposite of register renaming in modern CPUs).
The exception goes away if the _mm_empty call is added:
…
_m_maskmovq(a_64, b_64, dest);
_mm_empty();
std::unordered_map<int, int> map{{ 1, 1}};
…
See GCC PR 88998, as identified by cpplearner.
There is ongoing work to implement the MMX intrinsics with SSE on x86-64, which will make this issue disappear because SSE instructions do not affect the FPU state and vice versa.

32 byte aligned allocator for aligned eigen map

I use the Eigen Map<> template class to reinterpret chunks of C++ arrays as
Eigen fixed sized arrays. It seems that the Eigen::allocator provided
16 bytes aligned allocation. What is the proper way to deal with AVX ?
Should I build my own allocator ?
using Block=typedef Eigen::Array<float,8, 1>;
using Map=Eigen::Map<BLOCK,Eigen::Aligned32>;
template <class T> using allocator=Eigen::aligned_allocator<T>;
std::vector<float,allocator<float> > X(10000);
Map myMap(&X[0]); //should be 32 bytes aligned for AVX
Thank you for your help
The documentation is outdated, internally Eigen::aligned_allocator is a wrapper around Eigen::internal::aligned_malloc that returns memory aligned as defined there:
EIGEN_MAX_ALIGN_BYTES - Must be a power of two, or 0. Defines an upper
bound on the memory boundary in bytes on which dynamically and
statically allocated data may be aligned by Eigen. If not defined, a
default value is automatically computed based on architecture,
compiler, and OS. This option is typically used to enforce binary
compatibility between code/libraries compiled with different SIMD
options. For instance, one may compile AVX code and enforce ABI
compatibility with existing SSE code by defining
EIGEN_MAX_ALIGN_BYTES=16. In the other way round, since by default AVX
implies 32 bytes alignment for best performance, one can compile SSE
code to be ABI compatible with AVX code by defining
EIGEN_MAX_ALIGN_BYTES=32.
So basically, if you compile with -mavx, you'll get 32 bytes aligned pointers.

MingW Windows GCC cant compile c program with 2gb global data

GCC/G++ of MingW gives Relocation Errors when Building Applications with Large Global or Static Data.
Understanding the x64 code models
References to both code and data on x64 are done with
instruction-relative (RIP-relative in x64 parlance) addressing modes.
The offset from RIP in these instructions is limited to 32 bits.
small code model promises to the compiler that 32-bit relative offsets
should be enough for all code and data references in the compiled
object. The large code model, on the other hand, tells it not to make
any assumptions and use absolute 64-bit addressing modes for code and
data references. To make things more interesting, there's also a
middle road, called the medium code model.
For the below example program, despite adding options-mcmodel=medium or -mcmodel=large the code fails to compile
#define SIZE 16384
float a[SIZE][SIZE], b[SIZE][SIZE];
int main(){
return 0;
}
gcc -mcmodel=medium example.c fails to compile on MingW/Cygwin Windows, Intel windows /MSVC
You are limited to 32-bits for an offset, but this is a signed offset. So in practice, you are actually limited to 2GiB. You asked why this is not possible, but your array alone is 2GiB in size and there are things in the data segment other than just your array. C is a high level language. You get the ease of just being able to define a main function and you get all of these other things for free -- a standard in and output, etc. The C runtime implements this for you and all of this consumes stack space and room in your data segment. For example, if I build this on x86_64-pc-linux-gnu my .bss size is 0x80000020 in size -- an additional 32 bytes. (I've erased PE information from my brain, so I don't remember how those are laid out.)
I don't remember much about the various machine models, but it's probably helpful to note that the x86_64 instruction set doesn't even contain instructions (that I'm aware of, although I'm not an x86 assembly expert) to access any register-relative address beyond a signed 32-bit value. For example, when you want to cram that much stuff on the stack, gcc has to do weird things like this stack pointer allocation:
movabsq $-10000000016, %r11
addq %r11, %rsp
You can't addq $-10000000016, %rsp because it's more than a signed 32-bit offset. The same applies to RIP-relative addressing:
movq $10000000016(%rip), %rax # No such addressing mode

Is x86 32-bit assembly code valid x86 64-bit assembly code?

Is all x86 32-bit assembly code valid x86 64-bit assembly code?
I've wondered whether 32-bit assembly code is a subset of 64-bit assembly code, i.e., every 32-bit assembly code can be run in a 64-bit environment?
I guess the answer is yes, because 64-bit Windows is capable of executing 32-bit programs, but then I've seen that the 64-bit processor supports a 32-bit compatible mode?
If not, please provide a small example of 32-bit assembly code that isn't valid 64-bit assembly code and explain how the 64-bit processor executes the 32-bit assembly code.
A modern x86 CPU has three main operation modes (this description is simplified):
In real mode, the CPU executes 16 bit code with paging and segmentation disabled. Memory addresses in your code refer to phyiscal addresses, the content of segment registers is shifted and added to the address to form an effective address.
In protected mode, the CPU executes 16 bit or 32 bit code depending on the segment selector in the CS (code segment) register. Segmentation is enabled, paging can (and usually is) enabled. Programs can switch between 16 bit and 32 bit code by far jumping to an appropriate segment. The CPU can enter the submode virtual 8086 mode to emulate real mode for individual processes from inside a protected mode operating system.
In long mode, the CPU executes 64 bit code. Segmentation is mostly disabled, paging is enabled. The CPU can enter the sub-mode compatibility mode to execute 16 bit and 32 bit protected mode code from within an operating system written for long mode. Compatibility mode is entered by far-jumping to a CS selector with the appropriate bits set. Virtual 8086 mode is unavailable.
Wikipedia has a nice table of x86-64 operating modes including legacy and real modes, and all 3 sub-modes of long mode. Under a mainstream x86-64 OS, after booting the CPU cores will always all be in long mode, switching between different sub-modes depending on 32 or 64-bit user-space. (Not counting System Management Mode interrupts...)
Now what is the difference between 16 bit, 32 bit, and 64 bit mode?
16-bit and 32-bit mode are basically the same thing except for the following differences:
In 16 bit mode, the default address and operand width is 16 bit. You can change these to 32 bit for a single instruction using the 0x67 and 0x66 prefixes, respectively. In 32 bit mode, it's the other way round.
In 16 bit mode, the instruction pointer is truncated to 16 bit, jumping to addresses higher than 65536 can lead to weird results.
VEX/EVEX encoded instructions (including those of the AVX, AVX2, BMI, BMI2 and AVX512 instruction sets) aren't decoded in real or Virtual 8086 mode (though they are available in 16 bit protected mode).
16 bit mode has fewer addressing modes than 32 bit mode, though it is possible to override to a 32 bit addressing mode on a per-instruction basis if the need arises.
Now, 64 bit mode is a somewhat different. Most instructions behave just like in 32 bit mode with the following differences:
There are eight additional registers named r8, r9, ..., r15. Each register can be used as a byte, word, dword, or qword register. The family of REX prefixes (0x40 to 0x4f) encode whether an operand refers to an old or new register. Eight additional SSE/AVX registers xmm8, xmm9, ..., xmm15 are also available.
you can only push/pop 64 bit and 16 bit quantities (though you shouldn't do the latter), 32 bit quantities cannot be pushed/popped.
The single-byte inc reg and dec reg instructions are unavailable, their instruction space has been repurposed for the REX prefixes. Two-byte inc r/m and dec r/m is still available, so inc reg and dec reg can still be encoded.
A new instruction-pointer relative addressing mode exists, using the shorter of the 2 redundant ways 32-bit mode had to encode a [disp32] absolute address.
The default address width is 64 bit, a 32 bit address width can be selected through the 0x67 prefix. 16 bit addressing is unavailable.
The default operand width is 32 bit. A width of 16 bit can be selected through the 0x66 prefix, a 64 bit width can be selected through an appropriate REX prefix independently of which registers you use.
It is not possible to use ah, bh, ch, and dh in an instruction that requires a REX prefix. A REX prefix causes those register numbers to mean instead the low 8 bits of registers si, di, sp, and bp.
writing to the low 32 bits of a 64 bit register clears the upper 32 bit, avoiding false dependencies for out-of-order exec. (Writing 8 or 16-bit partial registers still merges with the 64-bit old value.)
as segmentation is nonfunctional, segment overrides are meaningless no-ops except for the fs and gs overrides (0x64, 0x65) which serve to support thread-local storage (TLS).
also, many instructions that specifically deal with segmentation are unavailable. These are: push/pop seg (except push/pop fs/gs), arpl, call far (only the 0xff encoding is valid), les, lds, jmp far (only the 0xff encoding is valid),
instructions that deal with decimal arithmetic are unavailable, these are: daa, das, aaa, aas, aam, aad,
additionally, the following instructions are unavailable: bound (rarely used), pusha/popa (not useful with the additional registers), salc (undocumented),
the 0x82 instruction alias for 0x80 is invalid.
on early amd64 CPUs, lahf and sahf are unavailable.
And that's basically all of it!
No, it isn't.
While there is a large amount of overlap, 64-bit assembly code is not a superset of 32-bit assembly code and so 32-bit assembly is not in general valid in 64-bit mode.
This applies both the mnemonic assembly source (which is assembled into binary format by an assembler), as well as the binary machine code format itself.
This question covers in some detail instructions that were removed, but there are also many encoding forms whose meanings were changed.
For example, Jester in the comments gives the example of push eax not being valid in 64-bit code. Based on this reference you can see that the 32-bit push is marked N.E. meaning not encodable. In 64-bit mode, the encoding is used to represent push rax (an 8-byte push) instead. So the same sequence of bytes has a different meaning in 32-bit mode versus 64-bit mode.
In general, you can browse the list of instructions on that site and find many which are listed as invalid or not encodable in 64-bit.
If not, please provide a small example of 32-bit assembly code that
isn't valid 64-bit assembly code and explain how the 64-bit processor
executes the 32-bit assembly code.
As above, push eax is one such example. I think what is missing is that 64-bit CPUs support directly running 32-bit binaries. They don't do it via compatibility between 32-bit and 64-bit instructions at the machine language level, but simply by having a 32-bit mode where the decoders (in particular) interpret the instruction stream as 32-bit x86 rather than x86-64, as well as the so-called long mode for running 64-bit instructions. When such 64-bit chips were first released, it was common to run a 32-bit operating system, which pretty much means the chip is permanently in this mode (never goes into 64-bit mode).
More recently, it is typical to run a 64-bit operating system, which is aware of the modes, and which will put the CPU into 32-bit mode when the user launches a 32-bit process (which are still very common: until very recently my browser was still 32-bit).
All the details and proper terminology for the modes can be found in fuz's answer, which is really the one you should read.

Resources