32 byte aligned allocator for aligned eigen map - eigen

I use the Eigen Map<> template class to reinterpret chunks of C++ arrays as
Eigen fixed sized arrays. It seems that the Eigen::allocator provided
16 bytes aligned allocation. What is the proper way to deal with AVX ?
Should I build my own allocator ?
using Block=typedef Eigen::Array<float,8, 1>;
using Map=Eigen::Map<BLOCK,Eigen::Aligned32>;
template <class T> using allocator=Eigen::aligned_allocator<T>;
std::vector<float,allocator<float> > X(10000);
Map myMap(&X[0]); //should be 32 bytes aligned for AVX
Thank you for your help

The documentation is outdated, internally Eigen::aligned_allocator is a wrapper around Eigen::internal::aligned_malloc that returns memory aligned as defined there:
EIGEN_MAX_ALIGN_BYTES - Must be a power of two, or 0. Defines an upper
bound on the memory boundary in bytes on which dynamically and
statically allocated data may be aligned by Eigen. If not defined, a
default value is automatically computed based on architecture,
compiler, and OS. This option is typically used to enforce binary
compatibility between code/libraries compiled with different SIMD
options. For instance, one may compile AVX code and enforce ABI
compatibility with existing SSE code by defining
EIGEN_MAX_ALIGN_BYTES=16. In the other way round, since by default AVX
implies 32 bytes alignment for best performance, one can compile SSE
code to be ABI compatible with AVX code by defining
EIGEN_MAX_ALIGN_BYTES=32.
So basically, if you compile with -mavx, you'll get 32 bytes aligned pointers.

Related

Do bytecode commands aligned?

I know that compilers perform data structure alignment and padding according to 4-byte(for 32-bit systems) or 8-byte(64-bit systems) boundaries.
But do interpreters align bytecode commands when they generate bytecode? If a command is coded by 1 byte and operands are coded by 1, 2, 4 or 8 bytes then it's seems it's not good for a processor to fetch data if bytecode is interpreted in looped switch? What do you think?
P.S I'm not asking about interpreters that perform JIT.
In general, the answer is no, but the JVM does require 32-bit alignment for the data portions of the lookupswitch and tableswitch instructions. Up to 3 bytes of padding (zeros) must be encoded to ensure proper alignment.

MingW Windows GCC cant compile c program with 2gb global data

GCC/G++ of MingW gives Relocation Errors when Building Applications with Large Global or Static Data.
Understanding the x64 code models
References to both code and data on x64 are done with
instruction-relative (RIP-relative in x64 parlance) addressing modes.
The offset from RIP in these instructions is limited to 32 bits.
small code model promises to the compiler that 32-bit relative offsets
should be enough for all code and data references in the compiled
object. The large code model, on the other hand, tells it not to make
any assumptions and use absolute 64-bit addressing modes for code and
data references. To make things more interesting, there's also a
middle road, called the medium code model.
For the below example program, despite adding options-mcmodel=medium or -mcmodel=large the code fails to compile
#define SIZE 16384
float a[SIZE][SIZE], b[SIZE][SIZE];
int main(){
return 0;
}
gcc -mcmodel=medium example.c fails to compile on MingW/Cygwin Windows, Intel windows /MSVC
You are limited to 32-bits for an offset, but this is a signed offset. So in practice, you are actually limited to 2GiB. You asked why this is not possible, but your array alone is 2GiB in size and there are things in the data segment other than just your array. C is a high level language. You get the ease of just being able to define a main function and you get all of these other things for free -- a standard in and output, etc. The C runtime implements this for you and all of this consumes stack space and room in your data segment. For example, if I build this on x86_64-pc-linux-gnu my .bss size is 0x80000020 in size -- an additional 32 bytes. (I've erased PE information from my brain, so I don't remember how those are laid out.)
I don't remember much about the various machine models, but it's probably helpful to note that the x86_64 instruction set doesn't even contain instructions (that I'm aware of, although I'm not an x86 assembly expert) to access any register-relative address beyond a signed 32-bit value. For example, when you want to cram that much stuff on the stack, gcc has to do weird things like this stack pointer allocation:
movabsq $-10000000016, %r11
addq %r11, %rsp
You can't addq $-10000000016, %rsp because it's more than a signed 32-bit offset. The same applies to RIP-relative addressing:
movq $10000000016(%rip), %rax # No such addressing mode

Why is the register length static in any CPU

Why is the register length (in bits) that a CPU operates on not dynamically/manually/arbitrarily adjustable? Would it make the computer slower if it was adjustable this way?
Imagine you had an 8-bit integer. If you could adjust the CPU register length to 8 bits, the CPU would only have to go through the first 8 bits instead of extending the 8-bit integer to 64 bits and then going through all 64 bits.
At first I thought you were asking if it was possible to have a CPU with no definitive register size. That make no sense since the number and size of the registers is a physical property of the hardware and cannot be changed.
However some architecture let the programmer work on a smaller part of a register or to pair registers.
The x86 does both for example, with add al, 9 (uses only 8 bits of the 64-bit rax) and div rbx (pairs rdx:rax to form a 128-bit register).
The reason this scheme is not so diffuse is that it comes with a lot of trade-offs.
More registers means more bits needed to address them, simply put: longer instructions.
Longer instructions mean less code density, more complex decoders and less performance.
Furthermore most elementary operations, like the logic ones, addition and subtraction are already implemented as operating on a full register in a single cycle.
Finally, one execution unit can handle only one instruction at a time, we cannot issue eight 8-bit additions in a 64-bit ALU at the same time.
So there wouldn't be any improvement, nor in the latency nor in the throughput.
Accessing partial registers is useful for the programmer to fan-out the number of available registers, so for example if an algorithm works with 16-bit data, the programmer could use a single physical 64-bit register to store four items and operate on them independently (but not in parallel).
The ISAs that have variable length instructions can also benefit from using partial register because that usually means smaller immediate values, for example and instruction that set a register to a specific value usually have an immediate operand that matches the size of register being loaded (though RISC usually sign-extends or zero-extends it).
Architectures like ARM (presumably others as well) supports half precision floats. The idea is to do what you were speculating and #Margaret explained. With half precision floats, you can pack two float values in a single register, thereby introducing less bandwidth at a cost of reduced accuracy.
Reference:
[1] ARM
[2] GCC

Which[neon/vfp/vfp3] should I specify for the mfpu when evaluate and compare float performance in ARM processor?

I want to evaluate some different ARM Processor float performance. I use the lmbench and pi_css5, I confuse in the float test.
From cat /proc/cpuinfo(below), I guess there're 3 types of float features: neon,vfp,vfpv3? From this question&answer, it seems it's depend to the compiler.
Still I don't know which I should to specify in compille flag(-mfpu=neon/vfp/vfpv3), or I should compile the program with each of that, or just do not specify the -mfpu?
cat /proc/cpuinfo
Processor : ARMv7 Processor rev 4 (v7l)
BogoMIPS : 532.00
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 4
It might be even a little bit more complicated then you anticipated. GCC arm options page doesn't explain fpu versions, however ARM's manual for their compiler does. You should also notice that Linux doesn't provide whole story about fpu features, only telling about vfp, vfpv3, vfpv3d16, or vfpv4.
Back to your question, you should select the greatest common factor among them, compile your code towards it and compare the results. On the other hand if a cpu has vfpv4 and other has vfpv3 which one would you think is better?
If your question is as simple as selecting between neon, vfp or vfpv3. Select neon (source).
-mfpu=neon selects VFPv3 with NEON coprocessor extensions.
From the gcc manual,
If the selected floating-point hardware includes the NEON extension
(e.g. -mfpu=neon), note that floating-point operations will
not be used by GCC's auto-vectorization pass unless
`-funsafe-math-optimizations' is also specified. This is because
NEON hardware does not fully implement the IEEE 754 standard for
floating-point arithmetic (in particular denormal values are
treated as zero), so the use of NEON instructions may lead to a
loss of precision.
See for instance, Subnormal IEEE-754 floating point numbers support on ios... for more on this topic.
I have tried each one of them, and it seems using the -mfpu=neon and to specify the -march=armv7-a and -mfloat-abi=softfp is the proper configuration.
Besides, a referrence(ARM Cortex-A8 vs. Intel Atom) is of great useful for ARM BenchMark.
Another helpful article is about ARM Cortex-A Processors and gcc command lines, this clears the SIMD coprocessor configuration.

Visual Studio parameter alignment restrictions and Windows x64 ABI

With Visual C++ on WIN32 there's a long-standing problem with functions with 4 or more SSE parameters, e.g.
__m128i foo4(__m128i m0, __m128i m1, __m128i m2, __m128i m3) {}
generates an error:
align.c(8) : error C2719: 'm3': formal parameter with __declspec(align('16')) won't be aligned
To compound the problem, Visual C++ still needlessly imposes the ABI restriction even if the function is __inline.
I'm wondering if this is still a problem on 64 bit Windows ? Does the ABI restriction still apply on x64 ?
(I don't have access to a 64 bit Windows system otherwise I'd try it myself, and an extensive Google search hasn't turned up anything definitive.)
You can pass as many 128 bit SSE intrinsic parameters as you like under x64. The x64 ABI was designed with these types in mind.
From the MSDN documentation:
__m128 types, arrays and strings are never passed by immediate value but rather a pointer is passed to memory allocated by the caller. Structs/unions of size 8, 16, 32, or 64 bits and __m64 are passed as if they were integers of the same size. Structs/unions other than these sizes are passed as a pointer to memory allocated by the caller. For these aggregate types passed as a pointer (including __m128), the caller-allocated temporary memory will be 16-byte aligned.

Resources