How does Eigen's vectorisation fallback work?

How does Eigen's vectorisation fallback work? - eigen

The Eigen web site says:
Explicit vectorization is performed for SSE 2/3/4, AVX, FMA, AVX512, ARM NEON (32-bit and 64-bit), PowerPC AltiVec/VSX (32-bit and 64-bit) instruction sets, and now S390x SIMD (ZVector) with graceful fallback to non-vectorized code.
Does that mean if you compile, e.g. with FMA, and the CPU you are running on does not support it it will fall back to completely unvectorised code? Or will it fall back to the best vectorisation available?
If not, is there any way to have Eigen compile for all or several SIMD ISAs and automatically pick the best at runtime?
Edit: To be clear I'm talking about run-time fallbacks.

There is absolutely no run-time dispatching in Eigen. Everything that happens happens at compile-time. This is where you need to make all of the choices, either via preprocessor macros that control the library's behavior, or using optimization settings in your C++ compiler.
In order to implement run-time dispatching, you would either need to check and see what CPU features were supported on each and every call into the library and branch into the applicable implementation, or you would need to do this check once at launch and set up a table of function pointers (or some other method to facilitate dynamic dispatching). Eigen can't do the latter because it is a header-only library (just a collection of types and functions), with no "main" function that gets called upon initialization where all of this setup code could be localized. So the only option would be the former, which would result in a significant performance penalty. The whole point of this library is speed; introducing this type of performance penalty to each of the library's functions would be a disaster.
The documentation also contains a breakdown of how Eigen works, using a simple example. This page says:
The goal of this page is to understand how Eigen compiles it, assuming that SSE2 vectorization is enabled (GCC option -msse2).
which lends further credence to the claim that static, compile-time options determine how the library will work.
Whichever instruction set you choose to target, the generated code will have those instructions in it. If you try to execute that code on a processor that does not support these instructions (for example, you compile Eigen with AVX optimizations enabled, but you run it on an Intel Nehalem processor that doesn't support the AVX instruction set), then you will get an invalid instruction exception (presented to your program in whatever form the operating system passes CPU exceptions through). This will happen as soon as your CPU encounters an unrecognized/unsupported instruction, which will probably be deep in the bowels of one of the Eigen functions (i.e., not immediately upon startup).
However, as I said in the comments, there is a fallback mechanism of sorts, but it is a static one, all done at compile time. As the documentation indicates, Eigen supports multiple vector instruction sets. If you choose a lowest-common-denominator instruction set, like SSE2, you will still get some degree of vectorization. For example, although SSE3 may provide a specialized instruction for some particular task, if you can't target SSE3, all hope is not lost. There is, in many places, code that uses a series of SSE2 instructions to accomplish the same task. These are probably not as efficient as if SSE3 were available, but they'll still be faster than having no vector code whatsoever. You can see examples of this by digging into the source code, specifically the arch folder that contains the tidbits that have been specifically optimized for the different instruction sets (often by the use of intrinsics). In other words, you get the best code it can give you for your target architecture/instruction set.

Related

Is there way to automatically replace avx512 with avx2?

Following the advice of Linus Torvalds (and cross platform performance), I wish to not use avx512. Is there a flag I can specify to the compiler (both gcc and msvc) such that all avx512 instructions are split into pairs of avx2 instructions if a library I am using tries to use axv512 either from intrinsics or compiler optimiszation?

No, compile your code not to use AVX-512 in the first place by telling the compiler it can't; you only have to do anything about code using intrinsics that require AVX-512.
However, if you're compiling for a CPU that supports AVX-512, it's often worth using it, especially with 256-bit vectors to avoid the turbo-frequency and other penalties that come with 512-bit vectors. GCC's default tuning is already -mprefer-vector-width=256 for CPUs like -march=skylake-avx512.
If you want to make a binary that can run on CPUs without AVX-512, then yes obviously you need to make sure it never executes and instructions that would fault without it. e.g. gcc -O3 -march=znver2 or -march=skylake or whatever. Neither of those target arch options include AVX-512. Or -march=native if compiling for whatever CPU you have.
But if you do have a CPU that supports AVX-512, and you want to not use it, you can use something like -march=native -mno-avx512f (All other AVX-512 extensions depend on the "Foundation" AVX-512F, so disabling that also prevents even AVX-512VL for 128 and 256-bit vectors.)
(Part of the benefit of -march=native and then disabling stuff is to also set tuning options. If you want a binary that runs well on both Skylake and Zen2, I'm not sure what to recommend; probably -march=skylake or -march=znver2 are both ok; there's the default "tune=generic" but it cares too much about really old CPUs that don't even support AVX2, like Sandybridge: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)
Intrinsics
Even with intrinsics, GCC will only ever emit instructions supported by the target options, so -mno-avx512f can let you be sure you didn't miss anything. You'll get compile time errors, instead of EVEX instructions slipping through the cracks.
(MSVC is different and is designed around a single-binary model where using new instruction-sets is done in functions that you only call if the CPU supports it, so it won't stop you from using AVX-512. AFAIK, MSVC still doesn't even have an option to auto-vectorize with AVX-512, only /arch:AVX2. But anyway, MSVC won't emit AVX-512 instructions on its own if you don't tell it to, if you don't use any option like /arch:AVX512 if such a thing exists; AFAIK it doesn't have a /arch:native unfortunately. With MSVC you do have to be sure you caught all uses of intrinsics, although compiling with GCC can help to make sure your codebase doesn't do that.)
If you still want to compile code that uses _mm512_add_epi32 or _mm256_ternlog_epi32 or whatever, you'll need a version of immintrin.h that defines __m512i as a struct/class with two __m256i members and emulates all the intrinsics. Some AVX512 intrinsics won't be cheap to emulate, especially masked operations, and the whole concept of compare-into-mask to get an integer instead of a vector. So it's probably a bad idea to try to make this happen fully transparently; instead just get GCC to stop you from using any AVX-512 instructions while you make AVX2-only versions of any intrinsics code that didn't already have AVX2 versions.
Last time this came up, Coding on insufficient hardware, I was able to find an avxintrin-emu.h that let you develop for AVX while only compiling for SSE4. But I didn't find an equivalent for AVX-512. (Normally you would compile an AVX-512 binary and test it on an emulator like SDE that emulates at runtime, not compile-time.)
Agner Fog's VectorClass wrapper library (https://www.agner.org/optimize/#vectorclass) has support for basic operations like + - * /, and shuffles and blends, and has versions 512-bit vectors emulated with a pair of AVX2 vectors. (And VCL types are implicitly convertible to __m256i or __m512i and so on, so for operations it doesn't have its own functions for, you can use Intel intrinsics. But then you're back in the same boat of needing a library that emulates __m256_ternlog_epi32 with only AVX2 instructions.)
This won't stop libc from possibly using hand-written AVX-512 instructions in functions like strcmp or log/exp, since dynamic CPU dispatching happens at run-time, and you can't stop your CPU from reporting that it supports AVX-512. (Except with a VM, or by telling the kernel not to enable AVX-512 at boot, if Linux has an option for that.)

How can an Operating System be coded in high level languages?

I just started diving into the world of operating systems and I've learned that processes have a certain memory space they can address which is handled by the operating system. I don't quite understand how can an Operating System written in high level languages like c and c++ obtain this kind of memory management functionality.

You have caught the bug and there is no cure for it :-)
The language you use to write your OS has very little to do with the way your OS operates. Yes, most people use C/C++, but there are others. As for the language, you do need a language that will let you directly communicate with the hardware you plan to manage, assembly being the main choice for this part. However, this is less than 5% of the whole project.
The code that you write must not rely upon any existing operating system. i.e.: you must code all of the function yourself, or call existing libraries. However, these existing libraries must be written so that they don't rely upon anything else.
Once you have a base, you can write your OS in any language you choose, with the minor part in assembly, something a high level language won't allow. In fact, in 64-bit code, some compilers no longer allow inline assembly, so this makes that 5% I mentioned above more like 15%.
Find out what you would like to do and then find out if that can be done in the language of choice. For example, the main operating system components can be written in C, while the actual processor management (interrupts, etc) must be done in assembly. Your boot code must be in assembly as well, at least most of it.
As mentioned in a different post, I have some early example code that you might want to look at. The boot is done in assembly, while the loader code, both Legacy BIOS and EFI, are mostly C code.

To clarify fysnet's answer, the reason you have to use at least a bit of assembly is that you can only explicitly access addressable memory in C/C++ (through pointers), while hardware registers (such as the program counter or stack pointer) often don't have memory addresses. Not only that, but some registers have to be manipulated with CPU architecture-dependent special instructions, and that, too, is only possible in machine language.
I don't quite understand how can an Operating System written in high level languages like c and c++ obtain this kind of memory management functionality.
As described above, depending on the architecture, this could be achieved by having special instructions to manage the MMU, TLB etc. INVLPG is one example of such an instruction in the x86 architecture. Note that having a special instruction requiring kernel privileges is probably the simplest way to implement such a feature in hardware in a secure manner, because then it is simply sufficient to check if the CPU is in kernel mode in order to determine whether the instruction can be executed or not.

Compilers turn high-level languages into asm / machine code for you, so you don't have to write asm yourself. You pick a compiler that handles memory the way you want your OS to; e.g. using the callstack for automatic storage, and not implicitly calling malloc / free (because those won't exist in your kernel).
To link your compiled C/C++ into a kernel, you typically have to know more about the ABI it targets, and the toolchain especially the linker.
The ISO C standard treats implementation details very much as a black box. But real compilers that people use for low level stuff work in well-known ways (i.e. make the expected/useful implementation choices) that kernel programmers depend on, in terms of compiling code and static data into contiguous blocks that can be linked into a single kernel executable that can be loaded all as one chunk.
As for actually managing the system's memory, you write code yourself to do that, with a bit of inline asm where necessary for special instructions like invlpg as other answers mention.
The entry point (where execution starts) will normally be written in pure asm, to set up a callstack with the stack pointer register pointing to it.
And set up virtual memory and so on so code is executable, data is read/write, and read-only data is readable. All of this before jumping to any compiled C code. The first C you jump to is probably more kernel init code, e.g. initializing data structures for an allocator to manage all the memory that isn't already in use by static code/data.
Creating a stack and mapping code/data into memory is the kind of setup that's normally done by an OS when starting a user-space program. The asm emitted by a compiler will assume that code, static data, and the stack are all there already.

Generate code for multiple SIMD architectures

I have written a library, where I use CMake for verifying the presence of headers for MMX, SSE, SSE2, SSE4, AVX, AVX2, and AVX-512. In addition to this, I check for the presence of the instructions and if present, I add the necessary compiler flags, -msse2 -mavx -mfma etc.
This is all very good, but I would like to deploy a single binary, which works across a range of generations of processors.
Question: Is it possible to tell the compiler (GCC) that whenever it optimizes a function using SIMD, it must generate code for a list of architectures? And of of course introduce high-level branches
I am thinking similar to how the compiler generates code for functions, where input pointers are either 4 or 8 byte aligned. To prevent this, I use the __builtin_assume_aligned macro.
What is best practice? Multiple binaries? Naming?

As long as you don't care about portability, yes.
Recent versions of GCC make this easier than any other compiler I'm aware of by using the target_clones function attribute. Just add the attribute, with a list of targets you want to create versions for, and GCC will automatically create the different variants, as well as a dispatch function to choose a version automatically at runtime.
If you want a bit more portability you can use the target attribute, which clang and icc also support, but you'll have to write the dispatch function yourself (which isn't difficult), and emit the function multiple times (generally using a macro, or repeatedly including a header).
AFAIK, if you want your code to work with MSVC you'll need multiple compiler invocations with different options.

If you're talking about just getting the compiler to generate SSE/AVX etc instructions, and you've got "general purpose" code (ie you're not explicitly vectorising using intrinsics, or got lots of code that the compiler will spot and auto-vectorise) then I should warn you that AVX, AVX2 or AVX512 compiling your entire codebase will probably run significantly slower than compiling for SSE versions.
When AVX opcodes using the upper halves of the registers are detected, the CPU powers up the upper half of the circuitry (which is otherwise powered down). This consumes more power, generates more heat and reduces the base clock speed of the chip, typically by 10-20% depending on the mix of high power and low-power opcodes, so you lose maybe 15% of performance immediately, and then have to be doing quite a lot of vectorised processing in order to make up for this performance deficit before you start seeing any gains.
See my longer explanation and references in this thread.
If on the other hand you're explicitly vectorising using intrinsics and you're sure you have large enough burst of AVX etc to make it worthwhile, I've successfully written code where I tell MSVC to compile for SSE2 (default for x64) but then I dynamically check the CPU capabilities and some functions switch to a codepath implemented using AVX intrinsics.
MSVC allows this (it will produce warnings, but you can silence these), but the same technique is hard to make work under GCC 4.9 as the intrinsics are only considered declared by the compiler when the appropriate code generation flag is used. [UPDATE: #nemequ explains below how you can make this work under gcc using attributes to decorate the functions] Depending on the version of GCC you may have to compile files with different flags to get a workable system.
Oh, and you have to watch for AVX-SSE transitions too (call VZEROUPPER when you leave an AVX section of code to return to SSE code) - it can be done but I found that understanding the CPU implications was a bigger battle than I originally envisaged.

compiler options to increase optimization performance of the code

I am porting the code from INTEL architecture to ARM architecture. The same code I am trying to build with arm cross compilers on centos. Are there any compiler options that increase the performance of executable image because the image on INTEL might not give similar performance on ARM. Is there a way to achieve this?

A lot of optimization options exist in GCC. By default the compiler tries to make the compilation process as short as possible and produce an object code that makes debugging easy. However, gcc also provides several options for optimization.
Four general levels of incremental optimizations for performance are available in GCC and can be activated by passing one of the options -O0, -O1, -O2 or -O3. Each of these levels activates a set of optimizations that can be also activated manually by specifying the corresponding command line option. For instance with -O1 the compiler performs branches using the decrement and branch instruction (if available and applicable) rather then decrementing the register, comparing it to zero and branching in separate instructions. This decrement and branch can be also specified manually by passing the -fbranch-count-reg option. Consider that optimizations performed at each level also depend on the target architecture, you can get the list of available and enabled optimization by running GCC with the -Q --help=optimizers option.
Generally speaking the levels of optimization correspond to (notice that at each level also the optimization of the previous ones are applied):
-O0: the default level, the compiler tries to reduce compilation time and produce an object code that can be easily processed by a debugger
-O1: the compiler tries to reduce both code size and execution time. Only optimizations that do not take a lot of compile time are performed. Compilation may take considerably more memory.
-O2: the compiler applies almost all optimizations available that do not affect the code size. Compiling takes more but performance should improve.
-O3: the compiler applies also optimization that might increase the code size (for instance inlining functions)
For a detailed description of all optimization options you can have a look here.
As a general remark consider that compiler optimization are designed to work in the general case but their effectiveness will depend a lot on both you program and the architecture you are running it on.
Edit:
If you are interested in memory paging optimization there is the -freorder-blocks-and-partition option (activated also with -O2). This option reorders basic blocks inside each function to partition them in hot blocks (called frequently) and cold blocks (called rarely). Hot blocks are then placed in contiguous memory locations. This should increasing cache locality and paging performance.

What standard techniques are there for using cpu specific features in DLLs?

Short version: I'm wondering if it's possible, and how best, to utilise CPU specific
instructions within a DLL?
Slightly longer version:
When downloading (32bit) DLLs from, say, Microsoft it seems that one size fits all processors.
Does this mean that they are strictly built for the lowest common denominator (ie. the
minimum platform supported by the OS)?
Or is there some technique that is used to export a single interface within the DLL but utilise
CPU specific code behind the scenes to get optimal performance? And if so, how is it done?

I don't know of any standard technique but if I had to make such a thing, I would write some code in the DllMain() function to detect the CPU type and populate a jump table with function pointers to CPU-optimized versions of each function.
There would also need to be a lowest common denominator function for when the CPU type is unknown.
You can find current CPU info in the registry here:
HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor

The DLL is expected to work on every computer WIN32 runs on, so you are stuck to the i386 instruction set in general. There is no official method of exposing functionality/code for specific instruction sets. You have to do it by hand and transparently.
The technique used basically is as follows:
- determine CPU features like MMX, SSE in runtime
- if they are present, use them, if not, have fallback code ready
Because you cannot let your compiler optimise for anything else than i386, you will have to write the code using the specific instruction sets in inline assembler. I don't know if there are higher-language toolkits for this. Determining the CPU features is straight forward, but could also need to be done in assembler.

An easy way to get the SSE/SSE2 optimizations is to just use the /arch argument for MSVC. I wouldn't worry about fallback--there is no reason to support anything below that unless you have a very niche application.
http://msdn.microsoft.com/en-us/library/7t5yh4fd.aspx
I believe gcc/g++ have equivalent flags.

Intel's ICC can compile code twice, for different architectures. That way, you can have your cake and eat it. (OK, you get two cakes - your DLL will be bigger). And even MSVC2005 can do it for very specific cases (E.g. memcpy() can use SSE4)
There are many ways to switch between different versions. A DLL is loaded, because the loading process needs functions from it. Function names are converted into addresses. One solution is to let this lookup depend on not just function name, but also processor features. Another method uses the fact that the name to address function uses a table of pointers in an interim step; you can switch out the entire table. Or you could even have a branch inside critical functions; so foo() calls foo__sse4 when that's faster.

DLLs you download from Microsoft are targeted for the generic x86 architecture for the simple reason that it has to work across all the multitude of machines out there.
Until the Visual Studio 6.0 time frame (I do not know if it has changed) Microsoft used to optimize its DLLs for size rather than speed. This is because the reduction in the overall size of the DLL gave a higher performance boost than any other optimization that the compiler could generate. This is because speed ups from micro optimization would be decidedly low compared to speed ups from not having the CPU wait for the memory. True improvements in speed come from reducing I/O or from improving the base algorithm.
Only a few critical loops that run at the heart of the program could benefit from micro optimizations simply because of the huge number of times they are invoked. Only about 5-10% of your code might fall in this category. You could rest assured that such critical loops would already be optimized in assembler by the Microsoft software engineers to some level and not leave much behind for the compiler to find. (I know it's expecting too much but I hope they do this)
As you can see, there would be only drawbacks from the increased DLL code that includes additional versions of code that are tuned for different architectures when most of this code is rarely used / are never part of the critical code that consumes most of your CPU cycles.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio