how compiler auto vectorization works? - compilation

Different architectures have their own SIMD instruction extentions and even for x86 intel and amd have different extentions how compilers manage this, in order to create a machine code which leverage SIMD extensions ?

Related

What would the optimal march settings be for modern CPUs?

I seen a similar question on this, but it was specific to P4s and Core2s. What I am looking for is a good setting for most modern CPUs, both AMD or Intel. It seems to me that i686 is a little out of date. I am leaning towards Pentium 4's for the extra SSE etc... instruction sets. What is the best target that would be compatible with modern CPUs, both Intel or AMD, for either just -march or both -march and -mtune?
I'm currently using GCC 5.3.0 32 bit on Windows 7.

GNU Fortran architecture dependent compiler option

Is there a GNU Fortran compiler (v5.3.0) option to tune the code for a particular architecture? I'm especially interested in Intel Core i7. I could not find anything related to code tuning in the official option summary at GNU Fortran 5.3.0 Option Summary. I remember in the past there used to be an option -march=.... Thank you.
Edit:
I have found out the processor architecture with cat /proc/cpuinfo and visited the Intel CPU Specifications website to find out that I have Sandy Bridge CPUs. In my case the correct GNU option would be -march=sandybridge.
i7 is not an architecture, SandyBridge, IvyBridge, Haswell and similar are architectures of Intel CPUs. And all of these architectures can have i3, i5, i7 or Xeon variants sold.
You can have two i7 CPUs, one older and one more recent and they can have different architectures.
In GCC (the whole suite for C, C++, Fortran...) has options -march and -mtune (see https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#x86-Options ) With march the compile code will only run on the specified architecture and newer. With mtune it will run on older, but will be somehow optimized for the specified one.
You can use native and the compiler will use the architecture of your current CPU. Or you can specify some architecture manually, like -march=haswell, -march=ivybridge or -march=core-avx-i.
Be aware you need a recent version of compiler to optimize for new CPU architectures.
All the information you are looking for is in the man page of gcc and not in the man page of gfortran :
man gcc
I assume -march=native does not work?
edit: tried hello world with gcc 5.3, it does compile with the option, don't know though, if it improves things.

GCC support for Intel AVX instrinsics (dvec.h)

Does GCC support dvec.h, and if not, what can I do to port code written for ICC to work with GCC?
I am getting errors:
fatal error: dvec.h: No such file or directory
#include <dvec.h>
Alternatively, GCC cannot find F32vec8.
See Agner Fog's manual Optimizing software in C++. See section 12.5 Using vector classes.
Agner's Vector Class Library (VCL) is far more powerful than Intel's dvec.h, it works on more compilers (including GCC and Clang), and it's free. However, it requires C++.
Another option is to use Yeppp!. Yepp works for C, C++, C#, Java, and FORTRAN and not just C++. However, it's an actually library that you must link in. The VCL is only a set of header files.
Another difference between the Yeppp! and the VCL is that Yeppp! is built from assembly whereas the VCL uses intrinsics. This is one reason Yeppp! needs to be linked in (MSVC 64-bit mode does not allow inline assembly).
One disadvantage of intrinsics is that the compiler can implement them differently than you expect. This is not normally a problem with ICC and GCC. They are excellent when it comes to intrinsic. However, MSVC with AVX and especially FMA is disappointing (though with SSE it's normally fine). So the performance using the VCL with GCC compared to MSVC may be quite different with AVX and FMA.
With assembly you always get what you want. However, since Yeppp! is not inline assembly you have to deal with the function calling overhead. In my case most of the time I want something like inline assembly which is what intrinsics mostly achieve.
I don't know Yeppp! well but the documentation of the VCL library is excellent and the source code is very clear.

What is the minimum target CPU architecture for the various versions of Visual Studio?

What is the minimum target processor architecture (indicated with _M_IX86 predefined macros) supported by every version of Visual Studio 2008, 2010 and 2012?
For example, MSVS 2012 supports only Pentium Pro and higher.
The classic switch for this was /G. Your available options differed for different versions of the compiler (with newer versions dropping older options, albeit continuing to accept them for compatibility reasons). Here's what you got:
/G3 built code that was optimized for 386 processors (_M_IX86 was set to 300)
/G4 for the 486 processor (_M_IX86 was set to 400)
/G5 built code that was optimized for the Pentium (_M_IX86 was set to 500)
/G6 built code that was optimized for the Pentium Pro, II, and III (_M_IX86 was set to 600)
/G7 built code that was optimized for the Pentium 4 or AMD Athlon (_M_IX86 was set to 700)
/GB specified either "blend" mode or the lowest common denominator that was reasonable when that version of the compiler was released. This was the default option if no other was specified.
And of course, it bears explicit mention that setting this option to optimize for a newer processor architecture did not prevent your code from running on an older processor architecture. It just wasn't optimized for that architecture and might run more slowly.
However, if you look up this compiler option in a current version of the documentation, you'll see no mention of any of this. All you see is something about Itanium processors (which we'll put aside). That's because the compiler shipping with VC++ 2005 dropped the /G3–/G7 compiler options altogether:
[The] /G3, /G4, /G5, /G6, /G7, and /GB compiler options have been removed. The compiler now uses a "blended model" that attempts to create the best output file for all architectures.
So, although many of us remember it well from VC++ 6, this code generation setting was a historical curiosity only even as far back as VC++ 2008. Therefore I'm not sure where you get the impression that VS 2012 supports only the Pentium Pro. I can't find mention of that anywhere in the official documentation or elsewhere online. The limiting factor for version 2012 of the compiler is not the processor architecture but the OS version. If you've patched the compiler, libraries, and all the other accoutrements to support targeting Windows XP, then you will be able to run your application on an original Pentium-233, onto which you've masochistically shoe-horned Windows XP.
The purpose of the _M_IX86 macro is really just an indicator that you're targeting the Intel IA-32 processor family—more commonly known as good old 32-bit x86—in contrast to one of the other supported target architectures, like _M_AMD64 for 64-bit x86. You should just treat it as a defined/undefined value now.
Yes, the old table of values for _M_IX86 still appears in the latest version of the preprocessor documentation, but it is utterly obsolete. You'll note that other obsolete symbols appear there as well, such as _M_PPC: what was the last version of MSVC++ that shipped with a PowerPC compiler? 4.2?
But that is only part of the story. There are still other compiler options that govern code generation with respect to target architectures.
For example, the /arch switch. From the latest version of the documentation, you have the following options:
/arch:IA32 which essentially sets the lowest common denominator, using x87 for floating point
/arch:SSE which turns on SSE instructions
/arch:SSE2 which turns on SSE2 instructions (and is the default for x86)
/arch:AVX which turns on Intel Advanced Vector Extensions
/arch:AVX2 which turns on Intel Advanced Vector Extensions 2
If you read the Remarks section, you'll also see that these options can imply more than just the specified instruction set. For example, since all processors that support SSE instructions also support the CMOV instruction, the CMOV instruction will be generated when /arch:SSE or higher is specified. The CMOV instruction has nothing to do with SSE; in fact, SSE was introduced with the Pentium III while CMOV was introduced way back with the Pentium Pro. But it's guaranteed to be supported on any architectures that support SSE.
The other relevant option is controlled by the /favor switch. This was new starting with VC++ 2008, and was presumably the replacement for the old /G3–/G7 options. As the documentation says:
/favor:blend is the default and produces code with no unique optimizations
/favor:INTEL64 generates code specific for Intel's implementation of x86-64
/favor:AMD64 generates code specific for AMD's implementation of x86-64
/favor:ATOM generates code specific for Intel's Atom processor

LAPACK for Windows for a Quad core machine

I have been searching for a precompiled library of Lapack for windows, I have found this
but my question is:
Is there any Lapack precompiled version for a quad core machine, Intel preprocessor 32 bits?
I want to get the most efficient computations using this machine, or the only way to go is compiling the libraries in the quad core computer?
My company has used Intel MKL for several years, and we are very satisfied with its performance. It is a commercial product developed by Intel; a single user license costs 399$ (129$ if you are a student).
Another options is AMD ACML. It is available for free, but when we profiled it (five years ago) we found that Intel MKL had better performance.
Both Intel MKL and AMD ACML work with both Intel and AMD processors. If the price is OK use Intel MKL, otherwise go with AMD ACML.

Resources