Balancing SSE & FPU

Balancing SSE & FPU - cpu

So I have some heavyweight algorithms which I would prefer to run on the VPU, but since there's so much going on, the VPU's tend to get saturated. Is there anyway to somehow do something like "Use VPU, if VPU overloaded, use FPU instead" so I have maximal throughput?
Thanks

re: comment. Either it's a problem with mixing SSE & AVX without vzeroupper (maybe you compiled the rest of your code with -march=native or something, and double-precision math is using AVX). Or your SSE version is bigger, and causes I-cache misses.
Or maybe your microbenchmark was bogus, and some of your SSE routine was optimized away.
To answer this, a lot more details about your code is needed. Like if you're sure your FPU code was really x87, and not just scalar in SSE.

Related

-march=haswell vs -march=core-avx2 vs -mavx2

Title says it all. What are the differences and tradeoffs between -march=haswell, -march=core-avx2, and -mavx2 for compiling avx2 intrinsics?
I know that -mavx2 is a flag and -march=haswell/core-avx2 are architectures which just translate to a bunch of flags. So -mavx2 is a subset of the other two. But beyond that, how do I choose the right one for my application?
In particular, if I'm compiling for Skylake and adding -mtune=skylake, but I can't actually do -march=skylake, how do I decide and why? (That's my particular case, but also interested in the more general tradeoffs).

-march sets -mtune, and enables other useful stuff you forgot like -mfma and -mbmi / -mbmi2 which aren't implied by -mavx2.
Unfortunately GCC's -mtune=generic doesn't adapt itself to the subset of CPUs that support the extensions enabled. Most notably, -mavx2 rules out Sandy Bridge, but doesn't imply -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store (Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?). Out of all AVX2 CPUs, only I think Excavator would really benefit much from those options, and it's not a big fraction of market share, and I think misaligned 32-byte load/store isn't disastrous on it like it was on Sandybridge. GCC11 and later have a tune=generic that does do unaligned 32-byte load/store (https://godbolt.org/z/qr9cEcne3) so this specific issue is no longer critical if you only care about AVX2 and newer CPUs. But still you generally want a -mtune= option somewhat appropriate for the set of CPUs you're targeting. But usually erring on the side of less loop unrolling, since too much can hurt I-cache / uop cache hit rate.
AVX2 CPUs have been widespread for long enough that -mtune=generic happens to be mostly appropriate for them in GCC11 and later, fortunately. (Especially CPUs like Haswell and Zen2 that have 32-byte-wide load/store units). But in general, with other CPU features and other far-future cases, that won't always be true.
But still don't just use -mavx2, that will leave out -mfma (which all AVX2 CPUs support except for one Via model), and leave out -mbmi/2 which I think all AVX2 CPUs also support. More efficient variable-count shifts are always nice, and so is lzcnt/tzcnt and efficient x &= x-1 lowest-set-bit idioms.
-march=haswell -mtune=skylake could be useful if you want to enable a Haswell baseline for compatibility, but tune for Skylake. (There's very little difference between tuning heuristics for those microarchitectures, and probably no new instructions that GCC would use automatically. For SKL specifically, see also How can I mitigate the impact of the Intel jcc erratum on gcc? which isn't implied by -march=skylake)
-march=core-avx2 and similar are an obsolete naming convention
It's just a synonym for -march=haswell, AFAIK. But maybe leaving -mtune=generic, unlike -march=haswell which sets -mtune=haswell. How to correctly determine -march and -mtune for Intel processors? has more details.
-march=core-i7 is I think just a synonym for Nehalem, which is really dumb because Nehalem (1st-gen i7) is a different microarchitecture from Sandy Bridge and later (2nd-gen i7). Intel unfortunately kept that branding through SnB's major re-architecting of the internals, although the 3-level cache with shared inclusive L3 has been a constant.
There's no such thing as "a generic i7", because i7 includes two different microarchitectures, one with a uop cache, one without.
Good riddance to those stupid ambiguous confusing -march= / -mtune names that are trying to be generic but actually aren't.
It would be nice if there was a -march=generic-avx2 to tune for Haswell and later, Zen1 and later, etc., using the common subset of instructions they all support efficiently. And setting tuning options appropriate for all of them, not going too crazy unrolling. But there isn't.
Or more generally, -mtune=generic-enabled or something that would look at the enabled -m options and not care about CPUs that couldn't run this code. e.g. not doing rep ret if you enabled any features that a Phenom II couldn't run, such as AVX. GCC bug 78762 is basically a feature request for that.
(Current GCC -mtune=generic no longer uses rep ret; Phenom II is old enough that GCC doesn't spend extra code-size to avoid potholes there. Most binary releases of software built with GCC use a GCC version that was in development a year or more before that, so there's some lead time in terms of what's appropriate for -mtune=generic. The CPU with a glass jaw / performance pothole you're working around doesn't have to be totally gone before a current nightly build of GCC stops tuning for it.)

Can I compile Go programs on Xeon Phi (Knight's Landing) processors?

I'm a hobbyist who likes to run my own programs in Go, and as Xeon Phi processors become older they're also becoming extremely cheap. So cheap I can build a dual socket machine from 2015/16 for <$1000
I'm trying to find out if I can run Go programs on these. From what I've seen, this thread says they won't run (and to try gccgo), but it says it won't run because it partially runs on an x87 ISA. Confusingly, in Go release notes they say they're dropping x87 support in 1.16, implying it was supported in the past. I've seen in other threads that all programs will run on the compatibility layer, but that's an extremely slow layer which only has access to a small portion of the cpu's cache.
I feel like I'm moving farther and farther out of my element; I was wondering if someone who's used Xeon Phi knows if it will run Go code? Or just in general, after booting up Ubuntu (or FreeBSD, something that I've seen done and is listed in motherboard specs) what sort of things aren't going to work and what will?
I appreciate any and all help!

You're basing your Knight's Landing worries on this quote about Knight's Corner:
The Knight's Corner processor is based on an x86-64 foundation, yes, but it in fact has its own floating-point instruction set—no x87, no AVX, no SSE, no MMX... Oh, and then you can throw all that away when Knight's Landing (KNL) comes out.
By "throw all that away", they mean all the worries and incompatibilities. KNL is based on Silvermont and is fully x86-64 compatible (including x87, SSE, and SSE2 for both standard ways of doing FP math). It also supports AVX-512F, AVX-512ER, and a few other AVX-512 extensions, along with AVX and AVX2 and SSE up to SSE4.2. A lot like a Skylake-server CPU, except a different set of AVX-512 extensions.
The point of this is exactly to solve the problem you're worried about: so any legacy binary can run on KNL. To get good performance out of it, you want to be running code vectorized with AVX-512 vectors in the loops that do the heavy lifting, but all the surrounding code and other programs in the rest of the Linux distro or whatever can be running ordinary bog-standard code that uses whatever x87 and/or SSE.
Knight's Corner (first-gen commercial Xeon Phi) has its own variant / precursor of AVX-512 in a core based on P5-Pentium, and no other FP hardware.
Knight's Landing (second-gen commercial Xeon Phi) is based on Silvermont, with AVX-512, and is the first that can act as a "host" processor (bootable) instead of just a coprocessor.
This "host" mode is another reason for including enough hardware to decode and execute x87 and SSE: if you're running a whole system on KNL, you're much more likely to want to execute some legacy binaries for non-perf-sensitive tasks, not only binaries compiled specifically for it.
Its x87 performance is not great, though: like one scalar fmul per 2 clocks (https://agner.org/optimize). vs. 2-per-clock SSE mulsd (0.5c recip throughput). Same 0.5c throughput for other SSE/AVX math, including AVX-512 vfma132ps zmm to do 16x single-precision Fused-Multiply-Add operations in one instruction.
So hopefully Go's compiler doesn't use x87 much. The normal way to do scalar math in 64-bit mode (that C compilers and their math libraries use) is SSE, in XMM registers. x86-64 C compilers only use x87 for types like long double.

Yes:
Xeon Phi is a series of x86 manycore processors designed and made by Intel. It is intended for use in supercomputers, servers, and high-end workstations. Its architecture allows use of standard programming languages and application programming interfaces (APIs) such as...
See also https://en.wikipedia.org/wiki/Xeon_Phi
If you can compile go on an x86 processor then you will be able to compile on that specific x86 processor which is manufactured by intel.
Xeon is not Itanium :)
On such systems you would also be able to compile go you would just need to provide a suitable c compiler...
What makes you think you would otherwise not be able to compile go on say... an Atari or perhaps a Arduino?
If you can elaborate on that perhaps I can improve my terrible answer further.

Is there a penalty for mixing x86-64 integer instructions with AVX1/2/512 instructions?

I have seen a lot of assembly with AVX(all three flavors), and in all the cases that I have seen the most concentrated a kind of instruction is the best the code performs. But, for example, things like doing a load into a 32-bit register and then broadcasting into all lanes for AVX512 or AVX2 say...isn't it better better/faster to just load a register, and then permute away?
In particular, is there a penalty for mixing x86_64 and the AVX family of instructions? I know there is such a penalty for SSE and AVX mixing in general.
I know I could test this, but I'd rather poke the knowledge of the masses before I take on that mini-project. I am sure someone already knows this.

Problems with Qualcomm Scorpion dual-core ARM NEON code?

I am developing a native library for Android where I use ARM assembly optimizations and multithreading in order to get maximum performance on the dual-core ARM chipset MSM8660. While doing some measurements I noticed the following:
The single-threaded library with NEON optimizations is faster than the single-threaded library with ARMv6 optimizations (as expected).
The multi-threaded library with ARMv6 optimizations is faster than the single-threaded library with ARMv6 optimizations (as expected).
The multi-threaded library with NEON optimizations is slower than the single-threaded library with NEON optimizations (definitely not expected!).
I have tried searching all over the net for an explanation for why this is but have so far not found any. It almost seems like all the cores share the same NEON pipeline or something like that, but all schematics seem to indicate that each core should have its own NEON unit. Does anyone know why this is happening?

First of all, what library are you using ?
You're correct, each core has it's own NEON unit, It is however their own proprietary 'VeNum' unit and not much information is provided about it, It was designed for the Cortex-A8 based Scorpion in 8x50 and was quite better than ARM's own implementation of NEON SIMD, However a good relief is that they (qcom) design their hardware in a way that it's compatible with the base refrence design so most code for a cortex-A8 will work just fine with Scorpion albeit with some performance hit due to possible different instruction timing.
If you're using "softfp" to compile your program, you will have an overhead of approx 20 cycles for every function you call which uses floating point arguments and or uses the NEON unit as transferring register data from the ARM core to Neon unit and vice versa is quite slow and can sometimes stall the core for many cycles waiting for the pipeline to flush.
Also for a threaded program using floating point unit, the kernel has to save the FP registers during a context switch so that incurs additional penalty for threads since we already know moving registers from neon to arm is slow and is known to stall the pipeline.
Additionally many other factors can lead to this such as a bad optimization from compiler, cache miss, not using the double issue feature of scorpion, bad instruction scheduling and switching of your thread from one core to another repeatedly.

It's probably because of cache misses. It's hard to tell without more information.

My guess would be that it is because of the extra cycle penalty involved in flushing the NEON pipeline. The NEON pipeline is behind the rest of the core, and so you see an extra cycle penalty for missed branches and so on.
If the threads have to synchronize quite often, or if you have a lot of locks, I think you are going to see big penalties with NEON.
The only way you are going to leverage NEON for an overall gain in performance with multi-threaded code is if the code is embarrassingly parallel and there is very little and infrequent communication between the threads.

Why does the Mac ABI require 16-byte stack alignment for x86-32?

I can understand this requirement for the old PPC RISC systems and even for x86-64, but for the old tried-and-true x86? In this case, the stack needs to be aligned on 4 byte boundaries only. Yes, some of the MMX/SSE instructions require 16byte alignments, but if that is a requirement of the callee, then it should ensure the alignments are correct. Why burden every caller with this extra requirement? This can actually cause some drops in performance because every call-site must manage this requirement. Am I missing something?
Update: After some more investigation into this and some consultation with some internal colleagues, I have some theories about this:
Consistency between the PPC, x86, and x64 version of the OS
It seems that the GCC codegen now consistently does a sub esp,xxx and then "mov"s the data onto the stack rather than simply doing a "push" instruction. This could actually be faster on some hardware.
While this does complicate the call sites a little, there is very little extra overhead when using the default "cdecl" convention where the caller cleans up the stack.
The issue I have with the last item, is that for calling conventions that rely on the callee cleaning the stack, the above requirements really "uglifies" the codegen. For instance, what some compiler decided to implement a faster register-based calling style for its own internal use (ie any code that isn't intended to be called from other languages or sources)? This stack-alignment thing could negate some of the performance gains achieved by passing some parameters in registers.
Update: So far the only real answers have been consistency, but to me that's a bit too easy of an answer. I have well over 20 years experience with the x86 architecture and if consistency, not performance, or something else concrete, is really the reason then I respectfully suggest that is a bit naive for the developers to require it. They're ignoring nearly three decades of tools and support. Especially if they're expecting tools vendors to quickly and easily adapt their tools for their platform (maybe not... it is Apple...) without having to jump through several seemingly unnecessary hoops.
I'll give this topic another day or so then close it...
Related
It’s my stack frame, I don’t care about your stack frame!

From "Intel®64 and IA-32 Architectures Optimization Reference Manual", section 4.4.2:
"For best performance, the Streaming SIMD Extensions and Streaming SIMD Extensions 2 require their memory operands to be aligned to 16-byte boundaries. Unaligned data can cause significant performance penalties compared to aligned data."
From Appendix D:
"It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation."
http://www.intel.com/Assets/PDF/manual/248966.pdf

I am not sure as I don't have first hand proof, but I believe the reason is SSE. SSE is much faster if your buffers are already aligned on a 16 bytes boundary (movps vs movups), and any x86 has at least sse2 for mac os x. It can be taken care of by the application user, but the cost is pretty significant. If the overall cost for making it mandatory in the ABI is not too significant, it may worth it. SSE is used quite pervasively in mac os X: accelerate framework, etc...

I believe it's to keep it inline with the x86-64 ABI.

First, note that the 16 bytes alignment is an exception introduced by Apple to the System V IA-32 ABI.
The stack alignment is only needed when calling system functions, because many system libraries are using SSE or Altivec extensions which require the 16 bytes alignment. I found an explicit reference in the libgmalloc MAN page.
You can perfectly handle your stack frame the way you want, but if you try to call a system function with a misaligned stack, you will end up with a misaligned_stack_error message.
Edit:
For the record, you can get rid of alignment problems when compiling with GCC by using the mstack-realign option.

This is an efficiency issue.
Making sure the stack is 16-byte aligned in every function that uses the new SSE instructions adds a lot of overhead for using those instructions, effectively reducing performance.
On the other hand, keeping the stack 16-byte aligned at all times ensures that you can use SSE instructions freely with no performance penalty. There is no cost to this (cost measured in instructions at least). It only involves changing a constant in the prologue of the function.
Wasting stack space is cheap, it is probably the hottest part of the cache.

My guess is that Apple believes everyone just uses XCode (gcc) which aligns the stack for you. So requiring the stack to be aligned so the kernel doesn't have to is just a micro-optimization.

While I cannot really answer your question of WHY, you may find the manuals at the following site useful:
http://www.agner.org/optimize/
Regarding the ABI, have a look especially at:
http://www.agner.org/optimize/calling_conventions.pdf
Hope that's useful.

Hmm, didn't OS X ABI also do funny RISC like things like passing small structs in registers?
So that points to the consistency with other platforms theory.
Come to think of it, the FreeBSD syscall api also aligns 64-bit values. (like e.g. lseek and mmap)

In order to maintain consistency in kernel. This allows the same kernel to be booted on multiple architectures without modicfication.

Not sure why no one has considered the possibility of easy portability from legacy PowerPC-based platform?
Read this:
http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual/LowLevelABI/100-32-bit_PowerPC_Function_Calling_Conventions/32bitPowerPC.html#//apple_ref/doc/uid/TP40002438-SW20
And then zoomed into "32-bit PowerPC Function Calling Conventions" and finally this:
"These are the embedding alignment modes available in the 32-bit
PowerPC environment:
Power alignment mode is derived from the alignment rules used by the
IBM XLC compiler for the AIX operating system. It is the default
alignment mode for the PowerPC-architecture version of GCC used on AIX
and Mac OS X. Because this mode is most likely to be compatible
between PowerPC-architecture compilers from different vendors, it’s
typically used with data structures that are shared between different
programs."
In view of the legacy PowerPC-based background of OSX, portability is a major consideration - it dictates following the convention all the way back to AIX's XLC compiler. When you think in terms of the need to make sure all the tools and applications will work together with minimal rework, I think it is important to stick to the same legacy ABI as far as possible.
That gives the philosophy, and reading further is the rule explicitly mentioned ("Prolog and Epilog"):
The called function is responsible for allocating
its own stack frame, making sure to preserve 16-byte alignment in the
stack. This operation is accomplished by a section of code called the
prolog, which the compiler places before the body of the subroutine.
After the body of the subroutine, the compiler places an epilog to
restore the processor to the state it was prior to the subroutine
call.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio