OpenMP SIMD on Power8 - openmp

I'm wondering whether there is any compiler (gcc, xlc, etc.) on Power8 that supports OpenMP SIMD constructs on Power8? I tried with XL (13.1) but I couldn't compile successfully. Probably it doesn't support simd construct yet.
I could compile with gcc 4.9.1 (with these flags -fopenmp -fopenmp-simd and -O1). I put differences between 2 asm files.
Can I say that gcc 4.9 is able to generate altivec code? In order to optimize more, what am I supposed to do? (I tried with -O3, restrict treatment)
My code is very simple:
int *x, *y, *z;
x = (int*) malloc(n * sizeof(int));
y = (int*) malloc(n * sizeof(int));
z = (int*) malloc(n * sizeof(int));
#pragma omp simd
for(i = 0; i < N; ++i)
z[i] = a * x[i] + y[i];
And generated assembly is here
.L7:
lwz 9,124(31)
extsw 9,9
std 9,104(31)
lfd 0,104(31)
stfd 0,104(31)
ld 8,104(31)
sldi 9,8,2
ld 10,152(31)
add 9,10,9
lwz 10,124(31)
extsw 10,10
std 10,104(31)
lfd 0,104(31)
stfd 0,104(31)
ld 7,104(31)
sldi 10,7,2
ld 8,136(31)
add 10,8,10
lwz 10,0(10)
extsw 10,10
lwz 8,132(31)
mullw 10,8,10
extsw 8,10
lwz 10,124(31)
extsw 10,10
std 10,104(31)
lfd 0,104(31)
stfd 0,104(31)
ld 7,104(31)
sldi 10,7,2
ld 7,144(31)
add 10,7,10
lwz 10,0(10)
extsw 10,10
add 10,8,10
extsw 10,10
stw 10,0(9)
lwz 9,124(31)
addi 9,9,1
stw 9,124(31)
GCC with -O1 -fopenmp-simd
.L7:
lwz 9,108(31)
mtvsrwa 0,9
mfvsrd 8,0
sldi 9,8,2
ld 10,136(31)
add 9,10,9
lwz 10,108(31)
mtvsrwa 0,10
mfvsrd 7,0
sldi 10,7,2
ld 8,120(31)
add 10,8,10
lwz 10,0(10)
extsw 10,10
lwz 8,116(31)
mullw 10,8,10
extsw 8,10
lwz 10,108(31)
mtvsrwa 0,10
mfvsrd 7,0
sldi 10,7,2
ld 7,128(31)
add 10,7,10
lwz 10,0(10)
extsw 10,10
add 10,8,10
extsw 10,10
stw 10,0(9)
lwz 9,108(31)
addi 9,9,1
stw 9,108(31)
In order to clarify and understand details, I have one more application which is n^2 nbody application. This time my question is related with these compilers (gcc 4.9 and XL 13.1 ) and architectures (Intel and Power).
I put all the codes into gist https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-input-c
( full version of input code input.c )
Power8 & XLC - It says "was not SIMD vectorized because it contains function calls. (there is sqrtf)". It's reasonable. But in the asm code I can see xsnmsubmdp is it normal? (the assembly: https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-power8-xlc-noinnersimd-asm)
Power8 & gcc I tried to compile it in 2 ways (with omp simd construct and without). It changed my asm code, is it normal? (According to OpenMP, the code should not contain function call) (Assembilies: https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-power8-gcc-noinnersimd-asm & https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-power8-gcc-innersimd-asm)
i74820K & gcc I did a same test with omp simd and without it. The output codes are different as well. Does FMA effect this code block ? (Assembilies: https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-i74820k-gcc-noinnersimd-asm & https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-i74820k-gcc-innersimd-asm)
Thanks in advance

The XL compiler on POWER Linux currently only supports a subset of the OpenMP 4.0 features. The SIMD construct feature is not supported at the moment, so the compiler will not recognize the construct in your source code.
However, if vectorization is what you're looking for then the good news is that the XL compiler should already automatically vectorize your code as long as you use at least the following optimization options
-O3 -qhot -qarch=pwr8 -qtune=pwr8
These options will enable high-order loop transformations along with POWER8 specific optimizations, including loop auto-vectorization for your loop.
Afterwards, you should see some VMX & VSX instructions in the generated assembly code similar to the following:
188: 19 2e 80 7c lxvw4x vs36,0,r5
18c: 84 09 a6 10 vslw v5,v6,v1
190: 10 00 e7 38 addi r7,r7,16
194: 10 00 a5 38 addi r5,r5,16
198: 40 28 63 10 vadduhm v3,v3,v5
19c: 80 20 63 10 vadduwm v3,v3,v4
1a0: 19 4f 66 7c stxvw4x vs35,r6,r9
1a4: 14 02 86 41 beq cr1,3b8 <foo+0x3b8>
1a8: 10 00 20 39 li r9,16
1ac: 19 4e 27 7d lxvw4x vs41,r7,r9
1b0: 19 3e a0 7c lxvw4x vs37,0,r7
By the way, you can also get an optimization report from the XL compilers by using the -qreport option. This will explain which loops were vectorized and which loops were not and for what reason. e.g.
1586-542 (I) Loop (loop index 1 with nest-level 0 and iteration count
100) at test.c was SIMD vectorized.
or
1586-549 (I) Loop (loop index 2) at test.c was not SIMD
vectorized because a data dependence prevents SIMD vectorization.
Hope this helps!

I don't have access to a Power-based machine right now, but some experimentation with the AST dumper on x86 shows that GCC 4.9.2 starts producing SIMD code only once the level of optimisation reaches O1, i.e. the following options should do the trick:
-fopenmp-simd -O1
The same is true for GCC 5.1.0.
Also note that the vectoriser applies a cost model that might prevent it from actually producing vectorised code in some cases. See -fsimd-cost-model and similar options here on how to override that behaviour.

Related

What do the constraints "Rah" and "Ral" mean in extended inline assembly?

This question is inspired by a question asked by someone on another forum. In the following code what does the extended inline assembly constraint Rah and Ral mean. I haven't seen these before:
#include<stdint.h>
void tty_write_char(uint8_t inchar, uint8_t page_num, uint8_t fg_color)
{
asm (
"int $0x10"
:
: "b" ((uint16_t)page_num<<8 | fg_color),
"Rah"((uint8_t)0x0e), "Ral"(inchar));
}
void tty_write_string(const char *string, uint8_t page_num, uint8_t fg_color)
{
while (*string)
tty_write_char(*string++, page_num, fg_color);
}
/* Use the BIOS to print the first command line argument to the console */
int main(int argc, char *argv[])
{
if (argc > 1)
tty_write_string(argv[1], 0, 0);
return 0;
}
In particular are the use of Rah and Ral as constraints in this code:
asm (
"int $0x10"
:
: "b" ((uint16_t)page_num<<8 | fg_color),
"Rah"((uint8_t)0x0e), "Ral"(inchar));
The GCC Documentation doesn't have an l or h constraint for either simple constraints or x86/x86 machine constraints. R is any legacy register and a is the AX/EAX/RAX register.
What am I not understanding?
What you are looking at is code that is intended to be run in real mode on an x86 based PC with a BIOS. Int 0x10 is a BIOS service that has the ability to write to the console. In particular Int 0x10/AH=0x0e is to write a single character to the TTY (terminal).
That in itself doesn't explain what the constraints mean. To understand the constraints Rah and Ral you have to understand that this code isn't being compiled by a standard version of GCC/CLANG. It is being compiled by a GCC port called ia16-gcc. It is a special port that targets 8086/80186 and 80286 and compatible processors. It doesn't generate 386 instructions or use 32-bit registers in code generation. This experimental version of GCC is to target 16-bit environments like DOS (FreeDOS, MSDOS), and ELKS.
The documentation for ia16-gcc is hard to find online in HTML format but I have produced a copy for the recent GCC 6.3.0 versions of the documentation on GitHub. The documentation was produced by building ia16-gcc from source and using make to generate the HTML. If you review the machine constraints for Intel IA-16—config/ia16 you should now be able to see what is going on:
Ral The al register.
Rah The ah register.
This version of GCC doesn't understand the R constraint by itself anymore. The inline assembly you are looking at matches that of the parameters for Int 0x10/Ah=0xe:
VIDEO - TELETYPE OUTPUT
AH = 0Eh
AL = character to write
BH = page number
BL = foreground color (graphics modes only)
Return:
Nothing
Desc: Display a character on the screen, advancing the cursor
and scrolling the screen as necessary
Other Information
The documentation does list all the constraints that are available for the IA16 target:
Intel IA-16—config/ia16/constraints.md
a
The ax register. Note that for a byte operand,
this constraint means that the operand can go into either al or ah.
b
The bx register.
c
The cx register.
d
The dx register.
S
The si register.
D
The di register.
Ral
The al register.
Rah
The ah register.
Rcl
The cl register.
Rbp
The bp register.
Rds
The ds register.
q
Any 8-bit register.
T
Any general or segment register.
A
The dx:ax register pair.
j
The bx:dx register pair.
l
The lower half of pairs of 8-bit registers.
u
The upper half of pairs of 8-bit registers.
k
Any 32-bit register group with access to the two lower bytes.
x
The si and di registers.
w
The bx and bp registers.
B
The bx, si, di and bp registers.
e
The es register.
Q
Any available segment register—either ds or es (unless one or both have been fixed).
Z
The constant 0.
P1
The constant 1.
M1
The constant -1.
Um
The constant -256.
Lbm
The constant 255.
Lor
Constants 128 … 254.
Lom
Constants 1 … 254.
Lar
Constants -255 … -129.
Lam
Constants -255 … -2.
Uo
Constants 0xXX00 except -256.
Ua
Constants 0xXXFF.
Ish
A constant usable as a shift count.
Iaa
A constant multiplier for the aad instruction.
Ipu
A constant usable with the push instruction.
Imu
A constant usable with the imul instruction except 257.
I11
The constant 257.
N
Unsigned 8-bit integer constant (for in and out instructions).
There are many new constraints and some repurposed ones.
In particular the a constraint for the AX register doesn't work like other versions of GCC that target 32-bit and 64-bit code. The compiler is free to choose either AH or AL with the a constraint if the values being passed are 8 bit values. This means it is possible for the a constraint to appear twice in an extended inline assembly statement.
You could have compiled your code to a DOS EXE with this command:
ia16-elf-gcc -mcmodel=small -mregparmcall -march=i186 \
-Wall -Wextra -std=gnu99 -O3 int10h.c -o int10h.exe
This targets the 80186. You can generate 8086 compatible code by omitting the -march=i186 The generated code for main would look something like:
00000000 <main>:
0: 83 f8 01 cmp ax,0x1
3: 7e 1d jle 22 <tty_write_string+0xa>
5: 56 push si
6: 89 d3 mov bx,dx
8: 8b 77 02 mov si,WORD PTR [bx+0x2]
b: 8a 04 mov al,BYTE PTR [si]
d: 20 c0 and al,al
f: 74 0d je 1e <tty_write_string+0x6>
11: 31 db xor bx,bx
13: b4 0e mov ah,0xe
15: 46 inc si
16: cd 10 int 0x10
18: 8a 04 mov al,BYTE PTR [si]
1a: 20 c0 and al,al
1c: 75 f7 jne 15 <main+0x15>
1e: 31 c0 xor ax,ax
20: 5e pop si
21: c3 ret
22: 31 c0 xor ax,ax
24: c3 ret
When run with the command line int10h.exe "Hello, world!" should print:
Hello, world!
Special Note: The IA16 port of GCC is very experimental and does have some code generation bugs especially when higher optimization levels are used. I wouldn't use it for mission critical applications at this point in time.

arm-none-eabi-gcc not inferring floating point multiply-accumulate from code

The ARM fpv5 instruction set supports double precision floating point operations, including single cycle multiply accumulate instructions (VMLA/VMLS) as detailed in their ISA documentation.
Unfortunately, I can't get my code to use this instruction from within any C application.
Here is a simple example:
float64_t a=0, b=0, c=0;
while(1)
{
b += 1.643;
c += 3.901;
a += b * c; // multiply accumulate???
do_stuff(a) // use the MAC result
}
The code above generates the following assembly for (what I believe should be) a MAC operation
170 a += b * c;
00000efe: vldr d6, [r7, #64] ; 0x40
00000f02: vldr d7, [r7, #56] ; 0x38
00000f06: vmul.f64 d7, d6, d7
00000f0a: vldr d6, [r7, #72] ; 0x48
00000f0e: vadd.f64 d7, d6, d7
00000f12: vstr d7, [r7, #72] ; 0x48
As you can see, it does the multiply and the addition step separately. Is there a good reason why the compiler cant use the VMLA.f64 instruction here?
Target: ARM Cortex M7 (NXP iMXRT1051)
Toolchain: arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 8-2018-q4-major) 8.2.1 20181213 (release) [gcc-8-branch revision 267074]
Solved. It was the optimization level. When set to -O3 the instructions changed to properly use the MAC.
I thought that taking advantage of hardware acceleration (e.g. FPU) wouldn't be dependent on the optimization level since its essentially "free", but I guess I was wrong.

What is the benefits of using vaddss instead of addss in scalar matrix addition?

I have implemented scalar matrix addition kernel.
#include <stdio.h>
#include <time.h>
//#include <x86intrin.h>
//loops and iterations:
#define N 128
#define M N
#define NUM_LOOP 1000000
float __attribute__(( aligned(32))) A[N][M],
__attribute__(( aligned(32))) B[N][M],
__attribute__(( aligned(32))) C[N][M];
int main()
{
int w=0, i, j;
struct timespec tStart, tEnd;//used to record the processiing time
double tTotal , tBest=10000;//minimum of toltal time will asign to the best time
do{
clock_gettime(CLOCK_MONOTONIC,&tStart);
for( i=0;i<N;i++){
for(j=0;j<M;j++){
C[i][j]= A[i][j] + B[i][j];
}
}
clock_gettime(CLOCK_MONOTONIC,&tEnd);
tTotal = (tEnd.tv_sec - tStart.tv_sec);
tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;
if(tTotal<tBest)
tBest=tTotal;
} while(w++ < NUM_LOOP);
printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, N, M);
return 0;
}
In this case, I've compiled the program with different compiler flag and the assembly output of the inner loop is as follows:
gcc -O2 msse4.2: The best time: 0.000024 sec in 406490 repetition for 128X128 matrix
movss xmm1, DWORD PTR A[rcx+rax]
addss xmm1, DWORD PTR B[rcx+rax]
movss DWORD PTR C[rcx+rax], xmm1
gcc -O2 -mavx: The best time: 0.000009 sec in 1000001 repetition for 128X128 matrix
vmovss xmm1, DWORD PTR A[rcx+rax]
vaddss xmm1, xmm1, DWORD PTR B[rcx+rax]
vmovss DWORD PTR C[rcx+rax], xmm1
AVX version gcc -O2 -mavx:
__m256 vec256;
for(i=0;i<N;i++){
for(j=0;j<M;j+=8){
vec256 = _mm256_add_ps( _mm256_load_ps(&A[i+1][j]) , _mm256_load_ps(&B[i+1][j]));
_mm256_store_ps(&C[i+1][j], vec256);
}
}
SSE version gcc -O2 -sse4.2::
__m128 vec128;
for(i=0;i<N;i++){
for(j=0;j<M;j+=4){
vec128= _mm_add_ps( _mm_load_ps(&A[i][j]) , _mm_load_ps(&B[i][j]));
_mm_store_ps(&C[i][j], vec128);
}
}
In scalar program the speedup of -mavx over msse4.2 is 2.7x. I know the avx improved the ISA efficiently and it might be because of these improvements. But when I implemented the program in intrinsics for both AVX and SSE the speedup is a factor of 3x. The question is: AVX scalar is 2.7x faster than SSE when I vectorized it the speed up is 3x (matrix size is 128x128 for this question). Does it make any sense While using AVX and SSE in scalar mode yield, a 2.7x speedup. but vectorized method must be better because I process eight elements in AVX compared to four elements in SSE. All programs have less than 4.5% of cache misses as perf stat reported.
using gcc -O2 , linux mint, skylake
UPDATE: Briefly, Scalar-AVX is 2.7x faster than Scalar-SSE but AVX-256 is only 3x faster than SSE-128 while it's vectorized. I think it might be because of pipelining. in scalar I have 3 vec-ALU that might not be useable in vectorized mode. I might compare apples to oranges instead of apples to apples and this might be the point that I can not understand the reason.
The problem you are observing is explained here. On Skylake systems if the upper half of an AVX register is dirty then there is false dependency for non-vex encoded SSE operations on the upper half of the AVX register. In your case it seems there is a bug in your version of glibc 2.23. On my Skylake system with Ubuntu 16.10 and glibc 2.24 I don't have the problem. You can use
__asm__ __volatile__ ( "vzeroupper" : : : );
to clean the upper half of the AVX register. I don't think you can use an intrinsic such as _mm256_zeroupper to fix this because GCC will say it's SSE code and not recognize the intrinsic. The options -mvzeroupper won't work either because GCC one again thinks it's SSE code and will not emit the vzeroupper instruction.
BTW, it's Microsoft's fault that the hardware has this problem.
Update:
Other people are apparently encountering this problem on Skylake. It has been observed after printf, memset, and clock_gettime.
If your goal is to compare 128-bit operations with 256-bit operations could consider using -mprefer-avx128 -mavx (which is particularly useful on AMD). But then you would be comparing AVX256 vs AVX128 and not AVX256 vs SSE. AVX128 and SSE both use 128-bit operations but their implementations are different. If you benchmark you should mention which one you used.

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

I've been trying to figure out a performance problem in an application and have finally narrowed it down to a really weird problem. The following piece of code runs 6 times slower on a Skylake CPU (i5-6500) if the VZEROUPPER instruction is commented out. I've tested Sandy Bridge and Ivy Bridge CPUs and both versions run at the same speed, with or without VZEROUPPER.
Now I have a fairly good idea of what VZEROUPPER does and I think it should not matter at all to this code when there are no VEX coded instructions and no calls to any function which might contain them. The fact that it does not on other AVX capable CPUs appears to support this. So does table 11-2 in the Intel® 64 and IA-32 Architectures Optimization Reference Manual
So what is going on?
The only theory I have left is that there's a bug in the CPU and it's incorrectly triggering the "save the upper half of the AVX registers" procedure where it shouldn't. Or something else just as strange.
This is main.cpp:
#include <immintrin.h>
int slow_function( double i_a, double i_b, double i_c );
int main()
{
/* DAZ and FTZ, does not change anything here. */
_mm_setcsr( _mm_getcsr() | 0x8040 );
/* This instruction fixes performance. */
__asm__ __volatile__ ( "vzeroupper" : : : );
int r = 0;
for( unsigned j = 0; j < 100000000; ++j )
{
r |= slow_function(
0.84445079384884236262,
-6.1000481519580951328,
5.0302160279288017364 );
}
return r;
}
and this is slow_function.cpp:
#include <immintrin.h>
int slow_function( double i_a, double i_b, double i_c )
{
__m128d sign_bit = _mm_set_sd( -0.0 );
__m128d q_a = _mm_set_sd( i_a );
__m128d q_b = _mm_set_sd( i_b );
__m128d q_c = _mm_set_sd( i_c );
int vmask;
const __m128d zero = _mm_setzero_pd();
__m128d q_abc = _mm_add_sd( _mm_add_sd( q_a, q_b ), q_c );
if( _mm_comigt_sd( q_c, zero ) && _mm_comigt_sd( q_abc, zero ) )
{
return 7;
}
__m128d discr = _mm_sub_sd(
_mm_mul_sd( q_b, q_b ),
_mm_mul_sd( _mm_mul_sd( q_a, q_c ), _mm_set_sd( 4.0 ) ) );
__m128d sqrt_discr = _mm_sqrt_sd( discr, discr );
__m128d q = sqrt_discr;
__m128d v = _mm_div_pd(
_mm_shuffle_pd( q, q_c, _MM_SHUFFLE2( 0, 0 ) ),
_mm_shuffle_pd( q_a, q, _MM_SHUFFLE2( 0, 0 ) ) );
vmask = _mm_movemask_pd(
_mm_and_pd(
_mm_cmplt_pd( zero, v ),
_mm_cmple_pd( v, _mm_set1_pd( 1.0 ) ) ) );
return vmask + 1;
}
The function compiles down to this with clang:
0: f3 0f 7e e2 movq %xmm2,%xmm4
4: 66 0f 57 db xorpd %xmm3,%xmm3
8: 66 0f 2f e3 comisd %xmm3,%xmm4
c: 76 17 jbe 25 <_Z13slow_functionddd+0x25>
e: 66 0f 28 e9 movapd %xmm1,%xmm5
12: f2 0f 58 e8 addsd %xmm0,%xmm5
16: f2 0f 58 ea addsd %xmm2,%xmm5
1a: 66 0f 2f eb comisd %xmm3,%xmm5
1e: b8 07 00 00 00 mov $0x7,%eax
23: 77 48 ja 6d <_Z13slow_functionddd+0x6d>
25: f2 0f 59 c9 mulsd %xmm1,%xmm1
29: 66 0f 28 e8 movapd %xmm0,%xmm5
2d: f2 0f 59 2d 00 00 00 mulsd 0x0(%rip),%xmm5 # 35 <_Z13slow_functionddd+0x35>
34: 00
35: f2 0f 59 ea mulsd %xmm2,%xmm5
39: f2 0f 58 e9 addsd %xmm1,%xmm5
3d: f3 0f 7e cd movq %xmm5,%xmm1
41: f2 0f 51 c9 sqrtsd %xmm1,%xmm1
45: f3 0f 7e c9 movq %xmm1,%xmm1
49: 66 0f 14 c1 unpcklpd %xmm1,%xmm0
4d: 66 0f 14 cc unpcklpd %xmm4,%xmm1
51: 66 0f 5e c8 divpd %xmm0,%xmm1
55: 66 0f c2 d9 01 cmpltpd %xmm1,%xmm3
5a: 66 0f c2 0d 00 00 00 cmplepd 0x0(%rip),%xmm1 # 63 <_Z13slow_functionddd+0x63>
61: 00 02
63: 66 0f 54 cb andpd %xmm3,%xmm1
67: 66 0f 50 c1 movmskpd %xmm1,%eax
6b: ff c0 inc %eax
6d: c3 retq
The generated code is different with gcc but it shows the same problem. An older version of the intel compiler generates yet another variation of the function which shows the problem too but only if main.cpp is not built with the intel compiler as it inserts calls to initialize some of its own libraries which probably end up doing VZEROUPPER somewhere.
And of course, if the whole thing is built with AVX support so the intrinsics are turned into VEX coded instructions, there is no problem either.
I've tried profiling the code with perf on linux and most of the runtime usually lands on 1-2 instructions but not always the same ones depending on which version of the code I profile (gcc, clang, intel). Shortening the function appears to make the performance difference gradually go away so it looks like several instructions are causing the problem.
EDIT: Here's a pure assembly version, for linux. Comments below.
.text
.p2align 4, 0x90
.globl _start
_start:
#vmovaps %ymm0, %ymm1 # This makes SSE code crawl.
#vzeroupper # This makes it fast again.
movl $100000000, %ebp
.p2align 4, 0x90
.LBB0_1:
xorpd %xmm0, %xmm0
xorpd %xmm1, %xmm1
xorpd %xmm2, %xmm2
movq %xmm2, %xmm4
xorpd %xmm3, %xmm3
movapd %xmm1, %xmm5
addsd %xmm0, %xmm5
addsd %xmm2, %xmm5
mulsd %xmm1, %xmm1
movapd %xmm0, %xmm5
mulsd %xmm2, %xmm5
addsd %xmm1, %xmm5
movq %xmm5, %xmm1
sqrtsd %xmm1, %xmm1
movq %xmm1, %xmm1
unpcklpd %xmm1, %xmm0
unpcklpd %xmm4, %xmm1
decl %ebp
jne .LBB0_1
mov $0x1, %eax
int $0x80
Ok, so as suspected in comments, using VEX coded instructions causes the slowdown. Using VZEROUPPER clears it up. But that still does not explain why.
As I understand it, not using VZEROUPPER is supposed to involve a cost to transition to old SSE instructions but not a permanent slowdown of them. Especially not such a large one. Taking loop overhead into account, the ratio is at least 10x, perhaps more.
I have tried messing with the assembly a little and float instructions are just as bad as double ones. I could not pinpoint the problem to a single instruction either.
You are experiencing a penalty for "mixing" non-VEX SSE and VEX-encoded instructions - even though your entire visible application doesn't obviously use any AVX instructions!
Prior to Skylake, this type of penalty was only a one-time transition penalty, when switching from code that used vex to code that didn't, or vice-versa. That is, you never paid an ongoing penalty for whatever happened in the past unless you were actively mixing VEX and non-VEX. In Skylake, however, there is a state where non-VEX SSE instructions pay a high ongoing execution penalty, even without further mixing.
Straight from the horse's mouth, here's Figure 11-1 1 - the old (pre-Skylake) transition diagram:
As you can see, all of the penalties (red arrows), bring you to a new state, at which point there is no longer a penalty for repeating that action. For example, if you get to the dirty upper state by executing some 256-bit AVX, an you then execute legacy SSE, you pay a one-time penalty to transition to the preserved non-INIT upper state, but you don't pay any penalties after that.
In Skylake, everything is different per Figure 11-2:
There are fewer penalties overall, but critically for your case, one of them is a self-loop: the penalty for executing a legacy SSE (Penalty A in the Figure 11-2) instruction in the dirty upper state keeps you in that state. That's what happens to you - any AVX instruction puts you in the dirty upper state, which slows all further SSE execution down.
Here's what Intel says (section 11.3) about the new penalty:
The Skylake microarchitecture implements a different state machine
than prior generations to manage the YMM state transition associated
with mixing SSE and AVX instructions. It no longer saves the entire
upper YMM state when executing an SSE instruction when in “Modified
and Unsaved” state, but saves the upper bits of individual register.
As a result, mixing SSE and AVX instructions will experience a penalty
associated with partial register dependency of the destination
registers being used and additional blend operation on the upper bits
of the destination registers.
So the penalty is apparently quite large - it has to blend the top bits all the time to preserve them, and it also makes instructions which are apparently independently become dependent, since there is a dependency on the hidden upper bits. For example xorpd xmm0, xmm0 no longer breaks the dependence on the previous value of xmm0, since the result is actually dependent on the hidden upper bits from ymm0 which aren't cleared by the xorpd. That latter effect is probably what kills your performance since you'll now have very long dependency chains that wouldn't expect from the usual analysis.
This is among the worst type of performance pitfall: where the behavior/best practice for the prior architecture is essentially opposite of the current architecture. Presumably the hardware architects had a good reason for making the change, but it does just add another "gotcha" to the list of subtle performance issues.
I would file a bug against the compiler or runtime that inserted that AVX instruction and didn't follow up with a VZEROUPPER.
Update: Per the OP's comment below, the offending (AVX) code was inserted by the runtime linker ld and a bug already exists.
1 From Intel's optimization manual.
I just made some experiments (on a Haswell). The transition between clean and dirty states is not expensive, but the dirty state makes every non-VEX vector operation dependent on the previous value of the destination register. In your case, for example movapd %xmm1, %xmm5 will have a false dependency on ymm5 which prevents out-of-order execution. This explains why vzeroupper is needed after AVX code.

Do C/C++ compilers such as GCC generally optimize modulo by a constant power of 2?

Let's say I have something like:
#define SIZE 32
/* ... */
unsigned x;
/* ... */
x %= SIZE;
Would the x % 32 generally be reduced to x & 31 by most C/C++ compilers such as GCC?
Yes, any respectable compiler should perform this optimization. Specifically, a % X operation, where X is a constant power of two will become the equivalent of an & (X-1) operation.
GCC will even do this with optimizations turned off:
Example (gcc -c -O0 version 3.4.4 on Cygwin):
unsigned int test(unsigned int a) {
return a % 32;
}
Result (objdump -d):
00000000 <_test>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 8b 45 08 mov 0x8(%ebp),%eax
6: 5d pop %ebp
7: 83 e0 1f and $0x1f,%eax ;; Here
a: c3 ret

Resources