Why is my SSE assembly slower in release builds? - performance

I've been playing around with some x64 assembly and the XMM registers to do some float math, and I'm seeing some performance that is puzzling me.
As a self-learning exercise, I wrote some SSE assembly to approximate the 'sin' function (using the Taylor series), and called this from some basic C++ in a loop to compare to the standard library version. Code is below, and I've pasted the output for some typical runs after that. (I'm not looking for a critique of the code or approach here, just trying to understand the perf numbers).
What I don't get is why with a "Release" build, where the actual running assembly is identical (I've stepped though the debugger to double check), is consistently about 40 - 50 cycles slower. (Uncommenting the LFENCE instructions adds about 100 cycles to both Debug and Release, so the delta remains the same). As a bonus question, why is the very first iteration typically in the thousands!!
I get this stuff is very complex and subtly impacted by numerous factors, but everything that pops in my head as a potential cause here just doesn't make sense.
I've checked the MSCSR flags in both runs, and this is identical across builds also (with the default value of 1f80h which has all exceptions masked).
Any idea what would cause this? What further analysis could I do to figure this out an an even deeper level?
Assembly
_RDATA segment
pi real4 3.141592654
rf3 real4 0.1666666667
rf5 real4 0.008333333333
rf7 real4 0.0001984126984
_RDATA ends
_TEXT segment
; float CalcSin(float rads, int* cycles)
CalcSin PROC
; "leaf" function - doesn't use the stack or any non-volatile registers
mov r8, rdx ; Save the 'cycles' pointer into R8
rdtsc ; Get current CPU cyles in EDX:EAX
; lfence ; Ensure timer is taken before executing the below
mov ecx, eax ; Save the low 32 bits of the timer into ECX
movss xmm2, xmm0
mulss xmm2, xmm2 ; X^2
movss xmm3, xmm0
mulss xmm3, xmm2 ; x^3
movss xmm4, rf3 ; 1/3!
mulss xmm4, xmm3 ; x^3 / 3!
subss xmm0, xmm4 ; x - x^3 / 3!
mulss xmm3, xmm2 ; x^5
movss xmm4, rf5 ; 1/5!
mulss xmm4, xmm3 ; x^5 / 5!
addss xmm0, xmm4 ; x - x^3 / 3! + x^5 / 5!
mulss xmm3, xmm2 ; x^7
movss xmm4, rf7 ; 1/7!
mulss xmm4, xmm3 ; x^7 / 7!
subss xmm0, xmm4 ; x - x^3 / 3! + x^5 / 5! - x^7 / 7!
; lfence ; Ensure above completes before taking the timer again
rdtsc ; Get the timer now
sub eax, ecx ; Get the difference in cycles
mov dword ptr [r8], eax
ret
CalcSin ENDP
_TEXT ends
END
C++
#include <stdio.h>
#include <math.h>
#include <vector>
const float PI = 3.141592654f;
extern "C" float CalcSin(float rads, int* cycles);
void DoCalcs(float rads) {
int cycles;
float result = CalcSin(rads, &cycles);
printf("Sin(%.8f) = %.8f. Took %d cycles\n", rads, result, cycles);
printf("C library = %.8f\n", sin(rads));
}
int main(int argc, char* argv[]) {
std::vector<float> inputs{PI / 1000, PI / 2 - PI / 1000, PI / 4, 0.0001f, PI / 2};
for (auto val : inputs) {
DoCalcs(val);
}
return 0;
}
With a "Debug" build (I'm using Visual Studio 2019), I typically see the below timing reported:
Sin(0.00314159) = 0.00314159. Took 3816 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 18 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 18 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 18 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 18 cycles
C library = 1.00000000
The exact same code with a "Release" build, I typically see the below:
Sin(0.00314159) = 0.00314159. Took 4426 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 70 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 62 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 64 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 62 cycles
C library = 1.00000000
====UPDATE 1====
I changed the code to load the constants as immediates, instead of referencing the .rdata segment as Peter mentioned, and this got rid of the slow first iteration, i.e. replaced the commented out line with the 2 lines following:
; movss xmm4, rf5 ; 1/5!
mov eax, 3C088889h ; 1/5! float representation
movd xmm4, eax
Warming up the CPU didn't help, but I did notice the first iteration in Release was now just as fast as debug, and the rest were still slow. As the printf isn't called until after the first calculation, I wondered if this had an impact. I change the code to just store the results as it ran, and print them once complete, and now Release is just as fast. i.e.
Updated C++ code
extern "C" float CalcSin(float rads, int* cycles);
std::vector<float> values;
std::vector<int> rdtsc;
void DoCalcs(float rads) {
int cycles;
float result = CalcSin(rads, &cycles);
values.push_back(result);
rdtsc.push_back(cycles);
// printf("Sin(%.8f) = %.8f. Took %d cycles\n", rads, result, cycles);
// printf("C library = %.8f\n", sin(rads));
}
int main(int argc, char* argv[]) {
std::vector<float> inputs{PI / 1000, PI / 2 - PI / 1000, PI / 4, 0.0001f, PI / 2};
for (auto val : inputs) {
DoCalcs(val);
}
auto cycle_iter = rdtsc.begin();
auto value_iter = values.begin();
for (auto& input : inputs) {
printf("Sin(%.8f) = %.8f. Took %d cycles\n", input, *value_iter++, *cycle_iter++);
printf("C library = %.8f\n", sin(input));
}
return 0;
}
And now Release is pretty much identical to debug, i.e. around 18 - 24 cycles consistently on each call.
I'm not sure what the printf call is doing in Release builds, or maybe the way it was linked/optimized with Release settings, but strange it negatively impacted the identical and distinct assembly calls as it did.
Sin(0.00314159) = 0.00314159. Took 18 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 18 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 24 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 20 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 24 cycles
C library = 1.00000000
====UPDATE 2====
To rule out the CPU ramp-up down, I went in and tweaked a few bios settings (disabled Turbo, set a consistent core voltage, etc.), and can now see via the "AI Suite" ASUS app for the motherboard the CPU is a consistent 3600MHz. (I'm running an Intel Core i9-9900k # 3.6GHz on Windows 10 x64).
After setting that... still no change.
Next thing that occurred to me is that with the 'printf' I have a call out to the C-runtime library between each loop, which is a different DLL between Debug and Release builds. To remove any other variation I starting building from the command-line instead of VS. Compiling with maximum speed optimizations and the release CRT DLLs (/O2 and /MD respectively), I still see the same slow-down. Switching to the debug CRT DLLs, I see some improvement. If I switch static linking in the CRT, then it doesn't matter if I use the debug or release versions, or if I compile with optimizations or not, I regularly see the 24 cycles per call, i.e.
ml64 /c ..\x64simd.asm
cl.exe /Od /MT /Feapp.exe ..\main.cpp x64simd.obj
>app.exe
Sin(0.00314159) = 0.00314159. Took 24 cycles
Sin(1.56765473) = 0.99984086. Took 24 cycles
Sin(0.78539819) = 0.70710647. Took 24 cycles
Sin(0.00010000) = 0.00010000. Took 24 cycles
Sin(1.57079637) = 0.99984306. Took 24 cycles
So it's definitely something in calling out to the CRT Release DLLs causing the slow-down. I'm still puzzled as to why, especially as the Debug build in VS is also using CRT via DLLs.

You're timing in reference cycles with rdtsc, not core clock cycles. It's probably the same speed both times, in core clock cycles, but with the CPU running at different frequencies.
Probably a debug build gives the CPU time to ramp up to max turbo (more core cycles per reference cycle) before your function gets called. Because the calling code compiles to slower asm. And especially with MSVC, a debug build adds extra stuff like poisoning the stack frame to catch use of uninitialized vars. And also overhead for incremental linking.
None of this slows down your hand-written function itself, it's just "warm up" that you neglected to do manually in your microbenchmark.
See How to get the CPU cycle count in x86_64 from C++? for lots more details about RDTSC.
A factor of ~3 between idle CPU clock and max-turbo (or some higher clock) is very plausible for modern x86 CPUs. My i7-6700k idles at 0.8GHz with rated frequency of 4.0GHz, max single-core turbo of 4.2. But many laptop CPUs much lower non-turbo max (and might only ramp to non-turbo initially, not max turbo right away, depending on energy_performance_preference HW governor, or especially software governor on older CPUs.)
As a bonus question, why is the very first iteration typically in the thousands!!
Probably dTLB miss and cache miss for loading rf3 from data memory. You could try loading those from C (by declaring extern volatile float rf3) to prime the TLB + cache for that block of constants, assuming they're all in the same cache line.
Possibly also an I-cache miss after the rdtsc, but the first load is probably before the end of an I-cache line so those could happen in parallel. (Putting the rdtsc inside your asm function means we probably aren't waiting for an iTLB miss or i-cache miss inside the timed region to even fetch the first byte of the function).
Code review:
Don't use movss between XMM registers unless you want to blend the low 4 bytes into the old value of the destination. Use movaps xmm2, xmm0 to copy the whole register; it's much more efficient.
movaps can be handled by register renaming without needing any back-end execution unit, vs. movss only running on one execution unit in Intel CPUs, port 5. https://agner.org/optimize/. Also, movaps avoids a false dependency on the old value of the register because it overwrites the full reg, allowing out-of-order exec to work properly.
movss xmm, [mem] is fine, though: as a load it zero-extends into the full register.

Related

Better understanding of timing and pipelining [duplicate]

This question already has answers here:
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
(1 answer)
How to get the CPU cycle count in x86_64 from C++?
(5 answers)
How many CPU cycles are needed for each assembly instruction?
(5 answers)
Assembly - How to score a CPU instruction by latency and throughput
(1 answer)
Closed 1 year ago.
In this code, I'm just looping through the set of instructions a bunch of times. Without regard to how many times (100, 1000, 1000000), the timing using RDTSC shows (outputs) 6 clock cycles for the loop. I'm on a Coffee Lake I9-9900K
There are 13 instructions in the loop- I would have thought the minimum RDTSC delta would have been 13.
Would someone be able to educate me as to how this is seeming to run twice as fast as I expected it to? I'm clearly misunderstanding something basic, or I've made a ridiculous mistake.
Thank you!
rng.SetFloatScale(2.0f / 8.0f);
00C010AE vmovups ymm4,ymmword ptr [__ymm#3e0000003e0000003e0000003e0000003e0000003e0000003e0000003e000000 (0C02160h)]
Vec8f sum = 0;
const size_t loopLen = 1000;
auto start = __rdtsc();
00C010BB rdtsc
00C010BD mov esi,eax
sum += rng.NextScaledFloats();
00C010F0 vpslld ymm0,ymm2,xmm5
00C010F4 vpxor ymm1,ymm0,ymm2
00C010F8 vpsrld ymm0,ymm1,xmm6
00C010FC vpxor ymm1,ymm0,ymm1
00C01100 vpslld ymm0,ymm1,xmm7
00C01104 vpxor ymm2,ymm0,ymm1
00C01108 vpand ymm0,ymm2,ymmword ptr [__ymm#007fffff007fffff007fffff007fffff007fffff007fffff007fffff007fffff (0C02140h)]
00C01110 vpor ymm0,ymm0,ymmword ptr [__ymm#4000000040000000400000004000000040000000400000004000000040000000 (0C021A0h)]
00C01118 vmovups ymm1,ymm4
00C0111C vfmsub213ps ymm1,ymm0,ymmword ptr [__ymm#3e8000003e8000003e8000003e8000003e8000003e8000003e8000003e800000 (0C02180h)]
00C01125 vaddps ymm3,ymm1,ymm3
for (size_t i = 0; i < loopLen; i++)
00C01129 sub eax,1
00C0112C jne main+80h (0C010F0h)
auto end = __rdtsc();
00C0112E rdtsc
00C01130 mov edi,eax
00C01132 mov ecx,edx
printf("\n\nAverage: %f\nAverage RDTSC: %ld\n", fsum, (end - start) / loopLen);

Why does re-initializing a register inside an unrolled ADD loop make it run faster even with more instructions inside the loop?

I have the following code:
#include <iostream>
#include <chrono>
#define ITERATIONS "10000"
int main()
{
/*
======================================
The first case: the MOV is outside the loop.
======================================
*/
auto t1 = std::chrono::high_resolution_clock::now();
asm("mov $100, %eax\n"
"mov $200, %ebx\n"
"mov $" ITERATIONS ", %ecx\n"
"lp_test_time1:\n"
" add %eax, %ebx\n" // 1
" add %eax, %ebx\n" // 2
" add %eax, %ebx\n" // 3
" add %eax, %ebx\n" // 4
" add %eax, %ebx\n" // 5
"loop lp_test_time1\n");
auto t2 = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
std::cout << time;
/*
======================================
The second case: the MOV is inside the loop (faster).
======================================
*/
t1 = std::chrono::high_resolution_clock::now();
asm("mov $100, %eax\n"
"mov $" ITERATIONS ", %ecx\n"
"lp_test_time2:\n"
" mov $200, %ebx\n"
" add %eax, %ebx\n" // 1
" add %eax, %ebx\n" // 2
" add %eax, %ebx\n" // 3
" add %eax, %ebx\n" // 4
" add %eax, %ebx\n" // 5
"loop lp_test_time2\n");
t2 = std::chrono::high_resolution_clock::now();
time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
std::cout << '\n' << time << '\n';
}
The first case
I compiled it with
gcc version 9.2.0 (GCC)
Target: x86_64-pc-linux-gnu
gcc -Wall -Wextra -pedantic -O0 -o proc proc.cpp
and its output is
14474
5837
I also compiled it with Clang with the same result.
So, why the second case is faster (almost 3x speedup)? Does it actually related with some microarchitectural details? If it matters, I have an AMD's CPU: “AMD A9-9410 RADEON R5, 5 COMPUTE CORES 2C+3G”.
mov $200, %ebx inside the loop breaks the loop-carried dependency chain through ebx, allowing out-of-order execution to overlap the chain of 5 add instructions across multiple iterations.
Without it, the chain of add instructions bottlenecks the loop on the latency of the add (1 cycle) critical path, instead of the throughput (4/cycle on Excavator, improved from
2/cycle on Steamroller). Your CPU is an Excavator core.
AMD since Bulldozer has an efficient loop instruction (only 1 uop), unlike Intel CPUs where loop would bottleneck either loop at 1 iteration per 7 cycles. (https://agner.org/optimize/ for instruction tables, microarch guide, and more details on everything in this answer.)
With loop and mov taking slots in the front-end (and back-end execution units) away from add, a 3x instead of 4x speedup looks about right.
See this answer for an intro to how CPUs find and exploit Instruction Level Parallelism (ILP).
See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for some in-depth details about overlapping independent dep chains.
BTW, 10k iterations is not many. Your CPU might not even ramp up out of idle speed in that time. Or might jump to max speed for most of the 2nd loop but none of the first. So be careful with microbenchmarks like this.
Also, your inline asm is unsafe because you forgot to declare clobbers on EAX, EBX, and ECX. You step on the compiler's registers without telling it. Normally you should always compile with optimization enabled, but your code would probably break if you did that.

What is the benefits of using vaddss instead of addss in scalar matrix addition?

I have implemented scalar matrix addition kernel.
#include <stdio.h>
#include <time.h>
//#include <x86intrin.h>
//loops and iterations:
#define N 128
#define M N
#define NUM_LOOP 1000000
float __attribute__(( aligned(32))) A[N][M],
__attribute__(( aligned(32))) B[N][M],
__attribute__(( aligned(32))) C[N][M];
int main()
{
int w=0, i, j;
struct timespec tStart, tEnd;//used to record the processiing time
double tTotal , tBest=10000;//minimum of toltal time will asign to the best time
do{
clock_gettime(CLOCK_MONOTONIC,&tStart);
for( i=0;i<N;i++){
for(j=0;j<M;j++){
C[i][j]= A[i][j] + B[i][j];
}
}
clock_gettime(CLOCK_MONOTONIC,&tEnd);
tTotal = (tEnd.tv_sec - tStart.tv_sec);
tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;
if(tTotal<tBest)
tBest=tTotal;
} while(w++ < NUM_LOOP);
printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, N, M);
return 0;
}
In this case, I've compiled the program with different compiler flag and the assembly output of the inner loop is as follows:
gcc -O2 msse4.2: The best time: 0.000024 sec in 406490 repetition for 128X128 matrix
movss xmm1, DWORD PTR A[rcx+rax]
addss xmm1, DWORD PTR B[rcx+rax]
movss DWORD PTR C[rcx+rax], xmm1
gcc -O2 -mavx: The best time: 0.000009 sec in 1000001 repetition for 128X128 matrix
vmovss xmm1, DWORD PTR A[rcx+rax]
vaddss xmm1, xmm1, DWORD PTR B[rcx+rax]
vmovss DWORD PTR C[rcx+rax], xmm1
AVX version gcc -O2 -mavx:
__m256 vec256;
for(i=0;i<N;i++){
for(j=0;j<M;j+=8){
vec256 = _mm256_add_ps( _mm256_load_ps(&A[i+1][j]) , _mm256_load_ps(&B[i+1][j]));
_mm256_store_ps(&C[i+1][j], vec256);
}
}
SSE version gcc -O2 -sse4.2::
__m128 vec128;
for(i=0;i<N;i++){
for(j=0;j<M;j+=4){
vec128= _mm_add_ps( _mm_load_ps(&A[i][j]) , _mm_load_ps(&B[i][j]));
_mm_store_ps(&C[i][j], vec128);
}
}
In scalar program the speedup of -mavx over msse4.2 is 2.7x. I know the avx improved the ISA efficiently and it might be because of these improvements. But when I implemented the program in intrinsics for both AVX and SSE the speedup is a factor of 3x. The question is: AVX scalar is 2.7x faster than SSE when I vectorized it the speed up is 3x (matrix size is 128x128 for this question). Does it make any sense While using AVX and SSE in scalar mode yield, a 2.7x speedup. but vectorized method must be better because I process eight elements in AVX compared to four elements in SSE. All programs have less than 4.5% of cache misses as perf stat reported.
using gcc -O2 , linux mint, skylake
UPDATE: Briefly, Scalar-AVX is 2.7x faster than Scalar-SSE but AVX-256 is only 3x faster than SSE-128 while it's vectorized. I think it might be because of pipelining. in scalar I have 3 vec-ALU that might not be useable in vectorized mode. I might compare apples to oranges instead of apples to apples and this might be the point that I can not understand the reason.
The problem you are observing is explained here. On Skylake systems if the upper half of an AVX register is dirty then there is false dependency for non-vex encoded SSE operations on the upper half of the AVX register. In your case it seems there is a bug in your version of glibc 2.23. On my Skylake system with Ubuntu 16.10 and glibc 2.24 I don't have the problem. You can use
__asm__ __volatile__ ( "vzeroupper" : : : );
to clean the upper half of the AVX register. I don't think you can use an intrinsic such as _mm256_zeroupper to fix this because GCC will say it's SSE code and not recognize the intrinsic. The options -mvzeroupper won't work either because GCC one again thinks it's SSE code and will not emit the vzeroupper instruction.
BTW, it's Microsoft's fault that the hardware has this problem.
Update:
Other people are apparently encountering this problem on Skylake. It has been observed after printf, memset, and clock_gettime.
If your goal is to compare 128-bit operations with 256-bit operations could consider using -mprefer-avx128 -mavx (which is particularly useful on AMD). But then you would be comparing AVX256 vs AVX128 and not AVX256 vs SSE. AVX128 and SSE both use 128-bit operations but their implementations are different. If you benchmark you should mention which one you used.

Aligned and unaligned memory access with AVX/AVX2 intrinsics

According to Intel's Software Developer Manual (sec. 14.9), AVX relaxed the alignment requirements of memory accesses. If data is loaded directly in a processing instruction, e.g.
vaddps ymm0,ymm0,YMMWORD PTR [rax]
the load address doesn't have to be aligned. However, if a dedicated aligned load instruction is used, such as
vmovaps ymm0,YMMWORD PTR [rax]
the load address has to be aligned (to multiples of 32), otherwise an exception is raised.
What confuses me is the automatic code generation from intrinsics, in my case by gcc/g++ (4.6.3, Linux). Please have a look at the following test code:
#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#define SIZE (1L << 26)
#define OFFSET 1
int main() {
float *data;
assert(!posix_memalign((void**)&data, 32, SIZE*sizeof(float)));
for (unsigned i = 0; i < SIZE; i++) data[i] = drand48();
float res[8] __attribute__ ((aligned(32)));
__m256 sum = _mm256_setzero_ps(), elem;
for (float *d = data + OFFSET; d < data + SIZE - 8; d += 8) {
elem = _mm256_load_ps(d);
// sum = _mm256_add_ps(elem, elem);
sum = _mm256_add_ps(sum, elem);
}
_mm256_store_ps(res, sum);
for (int i = 0; i < 8; i++) printf("%g ", res[i]); printf("\n");
return 0;
}
(Yes, I know the code is faulty, since I use an aligned load on unaligned addresses, but bear with me...)
I compile the code with
g++ -Wall -O3 -march=native -o memtest memtest.C
on a CPU with AVX. If I check the code generated by g++ by using
objdump -S -M intel-mnemonic memtest | more
I see that the compiler does not generate an aligned load instruction, but loads the data directly in the vector addition instruction:
vaddps ymm0,ymm0,YMMWORD PTR [rax]
The code executes without any problem, even though the memory addresses are not aligned (OFFSET is 1). This is clear since vaddps tolerates unaligned addresses.
If I uncomment the line with the second addition intrinsic, the compiler cannot fuse the load and the addition since vaddps can only have a single memory source operand, and generates:
vmovaps ymm0,YMMWORD PTR [rax]
vaddps ymm1,ymm0,ymm0
vaddps ymm0,ymm1,ymm0
And now the program seg-faults, since a dedicated aligned load instruction is used, but the memory address is not aligned. (The program doesn't seg-fault if I use _mm256_loadu_ps, or if I set OFFSET to 0, by the way.)
This leaves the programmer at the mercy of the compiler and makes the behavior partly unpredictable, in my humble opinion.
My question is: Is there a way to force the C compiler to either generate a direct load in a processing instruction (such as vaddps) or to generate a dedicated load instruction (such as vmovaps)?
There is no way to explicitly control folding of loads with intrinsics. I consider this a weakness of intrinsics. If you want to explicitly control the folding then you have to use assembly.
In previous version of GCC I was able to control the folding to some degree using an aligned or unaligned load. However, that no longer appears to be the case (GCC 4.9.2). I mean for example in the function AddDot4x4_vec_block_8wide here the loads are folded
vmulps ymm9, ymm0, YMMWORD PTR [rax-256]
vaddps ymm8, ymm9, ymm8
However in a previous verison of GCC the loads were not folded:
vmovups ymm9, YMMWORD PTR [rax-256]
vmulps ymm9, ymm0, ymm9
vaddps ymm8, ymm8, ymm9
The correct solution is, obviously, to only used aligned loads when you know the data is aligned and if you really want to explicitly control the folding use assembly.
In addition to Z boson's answer I can tell that the problem can be caused by that the compiler assumes the memory region is aligned (because of __attribute__ ((aligned(32))) marking the array). In runtime that attribute may not work for values on the stack because the stack is only 16-byte aligned (see this bug, which is still open at the time of this writing, though some fix have made it into gcc 4.6). The compiler is in its rights to choose the instructions to implement intrinsics, so it may or may not fold the memory load into the computational instruction, and it is also in its rights to use vmovaps when the folding does not occur (because, as noted before, the memory region is supposed to be aligned).
You can try forcing the compiler to realign the stack to 32 bytes upon entry in main by specifying -mstackrealign and -mpreferred-stack-boundary=5 (see here) but it will incur a performance overhead.

Determine CPUID as listed in the Intel Intrinsics Guide

In the Intel Intrinsics Guide there are 'Latency and Throughput Information' at the bottom of several Intrinsics, listing the performance for several CPUID(s).
For example, the table in the Intrinsics Guide looks as follows for the Intrinsic _mm_hadd_pd:
CPUID(s) Parameters Latency Throughput
0F_03 13 4
06_2A xmm1, xmm2 5 2
06_25/2C/1A/1E/1F/2E xmm1, xmm2 5 2
06_17/1D xmm1, xmm2 6 1
06_0F xmm1, xmm2 5 2
Now: How do I determine, what ID my CPU has?
I'm using Kubuntu 12.04 and tried with sudo dmidecode -t 4 and also with the little program cpuid from the Ubuntu packages, but their output isn't really useful.
I cannot find any of the strings listed in the Intrinsics Guide anywhere in the output of the commands above.
you can get that information using CPUID instruction, where
The extended family, bit positions 20 through
27 are used in conjunction with the family
code, specified in bit positions 8 through 11, to indicate whether the processor belongs to
the Intel386, Intel486, Pentium, Pentium Pro or Pentium 4 family of processors. P6
family processors include all processors based on the Pentium Pro processor architecture
and have an extended family equal to 00h
and a family code equal to 06h. Pentium 4
family processors include all processors based on the Intel NetBurst® microarchitecture
and have an extended family equal to 00h and a family code equal to 0Fh.
The extended model specified in bit positi
ons 16 through 19, in conjunction with the
model number specified in bits 4 though 7 are
used to identify the model of the processor
within the processor’s family.
see page 22 in Intel Processor Identification and the CPUID Instruction for futher details.
Actual CPUID is then "family_model".
The following code should do the job:
#include "stdio.h"
int main () {
int ebx = 0, ecx = 0, edx = 0, eax = 1;
__asm__ ("cpuid": "=b" (ebx), "=c" (ecx), "=d" (edx), "=a" (eax):"a" (eax));
int model = (eax & 0x0FF) >> 4;
int extended_model = (eax & 0xF0000) >> 12;
int family_code = (eax & 0xF00) >> 8;
int extended_family_code = (eax & 0xFF00000) >> 16;
printf ("%x %x %x %x \n", eax, ebx, ecx, edx);
printf ("CPUID: %02x %x\n", extended_family_code | family_code, extended_model | model);
return 0;
}
For my computer I get:
CPUID: 06_25
hope it helps.

Resources