Did not get expected performance speed up [duplicate] - performance

This question already has answers here:
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
(1 answer)
Idiomatic way of performance evaluation?
(1 answer)
Add+Mul become slower with Intrinsics - where am I wrong?
(2 answers)
Closed 1 year ago.
I am trying to see the performance speedup of AVX instructions. Below is the example code I am running:
#include <iostream>
#include <stdio.h>
#include <string.h>
#include <cstdlib>
#include <algorithm>
#include <immintrin.h>
#include <chrono>
#include <complex>
//using Type = std::complex<double>;
using Type = double;
int main()
{
size_t b_size = 1;
b_size = (1ul << 30) * b_size;
Type *d_ptr = (Type*)malloc(sizeof(Type)*b_size);
for(int i = 0; i < b_size; i++)
{
d_ptr[i] = 0;
}
std::cout <<"malloc finishes!" << std::endl;
#ifndef AVX512
auto a = std::chrono::high_resolution_clock::now();
for (int i = 0; i < b_size; i ++)
{
d_ptr[i] = i*0.1;
}
std::cout << d_ptr[b_size-1] << std::endl;
auto b = std::chrono::high_resolution_clock::now();
long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
std::cout << "No avx takes " << diff << std::endl;
#else
auto a = std::chrono::high_resolution_clock::now();
for (int i = 0; i < b_size; i += 4)
{
/* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
__m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
__m128d tmp3 = _mm_add_pd(tmp1,tmp2);
_mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
__m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
__m256d tmp2 = _mm256_set_pd(0.1*(i+3),0.1*(i+2),0.1*(i+1),0.1*i);
__m256d tmp3 = _mm256_add_pd(tmp1,tmp2);
_mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);
}
std::cout << d_ptr[b_size-1] << std::endl;
auto b = std::chrono::high_resolution_clock::now();
long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
std::cout << "avx takes " << diff << std::endl;
#endif
}
I have tested this code on both Haswell and Cascade lake machines, the cases without and with AVX produces quite similar execution times.
---Edit---
Here is the simple compiler command I used:
Without AVX
g++ test_avx512_performance.cpp -march=native -o test_avx512_performance_noavx
With AVX
g++ test_avx512_performance.cpp -march=native -DAVX512 -o test_avx512_performance
--Edit Again--
I have run the above code on the Haswell machine again. The results are surprising:
Without AVX and compiled with O3:
~$ ./test_avx512_auto_noavx
malloc finishes!
1.07374e+08
No avx takes 3824740
With AVX and compiled without any optimization flags:
~$ ./test_avx512_auto
malloc finishes!
1.07374e+08
avx takes 2121917
With AVX and compiled with O3:
~$ ./test_avx512_auto_o3
malloc finishes!
1.07374e+08
avx takes 6307190
It is against what we thought before.
Also, I have implemented a vectorized version (similar to Add+Mul become slower with Intrinsics - where am I wrong? ), see the code below:
#else
auto a = std::chrono::high_resolution_clock::now();
__m256d tmp2 = _mm256_set1_pd(0.1);
__m256d base = _mm256_set_pd(-1.0,-2.0,-3.0,-4.0);
__m256d tmp3 = _mm256_set1_pd(4.0);
for (int i = 0; i < b_size; i += 4)
{
/* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
__m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
__m128d tmp3 = _mm_add_pd(tmp1,tmp2);
_mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
__m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
base = _mm256_add_pd(base,tmp3);
__m256d tmp5 = _mm256_mul_pd(base,tmp2);
tmp1 = _mm256_add_pd(tmp1,tmp5);
_mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp1);
}
std::cout << d_ptr[b_size-1] << std::endl;
auto b = std::chrono::high_resolution_clock::now();
long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
std::cout << "avx takes " << diff << std::endl;
#endif
On the same machine, this gives me:
With AVX and without any optimization flags
~$ ./test_avx512_manual
malloc finishes!
1.07374e+08
avx takes 2151390
With AVX and with O3:
~$ ./test_avx512_manual_o3
malloc finishes!
1.07374e+08
avx takes 5965288
Not sure where the problem is. Why O3 gives up worse performance?

Related

Why does a loop over an array run faster without optimization vs. gcc -O3? Array was initialized with malloc + zeroing loop [duplicate]

This question already has answers here:
Idiomatic way of performance evaluation?
(1 answer)
Difference between malloc and calloc?
(14 answers)
Why is iterating though `std::vector` faster than iterating though `std::array`?
(2 answers)
Performance: memset
(2 answers)
Closed 1 year ago.
I am sorry to post this question again with some updates. The previous one has been closed. I am trying to see the performance speedup of AVX instructions. Below is the example code I am running:
#include <iostream>
#include <stdio.h>
#include <string.h>
#include <cstdlib>
#include <algorithm>
#include <immintrin.h>
#include <chrono>
#include <complex>
//using Type = std::complex<double>;
using Type = double;
int main()
{
size_t b_size = 1;
b_size = (1ul << 30) * b_size;
Type *d_ptr = (Type*)malloc(sizeof(Type)*b_size);
for(int i = 0; i < b_size; i++)
{
d_ptr[i] = 0;
}
std::cout <<"malloc finishes!" << std::endl;
#ifndef AVX512
auto a = std::chrono::high_resolution_clock::now();
for (int i = 0; i < b_size; i ++)
{
d_ptr[i] = i*0.1;
}
std::cout << d_ptr[b_size-1] << std::endl;
auto b = std::chrono::high_resolution_clock::now();
long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
std::cout << "No avx takes " << diff << std::endl;
#else
auto a = std::chrono::high_resolution_clock::now();
for (int i = 0; i < b_size; i += 4)
{
/* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
__m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
__m128d tmp3 = _mm_add_pd(tmp1,tmp2);
_mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
__m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
__m256d tmp2 = _mm256_set_pd(0.1*(i+3),0.1*(i+2),0.1*(i+1),0.1*i);
__m256d tmp3 = _mm256_add_pd(tmp1,tmp2);
_mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);
}
std::cout << d_ptr[b_size-1] << std::endl;
auto b = std::chrono::high_resolution_clock::now();
long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
std::cout << "avx takes " << diff << std::endl;
#endif
}
I have run the above code on the Haswell machine. The results are surprising:
Without AVX and compiled with O3:
~$ ./test_avx512_auto_noavx
malloc finishes!
1.07374e+08
No avx takes 3824740
With AVX and compiled without any optimization flags:
~$ ./test_avx512_auto
malloc finishes!
1.07374e+08
avx takes 2121917
With AVX and compiled with O3:
~$ ./test_avx512_auto_o3
malloc finishes!
1.07374e+08
avx takes 6307190
It is against what we thought before.
Also, I have implemented a vectorized version (similar to Add+Mul become slower with Intrinsics - where am I wrong? ), see the code below:
#else
auto a = std::chrono::high_resolution_clock::now();
__m256d tmp2 = _mm256_set1_pd(0.1);
__m256d base = _mm256_set_pd(-1.0,-2.0,-3.0,-4.0);
__m256d tmp3 = _mm256_set1_pd(4.0);
for (int i = 0; i < b_size; i += 4)
{
/* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
__m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
__m128d tmp3 = _mm_add_pd(tmp1,tmp2);
_mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
__m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
base = _mm256_add_pd(base,tmp3);
__m256d tmp5 = _mm256_mul_pd(base,tmp2);
tmp1 = _mm256_add_pd(tmp1,tmp5);
_mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp1);
}
std::cout << d_ptr[b_size-1] << std::endl;
auto b = std::chrono::high_resolution_clock::now();
long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
std::cout << "avx takes " << diff << std::endl;
#endif
On the same machine, this gives me:
With AVX and without any optimization flags
~$ ./test_avx512_manual
malloc finishes!
1.07374e+08
avx takes 2151390
With AVX and with O3:
~$ ./test_avx512_manual_o3
malloc finishes!
1.07374e+08
avx takes 5965288
Not sure where the problem is. Why O3 gives up worse performance?
Editor's note: in the executable names,
_avx512_ seems to be -march=native, even though Haswell only has AVX2.
_manual vs. _auto seems to be -DAVX512 to use the manually-vectorized AVX1 code or the compiler's auto-vectorization of the scalar code that only writes with = instead of += like the intrinsics are doing.

How to find median value in 2d array for each column with CUDA? [duplicate]

I found the method 'vectorized/batch sort' and 'nested sort' on below link. How to use Thrust to sort the rows of a matrix?
When I tried this method for 500 row and 1000 elements, the result of them are
vectorized/batch sort : 66ms
nested sort : 3290ms
I am using 1080ti HOF model to do this operation but it takes too long compared to your case.
But in the below link, it could be less than 10ms and almost 100 microseconds.
(How to find median value in 2d array for each column with CUDA?)
Could you recommend how to optimize this method to reduce operation time?
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <thrust/generate.h>
#include <thrust/equal.h>
#include <thrust/sequence.h>
#include <thrust/for_each.h>
#include <iostream>
#include <stdlib.h>
#define NSORTS 500
#define DSIZE 1000
int my_mod_start = 0;
int my_mod() {
return (my_mod_start++) / DSIZE;
}
bool validate(thrust::device_vector<int> &d1, thrust::device_vector<int> &d2) {
return thrust::equal(d1.begin(), d1.end(), d2.begin());
}
struct sort_functor
{
thrust::device_ptr<int> data;
int dsize;
__host__ __device__
void operator()(int start_idx)
{
thrust::sort(thrust::device, data + (dsize*start_idx), data + (dsize*(start_idx + 1)));
}
};
#include <time.h>
#include <windows.h>
unsigned long long dtime_usec(LONG start) {
SYSTEMTIME timer2;
GetSystemTime(&timer2);
LONG end = (timer2.wSecond * 1000) + timer2.wMilliseconds;
return (end-start);
}
int main() {
for (int i = 0; i < 3; i++) {
SYSTEMTIME timer1;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, (16 * DSIZE*NSORTS));
thrust::host_vector<int> h_data(DSIZE*NSORTS);
thrust::generate(h_data.begin(), h_data.end(), rand);
thrust::device_vector<int> d_data = h_data;
// first time a loop
thrust::device_vector<int> d_result1 = d_data;
thrust::device_ptr<int> r1ptr = thrust::device_pointer_cast<int>(d_result1.data());
GetSystemTime(&timer1);
LONG time_ms1 = (timer1.wSecond * 1000) + timer1.wMilliseconds;
for (int i = 0; i < NSORTS; i++)
thrust::sort(r1ptr + (i*DSIZE), r1ptr + ((i + 1)*DSIZE));
cudaDeviceSynchronize();
time_ms1 = dtime_usec(time_ms1);
std::cout << "loop time: " << time_ms1 << "ms" << std::endl;
//vectorized sort
thrust::device_vector<int> d_result2 = d_data;
thrust::host_vector<int> h_segments(DSIZE*NSORTS);
thrust::generate(h_segments.begin(), h_segments.end(), my_mod);
thrust::device_vector<int> d_segments = h_segments;
GetSystemTime(&timer1);
time_ms1 = (timer1.wSecond * 1000) + timer1.wMilliseconds;
thrust::stable_sort_by_key(d_result2.begin(), d_result2.end(), d_segments.begin());
thrust::stable_sort_by_key(d_segments.begin(), d_segments.end(), d_result2.begin());
cudaDeviceSynchronize();
time_ms1 = dtime_usec(time_ms1);
std::cout << "loop time: " << time_ms1 << "ms" << std::endl;
if (!validate(d_result1, d_result2)) std::cout << "mismatch 1!" << std::endl;
//nested sort
thrust::device_vector<int> d_result3 = d_data;
sort_functor f = { d_result3.data(), DSIZE };
thrust::device_vector<int> idxs(NSORTS);
thrust::sequence(idxs.begin(), idxs.end());
GetSystemTime(&timer1);
time_ms1 = (timer1.wSecond * 1000) + timer1.wMilliseconds;
thrust::for_each(idxs.begin(), idxs.end(), f);
cudaDeviceSynchronize();
time_ms1 = dtime_usec(time_ms1);
std::cout << "loop time: " << time_ms1 << "ms" << std::endl;
if (!validate(d_result1, d_result3)) std::cout << "mismatch 2!" << std::endl;
}
return 0;
}
The main takeaway from your thrust experience is that you should never compile a debug project or with device debug switch (-G) when you are interested in performance. Compiling device debug code causes the compiler to omit many performance optimizations. The difference in your case was quite dramatic, about a 30x improvement going from debug to release code.
Here is a segmented cub sort, where we are launching 500 blocks and each block is handling a separate 1024 element array. The CUB code is lifted from here.
$ cat t1761.cu
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
#include <iostream>
const int ipt=8;
const int tpb=128;
__global__ void ExampleKernel(int *data)
{
// Specialize BlockRadixSort for a 1D block of 128 threads owning 8 integer items each
typedef cub::BlockRadixSort<int, tpb, ipt> BlockRadixSort;
// Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[ipt];
// just create some synthetic data in descending order 1023 1022 1021 1020 ...
for (int i = 0; i < ipt; i++) thread_keys[i] = (tpb-1-threadIdx.x)*ipt+i;
// Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
__syncthreads();
// write results to output array
for (int i = 0; i < ipt; i++) data[blockIdx.x*ipt*tpb + threadIdx.x*ipt+i] = thread_keys[i];
}
int main(){
const int blks = 500;
int *data;
cudaMalloc(&data, blks*ipt*tpb*sizeof(int));
ExampleKernel<<<blks,tpb>>>(data);
int *h_data = new int[blks*ipt*tpb];
cudaMemcpy(h_data, data, blks*ipt*tpb*sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < 10; i++) std::cout << h_data[i] << " ";
std::cout << std::endl;
}
$ nvcc -o t1761 t1761.cu -I/path/to/cub/cub-1.8.0
$ CUDA_VISIBLE_DEVICES="2" nvprof ./t1761
==13713== NVPROF is profiling process 13713, command: ./t1761
==13713== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
0 1 2 3 4 5 6 7 8 9
==13713== Profiling application: ./t1761
==13713== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 60.35% 308.66us 1 308.66us 308.66us 308.66us [CUDA memcpy DtoH]
39.65% 202.79us 1 202.79us 202.79us 202.79us ExampleKernel(int*)
API calls: 98.39% 210.79ms 1 210.79ms 210.79ms 210.79ms cudaMalloc
0.72% 1.5364ms 1 1.5364ms 1.5364ms 1.5364ms cudaMemcpy
0.32% 691.15us 1 691.15us 691.15us 691.15us cudaLaunchKernel
0.28% 603.26us 97 6.2190us 400ns 212.71us cuDeviceGetAttribute
0.24% 516.56us 1 516.56us 516.56us 516.56us cuDeviceTotalMem
0.04% 79.374us 1 79.374us 79.374us 79.374us cuDeviceGetName
0.01% 13.373us 1 13.373us 13.373us 13.373us cuDeviceGetPCIBusId
0.00% 5.0810us 3 1.6930us 729ns 2.9600us cuDeviceGetCount
0.00% 2.3120us 2 1.1560us 609ns 1.7030us cuDeviceGet
0.00% 748ns 1 748ns 748ns 748ns cuDeviceGetUuid
$
(CUDA 10.2.89, RHEL 7)
Above I am running on a Tesla K20x, which has performance that is "closer" to your 1080ti than a Tesla V100. We see that the kernel execution time is ~200us. If I run the exact same code on a Tesla V100, the kernel execution time drops to ~35us:
$ CUDA_VISIBLE_DEVICES="0" nvprof ./t1761
==13814== NVPROF is profiling process 13814, command: ./t1761
0 1 2 3 4 5 6 7 8 9
==13814== Profiling application: ./t1761
==13814== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 82.33% 163.43us 1 163.43us 163.43us 163.43us [CUDA memcpy DtoH]
17.67% 35.073us 1 35.073us 35.073us 35.073us ExampleKernel(int*)
API calls: 98.70% 316.92ms 1 316.92ms 316.92ms 316.92ms cudaMalloc
0.87% 2.7879ms 1 2.7879ms 2.7879ms 2.7879ms cuDeviceTotalMem
0.19% 613.75us 97 6.3270us 389ns 205.37us cuDeviceGetAttribute
0.19% 601.61us 1 601.61us 601.61us 601.61us cudaMemcpy
0.02% 72.718us 1 72.718us 72.718us 72.718us cudaLaunchKernel
0.02% 59.905us 1 59.905us 59.905us 59.905us cuDeviceGetName
0.01% 37.886us 1 37.886us 37.886us 37.886us cuDeviceGetPCIBusId
0.00% 4.6830us 3 1.5610us 546ns 2.7850us cuDeviceGetCount
0.00% 1.9900us 2 995ns 587ns 1.4030us cuDeviceGet
0.00% 677ns 1 677ns 677ns 677ns cuDeviceGetUuid
$
You'll note there is no "input" array, I'm just synthesizing data in the kernel, since we are interested in performance, primarily. If you need to handle an array size like 1000, you should probably just pad each array to 1024 (e.g. pad with a very large number, then ignore the last numbers in the sorted result.)
This code is largely lifted from external documentation. It is offered for instructional purposes. I'm not suggesting it is defect-free or suitable for any particular purpose. Use it at your own risk.

wrong result boost gmp float

I need to compute 5^64 with boost multiprecision library which should yield 542101086242752217003726400434970855712890625 but boost::multiprecision::pow() takes mpfloat and gives 542101086242752217003726392492611895881105408.
However If I loop and repeatedly multiply using mpint I get correct result.
Is it a bug ? or I am using boost::multiprecision::pow() in a wrong way ? or I there is an alternative of using boost::multiprecision::pow() ?
#include <iostream>
#include <string>
#include <boost/multiprecision/gmp.hpp>
typedef boost::multiprecision::mpz_int mpint;
typedef boost::multiprecision::number<boost::multiprecision::gmp_float<4> > mpfloat;
int main(){
mpfloat p = boost::multiprecision::pow(mpfloat(5), mpfloat(64));
std::cout << p.template convert_to<mpint>() << std::endl;
mpint res(1);
for(int i = 0; i < 64; ++i){
res = res * 5;
}
std::cout << res << std::endl;
}

Giving up ownership of a memory without releasing it by shared_ptr

Is there a way I can make the shared pointer point to a different memory location without releasing the memory.pointed by it currently
Please consider the code:
#include <boost/shared_ptr.hpp>
#include <boost/make_shared.hpp>
#include <iostream>
int
main()
{
int *p = new int();
*p = 10;
int *q = new int();
*q = 20;
boost::shared_ptr<int> ps(p);
// This leads to a compiler error
ps = boost::make_shared<int>(q);
std::cout << *p << std::endl;
std::cout << *q << std::endl;
return 0;
}
You can't.
Of course you can release and reattach, while changing the deleter to a no-op
To be honest, it looks like you'd just want
ps = boost::make_shared<int>(*q);
Prints (live on Coliru):
0
20

Console output order slows down multi-threaded program

When compiling the following code
#include <iostream>
#include <vector>
#include <thread>
#include <chrono>
#include <mutex>
std::mutex cout_mut;
void task()
{
for(int i=0; i<10; i++)
{
double d=0.0;
for(size_t cnt=0; cnt<200000000; cnt++) d += 1.23456;
std::lock_guard<std::mutex> lg(cout_mut);
std::cout << d << "(Help)" << std::endl;
// std::cout << "(Help)" << d << std::endl;
}
}
int main()
{
std::vector<std::thread> all_t(std::thread::hardware_concurrency());
auto t_begin = std::chrono::high_resolution_clock::now();
for(auto& t : all_t) t = std::thread{task};
for(auto& t : all_t) t.join();
auto t_end = std::chrono::high_resolution_clock::now();
std::cout << "Took : " << (t_end - t_begin).count() << std::endl;
}
Under MinGW 4.8.1 it takes roughly 2.5 seconds to execute on my box. That is approximately the time it takes to only execute the task function single-threadedly.
However, when I uncomment the line in the middle and therefore comment out the line before (that is, when I exchange the order in which d and "(Help)" are written to std::cout) the whole thing takes now 8-9 seconds.
What is the explanation?
I tested again and found out that I only have the problem with MinGW-build x32-4.8.1-win32-dwarf-rev3 but not with MinGW build x64-4.8.1-posix-seh-rev3. I have a 64-bit machine. With the 64-bit compiler both versions take three seconds. However, using the 32-bit compiler, the problem remains (and is not due to release/debug version confusion).
It has nothing to do with multi-threading. It is a problem of loop optimization. I have rearranged the original code to get something minimalistic demonstrating the issue:
#include <iostream>
#include <chrono>
#include <mutex>
int main()
{
auto t_begin = std::chrono::high_resolution_clock::now();
for(int i=0; i<2; i++)
{
double d=0.0;
for(int j=0; j<100000; j++) d += 1.23456;
std::mutex mutex;
std::lock_guard<std::mutex> lock(mutex);
#ifdef SLOW
std::cout << 'a' << d << std::endl;
#else
std::cout << d << 'a' << std::endl;
#endif
}
auto t_end = std::chrono::high_resolution_clock::now();
std::cout << "Took : " << (static_cast<double>((t_end - t_begin).count())/1000.0) << std::endl;
}
When compiled and executed and with:
g++ -std=c++11 -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast
The output is:
a123456
a123456
Took : 931
123456a
123456a
Took : 373
Most of the difference in timing is explained by the assembly code generated for the inner loop: the fast case accumulates directly in xmm0 while the slow case accumulates into xmm1 - leading to 2 extra movsd instructions.
Now, when compiled with the '-ftree-loop-linear' option:
g++ -std=c++11 -ftree-loop-linear -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -ftree-loop-linear -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast
The output becomes:
a123456
a123456
Took : 340
123456a
123456a
Took : 346

Resources