When compiling the following code
#include <iostream>
#include <vector>
#include <thread>
#include <chrono>
#include <mutex>
std::mutex cout_mut;
void task()
{
for(int i=0; i<10; i++)
{
double d=0.0;
for(size_t cnt=0; cnt<200000000; cnt++) d += 1.23456;
std::lock_guard<std::mutex> lg(cout_mut);
std::cout << d << "(Help)" << std::endl;
// std::cout << "(Help)" << d << std::endl;
}
}
int main()
{
std::vector<std::thread> all_t(std::thread::hardware_concurrency());
auto t_begin = std::chrono::high_resolution_clock::now();
for(auto& t : all_t) t = std::thread{task};
for(auto& t : all_t) t.join();
auto t_end = std::chrono::high_resolution_clock::now();
std::cout << "Took : " << (t_end - t_begin).count() << std::endl;
}
Under MinGW 4.8.1 it takes roughly 2.5 seconds to execute on my box. That is approximately the time it takes to only execute the task function single-threadedly.
However, when I uncomment the line in the middle and therefore comment out the line before (that is, when I exchange the order in which d and "(Help)" are written to std::cout) the whole thing takes now 8-9 seconds.
What is the explanation?
I tested again and found out that I only have the problem with MinGW-build x32-4.8.1-win32-dwarf-rev3 but not with MinGW build x64-4.8.1-posix-seh-rev3. I have a 64-bit machine. With the 64-bit compiler both versions take three seconds. However, using the 32-bit compiler, the problem remains (and is not due to release/debug version confusion).
It has nothing to do with multi-threading. It is a problem of loop optimization. I have rearranged the original code to get something minimalistic demonstrating the issue:
#include <iostream>
#include <chrono>
#include <mutex>
int main()
{
auto t_begin = std::chrono::high_resolution_clock::now();
for(int i=0; i<2; i++)
{
double d=0.0;
for(int j=0; j<100000; j++) d += 1.23456;
std::mutex mutex;
std::lock_guard<std::mutex> lock(mutex);
#ifdef SLOW
std::cout << 'a' << d << std::endl;
#else
std::cout << d << 'a' << std::endl;
#endif
}
auto t_end = std::chrono::high_resolution_clock::now();
std::cout << "Took : " << (static_cast<double>((t_end - t_begin).count())/1000.0) << std::endl;
}
When compiled and executed and with:
g++ -std=c++11 -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast
The output is:
a123456
a123456
Took : 931
123456a
123456a
Took : 373
Most of the difference in timing is explained by the assembly code generated for the inner loop: the fast case accumulates directly in xmm0 while the slow case accumulates into xmm1 - leading to 2 extra movsd instructions.
Now, when compiled with the '-ftree-loop-linear' option:
g++ -std=c++11 -ftree-loop-linear -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -ftree-loop-linear -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast
The output becomes:
a123456
a123456
Took : 340
123456a
123456a
Took : 346
Related
I have some problems when I try to run my main.cpp file, only with mac gcc/clang/g++ compiler.
Here is the code:
random.h
#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <ctime>
void initialize();
float randomFloat(int, int, int);
random.cpp
#include "random.h"
void initialize() { srand(time(NULL)); }
float randomFloat(int min, int max, int p) {
int intPart = rand() % (max - (min - 1)) + min;
if (intPart == max) {
intPart--;
}
float decimal = (float)(rand() % (int)pow(10, p)) / pow(10, p);
return intPart + decimal;
}
util.h
#include <iostream>
int askInteger(const char *, bool);
util.cpp
#include "util.h"
using namespace std;
int askInteger(const char *message, bool onlyPositive) {
int number;
if (onlyPositive) {
while (cout << "Type a correct " << message << ": ",
!(cin >> number) || number < 0) {
cerr << "Input error, try again. \n";
if (cin.fail()) {
cin.clear();
cin.ignore();
}
}
} else {
while (cout << "Type a correct " << message << ": ", !(cin >> number)) {
cerr << "Input error, try again. \n";
if (cin.fail()) {
cin.clear();
cin.ignore();
}
}
}
return number;
}
main.cpp
#include "random/random.h"
#include "util/util.h"
using namespace std;
int main() {
int n, min, max, p;
n = askInteger("numbers quantity", true);
p = askInteger("precession", true);
min = askInteger("min value (included)", false);
max = askInteger("max value (included)", false);
while (max <= min) {
cout << "max value should be greather than " << min << "\n";
max = askInteger("max value (included)", false);
}
initialize();
for (int i = 0; i < n; i++) {
cout << randomFloat(min, max, p) << "\n";
}
}
And It gives me this result:
Undefined symbols for architecture arm64:
"askInteger(char const*, bool)", referenced from:
_main in main-2552a5.o
"initialize()", referenced from:
_main in main-2552a5.o
"randomFloat(int, int, int)", referenced from:
_main in main-2552a5.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
vscode applies this command to run the code:
Random-Numbers % cd "/Users/user/Downloads/Random-Numbers/" && g++ main.cpp -
o main && "/Users/user/Downloads/Random-Numbers/"main
Here is my project structure:
enter image description here
I have tried already
gcc *.h
gcc -c *.h
g++ *.cpp
g++ -o *.cpp
g++ -c *.cpp
AT LAST! I could run it
c++ vscode default commands doesn’t match with g++ mac compiler
c++ vscode default command to run main:
d "/Users/user/Downloads/Random-Numbers/" && g++ main.cpp -o main && "/Users/user/Downloads/Random-Numbers/"main
user#MacBook-Pro-de-user Random-Numbers % cd "/Users/user/Downloads/Random-Numbers/" && g++ main.cpp -o main && "/Users/
user/Downloads/Random-Numbers/"main
you have to compile by terminal with the command:
user#MacBook-Pro-de-user Random-Numbers % g++ -o ./main.exe random/random.cpp util/util.cpp main.cpp
user#MacBook-Pro-de-user Random-Numbers % ./main.exe
commands g++
This question already has answers here:
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
(1 answer)
Idiomatic way of performance evaluation?
(1 answer)
Add+Mul become slower with Intrinsics - where am I wrong?
(2 answers)
Closed 1 year ago.
I am trying to see the performance speedup of AVX instructions. Below is the example code I am running:
#include <iostream>
#include <stdio.h>
#include <string.h>
#include <cstdlib>
#include <algorithm>
#include <immintrin.h>
#include <chrono>
#include <complex>
//using Type = std::complex<double>;
using Type = double;
int main()
{
size_t b_size = 1;
b_size = (1ul << 30) * b_size;
Type *d_ptr = (Type*)malloc(sizeof(Type)*b_size);
for(int i = 0; i < b_size; i++)
{
d_ptr[i] = 0;
}
std::cout <<"malloc finishes!" << std::endl;
#ifndef AVX512
auto a = std::chrono::high_resolution_clock::now();
for (int i = 0; i < b_size; i ++)
{
d_ptr[i] = i*0.1;
}
std::cout << d_ptr[b_size-1] << std::endl;
auto b = std::chrono::high_resolution_clock::now();
long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
std::cout << "No avx takes " << diff << std::endl;
#else
auto a = std::chrono::high_resolution_clock::now();
for (int i = 0; i < b_size; i += 4)
{
/* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
__m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
__m128d tmp3 = _mm_add_pd(tmp1,tmp2);
_mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
__m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
__m256d tmp2 = _mm256_set_pd(0.1*(i+3),0.1*(i+2),0.1*(i+1),0.1*i);
__m256d tmp3 = _mm256_add_pd(tmp1,tmp2);
_mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);
}
std::cout << d_ptr[b_size-1] << std::endl;
auto b = std::chrono::high_resolution_clock::now();
long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
std::cout << "avx takes " << diff << std::endl;
#endif
}
I have tested this code on both Haswell and Cascade lake machines, the cases without and with AVX produces quite similar execution times.
---Edit---
Here is the simple compiler command I used:
Without AVX
g++ test_avx512_performance.cpp -march=native -o test_avx512_performance_noavx
With AVX
g++ test_avx512_performance.cpp -march=native -DAVX512 -o test_avx512_performance
--Edit Again--
I have run the above code on the Haswell machine again. The results are surprising:
Without AVX and compiled with O3:
~$ ./test_avx512_auto_noavx
malloc finishes!
1.07374e+08
No avx takes 3824740
With AVX and compiled without any optimization flags:
~$ ./test_avx512_auto
malloc finishes!
1.07374e+08
avx takes 2121917
With AVX and compiled with O3:
~$ ./test_avx512_auto_o3
malloc finishes!
1.07374e+08
avx takes 6307190
It is against what we thought before.
Also, I have implemented a vectorized version (similar to Add+Mul become slower with Intrinsics - where am I wrong? ), see the code below:
#else
auto a = std::chrono::high_resolution_clock::now();
__m256d tmp2 = _mm256_set1_pd(0.1);
__m256d base = _mm256_set_pd(-1.0,-2.0,-3.0,-4.0);
__m256d tmp3 = _mm256_set1_pd(4.0);
for (int i = 0; i < b_size; i += 4)
{
/* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
__m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
__m128d tmp3 = _mm_add_pd(tmp1,tmp2);
_mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
__m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
base = _mm256_add_pd(base,tmp3);
__m256d tmp5 = _mm256_mul_pd(base,tmp2);
tmp1 = _mm256_add_pd(tmp1,tmp5);
_mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp1);
}
std::cout << d_ptr[b_size-1] << std::endl;
auto b = std::chrono::high_resolution_clock::now();
long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
std::cout << "avx takes " << diff << std::endl;
#endif
On the same machine, this gives me:
With AVX and without any optimization flags
~$ ./test_avx512_manual
malloc finishes!
1.07374e+08
avx takes 2151390
With AVX and with O3:
~$ ./test_avx512_manual_o3
malloc finishes!
1.07374e+08
avx takes 5965288
Not sure where the problem is. Why O3 gives up worse performance?
This question already has answers here:
Idiomatic way of performance evaluation?
(1 answer)
Difference between malloc and calloc?
(14 answers)
Why is iterating though `std::vector` faster than iterating though `std::array`?
(2 answers)
Performance: memset
(2 answers)
Closed 1 year ago.
I am sorry to post this question again with some updates. The previous one has been closed. I am trying to see the performance speedup of AVX instructions. Below is the example code I am running:
#include <iostream>
#include <stdio.h>
#include <string.h>
#include <cstdlib>
#include <algorithm>
#include <immintrin.h>
#include <chrono>
#include <complex>
//using Type = std::complex<double>;
using Type = double;
int main()
{
size_t b_size = 1;
b_size = (1ul << 30) * b_size;
Type *d_ptr = (Type*)malloc(sizeof(Type)*b_size);
for(int i = 0; i < b_size; i++)
{
d_ptr[i] = 0;
}
std::cout <<"malloc finishes!" << std::endl;
#ifndef AVX512
auto a = std::chrono::high_resolution_clock::now();
for (int i = 0; i < b_size; i ++)
{
d_ptr[i] = i*0.1;
}
std::cout << d_ptr[b_size-1] << std::endl;
auto b = std::chrono::high_resolution_clock::now();
long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
std::cout << "No avx takes " << diff << std::endl;
#else
auto a = std::chrono::high_resolution_clock::now();
for (int i = 0; i < b_size; i += 4)
{
/* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
__m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
__m128d tmp3 = _mm_add_pd(tmp1,tmp2);
_mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
__m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
__m256d tmp2 = _mm256_set_pd(0.1*(i+3),0.1*(i+2),0.1*(i+1),0.1*i);
__m256d tmp3 = _mm256_add_pd(tmp1,tmp2);
_mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);
}
std::cout << d_ptr[b_size-1] << std::endl;
auto b = std::chrono::high_resolution_clock::now();
long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
std::cout << "avx takes " << diff << std::endl;
#endif
}
I have run the above code on the Haswell machine. The results are surprising:
Without AVX and compiled with O3:
~$ ./test_avx512_auto_noavx
malloc finishes!
1.07374e+08
No avx takes 3824740
With AVX and compiled without any optimization flags:
~$ ./test_avx512_auto
malloc finishes!
1.07374e+08
avx takes 2121917
With AVX and compiled with O3:
~$ ./test_avx512_auto_o3
malloc finishes!
1.07374e+08
avx takes 6307190
It is against what we thought before.
Also, I have implemented a vectorized version (similar to Add+Mul become slower with Intrinsics - where am I wrong? ), see the code below:
#else
auto a = std::chrono::high_resolution_clock::now();
__m256d tmp2 = _mm256_set1_pd(0.1);
__m256d base = _mm256_set_pd(-1.0,-2.0,-3.0,-4.0);
__m256d tmp3 = _mm256_set1_pd(4.0);
for (int i = 0; i < b_size; i += 4)
{
/* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
__m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
__m128d tmp3 = _mm_add_pd(tmp1,tmp2);
_mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
__m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
base = _mm256_add_pd(base,tmp3);
__m256d tmp5 = _mm256_mul_pd(base,tmp2);
tmp1 = _mm256_add_pd(tmp1,tmp5);
_mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp1);
}
std::cout << d_ptr[b_size-1] << std::endl;
auto b = std::chrono::high_resolution_clock::now();
long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
std::cout << "avx takes " << diff << std::endl;
#endif
On the same machine, this gives me:
With AVX and without any optimization flags
~$ ./test_avx512_manual
malloc finishes!
1.07374e+08
avx takes 2151390
With AVX and with O3:
~$ ./test_avx512_manual_o3
malloc finishes!
1.07374e+08
avx takes 5965288
Not sure where the problem is. Why O3 gives up worse performance?
Editor's note: in the executable names,
_avx512_ seems to be -march=native, even though Haswell only has AVX2.
_manual vs. _auto seems to be -DAVX512 to use the manually-vectorized AVX1 code or the compiler's auto-vectorization of the scalar code that only writes with = instead of += like the intrinsics are doing.
I have code using mutex for self learning.
Link is : https://baptiste-wicht.com/posts/2012/04/c11-concurrency-tutorial-advanced-locking-and-condition-variables.html
I wrote the example: main_deadlock.cpp
#include <iostream>
#include <thread>
#include <mutex>
struct Complex {
std::mutex mutex;
int i;
Complex() : i(0) {}
void mul(int x){
std::cout << "mul : before lock_guard" << std::endl;
std::lock_guard<std::mutex> lock(mutex);
std::cout << "mul : after lock_guard, before operation" << std::endl;
i *= x;
std::cout << "mul : after operation" << std::endl;
}
void div(int x){
std::cout << "div : before lock_guard" << std::endl;
std::lock_guard<std::mutex> lock(mutex);
std::cout << "div : after lock_guard, before operation" << std::endl;
i /= x;
std::cout << "div : after operation" << std::endl;
}
void both(int x, int y)
{
std::cout << "both : before lock_guard" << std::endl;
std::lock_guard<std::mutex> lock(mutex);
std::cout << "both : after lock_guard, before mul()" << std::endl;
mul(x);
std::cout << "both : after mul(), before div()" << std::endl;
div(y);
std::cout << "both : after div()" << std::endl;
}
};
int main(){
std::cout << "main : starting" << std::endl;
Complex complex;
std::cout << "main : calling both()" << std::endl;
complex.both(32, 23);
return 0;
}
I would expect this code will have a deadlock when calling mul() from both() cause both() already acquire the mutex, so mul() should be blocked.
Now I am using: ubuntu 17.10.1 with g++ (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0 (g++ --version output)
If I am using the compile command:
user#user: g++ -o out_deadlock main_deadlock.cpp
I get no deadlock at all!
But if I use the compile command:
user#user: g++ -std=c++11 -pthread -o out_deadlock main_deadlock.cpp
All works - means I see deadlock.
Can you explain?
also, How the first command makes the code compile? I didn't "mention" pthreads and didn't mention -std=c++11 although code is using c++ 11 lib? I would expect fail of compilation/linkage?
Thanks.
The answer is that if you do not compile and link with -pthread then you are not using actual pthread locking functions.
The GNU Linux C library is set up that way so that libraries can call all of the locking functions, but unless they are actually linked into a multithreaded program, none of the locks actually happen.
I need to compute 5^64 with boost multiprecision library which should yield 542101086242752217003726400434970855712890625 but boost::multiprecision::pow() takes mpfloat and gives 542101086242752217003726392492611895881105408.
However If I loop and repeatedly multiply using mpint I get correct result.
Is it a bug ? or I am using boost::multiprecision::pow() in a wrong way ? or I there is an alternative of using boost::multiprecision::pow() ?
#include <iostream>
#include <string>
#include <boost/multiprecision/gmp.hpp>
typedef boost::multiprecision::mpz_int mpint;
typedef boost::multiprecision::number<boost::multiprecision::gmp_float<4> > mpfloat;
int main(){
mpfloat p = boost::multiprecision::pow(mpfloat(5), mpfloat(64));
std::cout << p.template convert_to<mpint>() << std::endl;
mpint res(1);
for(int i = 0; i < 64; ++i){
res = res * 5;
}
std::cout << res << std::endl;
}