I got some problem on openmp. I've written some computational codes and parallize the code using openmp. But sequential and parallel gave me different result.
Here is the code
for(i=0; i<grid_number; i++)
{
double norm = 0;
const double alpha = gsl_vector_get(valpha, i);
for(j=0; j<n_sim; j++)
{
gsl_matrix_complex *sub_data = gsl_matrix_complex_calloc(n_obs, 1);
struct cmatrix cm;
cm.H0 = gsl_matrix_complex_calloc(n_obs-1, nWeight);
cm.optMatrix = gsl_matrix_complex_calloc(n_obs-1, n_obs-1);
for(k=0; k<3; k++)
{
gsl_vector_set(sub_b02, k, gsl_matrix_get(b02, j, k));
}
for(k=0; k<n_obs; k++)
{
const gsl_complex z = gsl_complex_rect(gsl_matrix_get(data2, k, j), 0);
gsl_matrix_complex_set(sub_data, k, 0, z);
}
gsl_vector* theta = gsl_vector_calloc(3);
c_matrix(sub_b02, sub_data, 1, cm, alpha);
fminsearch(sub_b02, sub_data, cm.optMatrix, cm.H0, theta);
gsl_vector_sub(theta, theta1);
norm += gsl_blas_dnrm2(theta);
gsl_matrix_free(sub_data);
gsl_matrix_free(cm.H0);
gsl_matrix_free(cm.optMatrix);
gsl_vector_free(theta);
}
double mse = total_weight * norm /(double)n_sim;
printf("alpha:%f, MSE:%.12e\n", alpha, mse);
mses[i] = mse;
alphas[i] = alpha;
}
Running this code, give this result:
alpha:0.000010, MSE:1.368646778831e-01
alpha:0.000076, MSE:1.368646778831e-01
alpha:0.000142, MSE:1.368646778831e-01
alpha:0.000208, MSE:1.368646778831e-01
alpha:0.000274, MSE:1.368646778831e-01
alpha:0.000340, MSE:1.368646778831e-01
alpha:0.000406, MSE:1.368646778831e-01
alpha:0.000472, MSE:1.368646778831e-01
alpha:0.000538, MSE:1.368646778831e-01
alpha:0.000604, MSE:1.368646778831e-01
alpha:0.000670, MSE:1.368646778831e-01
alpha:0.000736, MSE:1.368646778831e-01
alpha:0.000802, MSE:1.368646778831e-01
alpha:0.000868, MSE:1.368646778831e-01
alpha:0.000934, MSE:1.368646778831e-01
Then I tried to parallize the codes using open mp:
#pragma omp parallel for private(j,k)
for(i=0; i<grid_number; i++)
{
double norm = 0;
const double alpha = gsl_vector_get(valpha, i);
for(j=0; j<n_sim; j++)
{
gsl_matrix_complex *sub_data = gsl_matrix_complex_calloc(n_obs, 1);
struct cmatrix cm;
cm.H0 = gsl_matrix_complex_calloc(n_obs-1, nWeight);
cm.optMatrix = gsl_matrix_complex_calloc(n_obs-1, n_obs-1);
for(k=0; k<3; k++)
{
gsl_vector_set(sub_b02, k, gsl_matrix_get(b02, j, k));
}
for(k=0; k<n_obs; k++)
{
const gsl_complex z = gsl_complex_rect(gsl_matrix_get(data2, k, j), 0);
gsl_matrix_complex_set(sub_data, k, 0, z);
}
gsl_vector* theta = gsl_vector_calloc(3);
c_matrix(sub_b02, sub_data, 1, cm, alpha);
fminsearch(sub_b02, sub_data, cm.optMatrix, cm.H0, theta);
gsl_vector_sub(theta, theta1);
norm += gsl_blas_dnrm2(theta);
gsl_matrix_free(sub_data);
gsl_matrix_free(cm.H0);
gsl_matrix_free(cm.optMatrix);
gsl_vector_free(theta);
}
double mse = total_weight * norm /(double)n_sim;
printf("alpha:%f, MSE:%.12e\n", alpha, mse);
mses[i] = mse;
alphas[i] = alpha;
}
And the parallel result:
alpha:0.000934, MSE:1.368646778831e-01
alpha:0.000802, MSE:1.368646778831e-01
alpha:0.000274, MSE:1.368646778831e-01
alpha:0.000670, MSE:1.368646778831e-01
alpha:0.000010, MSE:1.368646778831e-01
alpha:0.000538, MSE:1.368646778831e-01
alpha:0.000406, MSE:1.368646778831e-01
alpha:0.000142, MSE:1.368646778831e-01
alpha:0.000736, MSE:1.368646778831e-01
alpha:0.000604, MSE:1.368646778831e-01
alpha:0.000208, MSE:1.368388509959e-01
alpha:0.000340, MSE:1.368646778831e-01
alpha:0.000472, MSE:1.369194416804e-01
alpha:0.000868, MSE:1.368691005950e-01
alpha:0.000076, MSE:1.369461873652e-01
Why both result is different on some alpha?
Different results between a sequential and a parallel version of a program virtually always mean one thing: race conditions. In your case it is hard to pinpoint the cause because you did not supply a minimal working example (shame on you).
However I have managed to reverse engineer some of what was missing, and I would make the claim that your problem is with the variable sub_b02. It is defined outside of the parallel block which makes it shared by default, but you call gsl_vector_set on it, which makes different threads write to the same same memory location. Since it is a pointer you will probably have to allocate it inside the parallel block.
I can't say that there aren't more things wrong, especially since I can't see into c_matrix and fminsearch. But what you should do is take a second to think about what variables should be shared/private to threads, then add default(none) to the pragma and write out shared/private things explicitly. This should give you a better idea of what you're missing.
Related
Although it is known that using nested std::vector to represent matrices is a bad idea, let's use it for now since it is flexible and many existing functions can handle std::vector.
I thought, in small cases, the speed difference can be ignored. But it turned out that vector<vector<double>> is 10+ times slower than numpy.dot().
Let A and B be matrices whose size is sizexsize. Assuming square matrices is just for simplicity. (We don't intend to limit discussion to the square matrices case.) We initialize each matrix in a deterministic way, and finally calculate C = A * B.
We define "calculation time" as the time elapsed just to calculate C = A * B. In other words, various overheads are not included.
Python3 code
import numpy as np
import time
import sys
if (len(sys.argv) != 2):
print("Pass `size` as an argument.", file = sys.stderr);
sys.exit(1);
size = int(sys.argv[1]);
A = np.ndarray((size, size));
B = np.ndarray((size, size));
for i in range(size):
for j in range(size):
A[i][j] = i * 3.14 + j
B[i][j] = i * 3.14 - j
start = time.time()
C = np.dot(A, B);
print("{:.3e}".format(time.time() - start), file = sys.stderr);
C++ code
using namespace std;
#include <iostream>
#include <vector>
#include <chrono>
int main(int argc, char **argv) {
if (argc != 2) {
cerr << "Pass `size` as an argument.\n";
return 1;
}
const unsigned size = atoi(argv[1]);
vector<vector<double>> A(size, vector<double>(size));
vector<vector<double>> B(size, vector<double>(size));
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
A[i][j] = i * 3.14 + j;
B[i][j] = i * 3.14 - j;
}
}
auto start = chrono::system_clock::now();
vector<vector<double>> C(size, vector<double>(size, /* initial_value = */ 0));
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
cerr << scientific;
cerr.precision(3);
cerr << chrono::duration<double>(chrono::system_clock::now() - start).count() << "\n";
}
C++ code (multithreaded)
We also wrote a multithreaded version of C++ code since numpy.dot() is automatically calculated in parallel.
You can get all the codes from GitHub.
Result
C++ version is 10+ times slower than Python 3 (with numpy) version.
matrix_size: 200x200
--------------- Time in seconds ---------------
C++ (not multithreaded): 8.45e-03
C++ (1 thread): 8.66e-03
C++ (2 threads): 4.68e-03
C++ (3 threads): 3.14e-03
C++ (4 threads): 2.43e-03
Python 3: 4.07e-04
-----------------------------------------------
matrix_size: 400x400
--------------- Time in seconds ---------------
C++ (not multithreaded): 7.011e-02
C++ (1 thread): 6.985e-02
C++ (2 threads): 3.647e-02
C++ (3 threads): 2.462e-02
C++ (4 threads): 1.915e-02
Python 3: 1.466e-03
-----------------------------------------------
Question
Is there any way to make the C++ implementation faster?
Optimizations I Tried
swap calculation order -> at most 3.5 times faster (not than numpy code but than C++ code)
optimization 1 plus partial unroll -> at most 4.5 times faster, but this can be done only when size is known in advance No. As pointed out in this comment, size is not needed to be known. We can just limit the max value of loop variables of unrolled loops and process remaining elements with normal loops. See my implementation for example.
optimization 2, plus minimizing the call of C[i][j] by introducing a simple variable sum -> at most 5.2 times faster. The implementation is here. This result implies std::vector::operator[] is un-ignorably slow.
optimization 3, plus g++ -march=native flag -> at most 6.2 times faster (By the way, we use -O3 of course.)
Optimization 3, plus reducing the call of operator [] by introducing a pointer to an element of A since A's elements are sequentially accessed in the unrolled loop. -> At most 6.2 times faster, and a little little bit faster than Optimization 4. The code is shown below.
g++ -funroll-loops flag to unroll for loops -> no change
g++ #pragma GCC unroll n -> no change
g++ -flto flag to turn on link time optimizations -> no change
Block Algorithm -> no change
transpose B to avoid cache miss -> no change
long linear std::vector instead of nested std::vector<std::vector>, swap calculation order, block algorithm, and partial unroll -> at most 2.2 times faster
Optimization 1, plus PGO(profile-guided optimization) -> 4.7 times faster
Optimization 3, plus PGO -> same as Optimization 3
Optimization 3, plus g++ specific __builtin_prefetch() -> same as Optimization 3
Current Status
(originally) 13.06 times slower -> (currently) 2.10 times slower
Again, you can get all the codes on GitHub. But let us cite some codes, all of which are functions called from the multithreaded version of C++ code.
Original Code (GitHub)
void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
const unsigned j_max = B[0].size();
const unsigned k_max = B.size();
for (int i = row_start; i < row_end; ++i) {
for (int j = 0; j < j_max; ++j) {
for (int k = 0; k < k_max; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
Current Best Code (GitHub)
This is the implementation of the Optimization 5 above.
void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
static const unsigned num_unroll = 5;
const unsigned j_max = B[0].size();
const unsigned k_max_for_unrolled_loop = B.size() / num_unroll * num_unroll;
const unsigned k_max = B.size();
for (int i = row_start; i < row_end; ++i) {
for (int k = 0; k < k_max_for_unrolled_loop; k += num_unroll) {
for (int j = 0; j < j_max; ++j) {
const double *p = A[i].data() + k;
double sum;
sum = *p++ * B[k][j];
sum += *p++ * B[k+1][j];
sum += *p++ * B[k+2][j];
sum += *p++ * B[k+3][j];
sum += *p++ * B[k+4][j];
C[i][j] += sum;
}
}
for (int k = k_max_for_unrolled_loop; k < k_max; ++k) {
const double a = A[i][k];
for (int j = 0; j < j_max; ++j) {
C[i][j] += a * B[k][j];
}
}
}
}
We've tried many optimizations since we first posted this question. We spent whole two days struggling with this problem, and finally reached the point where we have no more idea how to optimize the current best code. We doubt more complex algorithms like Strassen's will do it better since cases we handle are not large and each operation on std::vector is so expensive that, as we've seen, just reducing the call of [] improved the performance well.
We (want to) believe we can make it better, though.
Matrix multiplication is relativly easy to optimize. However if you want to get to decent cpu utilization it becomes tricky because you need deep knowledge of the hardware you are using. The steps to implement a fast matmul kernel are the following:
Use SIMDInstructions
Use Register Blocking and fetch multiple data at once
Optimize for your chache lines (mainly L2 and L3)
Parallelize your code to use multiple threads
Under this linke is a very good ressource, that explains all the nasty details:
https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0
If you want more indepth advise leave a comment.
I am trying to sum all the elements of an array which is initialized in the same code. As every element is independent from each other, I tried to perform the sum in parallel. My code is shown below:
int main(int argc, char** argv)
{
cout.precision(20);
double sumre=0.,Mre[11];
int n=11;
for(int i=0; i<n; i++)
Mre[i]=2.*exp(-10*M_PI*i/(1.*n));
#pragma omp parallel for reduction(+:sumre)
for(int i=0; i<n; i++)
{
sumre+=Mre[i];
}
cout<<sumre<<"\n";
}
which I compile and run with:
g++ -O3 -o sum sumparallel.cpp -fopenmp
./sum
respectively. My problem is that the output differs every time I run it. Sometimes it gives
2.1220129388411006488
or
2.1220129388411002047
Does anyone have an idea what is happening here?
Some of these comments hint at the problem here, but there could be two distinct issues
Double precision numbers do not have 20 decimal precision
If you want to print the maximum precision of sumre, use something like this
#include <float.h>
int maint(int argc, char* argv[])
{
...
printf("%.*g", DBL_DECIMAL_DIG, number);
return 0;
}
Floating point arithmetic is non-commutative
The effect of this property is roundoff error. In fact, the function that you have defined, a gaussian, is especially prone to roundoff for summation. Considering that workload distribution of OpenMP parallel for is undefined, you may receive different answers when you run it. To get around this you may use the kahan summation algorithm. An implementation with OpenMP would look something like this:
...
double sum = 0.0, c = 0.0;
#pragma omp parallel for reduction(+:sum, +:c)
for(i = 0; i < n; i++)
{
double y = Mre[i] - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}
sum = sum - c;
...
So, this time, I have a matrix called zvalue[10][10], and I would like to draw a pm3d map, with normal x-y grids, and with zvalue[i][j] to be the value of point (i,j).
In gnuplot, I just use:
set pm3d map
splot zvalue matrix using 1:2:3 with pm3d
And in C++ gnuplot pipes, I can use a function gp.file1d() to achieve this.
gp << "splot" << gp.file1d(matrix) << "matrix using 2:1:3" << "with pm3d\n"
But now, I am in C gnuplot pipe, of course I can write the matrix into a file called zvalue.txt, and use the following:
fprintf(gp, "splot \"zvalue.txt\" matrix using 1:2:3 with pm3d\n")
But is there other way? I tried #Christ's suggestion when I deal with normal splot with matrix, that is do something like:
int main(void)
{
FILE *gp = popen(GNUPLOT, "w");
fprintf(gp, "splot '-' matrix using 1:2:3\n");
int i, j;
for (i = 0; i < 10; i++) {
for (j = 0; j < 10; j++)
fprintf(gp, "%d ", i*j);
fprintf(gp, "\n");
}
pclose(gp);
return 0;
}
But when it's pm3d, it does not work well. I also tried with splot point by point, which works well for nomal splot with matrix, but here with pm3d, nothing works.
The following script works fine with pm3d:
#include <stdio.h>
#define GNUPLOT "gnuplot -persist"
int main(void)
{
FILE *gp = popen(GNUPLOT, "w");
fprintf(gp, "set pm3d map\n");
fprintf(gp, "splot '-' matrix using 1:2:3 with pm3d\n");
int i, j;
for (i = 0; i < 10; i++) {
for (j = 0; j < 10; j++)
fprintf(gp, "%d ", i*j);
fprintf(gp, "\n");
}
pclose(gp);
return 0;
}
So if you have a program which makes use of pm3d and doesn't work, it would really help to see exactly that full file (so that one can copy&paste the code to compile and test it).
I've created a simple c program using gsl(GNU Scienctific Library) and open mp. In this simple program, I want to test the execution time for sequential and parallel. Here is the program snippets, main.c.
#include "omp.h"
#include <stdio.h>
#include <gsl/gsl_matrix.h>
#include <time.h>
int main()
{
omp_set_num_threads(4);
int n1=10000, n2=10000;
gsl_matrix *A = gsl_matrix_alloc(n1, n2);
int i,j;
struct timeval tv1, tv2, tv3, tv4;
gettimeofday(&tv1, 0);
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv2, 0);
long elapsed = (tv2.tv_sec-tv1.tv_sec)*1000000 + tv2.tv_usec-tv1.tv_usec;
printf("Sequential Duration:%ldms\n", elapsed);
gettimeofday(&tv3, 0);
#pragma omp parallel for private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv4, 0);
elapsed = (tv4.tv_sec-tv3.tv_sec)*1000000 + tv4.tv_usec-tv3.tv_usec;
printf(" Parallel Duration:%ldms\n", elapsed);
return 0;
}
Then I compiled the above code, using this command:
gcc -fopenmp main.c -o test -lgsl -lgslcblas -lm
Here is the program's result:
Sequential Duration:11980106ms
Parallel Duration:20624043ms
Why, the parallel part slower than the sequential part. How can I optimize this code? Thanks
as you have written it the j variable is shared between all threads so the threads are overwritting other threads state constantly, leading to them iterating values they have already covered.
You should always minimize the scope of variables when trying to parallelize with openmp. Either move the scope of j into the loop or mark it as private explicitly:
#pragma omp parallel for private(j)
also clock counts the processor time not the real time, you probably want to use gettimeofday
you matrix is too small to benefit much from parallelization, the threading overhead will dominate. Increase it to ~10000x10000 to start seeing something.
The problem here is that you do not know what the procedure gsl_matrix_set does with A. You do not know if it is thread safe. To change one element in that matrix you supply the whole matrix to the routine instead of only the indices of the element. This smells by false sharing (see e.g. this answer).
I would try this instead
gsl_matrix_set(A[i][j],i*j*1000000);
If that does not work and what you are interested in is only the time difference between serial and parallel I would just do
A[i][j] = i*j*1000000
In the thread part, try this:
#pragma omp parallel private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
or
#pragma omp parallel for
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
Is there a way, how to make modulo by 511 (and 127) faster than using "%" operator ?
int c = 758 % 511;
int d = 423 % 127;
Here is a way to do fast modulo by 511 assuming that x is at most 32767. It's about twice as fast as x%511. It does the modulo in five steps: two multiply, two addition, one shift.
inline int fast_mod_511(int x) {
int y = (513*x+64)>>18;
return x - 511*y;
}
Here is the theory at how I arrive at this. I posted the code I tested this at the end
Let's consider
y = x/511 = x/(512-1) = x/1000 * 1/(1-1/512).
Let's define z = 512, then
y = x/z*1/(1-1/z).
Using Taylor expansion
y = x/z(1 + 1/z + 1/z^2 + 1/z^3 + ...).
Now if we know that x has a limited range we can cut the expansion. Let's assume x is always less than 2^15=32768. Then we can write
512*512*y = (1+512)*x = 513*x.
After looking at the digits which are significant we arrive at
y = (513*x+64)>>18 //512^2 = 2^18.
We can divide x/511 (assuming x is less than 32768) in three steps:
multiply,
add,
shift.
Here is the code I just to profile this in MSVC2013 64-bit release mode on an Ivy Bridge core.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
inline int fast_mod_511(int x) {
int y = (513*x+64)>>18;
return x - 511*y;
}
int main() {
unsigned int i, x;
volatile unsigned int r;
double dtime;
dtime = omp_get_wtime();
for(i=0; i<100000; i++) {
for(int j=0; j<32768; j++) {
r = j%511;
}
}
dtime =omp_get_wtime() - dtime;
printf("time %f\n", dtime);
dtime = omp_get_wtime();
for(i=0; i<100000; i++) {
for(int j=0; j<32768; j++) {
r = fast_mod_511(j);
}
}
dtime =omp_get_wtime() - dtime;
printf("time %f\n", dtime);
}
You can use a lookup table with the solutions pre-stored. If you create an array of a million integers looking up is about twice as fast as actually doing modulo in my C# app.
// fill an array
var mod511 = new int[1000000];
for (int x = 0; x < 1000000; x++) mod511[x] = x % 511;
and instead of using
c = 758 % 511;
you use
c = mod511[758];
This will cost you (possibly a lot of) memory, and will obviously not work if you want to use it for very large numbers also. But it is faster.
If you have to repeat those two modulus operations on a large number of data and your CPU supports SIMD (for example Intel's SSE/AVX/AVX2) then you can vectorize the operations, i.e., do the operations on many data in parallel. You can do this by using intrinsics or inline assembly. Yes the solution will be platform specific but maybe that is fine...