Is this PLINQ bug? - parallel-processing

Why PLINQ output is different than sequential processing and Parallel.For loop
I want to add sum of square root of 10,000,000 numbers.. Here is the code for 3 cases:
sequential for loop:
double sum = 0.0;
for(int i = 1;i<10000001;i++)
sum += Math.Sqrt(i);
Output of this is: 21081852648.717
Now Using Parallel.For loop:
object locker = new object();
double total ;
Parallel.For(1,10000001,
()=>0.0,
(i,state,local)=> local+Math.Sqrt(i),
(local)=>
{
lock(locker){ total += local; }
}
);
Output of this is: 21081852648.7199
Now using PLINQ
double tot = ParallelEnumerable.Range(1, 10000000)
.Sum(i => Math.Sqrt(i));
Output of this is: 21081852648.72
Why there is difference between PLINQ output and Parallel.For and Sequential for loop?

I strongly suspect it's because arithmetic with doubles isn't truly associative. Information is potentially lost while summing values, and exactly what information is lost will depend on the order of the operations.
Here's an example showing that effect:
using System;
class Test
{
static void Main()
{
double d1 = 0d;
for (int i = 0; i < 10000; i++)
{
d1 += 0.00000000000000001;
}
d1 += 1;
Console.WriteLine(d1);
double d2 = 1d;
for (int i = 0; i < 10000; i++)
{
d2 += 0.00000000000000001;
}
Console.WriteLine(d2);
}
}
In the first case, we can add very small numbers lots of times until they become big enough to still be relevant when added to 1.
In the second case, adding 0.00000000000000001 to 1 always just results in 1 as there isn't enough information in a double to represent 1.00000000000000001 - so the final result is still just 1.
EDIT: I've thought of another aspect which could be confusing things. For local variables, the JIT compiler is able to (and allowed to) use the 80-bit FP registers, which means arithmetic can be performed with less information loss. That's not the case for instance variables which definitely have to be 64-bit. In your Parallel.For case, the total variable will actually be an instance variable in a generated class because it's captured by a lambda expression. This could change the results - but it may well depend on computer architecture, CLR version etc.

Related

how do we calculate the number of reads/misses of the cache in this code snippet?

I'm trying to get an understanding of how to calculate the errors in the code, from the link on this page, Example given from text book. I can see where the calculations come from, but as the values are the same (32), I cannot work out how to do the calculation should the value in the two loops differ. Using different sized loops, what would the calculations be please?
`
for (i = 32; i >= 0; i--) {
for (j = 128; j >= 0; j--) {
total_x += grid[i][j].x;
}
}
for (i = 128; i >= 0; i--) {
for (j = 32; j >= 0; j--) {
total_y += grid[i][j].y;
}
}
`
If we had a matrix with 128 rows and 24 columns (instead of the 32 x 32 in the example), using 32-bit integers, and with each memory block able to hold 16 bytes, how do we calculate the number of compulsory misses on the top loop?
Also, if we use a direct-mapped cache holding 256 bytes of data, how would we calculate the number of all the data cache misses when running the top loop?
Finally, if we flip it and use the bottom loop, how does the maths change (if it does) for the points above?
Apologies as this is all new to me and I just want to understand the maths behind it so I can answer the problem, rather than just be given an answer.
Nothing - it's a theoretical question

Blitz++, armadillo and OpenMP very slow

I have been trying for very long time to learn how to parallelise and I have been reading lots of notes on OpenMP. So, I tried to used it and the results I get is that all placed where I tried to parallelise are 5 times slower than the serial case and I am wondering why...
My code is the next:
toevaluate is a blitz matrix of two columns and rows length.
storecallj and storecallk are just two blitz vectors that I used to store the calls and avoid extra function callings.
matrix is an square armadillo matrix of length columns (cols= rows) and length of row is rows (I will use it later for other thing and it is more convenient to define it as armadillo matrix)
externfunction1 is a function defined outside of this which computes the (x,y) values of f(a,b) function, where (a,b) is the input and (x,y) is the output.
Problem is a string variable and normal is a boolean.
resultk and resultj are such vectors (x,y) storing the output of such functions.
externfunction2 is another function defined outside of this which computes a double by evaluating funcci (i=1,2) which is a blitz polynomial. This polynomial gets evaluated in a double: innerprod to give another double: wvaluei (i=1,2).
I think these are all the generalities. The code is below.
{
omp_set_dynamic(0);
OMP_NUM_THREADS=4;
omp_set_num_threads(OMP_NUM_THREADS);
int chunk = int(floor(cols/OMP_NUM_THREADS));
#pragma omp parallel shared(matrix,storecallj,toevaluate,resultj,cols,rows, storecallk,atzero,innerprod,wvalue1,wvalue2,funcc1, funcc2,storediff,checking,problem,normal,chunk) private(tid,j,k)
{
tid = omp_get_thread_num();
if (tid == 0)
{
printf("Initializing parallel process...\n");
}
#pragma omp for collapse(2) schedule (dynamic, chunk) nowait
for(j=0; j<cols; ++j)
{
for(k=0; k<rows; ++k)
{
storecallj = toevaluate(j, All);
externfunction1(problem,normal,storecallj,resultj);
storecallk = toevaluate(k, All);
storediff = storecallj-storecallk;
if(k==j){
Matrix(j,k)=-atzero*sum(resultj*resultj);
}else{
innerprod =sqrt(sum(storediff* storediff));
checking=1.0-c* innerprod;
if(checking>0.0)
{
externfunction1(problem,normal, storecallk,resultk);
externfunction2(c, innerprod, funcc1, wvalue1);
externfunction2(c, innerprod, funcc2, wvalue2);
Matrix(j,k)=-wvalue2*sum(storediff*resultj)*sum(storediff*resultk)-wvalue1*sum(resultj*resultk);
}
}
}
}
}
}
and it is very slow. Another thing that I have tried is:
for(j=0; j<cols; ++j)
{
storecallj = toevaluate(j, All);
externfunction1(problem,normal,storecallj,resultj);
#pragma omp for collapse(2) schedule (dynamic, chunk) nowait
for(k=0; k<rows; ++k)
{
storecallk = toevaluate(k, All);
storediff = storecallj-storecallk;
...
I am using dynamic because I am to avoid problems in case the total number of points to evaluate is not a multiple of 4.
Could someone please give me a hand to understand why this is so slow?

Eigen - return type of .cwiseProduct?

I am writing a function in RcppEigen for weighted covariances. In one of the steps I want to take column i and column j of a matrix, X, and compute the cwiseProduct, which should return some kind of vector. The output of cwiseProduct will go into an intermediate variable which can be reused many times. From the docs it seems cwiseProduct returns a CwiseBinaryOp, which itself takes two types. My cwiseProduct operates on two column vectors, so I thought the correct return type should be Eigen::CwiseBinaryOp<Eigen::ColXpr, Eigen::ColXpr>, but I get the error no member named ColXpr in namespace Eigen
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
Rcpp::List Crossprod_sparse(Eigen::MappedSparseMatrix<double> X, Eigen::Map<Eigen::MatrixXd> W) {
int K = W.cols();
int p = X.cols();
Rcpp::List crossprods(W.cols());
for (int i = 0; i < p; i++) {
for (int j = i; j < p; j++) {
Eigen::CwiseBinaryOp<Eigen::ColXpr, Eigen::ColXpr> prod = X.col(i).cwiseProduct(X.col(j));
for (int k = 0; k < K; k++) {
//double out = prod.dot(W.col(k));
}
}
}
return crossprods;
}
I have also tried saving into a SparseVector
Eigen::SparseVector<double> prod = X.col(i).cwiseProduct(X.col(j));
as well as computing, but not saving at all
X.col(i).cwiseProduct(X.col(j));
If I don't save the product at all, the functions returns very quickly, hinting that cwiseProduct is not an expensive function. When I save it into a SparseVector, the function is extremely slow, making me think that SparseVector is not the right return type and Eigen is doing extra work to get it into that type.
Recall that Eigen relies on expression templates, so if you don't assign an expression then this expression is essentially a no-op. In your case, assigning it to a SparseVector is the right thing to do. Regarding speed, make sure to compile with compiler optimizations ON (like -O3).
Nonetheless, I believe there is a faster way to write your overall computations. For instance, are you sure that all X.col(i).cwiseProduct(X.col(j)) are non empty? If not, then the second loop should be rewritten to iterate over the sparse set of overlapping columns only. Loops could also be interchanged to leverage efficient matrix products.

How to parallelize multidimensional for-loops in with calculation done in outer loops in OpenCL?

I am trying to parallelize a code that looks like this using OpenCL run on a GPU:
int pos = 0;
for (int x=0 ; x<x_len ; x++){
//do some expensive calculation like
int value = pow(24, x);
for (int y=0 ; y<y_len ; y++){
for (int z=0 ; z<z_len ; z++){
//do some expensive calculation that depends on value
//but is very unlikely to have a positive result needed
//to save in global memmoy - something like
if ((value * y * z) % 456 = 0){
// save position that had positive result
output[pos] = x*y_len*z_len+y*z_len + z;
pos++;
}
}
}
}
(y_len and z_len are small enough so that a (1,y_len,z_len) lokal workgroup is possible, just in case that is important)
My current solution has 2 kernels, one calculating the outer calculation and saving the result to global memory and the second using that calculated data and calculate the inner calculation (using atomic_add for pos). That works fine but my actual code needs more data to be saved than just one integer (it is actually 2 integer and 2 longs) per x iteration, so that global memory is used up quite fast. That means that I need to split the kernel calls and iterate over the calls in the host code a lot of times.
So my question is, if there a better ways to parallelize that code?

Unexpected slowdown using omp

I'm using OMP to try to get some speedup in a small kernel. It's basically just querying a vector of unordered_sets for membership. I tried to make an optimization, but surprisingly I got a slowdown, and am really curious why.
My first pass was:
vector<unordered_set<uint16_t> > setList = getData();
#pragma omp parallel for default(shared) private(i, j) schedule(dynamic, 50)
for(i = 0; i < size; i++){
for(j = 0; j < 500; j++){
count = count + setList[i].count(val[j]);
}
}
Then I thought I could maybe get a speedup by moving the setList[i] sub expression up one level of nesting and save it in a temp variable, by doing the following:
#pragma omp parallel for default(shared) private(i, j, currSet) schedule(dynamic, 50)
for(i = 0; i < size; i++){
currSet = setList[i];
for(j = 0; j < 500; j++){
count = count + currSet.count(val[j]);
}
}
I had thought this would maybe save a load each iteration of the "j" for loop, and get a speedup, but it actually SLOWED DOWN by about 3x. By this I mean the entire kernel took about 3 times as long to run. Thoughts on why this would occur?
Thanks!
Adding up a few integers is really not enough work to warrant starting threads for.
If you forget to add the reduction clause, you'll suffer from true sharing - all threads want to update that count variable at the same time. This makes all cores fight for the cache line containing tha variable, which will considerably impact your performance.
I just noticed that you set the schedule to be dynamic. You shouldn't. This workload can be divided at compile time already. So don't specify a schedule.
As has already been stated inter-loop dependencies, i.e. threads waiting for data from other threads, or data being accessed by multiple threads successively, can cause a paralleled program to experience slow down and should be avoided as a rule of thumb. Built in functions like reductions can collect individual results and compile them together in an optimised fashion.
Here is a good example of reduction being used in a similar case to yours from the university of Utah
int array[8] = { 1, 1, 1, 1, 1, 1, 1, 1};
int sum = 0, i;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < 8; i++) {
sum += array[i];
}
printf("total %d\n", sum);
source: http://www.eng.utah.edu/~cs4960-01/lecture9.pdf
as an aside: private variables need only be assigned when they are local variables inside a parallel region In both cases it is not necessary for i to be declared private.
see wikipedia: https://en.wikipedia.org/wiki/OpenMP#Data_sharing_attribute_clauses
Data sharing attribute clauses
shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.
private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.
see stack exchange answer here: OpenMP: are local variables automatically private?

Resources