parallelization of nested loop in threading building blocks - algorithm

I am new to threading building blocks and trying to encode the FFT algorithm in TBB to get some hands on experience. In case of this algorithm i can only parallelize the innermost loop. but by doing so the performance has degraded up to unacceptable degree (more than thousand times). I have tried for array size up to 2^20 but same result. My code is given below
for(stage1=0;stage1 < lim;stage1 ++)
{
butterfly_index1 = butterfly_index2;
butterfly_index2 = butterfly_index2 + butterfly_index2;
k = -1*2*PI/butterfly_index2;
k_next = 0.0;
for(stage2 = 0 ; stage2 < butterfly_index1;stage2++)
{
sine=sin(k_next);
cosine = cos(k_next);
k_next = k_next + k;
FFT. sine = &sine;
FFT.cosine = &cosine;
FFT.n2 = &butterfly_index2;
FFT.loop_init = &stage2;
FFT.n1 = &butterfly_index1;
parallel_for(blocked_range<int>(
stage2,SIZE,SIZE/4),FFT,simple_partitioner());
}
}
and body for parallel_loop is
void operator()(const blocked_range<int> &r)const
{
for(int k = r.begin(); k != r.end(); k++)
{
if(k != *loop_init)
{
if((k - (*loop_init))% (* n2)!= 0)
continue;
}
temp_real = (*cosine) * X[k + *n1].real - (*sine) * X[k + *n1].img;
temp_img = (*sine)* X[k + *n1].real + (*cosine) * X[k + *n1].img;
X[k + *n1].real = X[k].real - temp_real;
X[k + *n1].img = X[k].img - temp_img;
X[k].real = X[k].real + temp_real;
X[k].img = X[k].img + temp_img;
}
}
If i replace it with normal loop then things are correct.

For very short workloads, the huge slowdown can happen due to the overhead on thread creation. For 2^20 elements in the array, I do not believe such a huge performance drop happens.
Another significant source of performance degradation is compiler's inability to optimize (in particular, vectorize) the code after it has been TBBfied. Find out whether your compiler may produce a vectorization report, and look for differences between serial and TBB versions.
A possible source of slowdown might be the copy constructor of parallel_for's body class, because the bodies are copied a lot. But the given code does not look suspicious in this regard: seems the body contains a few pointers. Anyway, look whether it could be a problem.
Another usual source of significant overhead is too fine-grained parallelism, i.e. a lot of tasks each containing very little work. But here this is not the case either, because grainsize SIZE/4 in the 3rd parameter to blocked_range specifies that body's operator()() will be invoked at most 4 times per the algorithm.
I would recommend to not specify simple_partitioner and grainsize for initial experiments, and instead let TBB distribute the work dynamically. You can tune it later if necessary.

the problem with my code was the decrease in workload with increase in n2 variable. so as the out loop proceeds the workload for parallel_for become half and after just a few iteration it becomes too small to give performance boost by using TBB. So solution is to parallelize the inner loop only for iterations when the workload for inner loop is sufficient ans serialize the inner loop for rest of iterations. secondly the "for" loop header that contains the condition check (k != r.end()) also degrades the performance. solution is to replace the r.end() with a locally defined variable that is initialized to r.end()
Thanks to intel software forum for their assistance in solving this problem

Related

Vectorization or alternative to speed up MATLAB loop

I am using MATLAB to run a for loop in which variable-length portions of a large vector are updated at each iteration with the content of another vector; something like:
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) +...
a(k)*vec2(idx_start2(k):idx_end2(k));
end
The selected portions of vec1 and vec2 are not so small and N can be quite large; moreover, if this can be useful, idx_end(k)<idx_start(k+1) does not necessarily hold (i.e. vec1's edited portions may be partially re-updated in subsequent iterations). As a consequence, the above is by far the slowest portion of code in my script and I would like to speed it up, if possible.
Is there any way to vectorize the above for loop in order to make it run faster? Or, are there any alternative approaches to improve its execution speed?
EDIT:
As requested in the comments, here are some example values: Using the profiler to check execution times, the loop above runs in about 3.3 s with N=5e4, length(vec1)=3e6, length(vec2)=1.7e3 and the portions indexed by idx_start/end are slightly shorter on average than the latter, although not significantly.
Of course, 3.3 s is not particularly worrying in itself, but I would like to be able to increase especially N and vec1 by 1 or 2 orders of magnitude and in such a loop it will take quite longer to run.
Sorry, I couldn't find a way to speed up your code. This is the code I created to try to speed it up:
N = 5e4;
vec1 = 1:3e6;
vec2 = 1:1.7e3;
rng(0)
a = randn(N, 1);
idx_start1 = randi([1, 2.9e6], N, 1);
idx_end1 = idx_start1 + 1000;
idx_start2 = randi([1, 0.6e3], N, 1);
idx_end2 = idx_start2 + 1000;
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) + a(k) * vec2(idx_start2(k):idx_end2(k));
% use = idx_start1(k):idx_end1(k);
% vec1(use) = vec1(use) + a(k) * vec2(idx_start2(k):idx_end2(k));
end
The two commented-out lines of code in the for loop were my attempt to speed it up, but it actually made it slower, much to my surprise. Generally, I would create a variable for an array that is used more than once thinking that is faster, but it is not. The code that is not commented out runs in 0.24 s versus 0.67 seconds for the code that is commented out.

Eigen: Reduction operations very slow with custom scalar type

This is an issue that came up when profiling my convex optimization library. It uses a non-primitive custom Scalar type. We found that reduction operations like sum() and squaredNorm() are very slow compared to raw loops.
Here's a minimal example to reproduce the issue:
#include "epigraph.hpp"
size_t T = 200;
cvx::OptimizationProblem qp;
cvx::MatrixX u = qp.addVariable("u", 1, T);
// this takes very long
cvx::Scalar u_sum = u.sum();
// a raw loop is about 5x faster
cvx::Scalar sum = 0.;
for (int i = 0; i < u.size(); i++)
{
sum += u(i);
}
I did some profiling but could not get to the root of the issue since the reduction implementations are somewhat convoluted. I suspect that this has to do with internal optimizations that turn out to be slower in this case lead to a lot of allocations. Is there a way to turn those optimizations off?
Profiler Output

GPU sorting vs CPU sorting

I made a very naive implementation of the mergesort algorithm, which i turned to work on CUDA with very minimal implementation changes, the algorith code follows:
//Merge for mergesort
__device__ void merge(int* aux,int* data,int l,int m,int r)
{
int i,j,k;
for(i=m+1;i>l;i--){
aux[i-1]=data[i-1];
}
//Copy in reverse order the second subarray
for(j=m;j<r;j++){
aux[r+m-j]=data[j+1];
}
//Merge
for(k=l;k<=r;k++){
if(aux[j]<aux[i] || i==(m+1))
data[k]=aux[j--];
else
data[k]=aux[i++];
}
}
//What this code do is performing a local merge
//of the array
__global__
void basic_merge(int* aux, int* data,int n)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int tn = n / (blockDim.x*gridDim.x);
int l = i * tn;
int r = l + tn;
//printf("Thread %d: %d,%d: \n",i,l,r);
for(int i{1};i<=(tn/2)+1;i*=2)
for(int j{l+i};j<(r+1);j+=2*i)
{
merge(aux,data,j-i,j-1,j+i-1);
}
__syncthreads();
if(i==0){
//Complete the merge
do{
for(int i{tn};i<(n+1);i+=2*tn)
merge(aux,data,i-tn,i-1,i+tn-1);
tn*=2;
}while(tn<(n/2)+1);
}
}
The problem is that no matter how many threads i launch on my GTX 760, the sorting performance is always much much more worst than the same code on CPU running on 8 threads (My CPU have hardware support for up to 8 concurrent threads).
For example, sorting 150 million elements on CPU takes some hundred milliseconds, on GPU up to 10 minutes (even with 1024 threads per block)! Clearly i'm missing some important point here, can you please provide me with some comment? I strongly suspect the the problem is in the final merge operation performed by the first thread, at that point we have a certain amount of subarray (the exact amount depend on the number of threads) which are sorted and need to me merged, this is completed by just one thread (one tiny GPU thread).
I think i should use come kind of reduction here, so each thread perform in parallel further more merge, and the "Complete the merge" step just merge the last two sorted subarray..
I'm very new to CUDA.
EDIT (ADDENDUM):
Thanks for the link, I must admit I still need some time to learn better CUDA before taking full advantage of that material.. Anyway, I was able to rewrite the sorting function in order to take advantage as long as possible of multiple threads, my first implementation had a bottleneck in the last phase of the merge procedure, which was performed by only one multiprocessor.
Now after the first merge, I use each time up to (1/2)*(n/b) threads, where n is the amount of data to sort and b is the size of the chunk of data sorted by each threads.
The improvement in performance is surprising, using only 1024 threads it takes about ~10 seconds to sort 30 milion element.. Well, this is still a poor result unfortunately! The problem is in the threads syncronization, but first things first, let's see the code:
__global__
void basic_merge(int* aux, int* data,int n)
{
int k = blockIdx.x*blockDim.x + threadIdx.x;
int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1;
b = pow( (float)2, b);
int l=k*b;
int r=min(l+b-1,n-1);
__syncthreads();
for(int m{1};m<=(r-l);m=2*m)
{
for(int i{l};i<=r;i+=2*m)
{
merge(aux,data,i,min(r,i+m-1),min(r,i+2*m-1));
}
}
__syncthreads();
do{
if(k<=(n/b)*.5)
{
l=2*k*b;
r=min(l+2*b-1,n-1);
merge(aux,data,l,min(r,l+b-1),r);
}else break;
__syncthreads();
b*=2;
}while((r+1)<n);
}
The function 'merge' is the same as before. Now the problem is that I'm using only 1024 threads instead of the 65000 and more I can run on my CUDA device, the problem is that __syncthreads does not work as sync primitive at grid level, but only at block level!
So i can syncronize up to 1024 threads,that is the amount of threads supported per block. Without a proper syncronization each thread mess up the data of the other, and the merging procedure does not work.
In order to boost the performance I need some kind of syncronization between all the threads in the grid, seems that no API exist for this purpose, and i read about a solution which involve multiple kernel launch from the host code, using the host as barrier for all the threads.
I have a certain plan on how to implement this tehcnique in my mergesort function, I will provide you with the code in the near future. Did you have any suggestion on your own?
Thanks
It looks like all the work is being done in __global __ memory. Each write takes a long time and each read takes a long time making the function slow. I think it would help to maybe first copy your data to __shared __ memory first and then do the work in there and then when the sorting is completed(for that block) copy the results back to global memory.
Global memory takes about 400 clock cycles (or about 100 if the data happens to be in L2 cache). Shared memory on the other hand only takes 1-3 clock cycles to write and read.
The above would help with performance a lot. Some other super minor things you can try are..
(1) remove the first __syncthreads(); It is not really doing anything because no data is being past in between warps at that point.
(2) Move the "int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1; b = pow( (float)2, b);" outside the kernel and just pass in b instead. This is being calculated over and over when it really only needs to be calculated once.
I tried to follow along on your algorithm but was not able to. The variable names were hard to follow...or... your code is above my head and I cannot follow. =) Hope the above helps.

exponential moving average

I have an exponential moving average that gets called millions of times, and thus is the most expensive part of my code:
double _exponential(double price[ ], double smoothingValue, int dataSetSize)
{
int i;
double cXAvg;
cXAvg = price[ dataSetSize - 2 ] ;
for (i= dataSetSize - 2; i > -1; --i)
cXAvg += (smoothingValue * (price[ i ] - cXAvg)) ;
return ( cXAvg) ;
}
Is there a more efficient way to code this to speed things up? I have a multi-threaded app and am using Visual C++.
Thank you.
Ouch!
Sure, multithreading can help. But you can almost assuredly improve the performance on a single threaded machine.
First, you are calculating it in the wrong direction. Only the most modern machines can do negative stride prefetching. Nearly all machihnes are faster for unit strides. I.e. changing the direction of the array so that you scan from low to high rather than high to low is almost always better.
Next, rewriting a bit - please allow me to shorten the variable names to make it easier to type:
avg = price[0]
for i
avg = s * (price[i] - avg)
By the way, I will start using shorthands p for price and s for smoothing, to save typing. I'm lazy...
avg0 = p0
avg1 = s*(p1-p0)
avg2 = s*(p2-s*(p1-p0)) = s*(p2-s*(p1-avg0))
avg3 = s*(p3-s*(p2-s*(p1-p0))) = s*p3 - s*s*p2 + s*s*avg1
and, in general
avg[i] = s*p[i] - s*s*p[i-1] + s*s*avg[i-2]
precalculating s*s
you might do
avg[i] = s*p[i] - s*s*(p[i-1] + s*s*avg[i-2])
but it is probably faster to do
avg[i] = (s*p[i] - s*s*p[i-1]) + s*s*avg[i-2])
The latency between avg[i] and avg[i-2] is then 1 multiply and an add, rather than a subtract and a multiply between avg[i] and avg[i-1]. I.e. more than twice as fast.
In general, you want to rewrite the recurrence so that avg[i] is calculated in terms of avg[j]
for j as far back as you can possibly go, without filling up the machine, either execution units or registers.
You are basically doing more multiplies overall, in order to get fewer chains of multiples (and subtracts) on the critical path.
Skipping from avg[i-2] to avg[i[ is easy, you can probably do three and four. Exactly how far
depends on what your machine is, and how many registers you have.
And the latency of the floating point adder and multiplier. Or, better yet, the flavour of combined multiply-add instruction you have - all modern machines have them. E.g. if the MADD or MSUB is 7 cycles long, you can do up to 6 other calculations in its shadow, even if you have only a single floating point unit. Fully pipelined. And so on. Less if pipelined every otherr cycle, as is common for double precision on older chips and GPUs. The assembly code should be software pipelined so that different loop iterations overlap. A good compiler should do that for you, but you might have to rewrite the C code to get the best performance.
By the way: I do NOT mean to suggest that you should be creating an array of avg[]. Instead, you would need two averages if avg[i] is calculated in terms of avg[i-2], and so on.
You can use an array of avg[i] if you want, but I think that you only need to have 2 or 4 avgs, called, creatively, avg0 and avg1 (2, 3...), and "rotate" them.
avg0 = p0
avg1 = s*(p1-p0)
/*avg2=reuses*/avg0 = s*(p2-s*(p1-avg0))
/*avg3=reusing*/avg3 = s*p3 - s*s*p2 + s*s*avg1
for i from 2 to N by 2 do
avg0 = s*p3 - s*s*p2 + s*s*avg0
avg1 = s*p3 - s*s*p2 + s*s*avg1
This sort of trick, splitting an accumulator or average into two or more,
combining multiple stages of the recurrence, is common in high performance code.
Oh, yes: precalculate s*s, etc.
If I have done it right, in infinite precision this would be identical. (Double check me, please.)
However, in finite precision FP your results may differ, hopefully only slightly, because of different roundings. If the unrolling is correct and the answers are significantly different, you probably have a numerically unstable algorithm. You're the one who wouyld know.
Note: floating point rounding errors will change the low bits of your answer.
Both because of rearranging the code, and using MADD.
I think that is probably okay, but you have to decide.
Note: the calculations for avg[i] and avg[i-1] are now independent. So you can use a SIMD
instruction set, like Intel SSE2, which permits operation on two 64 bit values in a 128 bit wide register at a time.
That'll be good for almost 2X, on a machine that has enough ALUs.
If you have enough registers to rewrite avg[i] in terms of avg[i-4]
(and I am sure you do on iA64), then you can go 4X wide,
if you have access to a machine like 256 bit AVX.
On a GPU... you can go for deeper recurrences, rewriting avg[i] in terms of avg[i-8], and so on.
Some GPUs have instructions that calculate AX+B or even AX+BY as a single instruction.
Although that's more common for 32 bit than for 64 bit precision.
At some point I would probably start asking: do you want to do this on multiple prices at a time?
Not only does this help you with multithreading, it will also suit it to running on a GPU. And using wide SIMD.
Minor Late Addition
I am a bit embarassed not to have applied Horner's Rule to expressions like
avg1 = s*p3 - s*s*p2 + s*s*avg1
giving
avg1 = s*(p3 - s*(p2 + avg1))
slightly more efficient. slightly different results with rounding.
In my defence, any decent compiler should do this for you.
But Hrner's rule makes the dependency chain deeper in terms of multiplies.
You might need to unroll and pipelined the loop a few more times.
Or you can do
avg1 = s*p3 - s2*(*p2 + avg1)
where you precalculate
s2 = s*s
Here is the most efficient way, albeit in C#, you would need to port it over to C++, which should be very simple. It calculates the EMA and Slope on the fly in the most efficient way.
public class ExponentialMovingAverageIndicator
{
private bool _isInitialized;
private readonly int _lookback;
private readonly double _weightingMultiplier;
private double _previousAverage;
public double Average { get; private set; }
public double Slope { get; private set; }
public ExponentialMovingAverageIndicator(int lookback)
{
_lookback = lookback;
_weightingMultiplier = 2.0/(lookback + 1);
}
public void AddDataPoint(double dataPoint)
{
if (!_isInitialized)
{
Average = dataPoint;
Slope = 0;
_previousAverage = Average;
_isInitialized = true;
return;
}
Average = ((dataPoint - _previousAverage)*_weightingMultiplier) + _previousAverage;
Slope = Average - _previousAverage;
//update previous average
_previousAverage = Average;
}
}

OpenMP - running things in parallel and some in sequence within them

I have a scenario like:
for (i = 0; i < n; i++)
{
for (j = 0; j < m; j++)
{
for (k = 0; k < x; k++)
{
val = 2*i + j + 4*k
if (val != 0)
{
for(t = 0; t < l; t++)
{
someFunction((i + t) + someFunction(j + t) + k*t)
}
}
}
}
}
Considering this is block A, Now I have two more similar blocks in my code. I want to put them in parallel, so I used OpenMP pragmas. However I am not able to parallelize it, because I am a tad confused that which variables would be shared and private in this case. If the function call in the inner loop was an operation like sum += x, then I could have added a reduction clause.
In general, how would one approach parallelizing a code using OpenMP, when we there is a nested for loop, and then another inner for loop doing the main operation.
I tried declaring a parallel region, and then simply putting pragma fors before the blocks, but definitely I am missing a point there!
Thanks,
Sayan
I'm more of a Fortran programmer than C so my knowledge of OpenMP in C-style is poor, and I'll leave the syntax to you.
Your easiest approach here is probably (I'll qualify this later) to simply parallelise the outermost loop. By default OpenMP will regard variable i as private, all the rest as shared. This is probably not what you want, you probably want to make j and k and t private too. I suspect that you want val private also.
I'm a bit puzzled by the statement at the bottom of your nest of loops (ie someFunction...), which doesn't seem to return any value at all. Does it work by side-effects ?
So, you shouldn't need to declare a parallel region enclosing all this code, and you should probably only parallelise the outermost loop. If you were to parallelise the inner loops too you might find your OpenMP installation either ignoring them, spawning more processes than you have processors, or complaining bitterly.
I say that your easiest approach is probably to parallelise the outermost loop because I've made some assumptions about what your program (fragment) is doing. If the assumptions are wrong you might want to parallelise one of the inner loops. Another point to check is that the number of executions of the loop(s) you parallelise is much greater than the number of threads you use. You don't want to have OpenMP run loops with a trip count of, say, 7, on 4 threads, the load balance would be very poor.
You're correct, the innermost statement would rather be someFunction((i + t) + someFunction2(j + t) + k*t).

Resources