I'm trying to reduce the number of calls to std::max in my inner loop, as I'm calling it millions of times (no exaggeration!) and that's making my parallel code run slower than the sequential code. The basic idea (yes, this IS for an assignment) is that the code calculates the temperature at a certain gridpoint, iteration by iteration, until the maximum change is no more than a certain, very tiny number (e.g 0.01). The new temp is the average of the temps in the cells directly above, below and beside it. Each cell has a different value as a result, and I want to return the largest change in any cell for a given chunk of the grid.
I've got the code working but it's slow because I'm doing a large (excessively so) number of calls to std::max in the inner loop and it's O(n*n). I have used a 1D domain decomposition
Notes: tdiff doesn't depend on anything but what's in the matrix
the inputs of the reduction function are the result of the lambda function
diff is the greatest change in a single cell in that chunk of the grid over 1 iteration
blocked range is defined earlier in the code
t_new is new temperature for that grid point, t_old is the old one
max_diff = parallel_reduce(range, 0.0,
//lambda function returns local max
[&](blocked_range<size_t> range, double diff)-> double
for (size_t j = range.begin(); j<range.end(); j++)
for (size_t i = 1; i < n_x-1; i++)
tdiff = fabs(t_old[j*n_x+i] - t_new[j*n_x+i]);
diff = std::max(diff, tdiff);
return diff; //return biggest value of tdiff for that iteration - once per 'i'
//reduction function - takes in all the max diffs for each iteration, picks the largest
[&](double a, double b)-> double
convergence = std::max(a,b);
return convergence;
How can I make my code more efficient? I want to make less calls to std::max but need to maintain the correct values. Using gprof I get:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
61.66 3.47 3.47 3330884 0.00 0.00 double const& std::max<double>(double const&, double const&)
38.03 5.61 2.14 5839 0.37 0.96 _ZZ4mainENKUlN3tbb13blocked_rangeImEEdE_clES1_d
ETA: 61.66% of the time spent executing my code is on the std::max calls, it calls over 3 million times. The reduce function is called for every output of the lambda function, so reducing the number of calls to std::max in the lambda function will also reduce the number of calls to the reduce function
First of all, I would expect std::max to be inlined into its caller, so it's suspicious that gprof points it out as a separate hotspot. Do you maybe analyze a debug configuration?
Also, I do not think that std::max is a culprit here. Unless some special checks are enabled in its implementation, I believe it should be equivalent to (diff<tdiff)?tdiff:diff. Since one of the arguments to std::max is the variable that you update, you can try if (tdiff>diff) diff = tdiff; instead, but I doubt it will give you much (and perhaps compilers can do such optimization on their own).
Most likely, std::max is highlighted as the result of sampling skid; i.e. the real hotspot is in computations above std::max, which makes perfect sense, due to both more work and accesses to non-local data (arrays) that might have longer latency, especially if the corresponding locations are not in CPU cache.
Depending on the size of the rows (n_x) in your grid, processing it by rows like you do can be inefficient, cache-wise. It's better to reuse data from t_old as much as possible while those are in cache. Processing by rows, you either don't re-use a point from t_old at all until the next row (for i+1 and i-1 points) or only reuse it once (for two neighbors in the same row). A better approach is to process the grid by rectangular blocks, which helps to re-use data that are hot in cache. With TBB, the way to do that is to use blocked_range2d. It will need minimal changes in your code; basically, changing the range type and two loops inside the lambda: the outer and inner loops should iterate over range.rows() and range.cols(), respectively.
I ended up using parallel_for:
parallel_for(range, [&](blocked_range<size_t> range)
double loc_max = 0.0;
double tdiff;
for (size_t j = range.begin(); j<range.end(); j++)
for (size_t i = 1; i < n_x-1; i++)
tdiff = fabs(t_old[j*n_x+i] - t_new[j*n_x+i]);
loc_max = std::max(loc_max, tdiff);
//reduction function - takes in all the max diffs for each iteration, picks the largest
max_diff = std::max(max_diff, loc_max);
And now my code runs in under 2 seconds for an 8000x8000 grid :-)
I am writing some data on a bitmap file, and I have this loop to calculate the data which runs for 480,000 times according to each pixel in 800 * 600 resolution, hence different arguments (coordinates) and different return value at each iteration which is then stored in an array of size 480,000. This array is then used for further calculation of colours.
All these iterations combined take a lot of time, around a minute at runtime in Visual Studio (for different values at each execution). How can I ensure that the time is greatly reduced? It's really stressing me out.
Is it the fault of my machine (i5 9th gen, 8GB RAM)? Visual Studio 2019? Or the algorithm entirely? If it's the algorithm, what can I do to reduce its time?
Here's the loop that runs for each individual iteration:
int getIterations(double x, double y) //x and y are coordinates
complex<double> z = 0; //These are complex numbers, imagine a pair<double>
complex<double> c(x, y);
int iterations = 0;
while (iterations < max_iterations) // max_iterations has to be 1000 to get decent image quality
z = z * z + c;
if (abs(z) > 2) // abs(z) = square root of the sum of squares of both elements in the pair
return iterations;
While I don't know how exactly your abs(z) works, but based on your description, it might be slowing down your program by a lot.
Based on your description, your are taking the sum of squares of both element of your complex number, then get a square root out of it. Whatever your methods of square root is, it probably takes more than just a few lines of codes to run.
Instead, just compare complex.x * complex.x + complex.y * complex.y > 4, it's definitely faster than getting the square root first, then compare it with 2
There's a reason the above should be done during run-time?
I mean: the result of this loop seems dependant only on "x" and "y" (which are only coordinates), thus you can try to constexpr-ess all these calculation to be done at compile-time to pre-made a map of results...
At least, just try to build that map once during run-time initialisation.
Lets look at a simplified example function in GLSL:
void foo() {
vec2 localData[16];
// ...
int i = ... // somehow dependent on dynamic data (not known at compile time)
localData[i] = x; // THE IMPORTANT LINE
It writes some value x to a dynamic determined index in a local array.
Now, replacing the line localData[i] = x; with
for( int j = 0; j < 16; ++j )
if( i == j )
localData[j] = x;
makes the code significantly faster. In several tested examples (different shaders) the execution time almost halved and there were much more things going on than this write.
For example: in an order-independent transparency shader which, among other things, fetches 16 texels the timings are 39ms with the direct write and 23ms with the looped write. Nothing else changed!
The test hardware is an GTX1080. The assembly returned by glGetProgramBinary is still too high-level. It contains one line in the first case and a loop+if surrounding an identical line in the second.
Why does this performance issue happen?
Is this true for all vendors?
Guess: localData is stored in 8 vec4 registers (the assembly does not say anything about that). Further I assume, that registers cannot be addressed with an index. If both are true, than the final binary must use some branch construct. The loop variant might be unrolled and result in a switch-like pattern which is faster. But is that common for all vendors? Why can't the compiler use whatever results from the for loop as the default for such writes?
Further experiments have shown that the reason is the use of a different memory type for the array. The (unrolled) looped variant uses registers, while the random access variant switches to local memory.
Local memory is usual placed in the global one, but private to each thread. It is likely that accesses to this local array are going to be cached (L2?).
The experiments to verify this reasoning were the following:
Manual versions of unrolled loops (measured in an insertion sort with 16 elements over 1M pixels):
Base line: localData[i] = x 33ms
For loop: for j + if i=j 16.8ms
Switch: switch(i) { case 0: localData[0] ...: 16.92ms
If else tree (splitting in halves): 16.92ms
If list (plain manual unrolled): 16.8ms
=> All kinds of branch constructs result in more or less the same timings. So it is not a bad branching behavior as initially guessed.
Multiple vs. one vs no random access (32 element insertion sort)
2x localData[i] = x 47ms
1x localData[i] = x 45ms
0x localData[i] = x 16ms
=> As long as there is at least one random access the performance will be bad. This means there is a global decision changing the behavior of localData -- most likely the use of a different memory. Using more than one random access does not make things worse much, because of caching.
I made a very naive implementation of the mergesort algorithm, which i turned to work on CUDA with very minimal implementation changes, the algorith code follows:
//Merge for mergesort
__device__ void merge(int* aux,int* data,int l,int m,int r)
int i,j,k;
//Copy in reverse order the second subarray
if(aux[j]<aux[i] || i==(m+1))
//What this code do is performing a local merge
//of the array
void basic_merge(int* aux, int* data,int n)
int i = blockIdx.x*blockDim.x + threadIdx.x;
int tn = n / (blockDim.x*gridDim.x);
int l = i * tn;
int r = l + tn;
//printf("Thread %d: %d,%d: \n",i,l,r);
for(int i{1};i<=(tn/2)+1;i*=2)
for(int j{l+i};j<(r+1);j+=2*i)
//Complete the merge
for(int i{tn};i<(n+1);i+=2*tn)
The problem is that no matter how many threads i launch on my GTX 760, the sorting performance is always much much more worst than the same code on CPU running on 8 threads (My CPU have hardware support for up to 8 concurrent threads).
For example, sorting 150 million elements on CPU takes some hundred milliseconds, on GPU up to 10 minutes (even with 1024 threads per block)! Clearly i'm missing some important point here, can you please provide me with some comment? I strongly suspect the the problem is in the final merge operation performed by the first thread, at that point we have a certain amount of subarray (the exact amount depend on the number of threads) which are sorted and need to me merged, this is completed by just one thread (one tiny GPU thread).
I think i should use come kind of reduction here, so each thread perform in parallel further more merge, and the "Complete the merge" step just merge the last two sorted subarray..
I'm very new to CUDA.
Thanks for the link, I must admit I still need some time to learn better CUDA before taking full advantage of that material.. Anyway, I was able to rewrite the sorting function in order to take advantage as long as possible of multiple threads, my first implementation had a bottleneck in the last phase of the merge procedure, which was performed by only one multiprocessor.
Now after the first merge, I use each time up to (1/2)*(n/b) threads, where n is the amount of data to sort and b is the size of the chunk of data sorted by each threads.
The improvement in performance is surprising, using only 1024 threads it takes about ~10 seconds to sort 30 milion element.. Well, this is still a poor result unfortunately! The problem is in the threads syncronization, but first things first, let's see the code:
void basic_merge(int* aux, int* data,int n)
int k = blockIdx.x*blockDim.x + threadIdx.x;
int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1;
b = pow( (float)2, b);
int l=k*b;
int r=min(l+b-1,n-1);
for(int m{1};m<=(r-l);m=2*m)
for(int i{l};i<=r;i+=2*m)
}else break;
The function 'merge' is the same as before. Now the problem is that I'm using only 1024 threads instead of the 65000 and more I can run on my CUDA device, the problem is that __syncthreads does not work as sync primitive at grid level, but only at block level!
So i can syncronize up to 1024 threads,that is the amount of threads supported per block. Without a proper syncronization each thread mess up the data of the other, and the merging procedure does not work.
In order to boost the performance I need some kind of syncronization between all the threads in the grid, seems that no API exist for this purpose, and i read about a solution which involve multiple kernel launch from the host code, using the host as barrier for all the threads.
I have a certain plan on how to implement this tehcnique in my mergesort function, I will provide you with the code in the near future. Did you have any suggestion on your own?
It looks like all the work is being done in __global __ memory. Each write takes a long time and each read takes a long time making the function slow. I think it would help to maybe first copy your data to __shared __ memory first and then do the work in there and then when the sorting is completed(for that block) copy the results back to global memory.
Global memory takes about 400 clock cycles (or about 100 if the data happens to be in L2 cache). Shared memory on the other hand only takes 1-3 clock cycles to write and read.
The above would help with performance a lot. Some other super minor things you can try are..
(1) remove the first __syncthreads(); It is not really doing anything because no data is being past in between warps at that point.
(2) Move the "int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1; b = pow( (float)2, b);" outside the kernel and just pass in b instead. This is being calculated over and over when it really only needs to be calculated once.
I tried to follow along on your algorithm but was not able to. The variable names were hard to follow...or... your code is above my head and I cannot follow. =) Hope the above helps.
I recently read a post about for loops over a range of integers being slower than the corresponding while loops, which is true, but wanted to see if the same held up for iterating over existing sequences and was surprised to find the complete opposite by a large margin.
First and foremost, I'm using the following function for timing:
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: " + (System.nanoTime - s) / 1e6 + "ms")
and I'm using a simple sequence of Integers:
val seq = List.range(0, 10000)
(I also tried creating this sequence a few other ways in case the way this sequence was accessed affected the run time. Using the Range type certainly did. This should ensure that each item in the sequence is an independent object.)
I ran the following:
time {
for(item <- seq) {
time {
var i = 0
while(i < seq.size) {
i += 1
I printed the results so to ensure that we're actually accessing the values in both loops. The first code snippet runs in an average of 33 ms on my machine. The second takes an average of 305 ms.
I tried adding the mutable variable i to the for loop, but it only adds a few milliseconds. The map function gets similar performance to a for loop, as expected. For whatever reason, this doesn't seem to occur if I use an array (converting the above defined seq with seq.toArray). In such a case, the for loop takes 90 ms and the while loop takes 40 ms.
What is the reason for this major performance difference?
The reason is: complexity. seq(i) is Θ(i) for List, which means your whole loop takes quadratic time. The foreach method, however, is linear.
If you compile with -optimize, the for loop version will likely be even faster, because List.foreach should be inlined, eliminating the cost of the lambda.
I'm a beginner at cuda and am having some difficulties with it
If I have an input vector A and a result vector B both with size N, and B[i] depends on all elements of A except A[i], how can I code this without having to call a kernel multiple times inside a serial for loop? I can't think of a way to paralelise both the outer and inner loop simultaneously.
edit: Have a device with cc 2.0
// a = some stuff
int i;
int j;
double result = 0;
for(i=0; i<1000; i++) {
double ai = a[i];
for(j=0; j<1000; j++) {
double aj = a[j];
if (i == j)
result += ai - aj;
I have this at the moment:
//in host
int i;
for(i=0; i<1000; i++) {
kernelFunc <<<2, 500>>> (i, d_a)
Is there a way to eliminate the serial loop?
Something like this should work, I think:
__global__ void my_diffs(const double *a, double *b, const length){
unsigned idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < length){
double my_a = a[idx];
double result = 0.0;
for (int j=0; j<length; j++)
result += my_a - a[j];
b[idx] = result;
(written in browser, not tested)
This can possibly be further optimized in a couple ways, however for cc 2.0 and newer devices that have L1 cache, the benefits of these optimizations might be small:
use shared memory - we can reduce the number of global loads to one per element per block. However, the initial loads will be cached in L1, and your data set is quite small (1000 double elements ?) so the benefits might be limited
create an offset indexing scheme, so each thread is using a different element from the cacheline to create coalesced access (i.e. modify j index for each thread). Again, for cc 2.0 and newer devices, this may not help much, due to L1 cache as well as the ability to broadcast warp global reads.
If you must use a cc 1.x device, then you'll get significant mileage out of one or more optimizations -- the code I've shown here will run noticeably slower in that case.
Note that I've chosen not to bother with the special case where we are subtracting a[i] from itself, as that should be approximately zero anyway, and should not disturb your results. If you're concerned about that, you can special-case it out, easily enough.
You'll also get more performance if you increase the blocks and reduce the threads per block, perhaps something like this:
my_diffs<<<8,128>>>(d_a, d_b, len);
The reason for this is that many GPUs have more than 1 or 2 SMs. To maximize perf on these GPUs with such a small data set, we want to try and get at least one block launched on each SM. Having more blocks in the grid makes this more likely.
If you want to fully parallelize the computation, the approach would be to create a 2D matrix (let's call it c[...]) in GPU memory, of square dimensions equal to the length of your vector. I would then create a 2D grid of threads, and have each thread perform the subtraction (a[row] - a[col]) and store it's result in c[row*len+col]. I would then launch a second (1D) kernel to sum the columns of c (each thread has a loop to sum a column) to create the result vector b. However I'm not sure this would be any faster than the approach I've outlined. Such a "more fully parallelized" approach also wouldn't lend itself as easily to the optimizations I discussed.