Grid size in phase #4 of Harris' reduction optimization

Grid size in phase #4 of Harris' reduction optimization - parallel-processing

I am learning about unrolling loops to optimize kernel computation.
This is a code snippet from the book Professional CUDA C Programming:
if (idx + 4 * blockDim.x <= n)
{
int a1 = g_idata[idx];
int a2 = g_idata[idx + blockDim.x];
int a3 = g_idata[idx + 2 * blockDim.x];
int a4 = g_idata[idx + 3 * blockDim.x];
tmpSum = a1 + a2 + a3 + a4;
}
In my understanding, each thread works on 4 data blocks and processes a single element from each data block.
So, when we launch kernel, compared with kernel w/o unrolling grid.x, the configuration is changed to
reduceSmemUnroll<<<grid.x / 4, block>>>.
Then I have a question about the code snippet from Mark Harris's presentation on parallel reduction on page 32:
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
My question is about how to determine the size of grid when launching the kernel? Should it be grid.x/2 compared to configuration w/o multiple load?

Yes, it should be half the number of blocks; it says so on the slide with the first occurrence of the code snippet you quoted from in Mark's presentation - already on slide 18:
Halve the number of blocks, and replace single load:
[code snippet]
with two loads and [the] first add of the reduction
Of course, you need to be careful about the sizes. The presentation assumes, for simplicity, that your overall length is a power of 2, so you can always safely divide by 2 while there are multiple elements left. In real life that is not the case, so you may need to allow for slack (e.g. "half the grid size plus one if it was odd").

Related

Fast random/mutation algorithms (vector to vector) [duplicate]

I've been trying to create a generalized Gradient Noise generator (which doesn't use the hash method to get gradients). The code is below:
class GradientNoise {
std::uint64_t m_seed;
std::uniform_int_distribution<std::uint8_t> distribution;
const std::array<glm::vec2, 4> vector_choice = {glm::vec2(1.0, 1.0), glm::vec2(-1.0, 1.0), glm::vec2(1.0, -1.0),
glm::vec2(-1.0, -1.0)};
public:
GradientNoise(uint64_t seed) {
m_seed = seed;
distribution = std::uniform_int_distribution<std::uint8_t>(0, 3);
}
// 0 -> 1
// just passes the value through, origionally was perlin noise activation
double nonLinearActivationFunction(double value) {
//return value * value * value * (value * (value * 6.0 - 15.0) + 10.0);
return value;
}
// 0 -> 1
//cosine interpolation
double interpolate(double a, double b, double t) {
double mu2 = (1 - cos(t * M_PI)) / 2;
return (a * (1 - mu2) + b * mu2);
}
double noise(double x, double y) {
std::mt19937_64 rng;
//first get the bottom left corner associated
// with these coordinates
int corner_x = std::floor(x);
int corner_y = std::floor(y);
// then get the respective distance from that corner
double dist_x = x - corner_x;
double dist_y = y - corner_y;
double corner_0_contrib; // bottom left
double corner_1_contrib; // top left
double corner_2_contrib; // top right
double corner_3_contrib; // bottom right
std::uint64_t s1 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y) + m_seed);
std::uint64_t s2 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s3 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s4 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y) + m_seed);
// each xy pair turns into distance vector from respective corner, corner zero is our starting corner (bottom
// left)
rng.seed(s1);
corner_0_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y});
rng.seed(s2);
corner_1_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y - 1});
rng.seed(s3);
corner_2_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y - 1});
rng.seed(s4);
corner_3_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y});
double u = nonLinearActivationFunction(dist_x);
double v = nonLinearActivationFunction(dist_y);
double x_bottom = interpolate(corner_0_contrib, corner_3_contrib, u);
double x_top = interpolate(corner_1_contrib, corner_2_contrib, u);
double total_xy = interpolate(x_bottom, x_top, v);
return total_xy;
}
};
I then generate an OpenGL texture to display with like this:
int width = 1024;
int height = 1024;
unsigned char *temp_texture = new unsigned char[width*height * 4];
double octaves[5] = {2,4,8,16,32};
for( int i = 0; i < height; i++){
for(int j = 0; j < width; j++){
double d_noise = 0;
d_noise += temp_1.noise(j/octaves[0], i/octaves[0]);
d_noise += temp_1.noise(j/octaves[1], i/octaves[1]);
d_noise += temp_1.noise(j/octaves[2], i/octaves[2]);
d_noise += temp_1.noise(j/octaves[3], i/octaves[3]);
d_noise += temp_1.noise(j/octaves[4], i/octaves[4]);
d_noise/=5;
uint8_t noise = static_cast<uint8_t>(((d_noise * 128.0) + 128.0));
temp_texture[j*4 + (i * width * 4) + 0] = (noise);
temp_texture[j*4 + (i * width * 4) + 1] = (noise);
temp_texture[j*4 + (i * width * 4) + 2] = (noise);
temp_texture[j*4 + (i * width * 4) + 3] = (255);
}
}
Which give good results:
But gprof is telling me that the Mersenne twister is taking up 62.4% of my time and growing with larger textures. Nothing else individual takes any where near as much time. While the Mersenne twister is fast after initialization, the fact that I initialize it every time I use it seems to make it pretty slow.
This initialization is 100% required for this to make sure that the same x and y generates the same gradient at each integer point (so you need either a hash function or seed the RNG each time).
I attempted to change the PRNG to both the linear congruential generator and Xorshiftplus, and while both ran orders of magnitude faster, they gave odd results:
LCG (one time, then running 5 times before using)
Xorshiftplus
After one iteration
After 10,000 iterations.
I've tried:
Running the generator several times before utilizing output, this results in slow execution or simply different artifacts.
Using the output of two consecutive runs after initial seed to seed the PRNG again and use the value after wards. No difference in result.
What is happening? What can i do to get faster results that are of the same quality as the mersenne twister?
OK BIG UPDATE:
I don't know why this works, I know it has something to do with the prime number utilized, but after messing around a bit, it appears that the following works:
Step 1, incorporate the x and y values as seeds separately (and incorporate some other offset value or additional seed value with them, this number should be a prime/non trivial factor)
Step 2, Use those two seed results into seeding the generator again back into the function (so like geza said, the seeds made were bad)
Step 3, when getting the result, instead of using modulo number of items (4) trying to get, or & 3, modulo the result by a prime number first then apply & 3. I'm not sure if the prime being a mersenne prime matters or not.
Here is the result with prime = 257 and xorshiftplus being used! (note I used 2048 by 2048 for this one, the others were 256 by 256)

LCG is known to be inadequate for your purpose.
Xorshift128+'s results are bad, because it needs good seeding. And providing good seeding defeats the whole purpose of using it. I don't recommend this.
However, I recommend using an integer hash. For example, one from Bob's page.
Here's a result of the first hash of that page, it looks OK to me, and it is fast (I think it is much faster than Mersenne Twister):
Here's the code I've written to generate this:
#include <cmath>
#include <stdio.h>
unsigned int hash(unsigned int a) {
a = (a ^ 61) ^ (a >> 16);
a = a + (a << 3);
a = a ^ (a >> 4);
a = a * 0x27d4eb2d;
a = a ^ (a >> 15);
return a;
}
unsigned int ivalue(int x, int y) {
return hash(y<<16|x)&0xff;
}
float smooth(float x) {
return 6*x*x*x*x*x - 15*x*x*x*x + 10*x*x*x;
}
float value(float x, float y) {
int ix = floor(x);
int iy = floor(y);
float fx = smooth(x-ix);
float fy = smooth(y-iy);
int v00 = ivalue(iy+0, ix+0);
int v01 = ivalue(iy+0, ix+1);
int v10 = ivalue(iy+1, ix+0);
int v11 = ivalue(iy+1, ix+1);
float v0 = v00*(1-fx) + v01*fx;
float v1 = v10*(1-fx) + v11*fx;
return v0*(1-fy) + v1*fy;
}
unsigned char pic[1024*1024];
int main() {
for (int y=0; y<1024; y++) {
for (int x=0; x<1024; x++) {
float v = 0;
for (int o=0; o<=9; o++) {
v += value(x/64.0f*(1<<o), y/64.0f*(1<<o))/(1<<o);
}
int r = rint(v*0.5f);
pic[y*1024+x] = r;
}
}
FILE *f = fopen("x.pnm", "wb");
fprintf(f, "P5\n1024 1024\n255\n");
fwrite(pic, 1, 1024*1024, f);
fclose(f);
}
If you want to understand, how a hash function work (or better yet, which properties a good hash have), check out Bob's page, for example this.

You (unknowingly?) implemented a visualization of PRNG non-random patterns. That looks very cool!
Except Mersenne Twister, all your tested PRNGs do not seem fit for your purpose. As I have not done further tests myself, I can only suggest to try out and measure further PRNGs.

The randomness of LCGs are known to be sensitive to the choice of their parameters. In particular, the period of a LCG is relative to the m parameter - at most it will be m (your prime factor) & for many values it can be less.
Similarly, the careful parameters selection is required to get a long period from Xorshift PRNGs.
You've noted that some PRNGs give good procedural generation results while other do not. In order to isolate the cause, I would factor out the proc gen stuff & examine the PRNG output directly. An easy way to visualize the data is to build a grey scale image where each pixel value is a (possibly scaled) random value. For image based stuff, I find this to be an easy way to find stuff that may lead to visual artifacts. Any artifacts you see with this are likely to cause issues with your proc gen output.
Another option is to try something like the Diehard tests. If the aforementioned image test failed to reveal any problems, I might use this just to be sure my PRNG techniques were trustworthy.

Note that your code seeds the PRNG, then generates one pseudorandom number from the PRNG. The reason for the nonrandomness in xorshift128+ that you discovered is that xorshift128+ simply adds the two halves of the seed (and uses the result mod 264 as the generated number) before changing its state (review its source code). This makes that PRNG considerably different from a hash function.

What you see is the practical demonstration of quality of PRNG. Mersenne Twister is one of the best PRNGs with good performance, it passes DIEHARD tests. One should know that generating a random numbers is not an easy computational task, so looking for a better performance will inevitably result in poor quality. LCG is known to be simplest and worst PRNG ever designed and it clearly shows two-dimensional correlation as in your picture. The quality of Xorshift generators largely depend on bitness and parameters. They are definitely worse than Mersenne Twister, but some (xorshift128+) may work good enough to pass BigCrush battery of TestU01 tests.
In other words, if you are making an important physical modelling numerical experiment, you better continue to use Mersenne Twister as known to be a good trade-off between speed and quality and it comes in many standard libraries. On a less important case you may try to use xorshift128+ generator. For an ultimate results you need to use cryptographical-quality PRNG (none of mentioned here may be used for cryptographical purposes).

How to avoid un-coalesced accesses in matrix multiplication CUDA kernel?

I am learning CUDA with the book 'Programming Massively Parallel Processors'. A practice problem from chapter 5 confuses me:
For tiled matrix multiplication out of possible range of values for
BLOCK_SIZE, for what values of BLOCK_SIZE will the kernel completely
avoid un-coalesced accesses to global memory? (you only need to consider square blocks)
On my understanding, BLOCK_SIZE does little to memory-coalescing. As long as threads within single warp access consecutive elements, we will have a coalesced accesses. I could not figure out where the kernel has un-coalesced accesses to global memory. Any hints from you guys?
Here is the kernel's source codes:
#define COMMON_WIDTH 512
#define ROW_LEFT 500
#define COL_RIGHT 250
#define K 1000
#define TILE_WIDTH 32
__device__ int D_ROW_LEFT = ROW_LEFT;
__device__ int D_COL_RIGHT = COL_RIGHT;
__device__ int D_K = K;
.....
__global__
void MatrixMatrixMultTiled(float *matrixLeft, float *matrixRight, float *output){
__shared__ float sMatrixLeft[TILE_WIDTH][TILE_WIDTH];
__shared__ float sMatrixRight[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int col = bx * TILE_WIDTH + tx;
int row = by * TILE_WIDTH + ty;
float value = 0;
for (int i = 0; i < ceil(D_K/(float)TILE_WIDTH); ++i){
if (row < D_ROW_LEFT && row * D_K + i * TILE_WIDTH +tx < D_K){
sMatrixLeft[ty][tx] = matrixLeft[row * D_K + i * TILE_WIDTH +tx];
}
if (col < D_COL_RIGHT && (ty + i * TILE_WIDTH) * D_COL_RIGHT + col < D_K ){
sMatrixRight[ty][tx] = matrixRight[(ty + i * TILE_WIDTH) * D_COL_RIGHT + col];
}
__syncthreads();
for (int j = 0; j < TILE_WIDTH; j++){
value += sMatrixLeft[ty][j] * sMatrixRight[j][tx];
}
__syncthreads();
}
if (row < D_ROW_LEFT && col < D_COL_RIGHT ){
output[row * D_COL_RIGHT + col] = value;
}
}

Your question is incomplete, since the code you have posted does not make any reference to BLOCK_SIZE, and that is certainly at least very relevant to the question posed in the book. More generally, questions that pose a kernel without the launch configuration are often incomplete, since the launch configuration is often relevant to both the correctness and the behavior, of a kernel.
I've not re-read this portion of the book right at the moment. However I'll assume the kernel launch configuration includes a block dimension that is something like the following: (this information is absent from your question but should have been included, in my opinion, for a sensible question)
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(...,...);
And I will assume the kernel launch is given by something like:
MatrixMatrixMultTiled<<<dimGrid, dimBlock>>>(...);
Your statement: "As long as threads within single warp access consecutive elements, we will have a coalesced accesses." is a reasonable working definition. Let's show that that is violated for some choices of BLOCK_SIZE, given the above assumptions to cover over the gaps in your incomplete question.
Coalesced access is a term that applies to global memory accesses only. We will therefore ignore accesses to shared memory. We will also, for this discussion, ignore accesses to the __device__ variables such as D_ROW_LEFT. (The access to those variables appears to be uniform. We can quibble about whether that constitutes coalesced access. My claim would be that it does constitute coalesced access, but we need not unpack that here.) Therefore we are left with just 3 "access" points:
matrixLeft[row * D_K + i * TILE_WIDTH +tx];
matrixRight[(ty + i * TILE_WIDTH) * D_COL_RIGHT + col];
output[row * D_COL_RIGHT + col]
Now, to pick an example, let's suppose BLOCK_SIZE is 16. Will any of the above access points violate your statement "threads within single warp access consecutive elements"?
Let's start with the block (0,0). Therefore row is equal to threadIdx.y and col is equal to threadIdx.x. Let's consider the first warp in that block. Therefore the first 16 threads in that warp will have a threadIdx.y value of 0, and their threadIdx.x values will be increasing from 0..15. Likewise the second 16 threads in that warp will have a threadIdx.y value of 1, and their threadIdx.x values will be increasing from 0..15.
Now let's compute the actual index generated for the first access point above, across the warp. Let's assume we are on the first loop iteration, so i is zero. Therefore this:
matrixLeft[row * D_K + i * TILE_WIDTH +tx];
reduces to:
matrixLeft[threadIdx.y * D_K + threadIdx.x];
D_K here is just the device copy of the K variable, which is 1000. Now let's evaluate the reduced index expression above across our selected warp (0) in our selected block (0,0):
warp lane: 0 1 2 3 4 5 6 .. 15 16 17 18 .. 31
threadIdx.x 0 1 2 3 4 5 6 15 0 1 2 15
threadIdx.y 0 0 0 0 0 0 0 0 1 1 1 1
index: 0 1 2 3 4 5 6 15 1000 1001 1002 1015
Therefore the generated index pattern here shows a discontinuity between the 16th and 17th thread in the warp, and the access pattern does not fit your previously stated condition:
"threads within single warp access consecutive elements"
and we do not have coalesced access in this case (at least, for float quantities).

Which way to order a shared 2D/3D array for parallel reduction over 1 dimension in CUDA/OpenCL?

Overall goal
I have several reductions to make on a bipartite graph, represented by two dense arrays for vertices and a dense array specifying whether an edge is present b/w the two. Say, two arrays are a0[] and a1[], and all edges go like e[i0][i1] (that is, from elements in a0 to elements in a1).
There are ~100+100 vertices, and ~100*100 edges, so each thread is responsible for one edge.
Task 1 : max reduction
For each vertex in a0 I want to find the maximum of all vertices (in a1) connected to it, and then the same in reverse: having assigned the result to an array b0, for each vertex in a1, I want to find the maximum b0[i0] of the connected vertices.
To do this, I:
1) load into shared memory
#define DC_NUM_FROM_SHARED 16
#define DC_NUM_TO_SHARED 16
__global__ void max_reduce_down(
Value* value1
, Value* max_value_in_connected
, int r0_size, int r1_size
, bool** connected
)
{
int id_from;
id_from = blockIdx.x * blockDim.x + threadIdx.x;
id_to = blockIdx.y * blockDim.y + threadIdx.y;
bool within_bounds = (id_from < r0_size) && (id_to < r1_size);
//load into shared memory
__shared__ Value value[DC_NUM_TO_SHARED][DC_NUM_FROM_SHARED]; //FROM is the inner (consecutive) dimension
if(within_bounds)
value[threadIdx.y][threadIdx.x] = connected[id_to][id_from]? value1[id_to] : 0;
else
value[threadIdx.y][threadIdx.x] = 0;
__syncthreads();
if(!within_bounds)
return;
2) reduce
for(int stride = DC_NUM_TO_SHARED/2; threadIdx.y < stride; stride >>= 1)
{
value[threadIdx.y][threadIdx.x] = max(value[threadIdx.y][threadIdx.x], dc[threadIdx.y + stride][threadIdx.x]);
__syncthreads();
}
3) write back
max_value_connected[id_from] = value[0][threadIdx.x];
Task 2 : best k
Similar problem, but reduction is only in for vertices in a0, I need to find the k best candidates are chosen from connected in a1 (k is ~5).
1) I initialize the shared array with zero elements except for the 1st place
int id_from, id_to;
id_from = blockIdx.x * blockDim.x + threadIdx.x;
id_to = blockIdx.y * blockDim.y + threadIdx.y;
__shared Value* values[MAX_CHAMPS * CHAMPS_NUM_FROM_SHARED * CHAMPS_NUM_TO_SHARED]; //champion overlaps
__shared int* champs[MAX_CHAMPS * CHAMPS_NUM_FROM_SHARED * CHAMPS_NUM_TO_SHARED]; // overlap champions
bool within_bounds = (id_from < r0_size) && (id_to < r1_size);
int i = threadIdx.y * CHAMPS_NUM_FROM_SHARED + threadIdx.x;
if(within_bounds)
{
values[i] = connected[id_to][id_from] * values1[id_to];
champs[i] = connected[id_to][id_from] ? id_to : -1;
}
else
{
values[i] = 0;
champs[i] = -1;
}
for(int place = 1; place < CHAMP_COUNT; place++)
{
i = (place * CHAMPS_NUM_TO_SHARED + threadIdx.y) * CHAMPS_NUM_FROM_SHARED + threadIdx.x;
values[i] = 0;
champs[i] = -1;
}
if(! within_bounds)
return;
__syncthreads();
2) reduce it
for(int stride = CHAMPS_NUM_TO_SHARED/2; threadIdx.y < stride; stride >>= 1)
{
merge_2_champs(values, champs, CHAMP_COUNT, id_from, id_to, id_to + stride);
__syncthreads();
}
3) write the results back
for(int place = 0; place < LOCAL_DESIRED_ACTIVITY; place++)
champs0[place][id_from] = champs[place * CHAMPS_NUM_TO_SHARED * CHAMPS_NUM_FROM_SHARED + threadIdx.x];
Issue
How do I order (transpose) the elements in the shared array, so that memory access uses the cache better?
Does it matter at this point, or there is much more I can gain from other optimizations?
Would it be better to transpose the edge matrix if I needed to optimize for Task 2? (as far as I understood, there is a symmetry in Task 1, so it doesn't matter).
P.S.
I have delayed unrolling loops and doing the first reduction iteration while loading, since I thought it is too complicated to do before I have explored simpler ways.
For Task 2, it would be nice to not load zero elements, since the array would never need to grow, and only start shrinking once log k steps have been made. This would make it k times more compact in shared memory! But I dread the resulting index math.
Syntax and Correctness
The unusual types are just typedef'ed ints/chars/etc - AFAIK, in GPUs, it makes sense to compactify those as much as possible. I have not run the code yet, no need to check for indexing errors.
Also, I am using CUDA, but I am interested in an OpenCL perspective as well, since I think the best solution should be the same, and I will be using OpenCL in the future anyway.

OK, I think I figured this out.
The two alternatives that I am considering are to have reductions work on the y dimension, and independent on the x dimension, or vice versa (x dimension being the contiguous one). In any case, the scheduler is able to assemble threads into warps along the x dimension, so some coherence is guaranteed. However, having coherence extend beyond a warp would be great. Also, due to the 2D/3D nature of the shared arrays, one would have to limit the dimensions to 16 or even 8.
To ensure coalescence within a warp, the scheduler has to assemble warps along the x dimension.
If reducing over x dimension, after each iteration, the number of active threads in a warp will halve. However, if reducing over y dimension, it is the number of active warps that will halve.
So, I need to reduce over y.
Unless the transpose (load) is the slowest, which is an abnormal case.

Coalesced buffer reads really matter; kernels can be 32x slower if you don't do them. It can be worth doing a re-arrangement pass if it means being able to do them (of course, the re-arrangement pass needs to be coalesced as well, but you can often leverage shared local memory to do this).

How to make sum calculations without using atomic in CUDA

In the below code, how can I calculate sum_array value without using atomicAdd.
Kernel method
__global__ void calculate_sum( int width,
int height,
int *pntrs,
int2 *sum_array )
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if ( row >= height || col >= width ) return;
int idx = pntrs[ row * width + col ];
//atomicAdd( &sum_array[ idx ].x, col );
//atomicAdd( &sum_array[ idx ].y, row );
sum_array[ idx ].x += col;
sum_array[ idx ].y += row;
}
Launch Kernel
dim3 dimBlock( 16, 16 );
dim3 dimGrid( ( width + ( dimBlock.x - 1 ) ) / dimBlock.x,
( height + ( dimBlock.y - 1 ) ) / dimBlock.y );

Reduction is a general name for this kind of problems. Look at the presentation for further explanation or use Google for other examples.
General way to solve this is to make parallel sum of global memory segments inside the thread blocks and store the results in global memory. Afterwards, copy the partial results to CPU memory space, sum the partial results using CPU, and copy the result back to GPU memory. You can avoid coping of memory by execution of another parallel sum for the partial results.
Another approach is to use highly optimized libraries for CUDA such as Thrust or CUDPP which contain functions doing the stuff.

My Cuda is very very rusty, but this is roughly how you do it (courtesy of "Cuda by Example", which I would strongly suggest you to read):
https://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0
Do a better partitioning of the array you need to sum: threads in CUDA are lightweight, but not so much that you can spawn one for just two sums and hope to get any performance benefit in return.
At this point each thread will be tasked to sum over a slice of your data: create an array of shared int as big as the number of your threads, where each thread will save the partial sum it computed.
Synchronize the threads and reduce the shared memory array:
(please, take it as pseudocode)
// Code to sum over a slice, essentially a loop over each thread subset
// and accumulate over "localsum" (a local variable)
...
// Save the result in the shared memory
partial[threadidx] = localsum;
// Synchronize the threads:
__syncthreads();
// From now on partial is filled with the result of all computations: you can reduce partial
// we'll do it the illiterate way, using a single thread (it can be easily parallelized)
if(threadidx == 0) {
for(i = 1; i < nthreads; ++i) {
partial[0] += partial[i];
}
}
and off you go: partial[0] will hold your sum (or computation).
See the dot product example in "CUDA by example" for a more rigorous discussion of the topic and a reduction algorithm that runs in about O(log(n)).
Hope this helps

How to absolute 2 double or 4 floats using SSE instruction set? (Up to SSE4)

Here's the sample C code that I am trying to accelerate using SSE, the two arrays are 3072 element long with doubles, may drop it down to float if i don't need the precision of doubles.
double sum = 0.0;
for(k = 0; k < 3072; k++) {
sum += fabs(sima[k] - simb[k]);
}
double fp = (1.0 - (sum / (255.0 * 1024.0 * 3.0)));
Anyway my current problem is how to do the fabs step in a SSE register for doubles or float so that I can keep the whole calculation in the SSE registers so that it remains fast and I can parallelize all of the steps by partly unrolling this loop.
Here's some resources I've found fabs() asm or possibly this flipping the sign - SO however the weakness of the second one would need a conditional check.

I suggest using bitwise and with a mask. Positive and negative values have the same representation, only the most significant bit differs, it is 0 for positive values and 1 for negative values, see double precision number format. You can use one of these:
inline __m128 abs_ps(__m128 x) {
static const __m128 sign_mask = _mm_set1_ps(-0.f); // -0.f = 1 << 31
return _mm_andnot_ps(sign_mask, x);
}
inline __m128d abs_pd(__m128d x) {
static const __m128d sign_mask = _mm_set1_pd(-0.); // -0. = 1 << 63
return _mm_andnot_pd(sign_mask, x); // !sign_mask & x
}
Also, it might be a good idea to unroll the loop to break the loop-carried dependency chain. Since this is a sum of nonnegative values, the order of summation is not important:
double norm(const double* sima, const double* simb) {
__m128d* sima_pd = (__m128d*) sima;
__m128d* simb_pd = (__m128d*) simb;
__m128d sum1 = _mm_setzero_pd();
__m128d sum2 = _mm_setzero_pd();
for(int k = 0; k < 3072/2; k+=2) {
sum1 += abs_pd(_mm_sub_pd(sima_pd[k], simb_pd[k]));
sum2 += abs_pd(_mm_sub_pd(sima_pd[k+1], simb_pd[k+1]));
}
__m128d sum = _mm_add_pd(sum1, sum2);
__m128d hsum = _mm_hadd_pd(sum, sum);
return *(double*)&hsum;
}
By unrolling and breaking the dependency (sum1 and sum2 are now independent), you let the processor execute the additions our of order. Since the instruction is pipelined on a modern CPU, the CPU can start working on a new addition before the previous one is finished. Also, bitwise operations are executed on a separate execution unit, the CPU can actually perform it in the same cycle as addition/subtraction. I suggest Agner Fog's optimization manuals.
Finally, I don't recommend using openMP. The loop is too small and the overhead of distribution the job among multiple threads might be bigger than any potential benefit.

The maximum of -x and x should be abs(x). Here it is in code:
x = _mm_max_ps(_mm_sub_ps(_mm_setzero_ps(), x), x)

Probably the easiest way is as follows:
__m128d vsum = _mm_set1_pd(0.0); // init partial sums
for (k = 0; k < 3072; k += 2)
{
__m128d va = _mm_load_pd(&sima[k]); // load 2 doubles from sima, simb
__m128d vb = _mm_load_pd(&simb[k]);
__m128d vdiff = _mm_sub_pd(va, vb); // calc diff = sima - simb
__m128d vnegdiff = mm_sub_pd(_mm_set1_pd(0.0), vdiff); // calc neg diff = 0.0 - diff
__m128d vabsdiff = _mm_max_pd(vdiff, vnegdiff); // calc abs diff = max(diff, - diff)
vsum = _mm_add_pd(vsum, vabsdiff); // accumulate two partial sums
}
Note that this may not be any faster than scalar code on modern x86 CPUs, which typically have two FPUs anyway. However if you can drop down to single precision then you may well get a 2x throughput improvement.
Note also that you will need to combine the two partial sums in vsum into a scalar value after the loop, but this is fairly trivial to do and is not performance-critical.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio