How to absolute 2 double or 4 floats using SSE instruction set? (Up to SSE4) - gcc

Here's the sample C code that I am trying to accelerate using SSE, the two arrays are 3072 element long with doubles, may drop it down to float if i don't need the precision of doubles.
double sum = 0.0;
for(k = 0; k < 3072; k++) {
sum += fabs(sima[k] - simb[k]);
}
double fp = (1.0 - (sum / (255.0 * 1024.0 * 3.0)));
Anyway my current problem is how to do the fabs step in a SSE register for doubles or float so that I can keep the whole calculation in the SSE registers so that it remains fast and I can parallelize all of the steps by partly unrolling this loop.
Here's some resources I've found fabs() asm or possibly this flipping the sign - SO however the weakness of the second one would need a conditional check.

I suggest using bitwise and with a mask. Positive and negative values have the same representation, only the most significant bit differs, it is 0 for positive values and 1 for negative values, see double precision number format. You can use one of these:
inline __m128 abs_ps(__m128 x) {
static const __m128 sign_mask = _mm_set1_ps(-0.f); // -0.f = 1 << 31
return _mm_andnot_ps(sign_mask, x);
}
inline __m128d abs_pd(__m128d x) {
static const __m128d sign_mask = _mm_set1_pd(-0.); // -0. = 1 << 63
return _mm_andnot_pd(sign_mask, x); // !sign_mask & x
}
Also, it might be a good idea to unroll the loop to break the loop-carried dependency chain. Since this is a sum of nonnegative values, the order of summation is not important:
double norm(const double* sima, const double* simb) {
__m128d* sima_pd = (__m128d*) sima;
__m128d* simb_pd = (__m128d*) simb;
__m128d sum1 = _mm_setzero_pd();
__m128d sum2 = _mm_setzero_pd();
for(int k = 0; k < 3072/2; k+=2) {
sum1 += abs_pd(_mm_sub_pd(sima_pd[k], simb_pd[k]));
sum2 += abs_pd(_mm_sub_pd(sima_pd[k+1], simb_pd[k+1]));
}
__m128d sum = _mm_add_pd(sum1, sum2);
__m128d hsum = _mm_hadd_pd(sum, sum);
return *(double*)&hsum;
}
By unrolling and breaking the dependency (sum1 and sum2 are now independent), you let the processor execute the additions our of order. Since the instruction is pipelined on a modern CPU, the CPU can start working on a new addition before the previous one is finished. Also, bitwise operations are executed on a separate execution unit, the CPU can actually perform it in the same cycle as addition/subtraction. I suggest Agner Fog's optimization manuals.
Finally, I don't recommend using openMP. The loop is too small and the overhead of distribution the job among multiple threads might be bigger than any potential benefit.

The maximum of -x and x should be abs(x). Here it is in code:
x = _mm_max_ps(_mm_sub_ps(_mm_setzero_ps(), x), x)

Probably the easiest way is as follows:
__m128d vsum = _mm_set1_pd(0.0); // init partial sums
for (k = 0; k < 3072; k += 2)
{
__m128d va = _mm_load_pd(&sima[k]); // load 2 doubles from sima, simb
__m128d vb = _mm_load_pd(&simb[k]);
__m128d vdiff = _mm_sub_pd(va, vb); // calc diff = sima - simb
__m128d vnegdiff = mm_sub_pd(_mm_set1_pd(0.0), vdiff); // calc neg diff = 0.0 - diff
__m128d vabsdiff = _mm_max_pd(vdiff, vnegdiff); // calc abs diff = max(diff, - diff)
vsum = _mm_add_pd(vsum, vabsdiff); // accumulate two partial sums
}
Note that this may not be any faster than scalar code on modern x86 CPUs, which typically have two FPUs anyway. However if you can drop down to single precision then you may well get a 2x throughput improvement.
Note also that you will need to combine the two partial sums in vsum into a scalar value after the loop, but this is fairly trivial to do and is not performance-critical.

Related

Out of Order execution and loops

I've always had the same question in mind about instruction level parallelism and loops:
How does a cpu parallelize loops? Do they execute multiple successive iterations at once? Do they execute subsequent instructions independent on the loop or both?
Consider the following function for reversing an array as an example:
// pretend that uint64_t is strict-aliasing safe,
// like it was defined with GNU C __attribute__((may_alias, aligned(4)))
void reverse_SSE2(uint32_t* arr, size_t length)
{
__m128i* startPtr = (__m128i*)arr;
__m128i* endPtr = (__m128i*)(arr + (length - sizeof(__m128i) / sizeof(uint32_t)));
while (startPtr < (__m128i*)((uint32_t*)endPtr - (sizeof(__m128i) / sizeof(uint32_t) - 1)))
{
__m128i lo = _mm_loadu_si128(startPtr);
__m128i hi = _mm_loadu_si128(endPtr);
__m128i reverseLo = _mm_shuffle_epi32(lo, _MM_SHUFFLE(0, 1, 2, 3));
__m128i reverseHi = _mm_shuffle_epi32(hi, _MM_SHUFFLE(0, 1, 2, 3));
_mm_storeu_si128(startPtr++, reverseHi);
_mm_storeu_si128(endPtr--, reverseLo);
}
uint64_t* startPtr64 = (uint64_t*)startPtr;
uint64_t* endPtr64 = (uint64_t*)endPtr + 1;
if (startPtr64 < (uint64_t*)((uint32_t*)endPtr64 - (sizeof(uint64_t) / sizeof(uint32_t) - 1)))
{
uint64_t lo = *startPtr64;
uint64_t hi = *endPtr64;
lo = math.rol(lo, 32);
hi = math.ror(hi, 32);
*startPtr64++ = hi;
*endPtr64 = lo;
uint32_t* startPtr32 = (uint32_t*)startPtr64;
uint32_t* endPtr32 = (uint32_t*)math.max((uint64_t)((uint32_t*)endPtr64 - 1), (uint64_t)startPtr32);
uint32_t lo32 = *startPtr32;
uint32_t hi32 = *endPtr32;
*startPtr32 = hi32;
*endPtr32 = lo32;
}
else
{
uint32_t* startPtr32 = (uint32_t*)startPtr64;
uint32_t* endPtr32 = (uint32_t*)endPtr64 + 1;
while (endPtr32 > startPtr32)
{
uint32_t lo = *startPtr32;
uint32_t hi = *endPtr32;
*startPtr32++ = hi;
*endPtr32-- = lo;
}
}
}
This code could be rewritten for maximum performance in a few ways, depending on the answers to my questions (and yes, in this specific example it would be irrelevant but this is just an example; we could continue with vectors that (may) do overlapping stores, as long as they load data that hasn't already been reversed).
The while loop could be unrolled, since The throughput of the instructions used is higher than what one iteration issues. If the hardware "unrolls" the loop, that would not be necessary.
All of the code following the while loop could be rewritten to not depend on the startPtr and endPtr values and thus the while loop, by calculating the number of remainders and pointers from it. If the CPU cannot execute other instructions while looping, that would introduce additional overhead. If it can, the function would end with the while loop simultaneously.
If the code following the while loop does not execute in parallel to it, that code might better be moved to the top of the function, so that one loop iteration can start executing in parallel to that code, since the initial check is cheap to calculate. The potential of introducing another cache miss does not matter in this case.
As an additional question (since the code following the loop has a branch and another loop): How are flags registers handled in superscalar CPUs? Are there multiple physical ones?

Using SIMD to find the biggest difference of two elements

I wrote an algorithm to get the biggest difference between two elements in an std::vector where the bigger of the two values must be at a higher index than the lower value.
unsigned short int min = input.front();
unsigned short res = 0;
for (size_t i = 1; i < input.size(); ++i)
{
if (input[i] <= min)
{
min = input[i];
continue;
}
int dif = input[i] - min;
res = dif > res ? dif : res;
}
return res != 0 ? res : -1;
Is it possible to optimize this algorithm using SIMD? I'm new to SIMD and so far I've been unsuccessful with this one
You didn't specify any particular architecture so I'll keep this mostly architecture neutral with an algorithm described in English. But it requires a SIMD ISA that can efficiently branch on SIMD compare results to check a usually-true condition, like x86 but not really ARM NEON.
This won't work well for NEON because it doesn't have a movemask equivalent, and SIMD -> integer causes stalls on many ARM microarchitectures.
The normal case while looping over the array is that an element, or a whole SIMD vector of elements, is not a new min, and not diff candidate. We can quickly fly through those elements, only slowing down to get the details right when there's a new min. This is like a SIMD strlen or SIMD memcmp, except instead of stopping at the first search hit, we just go scalar for one block and then resume.
For each vector v[0..7] of the input array (assuming 8 int16_t elements per vector (16 bytes), but that's arbitrary):
SIMD compare vmin > v[0..7], and check for any elements being true. (e.g. x86 _mm_cmpgt_epi16 / if(_mm_movemask_epi8(cmp) != 0)) If there's a new min somewhere, we have a special case: the old min applies to some elements, but the new min applies to others. And it's possible there are multiple new-min updates within the vector, and new-diff candidates at any of those points.
So handle this vector with scalar code (updating a scalar diff which doesn't need to be in sync with the vector diffmax because we don't need position).
Broadcast the final min to vmin when you're done. Or do a SIMD horizontal min so out-of-order execution of later SIMD iterations can get started without waiting for a vmin from scalar. Should work well if the scalar code is branchless, so there are no mispredicts in the scalar code that cause later vector work to be thrown out.
As an alternative, a SIMD prefix-sum type of thing (actually prefix-min) could produce a vmin where every element is the min up to that point. (parallel prefix (cumulative) sum with SSE). You could always do this to avoid any branching, but if new-min candidates are rare then it's expensive. Still, it could be viable on ARM NEON where branching is hard.
If there's no new min, SIMD packed max diffmax[0..7] = max(diffmax[0..7], v[0..7]-vmin). (Use saturating subtraction so you don't get wrap-around to a large unsigned difference, if you're using unsigned max to handle the full range.)
At the end of the loop, do a SIMD horizontal max of the diffmax vector. Notice that since we don't need the position of the max-difference, we don't need to update all elements inside the loop when one finds a new candidate. We don't even need to keep the scalar special-case diffmax and SIMD vdiffmax in sync with each other, just check at the end to take the max of the scalar and SIMD max diffs.
SIMD min/max is basically the same as a horizontal sum, except you use packed-max instead of packed-add. For x86, see Fastest way to do horizontal float vector sum on x86.
Or on x86 with SSE4.1 for 16-bit integer elements, phminposuw / _mm_minpos_epu16 can be used for min or max, signed or unsigned, with appropriate tweaks to the input. max = -min(-diffmax). You can treat diffmax as unsigned because it's known to be non-negative, but Horizontal minimum and maximum using SSE shows how to flip the sign bit to range-shift signed to unsigned and back.
We probably get a branch mispredict every time we find a new min candidate, or else we're finding new min candidates too often for this to be efficient.
If new min candidates are expected frequently, using shorter vectors could be good. Or on discovering there's a new-min in a current vector, then use narrower vectors to only go scalar over fewer elements. On x86, you might use bsf (bit-scan forward) to find which element had the first new-min. That gives your scalar code a data dependency on the vector compare-mask, but if the branch to it was mispredicted then the compare-mask will be ready. Otherwise if branch-prediction can somehow find a pattern in which vectors need the scalar fallback, prediction+speculative execution will break that data dependency.
Unfinished / broken (by me) example adapted from #harold's deleted answer of a fully branchless version that constructs a vector of min-up-to-that-element on the fly, for x86 SSE2.
(#harold wrote it with suffix-max instead of min, which is I think why he deleted it. I partially converted it from max to min.)
A branchless intrinsics version for x86 could look something like this. But branchy is probably better unless you expect some kind of slope or trend that makes new min values frequent.
// BROKEN, see FIXME comments.
// converted from #harold's suffix-max version
int broken_unfinished_maxDiffSSE(const std::vector<uint16_t> &input) {
const uint16_t *ptr = input.data();
// construct suffix-min
// find max-diff at the same time
__m128i min = _mm_set_epi32(-1);
__m128i maxdiff = _mm_setzero_si128();
size_t i = input.size();
for (; i >= 8; i -= 8) {
__m128i data = _mm_loadu_si128((const __m128i*)(ptr + i - 8));
// FIXME: need to shift in 0xFFFF, not 0, for min.
// or keep the old data, maybe with _mm_alignr_epi8
__m128i d = data;
// link with suffix
d = _mm_min_epu16(d, _mm_slli_si128(max, 14));
// do suffix-min within block.
d = _mm_min_epu16(d, _mm_srli_si128(d, 2));
d = _mm_min_epu16(d, _mm_shuffle_epi32(d, 0xFA));
d = _mm_min_epu16(d, _mm_shuffle_epi32(d, 0xEE));
max = d;
// update max-diff
__m128i diff = _mm_subs_epu16(data, min); // with saturation to 0
maxdiff = _mm_max_epu16(maxdiff, diff);
}
// horizontal max
maxdiff = _mm_max_epu16(maxdiff, _mm_srli_si128(maxdiff, 2));
maxdiff = _mm_max_epu16(maxdiff, _mm_shuffle_epi32(maxdiff, 0xFA));
maxdiff = _mm_max_epu16(maxdiff, _mm_shuffle_epi32(maxdiff, 0xEE));
int res = _mm_cvtsi128_si32(maxdiff) & 0xFFFF;
unsigned scalarmin = _mm_extract_epi16(min, 7); // last element of last vector
for (; i != 0; i--) {
scalarmin = std::min(scalarmin, ptr[i - 1]);
res = std::max(res, ptr[i - 1] - scalarmin);
}
return res != 0 ? res : -1;
}
We could replace the scalar cleanup with a final unaligned vector, if we handle the overlap between the last full vector min.

warp shuffling to reduction of arrays with any length

I am working on a Cuda kernel which performs vector dot product (A x B). I assumed that the length of each vector is multiple of 32 (32,64, ...) and defined the block size to be equal to the length of the array. Each thread in the block multiplies one element of A to the corresponding element of B (thread i ==>psum = A[i]xB[i]). After multiplication, I used the following functions which used warp shuffling technique to perform reduction and calculate the sum all multiplications.
__inline__ __device__
float warpReduceSum(float val) {
int warpSize =32;
for (int offset = warpSize/2; offset > 0; offset /= 2)
val += __shfl_down(val, offset);
return val;
}
__inline__ __device__
float blockReduceSum(float val) {
static __shared__ int shared[32]; // Shared mem for 32 partial sums
int lane = threadIdx.x % warpSize;
int wid = threadIdx.x / warpSize;
val = warpReduceSum(val); // Each warp performs partial reduction
if (lane==0)
shared[wid]=val; // Write reduced value to shared memory
__syncthreads(); // Wait for all partial reductions
//read from shared memory only if that warp existed
val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0;
if (wid==0)
val = warpReduceSum(val); // Final reduce within first warp
return val;
}
I simply call blockReduceSum(psum) which psum is the multiplication of two elements by a thread.
This approach doesn't work when the length of the array is not multiple of 32, so my question is, can we change this code so that it also works for any length? or is it impossible because if the length of the array is not multiple of 32, some warps have elements belonging more than one array?
First of all, depending on the GPU you are using, performing dot product with just 1 block will probably not be very efficient (as long as you are not batching several dot products in 1 kernel, each done by a single block).
To answer your question: you can reuse the code you have written by just calling your kernel with the number of threads being the closest multiple of 32 higher than N (length of the array) and introducing if statement before calling to blockReduceSum that would like this:
__global__ void kernel(float * A, float * B, int N) {
float psum = 0;
if(threadIdx.x < N) //threadIDx.x because your are using single block, you will need to change it to more general id once you move to multiple blocks
psum = A[threadIdx.x] * B[threadIdx.x];
blockReduceSum(psum);
//The rest of computation
}
That way, threads that do not have array element associated with them, but that need to be there due to use of __shfl, will contribute 0 to the sum.

Correct OpenMP pragmas for pi monte carlo in C with not thread-safe random number generator

I need some help to parallelize the pi calculation with the monte carlo method with openmp by a given random number generator, which is not thread safe.
First: This SO thread didn't help me.
My own try is the following #pragma omp statements. I thought the i, x and y vars should be init by each thread and should than be private. z ist the sum of all hits in the circle, so it should be summed after the implied barriere after the for loop.
Think the main problem ist the static state var of the random number generator. I made a critical section where the functions are called, so that only one thread per time could execute it. But the Pi solutions doesn't scale with more higher values.
Note: I should not use another RNG, but its okay to make little changes on it.
int main (int argc, char *argv[]) {
int i, z = 0, threads = 8, iters = 100000;
double x,y, pi;
#pragma omp parallel firstprivate(i,x,y) reduction(+:z) num_threads(threads)
for (i=0; i<iters; ++i) {
#pragma omp critical
{
x = rng_doub(1.0);
y = rng_doub(1.0);
}
if ((x*x+y*y) <= 1.0)
z++;
}
pi = ((double) z / (double) (iters*threads))*4.0;
printf("Pi: %lf\n", pi);;
return 0;
}
This RNG is actually an included file, but as I'm not sure if I create the header file correct, I integrated it in the other program file, so I have only one .c file.
#define RNG_MOD 741025
int rng_int(void) {
static int state = 0;
return (state = (1366 * state + 150889) % RNG_MOD);
}
double rng_doub(double range) {
return ((double) rng_int()) / (double) ((RNG_MOD - 1)/range);
}
I've also tried to make the static int state global, but it doesn't change my result, maybe I done it wrong. So please could you help me make the correct changes? Thank you very much!
Your original linear congruent PRNG has a cycle length of 49400, therefore you are only getting 29700 unique test points. This is a terrible generator to be used for any kind of Monte Carlo simulations. Even if you make 100000000 trials, you won't get any closer to the true value of Pi because you are simply repeating the same points over and over again and as a result both the final value of z and iters are simply multiplied by the same constant, which cancel in the end during the division.
The per-thread seed introduced by Z boson improves the situation a little bit with the number of unique points increasing with the total number of OpenMP threads. The increase is not linear since if the seed of one PRNG falls in the sequence of another PRNG, both PRNGs produce the same sequence shifted with no more than 49400 elements. Given the cycle length, each PRNG covers 49400/RNG_MOD = 6,7% of the total output range and that is the probability of two PRNGs being synchronised. There are a total of RNG_MOD/49400 = 15 unique sequences possible. It basically means that in the best seeding case scenario you won't be able to get past 30 threads as any other thread would simply repeat the result of some of the others. The multiplier 2 comes from the fact that each point uses two elements from the sequence and therefore it is possible to get a different set of points if you shift the sequence by one element.
The ultimate solution is to completely drop your PRNG and stick to something like Mersenne twister MT19937, which has a cycle length of 219937 − 1 and a very strong seeding algorithm. If you are not able to use another PRNG as you state in your question, at least modify the constants of the LCG to match those used in rand():
int rng_int(void) {
static int state = 1;
// & 0x7fffffff is equivalent to modulo with RNG_MOD = 2^31
return (state = (state * 1103515245 + 12345) & 0x7fffffff);
}
Note that rand() is not a good PRNG - it is still bad. It is just a little better than the one used in your code.
Try the code below. It makes a private state for each thread. I did something similar with the at rand_r function Why does calculation with OpenMP take 100x more time than with a single thread?
Edit: I updated my code using some of Hristo's suggestions. I used threadprivate (for the first time). I also used a better rand function which gives a better estimate of pi but it's still not good enough.
One strange things was I had to define the function rng_int after threadprivate otherwise I got an error "error: 'state' declared 'threadprivate' after first use". I should probably ask a question about this.
//gcc -O3 -Wall -pedantic -fopenmp main.c
#include <omp.h>
#include <stdio.h>
#define RNG_MOD 0x80000000
int state;
int rng_int(void);
double rng_doub(double range);
int main() {
int i, numIn, n;
double x, y, pi;
n = 1<<30;
numIn = 0;
#pragma omp threadprivate(state)
#pragma omp parallel private(x, y) reduction(+:numIn)
{
state = 25234 + 17 * omp_get_thread_num();
#pragma omp for
for (i = 0; i <= n; i++) {
x = (double)rng_doub(1.0);
y = (double)rng_doub(1.0);
if (x*x + y*y <= 1) numIn++;
}
}
pi = 4.*numIn / n;
printf("asdf pi %f\n", pi);
return 0;
}
int rng_int(void) {
// & 0x7fffffff is equivalent to modulo with RNG_MOD = 2^31
return (state = (state * 1103515245 + 12345) & 0x7fffffff);
}
double rng_doub(double range) {
return ((double)rng_int()) / (((double)RNG_MOD)/range);
}
You can see the results (and edit and run the code) at http://coliru.stacked-crooked.com/a/23c1753a1b7d1b0d

OpenCL/CUDA: Two-stage reduction Algorithm

Reduction of large arrays can be done by calling __reduce(); multiple times.
The following code however uses only two stages and is documented here:
However I am unable to understand the algorithm for this two stage reduction. can some give a simpler explanation?
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2; offset > 0; offset = offset / 2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
It can also be well implemented using CUDA.
You create N threads. The first thread looks at values at positions 0, N, 2*N, ... The second thread looks at values 1, N+1, 2*N+1, ... That's the first loop. It reduces length values into N values.
Then each thread saves its smallest value in shared/local memory. Then you have a synchronization instruction (barrier(CLK_LOCAL_MEM_FENCE).) Then you have standard reduction in shared/local memory. When you're done the thread with local id 0 saves its result in the output array.
All in all, you have a reduction from length to N/get_local_size(0) values. You'd need to do one last pass after this code is done executing. However, this gets most of the job done, for example, you might have length ~ 10^8, N = 2^16, get_local_size(0) = 256 = 2^8, and this code reduces 10^8 elements into 256 elements.
Which parts do you not understand?

Resources