Inefficient openMP parallel single nowait - openmp

I am a beginner in openMP and I would like to simply write three independent blocks in parallel. I used "single no wait" structure. The variables as shown are independent. Type "SpDoubleVec" and functions "transpose" and "prune" are related to Eigen library. The code works but it is not efficient.
Any comment to speed up ?
SpDoubleVec x1, x2, x3, x4, b1, b2, b3, g1, g2, g3;
double lambda1, lambda2, lambda3;
double t1, t2, t3;
omp_set_num_threads(3);
#pragma omp parallel
{
#pragma omp single nowait
{
SpDoubleVec v1;
v1 = lambda1 * x1 + (1 - lambda1) * x2 - b1;
t1 = g1.transpose() * v1;
}
#pragma omp single nowait
{
SpDoubleVec v2;
v2 = lambda2 * x3 + (1 - lambda2) * zx4 - b2;
for (SpDoubleVec::InnerIterator it1(v2); it1; ++it1)
if (it1.value()< 0)
it1.valueRef() = 0;
v2.prune(0.0);
t2 = g2.transpose() * v2;
}
#pragma omp single nowait
{
SpDoubleVec v3;
v3 = b3 - lambda3 * x5 - (1 - lambda3) * x6;
for (SpDoubleVec::InnerIterator it2(v3); it2; ++it2)
if (it2.value()< 0)
it2.valueRef() = 0;
v3.prune(0.0);
t3 = g3.transpose() * v3;
}
}
t = t1 + t2 + t3;

Related

When declaring a static array as "private" before a parallel loop is perfectly equivalent to declaring the array inside the loop?

I've encountered a situation where the code generates different results in the case of having arrays defined inside the loop on index i (case #1) and in the case of declaring them outside the loop on the i index and using the clause private (case #2).
Case #2 generates the same results of the code running on CPU only.
Case #1
#pragma acc parallel loop
for (j = jbeg; j <= jend; j++){
#pragma acc loop
for (i = ibeg; i <= iend; i++){
double Rc[NFLX][NFLX];
double eta[NFLX], um[NFLX], dv[NFLX];
double lambda[NFLX], alambda[NFLX];
double fL[NFLX], fR[NFLX];
.
.
.
}
}}
Case #2
#pragma acc parallel loop
for (j = jbeg; j <= jend; j++){
double Rc[NFLX][NFLX];
double eta[NFLX], um[NFLX], dv[NFLX];
double lambda[NFLX], alambda[NFLX];
double fL[NFLX], fR[NFLX];
#pragma acc loop private(Rc[:NFLX][:NFLX], eta[:NFLX], \
um[:NFLX], lambda[:NFLX], alambda[:NFLX], \
dv[:NFLX], fL[:NFLX], fR[:NFLX])
for (i = ibeg; i <= iend; i++){
.
.
.
}
}}
I have the following values:
NFLX = 8;
jbeg = 3, jend = 258;
ibeg = 3, iend = 1026;
In which cases the two techniques are equivalent and when it is better to choose one over the other?
This is what I see with -Minfo=accel:
case #1:
71, Local memory used for Rc,dv,fR,um,lambda,alambda,fL,eta
case #2:
71, Local memory used for Rc,dv,fR,lambda,alambda,fL,eta
CUDA shared memory used for Rc,eta
Local memory used for um
CUDA shared memory used for um,lambda,alambda,dv,fL,fR
function:
/* ********************************************************************* */
void Roe_Solver (Data *d, timeStep *Dts, Grid *grid, RBox *box)
/*
* Solve the Riemann problem between L/R states using a
* Rusanov-Lax Friedrichs flux.
*********************************************************************** */
{
int i, j, k;
int ibeg = *(box->nbeg)-1, iend = *(box->nend);
int jbeg = *(box->tbeg), jend = *(box->tend);
int kbeg = *(box->bbeg), kend = *(box->bend);
int VXn = VX1, VXt = VX2, VXb = VX3;
int MXn = MX1, MXt = MX2, MXb = MX3;
int ni, nj;
double gmm = GAMMA_EOS;
double gmm1 = gmm - 1.0;
double gmm1_inv = 1.0/gmm1;
double delta = 1.e-7;
double delta_inv = 1.0/delta;
ARRAY_OFFSET (grid, ni, nj);
INDEX_CYCLE (grid->dir, VXn, VXt, VXb);
INDEX_CYCLE (grid->dir, MXn, MXt, MXb);
#pragma acc parallel loop collapse(2) present(d, Dts, grid)
for (k = kbeg; k <= kend; k++){
for (j = jbeg; j <= jend; j++){
long int offset = ni*(j + nj*k);
double * __restrict__ cmax = &Dts->cmax [offset];
double * __restrict__ SL = &d->sweep.SL [offset];
double * __restrict__ SR = &d->sweep.SR [offset];
double um[NFLX];
double fL[NFLX], fR[NFLX];
#pragma acc loop private(um[:NFLX], fL[:NFLX], fR[:NFLX])
for (i = ibeg; i <= iend; i++){
int nv;
double scrh, vel2;
double a2, a, h;
double alambda, lambda, eta;
double s, c, hl, hr;
double bmin, bmax, scrh1;
double pL, pR;
double * __restrict__ vL = d->sweep.vL [offset + i];
double * __restrict__ vR = d->sweep.vR [offset + i];
double * __restrict__ uL = d->sweep.uL [offset + i];
double * __restrict__ uR = d->sweep.uR [offset + i];
double * __restrict__ flux = d->sweep.flux[offset + i];
double a2L = SoundSpeed2 (vL);
double a2R = SoundSpeed2 (vR);
PrimToCons (vL, uL);
PrimToCons (vR, uR);
Flux (vL, uL, fL, grid->dir);
Flux (vR, uR, fR, grid->dir);
pL = vL[PRS];
pR = vR[PRS];
s = sqrt(vR[RHO]/vL[RHO]);
um[RHO] = vL[RHO]*s;
s = 1.0/(1.0 + s);
c = 1.0 - s;
um[VX1] = s*vL[VX1] + c*vR[VX1];
um[VX2] = s*vL[VX2] + c*vR[VX2];
um[VX3] = s*vL[VX3] + c*vR[VX3];
vel2 = um[VX1]*um[VX1] + um[VX2]*um[VX2] + um[VX3]*um[VX3];
hl = 0.5*(vL[VX1]*vL[VX1] + vL[VX2]*vL[VX2] + vL[VX3]*vL[VX3]);
hl += a2L*gmm1_inv;
hr = 0.5*(vR[VX1]*vR[VX1] + vR[VX2]*vR[VX2] + vR[VX3]*vR[VX3]);
hr += a2R*gmm1_inv;
h = s*hl + c*hr;
/* ----------------------------------------------------
1. the following should be equivalent to
scrh = dv[VX1]*dv[VX1] + dv[VX2]*dv[VX2] + dv[VX3]*dv[VX3];
a2 = s*a2L + c*a2R + 0.5*gmm1*s*c*scrh;
and therefore always positive.
---------------------------------------------------- */
a2 = gmm1*(h - 0.5*vel2);
a = sqrt(a2);
/* ----------------------------------------------------------------
2. define non-zero components of conservative eigenvectors Rc,
eigenvalues (lambda) and wave strenght eta = L.du
---------------------------------------------------------------- */
#pragma acc loop seq
NFLX_LOOP(nv) flux[nv] = 0.5*(fL[nv] + fR[nv]);
/* ---- (u - c_s) ---- */
SL[i] = um[VXn] - a;
/* ---- (u + c_s) ---- */
SR[i] = um[VXn] + a;
/* ---- get max eigenvalue ---- */
cmax[i] = fabs(um[VXn]) + a;
NFLX_LOOP(nv) flux[nv] = 0.5*(fL[nv] + fR[nv]) - 0.5*cmax[i]*(uR[nv] - uL[nv]);
#if DIMENSIONS > 1
/* ---------------------------------------------
3. use the HLL flux function if the interface
lies within a strong shock.
The effect of this switch is visible
in the Mach reflection test.
--------------------------------------------- */
scrh = fabs(vL[PRS] - vR[PRS]);
scrh /= MIN(vL[PRS],vR[PRS]);
if (scrh > 0.5 && (vR[VXn] < vL[VXn])){ /* -- tunable parameter -- */
bmin = MIN(0.0, SL[i]);
bmax = MAX(0.0, SR[i]);
scrh1 = 1.0/(bmax - bmin);
#pragma acc loop seq
for (nv = 0; nv < NFLX; nv++){
flux[nv] = bmin*bmax*(uR[nv] - uL[nv])
+ bmax*fL[nv] - bmin*fR[nv];
flux[nv] *= scrh1;
}
}
#endif /* DIMENSIONS > 1 */
} /* End loop on i */
}} /* End loop on j,k */
}
Technically they are equivalent, but in practice different. What's happening is that the compiler will hoist the declaration of these arrays outside of the loops. This is standard practice for the compiler and happens before the OpenACC directives are applied. What should happen is that then these arrays are implicitly privatized within the scoping unit they are declared. However the compiler doesn't currently track this so the arrays are implicitly copied into the compute region as shared arrays. If you add the flag "-Minfo=accel", you'll see the compiler feedback messages indicating the implicit copies.
I have an open issue report requesting this support, TPR #31360, however it's been a challenge to implement so not in a released compiler as of yet. Hence until/if we can fix the behavior, you'll need to manually hoist the declaration of these arrays and then add them to a "private" clause.

Efficient and accurate computation of the reciprocal of hypot(a,b)

Givens rotations provide a robust and easily parallelizable way to implement QR decomposition. A Givens rotation requires the computation of sine and cosine components of a rotation angle. In the case of real computation, this typically involves the computation of the reciprocal of the hypot() function to normalize a two-vector, as shown for example in Wikipedia.
While this avoids most cases of overflow and underflow in intermediate computation, for very large values a, b, hypot(a,b) may overflow to infinity, while 1/√(a2+b2) is actually representable as a subnormal floating-point number. Also, the use of a division adds further computational cost that can be significant on platforms with slow floating-point division.
A function rhypot(a,b) that directly computes 1/√(a2+b2) at a cost similar to the standard hypot() function would therefore be desirable. The accuracy should be same or better than the naive approach of computing 1.0/hypot(a,b). With a correctly-rounded hypot function, this expression has a maximum error of 1.5 ulps.
How can such a function be implemented efficiently and accurately? The use of IEEE-754 binary floating-point arithmetic and the availability of native hardware support for fused multiply-add (FMA) operations can be assumed. For ease of exposition and testing, we can restrict to single-precision computation, i.e. the IEEE-754 binary32 format.
In the following, I am showing ISO-C99 code that implements rhypot with good accuracy and good performance. The general algorithm is directly derived from the example implementations I showed for hypot in this answer. For hypot, one determines the value of largest magnitude among the arguments, then find a scale factor (a power of two for reasons of accuracy) that maps this value into the vicinity of unity. The scale factor is applied to both arguments, and the length of this transformed 2-vector is then computed with the sqrt function, finally the result scaled back with the "inverse' of the scale factor. The scaling relies on actual multiplication as the arguments may be subnormals that cannot be scaled correctly by simple exponent manipulation alone.
For rhypot, only two changes are needed: the reciprocal square root function rsqrt must be used instead of sqrt, and input scaling and result scaling use the same scale factor.
Some computing environments provide an rsqrt() function, and this function is scheduled for inclusion in a future version of the ISO C standard (ISO/IEC TS 18661-4:2015). For environments that do not provide a reciprocal square root function, I am showing some portable (within the platform requirements stated in the question) and machine-specific implementations.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
uint32_t __float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
float __uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
float my_rsqrtf (float);
/* Compute the reciprocal of sqrt (a**2 + b**2), avoiding premature overflow
and underflow in intermediate computation. The accuracy of this function
depends on the accuracy of the reciprocal square root implementation used.
With the rsqrtf() implementations shown below, the following maximum ulp
error was observed for 2**36 random test cases:
CORRECTLY_ROUNDED 1.20736973
SSE_HALLEY 1.33120522
SSE_2NR 1.42086841
SQRT_OOX 1.42906701
BIT_TWIDDLE_3NR 1.43062950
ITO_TAKAGI_YAJIMA_1NR 1.43681737
BIT_TWIDDLE_NR_HALLEY 1.47485797
*/
float my_rhypotf (float a, float b)
{
float fa, fb, mn, mx, scale, s, w, res;
uint32_t expo;
/* sort arguments by magnitude */
fa = fabsf (a);
fb = fabsf (b);
mx = fmaxf (fa, fb);
mn = fminf (fa, fb);
/* compute scale factor */
expo = __float_as_uint32 (mx) & 0xfc000000;
scale = __uint32_as_float (0x7e000000 - expo);
/* scale operand of maximum magnitude towards unity */
mn = mn * scale;
mx = mx * scale;
/* mx in [2**-23, 2**6) */
s = fmaf (mx, mx, mn * mn); // 0.75 ulp
w = my_rsqrtf (s);
/* reverse previous scaling */
res = w * scale;
/* handle special cases */
float t = a + b;
if (!(fabsf (t) <= INFINITY)) res = t; // isnan(t)
if (mx == INFINITY) res = 0.0f; // isinf(mx)
return res;
}
#define CORRECTLY_ROUNDED (1)
#define SSE_HALLEY (2)
#define SSE_2NR (3)
#define ITO_TAKAGI_YAJIMA_1NR (4)
#define SQRT_OOX (5)
#define BIT_TWIDDLE_3NR (6)
#define BIT_TWIDDLE_NR_HALLEY (7)
#define RSQRT_VARIANT (SSE_HALLEY)
#if (RSQRT_VARIANT == SSE_2NR) || (RSQRT_VARIANT == SSE_HALLEY)
#include "immintrin.h"
#endif // (RSQRT_VARIANT == SSE_2NR) || (RSQRT_VARIANT == SSE_HALLEY)
float my_rsqrtf (float a)
{
#if RSQRT_VARIANT == CORRECTLY_ROUNDED
float r = (float) sqrt (1.0/(double)a);
#elif RSQRT_VARIANT == SQRT_OOX
float r = sqrtf (1.0f / a);
#elif RSQRT_VARIANT == SSE_2NR
float r;
/* compute initial approximation */
_mm_store_ss (&r, _mm_rsqrt_ss (_mm_set_ss (a)));
/* refine approximation using two Newton-Raphson iterations */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
#elif RSQRT_VARIANT == SSE_HALLEY
float e, r;
/* compute initial approximation */
_mm_store_ss (&r, _mm_rsqrt_ss (_mm_set_ss (a)));
/* refine approximation using Halley iteration with cubic convergence */
e = fmaf (r * r, -a, 1.0f);
r = fmaf (fmaf (0.375f, e, 0.5f), e * r, r);
#elif RSQRT_VARIANT == BIT_TWIDDLE_3NR
float r;
/* compute initial approximation */
r = __uint32_as_float (0x5f375b0d - (__float_as_uint32(a) >> 1));
/* refine approximation using three Newton-Raphson iterations */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
#elif RSQRT_VARIANT == BIT_TWIDDLE_NR_HALLEY
float e, r;
/* compute initial approximation */
r = __uint32_as_float (0x5f375b0d - (__float_as_uint32(a) >> 1));
/* refine approximation using Newton-Raphson iteration */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
/* refine approximation using Halley iteration with cubic convergence */
e = fmaf (r * r, -a, 1.0f);
r = fmaf (fmaf (0.375f, e, 0.5f), e * r, r);
#elif RSQRT_VARIANT == ITO_TAKAGI_YAJIMA_1NR
/* Masayuki Ito, Naofumi Takagi, Shuzo Yajima, "Efficient Initial
Approximation for Multiplicative Division and Square Root by a
Multiplication with Operand Modification". IEEE Transactions on
Computers, Vol. 46, No. 4, April 1997, pp. 495-498.
*/
#define TAB_INDEX_BITS (7)
#define TAB_ENTRY_BITS (16)
#define TAB_ENTRIES (1 << TAB_INDEX_BITS)
#define FP32_EXPO_BIAS (127)
#define FP32_MANT_BITS (23)
#define FP32_SIGN_MASK (0x80000000)
#define FP32_EXPO_MASK (0x7f800000)
#define FP32_EXPO_LSB_MASK (1u << FP32_MANT_BITS)
#define FP32_INDEX_MASK (((1u << TAB_INDEX_BITS) - 1) << (FP32_MANT_BITS - TAB_INDEX_BITS))
#define FP32_XHAT_MASK (~(FP32_INDEX_MASK | FP32_SIGN_MASK) | FP32_EXPO_MASK)
#define FP32_FLIP_BIT_MASK (3u << (FP32_MANT_BITS - TAB_INDEX_BITS - 1))
#define FP32_ONE_HALF (0x3f000000)
const uint16_t d1tab [TAB_ENTRIES] = {
0xb2ec, 0xaed7, 0xaae9, 0xa720, 0xa37b, 0x9ff7, 0x9c93, 0x994d,
0x9623, 0x9316, 0x9022, 0x8d47, 0x8a85, 0x87d8, 0x8542, 0x82c0,
0x8053, 0x7bf0, 0x775f, 0x72f1, 0x6ea4, 0x6a77, 0x666a, 0x6279,
0x5ea5, 0x5aed, 0x574e, 0x53c9, 0x505d, 0x4d07, 0x49c8, 0x469e,
0x438a, 0x408a, 0x3d9e, 0x3ac4, 0x37fc, 0x3546, 0x32a0, 0x300b,
0x2d86, 0x2b10, 0x28a8, 0x264f, 0x2404, 0x21c6, 0x1f95, 0x1d70,
0x1b58, 0x194c, 0x174b, 0x1555, 0x136a, 0x1189, 0x0fb2, 0x0de6,
0x0c22, 0x0a68, 0x08b7, 0x070f, 0x056f, 0x03d8, 0x0249, 0x00c1,
0xfd08, 0xf742, 0xf1b4, 0xec5a, 0xe732, 0xe239, 0xdd6d, 0xd8cc,
0xd454, 0xd002, 0xcbd6, 0xc7cd, 0xc3e5, 0xc01d, 0xbc75, 0xb8e9,
0xb57a, 0xb225, 0xaeeb, 0xabc9, 0xa8be, 0xa5cb, 0xa2ed, 0xa024,
0x9d6f, 0x9ace, 0x983e, 0x95c1, 0x9355, 0x90fa, 0x8eae, 0x8c72,
0x8a45, 0x8825, 0x8614, 0x8410, 0x8219, 0x802e, 0x7c9c, 0x78f5,
0x7565, 0x71eb, 0x6e85, 0x6b31, 0x67f3, 0x64c7, 0x61ae, 0x5ea7,
0x5bb0, 0x58cb, 0x55f6, 0x5330, 0x5079, 0x4dd1, 0x4b38, 0x48ad,
0x462f, 0x43be, 0x4159, 0x3f01, 0x3cb5, 0x3a75, 0x3840, 0x3616
};
uint32_t arg, idx, d1, xhat;
float r;
arg = __float_as_uint32 (a);
idx = (arg >> ((FP32_MANT_BITS + 1) - TAB_INDEX_BITS)) & ((1u << TAB_INDEX_BITS) - 1);
d1 = FP32_ONE_HALF | (d1tab[idx] << ((FP32_MANT_BITS + 1) - TAB_ENTRY_BITS));
xhat = ((arg & FP32_INDEX_MASK) | (((((3 * FP32_EXPO_BIAS) << FP32_MANT_BITS) + ~arg) >> 1) & FP32_XHAT_MASK)) ^ FP32_FLIP_BIT_MASK;
/* compute initial approximation, accurate to about 14 bits */
r = __uint32_as_float (d1) * __uint32_as_float (xhat);
/* refine approximation with one Newton-Raphson iteration */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
#else
#error unsupported RSQRT_VARIANT
#endif // RSQRT_VARIANT
return r;
}
uint64_t __double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
double floatUlpErr (float res, double ref)
{
uint64_t i, j, err, refi;
int expoRef;
/* ulp error cannot be computed if either operand is NaN, infinity, zero */
if (isnan (res) || isnan (ref) || isinf (res) || isinf (ref) ||
(res == 0.0f) || (ref == 0.0f)) {
return 0.0;
}
/* Convert the float result to an "extended float". This is like a float
with 56 instead of 24 effective mantissa bits.
*/
i = ((uint64_t)__float_as_uint32(res)) << 32;
/* Convert the double reference to an "extended float". If the reference is
>= 2^129, we need to clamp to the maximum "extended float". If reference
is < 2^-126, we need to denormalize because of the float types's limited
exponent range.
*/
refi = __double_as_uint64(ref);
expoRef = (int)(((refi >> 52) & 0x7ff) - 1023);
if (expoRef >= 129) {
j = 0x7fffffffffffffffULL;
} else if (expoRef < -126) {
j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
j = j >> (-(expoRef + 126));
} else {
j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
j = j | ((uint64_t)(expoRef + 127) << 55);
}
j = j | (refi & 0x8000000000000000ULL);
err = (i < j) ? (j - i) : (i - j);
return err / 4294967296.0;
}
double rhypot (double a, double b)
{
return 1.0 / hypot (a, b);
}
// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579) /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)
#define FP32_QNAN_BIT (0x00400000)
int main (void)
{
float af, bf, resf, reff;
uint32_t ai, bi, resi, refi;
double ref, err, maxerr = 0;
uint64_t diff, diffsum = 0, count = 1ULL << 36;
do {
ai = KISS;
bi = KISS;
af = __uint32_as_float (ai);
bf = __uint32_as_float (bi);
resf = my_rhypotf (af, bf);
ref = rhypot ((double)af, (double)bf);
reff = (float)ref;
refi = __float_as_uint32 (reff);
resi = __float_as_uint32 (resf);
diff = llabs ((long long int)resi - (long long int)refi);
/* If both inputs are a NaN, result can be either argument, converted
to QNaN if necessary. If one input is NaN and the other not infinity
the NaN input must be returned, converted to QNaN if necessary. If
one input is infinity, zero must be returned even if the other input
is a NaN. In all other cases allow up to 1 ulp of difference.
*/
if ((isnan (af) && isnan (bf) && (resi != (ai | FP32_QNAN_BIT)) && (resi != (bi | FP32_QNAN_BIT))) ||
(isnan (af) && !isinf (bf) && !isnan (bf) && (resi != (ai | FP32_QNAN_BIT))) ||
(isnan (bf) && !isinf (af) && !isnan (af) && (resi != (bi | FP32_QNAN_BIT))) ||
(isinf (af) && (resi != 0)) ||
(isinf (bf) && (resi != 0)) ||
(diff > 1)) {
printf ("err # (%08x,%08x): res= %08x (%15.8e) ref=%08x (%15.8e)\n",
ai, bi, resi, resf, refi, reff);
return EXIT_FAILURE;
}
diffsum += diff;
err = floatUlpErr (resf, ref);
if (err > maxerr) {
printf ("ulp=%.8f # (% 15.8e, % 15.8e): res=%15.6a ref=%22.13a\n",
err, af, bf, resf, ref);
maxerr = err;
}
count--;
} while (count);
printf ("diffsum = %llu\n", diffsum);
return EXIT_SUCCESS;
}

Dealing with matrices in CUDA: understanding basic concepts

I'm building a CUDA kernel to compute the numerical N*N jacobian of a function, using finite differences; in the example I provided, it is the square function (each entry of the vector is squared). The host coded allocates in linear memory, while I'm using a 2-dimensional indexing in the kernel.
My issue is that I haven't found a way to sum on the diagonal of the matrices cudaMalloc'ed. My attempt has been to use the statement threadIdx.x == blockIdx.x as a condition for the diagonal, but instead it evaluates to true only for them both at 0.
Here is the kernel and EDIT: I posted the whole code as an answer, based on the suggestions in the comments (the main() is basically the same, while the kernel is not)
template <typename T>
__global__ void jacobian_kernel (
T * J,
const T t0,
const T tn,
const T h,
const T * u0,
const T * un,
const T * un_old)
{
T cgamma = 2 - sqrtf(2);
const unsigned int t = threadIdx.x;
const unsigned int b = blockIdx.x;
const unsigned int tid = t + b * blockDim.x;
/*__shared__*/ T temp_sx[BLOCK_SIZE][BLOCK_SIZE];
/*__shared__*/ T temp_dx[BLOCK_SIZE][BLOCK_SIZE];
__shared__ T sm_temp_du[BLOCK_SIZE];
T* temp_du = &sm_temp_du[0];
if (tid < N )
{
temp_sx[b][t] = un[t];
temp_dx[b][t] = un[t];
if ( t == b )
{
if ( tn == t0 )
{
temp_du[t] = u0[t]*0.001;
temp_sx[b][t] += temp_du[t]; //(*)
temp_dx[b][t] -= temp_du[t];
temp_sx[b][t] += ( abs( temp_sx[b][t] ) < 10e-6 ? 0.1 : 0 );
temp_dx[b][t] += ( abs( temp_dx[b][t] ) < 10e-6 ? 0.1 : 0 );
temp_sx[b][t] = ( temp_sx[b][t] == 0 ? 0.1 : temp_sx[b][t] );
temp_dx[b][t] = ( temp_dx[b][t] == 0 ? 0.1 : temp_dx[b][t] );
}
else
{
temp_du[t] = MAX( un[t] - un_old[t], 10e-6 );
temp_sx[b][t] += temp_du[t];
temp_dx[b][t] -= temp_du[t];
}
}
__syncthreads();
//J = f(tn, un + du)
d_func(tn, (temp_sx[b]), (temp_sx[b]), 1.f);
d_func(tn, (temp_dx[b]), (temp_dx[b]), 1.f);
__syncthreads();
J[tid] = (temp_sx[b][t] - temp_dx[b][t]) * powf((2 * temp_du[t]), -1);
//J[tid]*= - h*cgamma/2;
//J[tid]+= ( t == b ? 1 : 0);
//J[tid] = temp_J[tid];
}
}
The general procedure for computing the jacobian is
Copy un into every row of temp_sx and temp_dx
Compute du as a 0.01 magnitude from u0
Sum du to the diagonal of temp_sx, subtract du from the diagonal of temp_dx
Compute the square function on each entry of temp_sx and temp_dx
Subtract them and divide every entry by 2*du
This procedure can be summarized with (f(un + du*e_i) - f(un - du*e_i))/2*du.
My problem is to sum du to the diagonal of the matrices of temp_sx and temp_dx like I tried in (*). How can I achieve that?
EDIT: Now calling 1D blocks and threads; in fact, .y axis wasn't used at all in the kernel. I'm calling the kernel with a fixed amount of shared memory
Note that in int main() I'm calling the kernel with
#define REAL sizeof(float)
#define N 32
#define BLOCK_SIZE 16
#define NUM_BLOCKS ((N*N + BLOCK_SIZE - 1)/ BLOCK_SIZE)
...
dim3 dimGrid(NUM_BLOCKS,);
dim3 dimBlock(BLOCK_SIZE);
size_t shm_size = N*N*REAL;
jacobian_kernel <<< dimGrid, dimBlock, size_t shm_size >>> (...);
So that I attempt to deal with block-splitting the function calls. In the kernel to sum on the diagonal I used if(threadIdx.x == blockIdx.x){...}. Why isn't this correct? I'm asking it because while debugging and making the code print the statement, It only evaluates true if they both are 0. Thus du[0] is the only numerical value and the matrix becomes nan. Note that this approach worked with the first code I built, where instead I called the kernel with
jacobian_kernel <<< N, N >>> (...)
So that when threadIdx.x == blockIdx.x the element is on the diagonal. This approach doesn't fit anymore though, since now I need to deal with larger N (possibly larger than 1024, which is the maximum number of threads per block).
What statement should I put there that works even if the matrices are split into blocks and threads?
Let me know if I should share some other info.
Here is how I managed to solve my problem, based on the suggestion in the comments on the answer. The example is compilable, provided you put helper_cuda.h and helper_string.h in the same directory or you add -I directive to the CUDA examples include path, installed along with the CUDA toolkit. The relevant changes are only in the kernel; there's a minor change in the main() though, since I was calling double the resources to execute the kernel, but the .y axis of the grid of thread blocks wasn't even used at all, so it didn't generate any error.
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <math.h>
#include <assert.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include "helper_cuda.h"
#include "helper_string.h"
#include <fstream>
#ifndef MAX
#define MAX(a,b) ((a > b) ? a : b)
#endif
#define REAL sizeof(float)
#define N 128
#define BLOCK_SIZE 128
#define NUM_BLOCKS ((N*N + BLOCK_SIZE - 1)/ BLOCK_SIZE)
template <typename T>
inline void printmatrix( T mat, int rows, int cols);
template <typename T>
__global__ void jacobian_kernel ( const T * A, T * J, const T t0, const T tn, const T h, const T * u0, const T * un, const T * un_old);
template<typename T>
__device__ void d_func(const T t, const T u[], T res[], const T h = 1);
template<typename T>
int main ()
{
float t0 = 0.; //float tn = 0.;
float h = 0.1;
float* u0 = (float*)malloc(REAL*N); for(int i = 0; i < N; ++i){u0[i] = i+1;}
float* un = (float*)malloc(REAL*N); memcpy(un, u0, REAL*N);
float* un_old = (float*)malloc(REAL*N); memcpy(un_old, u0, REAL*N);
float* J = (float*)malloc(REAL*N*N);
float* A = (float*)malloc(REAL*N*N); host_heat_matrix(A);
float *d_u0;
float *d_un;
float *d_un_old;
float *d_J;
float *d_A;
checkCudaErrors(cudaMalloc((void**)&d_u0, REAL*N)); //printf("1: %p\n", d_u0);
checkCudaErrors(cudaMalloc((void**)&d_un, REAL*N)); //printf("2: %p\n", d_un);
checkCudaErrors(cudaMalloc((void**)&d_un_old, REAL*N)); //printf("3: %p\n", d_un_old);
checkCudaErrors(cudaMalloc((void**)&d_J, REAL*N*N)); //printf("4: %p\n", d_J);
checkCudaErrors(cudaMalloc((void**)&d_A, REAL*N*N)); //printf("4: %p\n", d_J);
checkCudaErrors(cudaMemcpy(d_u0, u0, REAL*N, cudaMemcpyHostToDevice)); assert(d_u0 != NULL);
checkCudaErrors(cudaMemcpy(d_un, un, REAL*N, cudaMemcpyHostToDevice)); assert(d_un != NULL);
checkCudaErrors(cudaMemcpy(d_un_old, un_old, REAL*N, cudaMemcpyHostToDevice)); assert(d_un_old != NULL);
checkCudaErrors(cudaMemcpy(d_J, J, REAL*N*N, cudaMemcpyHostToDevice)); assert(d_J != NULL);
checkCudaErrors(cudaMemcpy(d_A, A, REAL*N*N, cudaMemcpyHostToDevice)); assert(d_A != NULL);
dim3 dimGrid(NUM_BLOCKS); std::cout << "NUM_BLOCKS \t" << dimGrid.x << "\n";
dim3 dimBlock(BLOCK_SIZE); std::cout << "BLOCK_SIZE \t" << dimBlock.x << "\n";
size_t shm_size = N*REAL; //std::cout << shm_size << "\n";
//HERE IS A RELEVANT CHANGE OF THE MAIN, SINCE I WAS CALLING
//THE KERNEL WITH A 2D GRID BUT WITHOUT USING THE .y AXIS,
//WHILE NOW THE GRID IS 1D
jacobian_kernel <<< dimGrid, dimBlock, shm_size >>> (d_A, d_J, t0, t0, h, d_u0, d_un, d_un_old);
checkCudaErrors(cudaMemcpy(J, d_J, REAL*N*N, cudaMemcpyDeviceToHost)); //printf("4: %p\n", d_J);
printmatrix( J, N, N);
checkCudaErrors(cudaDeviceReset());
free(u0);
free(un);
free(un_old);
free(J);
}
template <typename T>
__global__ void jacobian_kernel (
const T * A,
T * J,
const T t0,
const T tn,
const T h,
const T * u0,
const T * un,
const T * un_old)
{
T cgamma = 2 - sqrtf(2);
const unsigned int t = threadIdx.x;
const unsigned int b = blockIdx.x;
const unsigned int tid = t + b * blockDim.x;
/*__shared__*/ T temp_sx[BLOCK_SIZE][BLOCK_SIZE];
/*__shared__*/ T temp_dx[BLOCK_SIZE][BLOCK_SIZE];
__shared__ T sm_temp_du;
T* temp_du = &sm_temp_du;
//HERE IS A RELEVANT CHANGE (*)
if ( t < BLOCK_SIZE && b < NUM_BLOCKS )
{
temp_sx[b][t] = un[t]; //printf("temp_sx[%d] = %f\n", t,(temp_sx[b][t]));
temp_dx[b][t] = un[t];
//printf("t = %d, b = %d, t + b * blockDim.x = %d \n",t, b, tid);
//HERE IS A NOTE (**)
if ( t == b )
{
//printf("t = %d, b = %d \n",t, b);
if ( tn == t0 )
{
*temp_du = u0[t]*0.001;
temp_sx[b][t] += *temp_du;
temp_dx[b][t] -= *temp_du;
temp_sx[b][t] += ( abs( temp_sx[b][t] ) < 10e-6 ? 0.1 : 0 );
temp_dx[b][t] += ( abs( temp_dx[b][t] ) < 10e-6 ? 0.1 : 0 );
temp_sx[b][t] = ( temp_sx[b][t] == 0 ? 0.1 : temp_sx[b][t] );
temp_dx[b][t] = ( temp_dx[b][t] == 0 ? 0.1 : temp_dx[b][t] );
}
else
{
*temp_du = MAX( un[t] - un_old[t], 10e-6 );
temp_sx[b][t] += *temp_du;
temp_dx[b][t] -= *temp_du;
}
;
}
//printf("du[%d] %f\n", tid, (*temp_du));
__syncthreads();
//printf("temp_sx[%d][%d] = %f\n", b, t, temp_sx[b][t]);
//printf("temp_dx[%d][%d] = %f\n", b, t, temp_dx[b][t]);
//d_func(tn, (temp_sx[b]), (temp_sx[b]), 1.f);
//d_func(tn, (temp_dx[b]), (temp_dx[b]), 1.f);
matvec_dev( tn, A, (temp_sx[b]), (temp_sx[b]), N, N, 1.f );
matvec_dev( tn, A, (temp_dx[b]), (temp_dx[b]), N, N, 1.f );
__syncthreads();
//printf("temp_sx_later[%d][%d] = %f\n", b, t, (temp_sx[b][t]));
//printf("temp_sx_later[%d][%d] - temp_dx_later[%d][%d] = %f\n", b,t,b,t, (temp_sx[b][t] - temp_dx[b][t]) / 2 * *temp_du);
//if (t == b ) printf( "2du[%d]^-1 = %f\n",t, powf((2 * *temp_du), -1));
J[tid] = (temp_sx[b][t] - temp_dx[b][t]) / (2 * *temp_du);
}
}
template<typename T>
__device__ void d_func(const T t, const T u[], T res[], const T h )
{
__shared__ float temp_u;
temp_u = u[threadIdx.x];
res[threadIdx.x] = h*powf( (temp_u), 2);
}
template <typename T>
inline void printmatrix( T mat, int rows, int cols)
{
std::ofstream matrix_out;
matrix_out.open( "heat_matrix.txt", std::ofstream::out);
for( int i = 0; i < rows; i++)
{
for( int j = 0; j <cols; j++)
{
double next = mat[i + N*j];
matrix_out << ( (next >= 0) ? " " : "") << next << " ";
}
matrix_out << "\n";
}
}
The relevant change is on (*). Before I used if (tid < N) which has two downsides:
First, it is wrong, since it should be tid < N*N, as my data is 2D, while tid is a global index which tracks all the data.
Even if I wrote tid < N*N, since I'm splitting the function calls into blocks, the t < BLOCK_SIZE && b < NUM_BLOCKS seems clearer to me in how the indexing is arranged in the code.
Moreover, the statement t == b in (**) is actually the right one to operate on the diagonal elements of the matrix. The fact that it was evaluated true only on 0 was because of my error right above.
Thanks for the suggestions!

Resize 8-bit image by 2 with ARM NEON

I have an 8-bit 640x480 image that I would like to shrink to a 320x240 image:
void reducebytwo(uint8_t *dst, uint8_t *src)
//src is 640x480, dst is 320x240
What would be the best way to do that using ARM SIMD NEON? Any sample code somewhere?
As a starting point, I simply would like to do the equivalent of:
for (int h = 0; h < 240; h++)
for (int w = 0; w < 320; w++)
dst[h * 320 + w] = (src[640 * h * 2 + w * 2] + src[640 * h * 2 + w * 2 + 1] + src[640 * h * 2 + 640 + w * 2] + src[640 * h * 2 + 640 + w * 2 + 1]) / 4;
This is a one to one translation of your code to arm NEON intrinsics:
#include <arm_neon.h>
#include <stdint.h>
static void resize_line (uint8_t * __restrict src1, uint8_t * __restrict src2, uint8_t * __restrict dest)
{
int i;
for (i=0; i<640; i+=16)
{
// load upper line and add neighbor pixels:
uint16x8_t a = vpaddlq_u8 (vld1q_u8 (src1));
// load lower line and add neighbor pixels:
uint16x8_t b = vpaddlq_u8 (vld1q_u8 (src2));
// sum of upper and lower line:
uint16x8_t c = vaddq_u16 (a,b);
// divide by 4, convert to char and store:
vst1_u8 (dest, vshrn_n_u16 (c, 2));
// move pointers to next chunk of data
src1+=16;
src2+=16;
dest+=8;
}
}
void resize_image (uint8_t * src, uint8_t * dest)
{
int h;
for (h = 0; h < 240 - 1; h++)
{
resize_line (src+640*(h*2+0),
src+640*(h*2+1),
dest+320*h);
}
}
It processes 32 source-pixels and generates 8 output pixels per iteration.
I did a quick look at the assembler output and it looks okay. You can get better performance if you write the resize_line function in assembler, unroll the loop and eliminate pipeline stalls. That would give you an estimated factor of three performance boost.
It should be a lot faster than your implementation without assembler changes though.
Note: I haven't tested the code...
If you're not too concerned with precision then this inner loop should give you twice the compute throughput compared to the more accurate algorithm:
for (i=0; i<640; i+= 32)
{
uint8x16x2_t a, b;
uint8x16_t c, d;
/* load upper row, splitting even and odd pixels into a.val[0]
* and a.val[1] respectively. */
a = vld2q_u8(src1);
/* as above, but for lower row */
b = vld2q_u8(src2);
/* compute average of even and odd pixel pairs for upper row */
c = vrhaddq_u8(a.val[0], a.val[1]);
/* compute average of even and odd pixel pairs for lower row */
d = vrhaddq_u8(b.val[0], b.val[1]);
/* compute average of upper and lower rows, and store result */
vst1q_u8(dest, vrhaddq_u8(c, d));
src1+=32;
src2+=32;
dest+=16;
}
It works by using the vhadd operation, which has a result the same size as the input. This way you don't have to shift the final sum back down to 8-bit, and all of the arithmetic throughout is eight-bit, which means you can perform twice as many operations per instruction.
However it is less accurate, because the intermediate sum is quantised, and GCC 4.7 does a terrible job of generating code. GCC 4.8 does just fine.
The whole operation has a good chance of being I/O bound, though. The loop should be unrolled to maximise separation between loads and arithmetic, and __builtin_prefetch() (or PLD) should be used to hoist the incoming data into caches before it's needed.
Here is the asm version on reduce_line that #Nils Pipenbrinck suggested
static void reduce2_neon_line(uint8_t* __restrict src1, uint8_t* __restrict src2, uint8_t* __restrict dest, int width) {
for(int i=0; i<width; i+=16) {
asm (
"pld [%[line1], #0xc00] \n"
"pld [%[line2], #0xc00] \n"
"vldm %[line1]!, {d0,d1} \n"
"vldm %[line2]!, {d2,d3} \n"
"vpaddl.u8 q0, q0 \n"
"vpaddl.u8 q1, q1 \n"
"vadd.u16 q0, q1 \n"
"vshrn.u16 d0, q0, #2 \n"
"vst1.8 {d0}, [%[dst]]! \n"
:
: [line1] "r"(src1), [line2] "r"(src2), [dst] "r"(dest)
: "q0", "q1", "memory"
);
}
}
It is about 4 times faster then C version (tested on iPhone 5).

Dependence speed up on data size using auto vectorization and sse

I'm trying to speed up some code using auto vectorization from Intel Compiler and using sse.
All computations are transformation some struct node_t to another struct w_t (functions tr() and gen_tr()).
When I try vectorize function gen_tr() it does not produce any effects.
If change data storage format, when each struct component stored in different array of floats, then auto vectorization works well, see function genv_tr().
Function that used sse called ssev_tr (N should divided evenly by 4).
transform.c:
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <xmmintrin.h>
static __inline__ unsigned long getCC(void)
{
unsigned a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long)a) | (((unsigned long)d) << 32);
}
typedef struct {
float x1, x2, x3, x4, x5;
} node_t;
typedef struct {
float w1, w2, w3, w4;
} w_t;
void tr(node_t *n, float c1, float c2, w_t *w)
{
const float nv = n->x1;
const float N00T = n->x3 * c1;
const float n1v = n->x2;
const float N01T = n->x4 * c2;
w->w1 = nv - N00T;
w->w2 = nv + N00T;
w->w3 = n1v - N01T;
w->w4 = n1v + N01T;
}
__attribute__ ((noinline))
void gen_tr(node_t *n, w_t *w, const int N, float c1, float c2)
{
int i;
#pragma vector aligned
#pragma ivdep
for (i = 0; i < N; i++) {
tr(n + i, c1, c2, w + i);
}
}
__attribute__ ((noinline))
void genv_tr(float *x1, float *x2, float *x3, float *x4, float *x5, float *w1, float *w2, float *w3, float *w4, const int N, float c1, float c2)
{
int i;
#pragma vector aligned
#pragma ivdep
for (i = 0; i < N; i++) {
const float N00T = x3[i] * c1;
const float N01T = x4[i] * c2;
w1[i] = x1[i] - N00T;
w2[i] = x1[i] + N00T;
w3[i] = x2[i] - N01T;
w4[i] = x2[i] + N01T;
}
}
__attribute__ ((noinline))
void ssev_tr(float *x1, float *x2, float *x3, float *x4, float *x5, float *w1, float *w2, float *w3, float *w4, const int N, float c1, float c2)
{
__m128 *ws1 = (__m128*)w1;
__m128 *ws2 = (__m128*)w2;
__m128 *ws3 = (__m128*)w3;
__m128 *ws4 = (__m128*)w4;
__m128 *xs1 = (__m128*)x1;
__m128 *xs2 = (__m128*)x2;
__m128 *xs3 = (__m128*)x3;
__m128 *xs4 = (__m128*)x4;
const __m128 cs1 = _mm_set1_ps(c1);
const __m128 cs2 = _mm_set1_ps(c2);
int i;
#pragma vector aligned
#pragma ivdep
for (i = 0; i < N / 4; i++) {
const __m128 N00T = _mm_mul_ps(xs3[i], cs1);
const __m128 N01T = _mm_mul_ps(xs4[i], cs2);
ws1[i] = _mm_sub_ps(xs1[i], N00T);
ws2[i] = _mm_add_ps(xs1[i], N00T);
ws3[i] = _mm_sub_ps(xs2[i], N01T);
ws4[i] = _mm_add_ps(xs2[i], N01T);
}
}
#define test(func) \
for (i = 0; i < n; i++) { \
x[i].x1 = 1.0; \
x[i].x2 = 2.0; \
x[i].x3 = 2.0; \
x[i].x4 = 2.0; \
x[i].x5 = 2.0; \
} \
\
t1 = getCC(); \
for (i = 0; i < rep; i++) { \
func(x, w, n, c1, c2); \
} \
t2 = getCC(); \
printf("\t%f", ((double)(t2 - t1)) / n / rep);
#define test1(func) \
for (i = 0; i < n; i++) { \
x1[i] = 1.0; \
x2[i] = 2.0; \
x3[i] = 2.0; \
x4[i] = 2.0; \
x5[i] = 2.0; \
} \
\
t1 = getCC(); \
for (i = 0; i < rep; i++) { \
func(x1, x2, x3, x4, x5, w1, w2, w3, w4, n, c1, c2); \
} \
t2 = getCC(); \
printf("\t%f", ((double)(t2 - t1)) / n / rep);
int main(int argc, char *argv[])
{
if (argc < 2) {
printf("Usage %s vector_size\n", argv[0]);
}
int n = atoi(argv[1]);
printf("%d", n);
int rep = 100000000 / n;
int i;
int inc = 1;
float c1 = 2.0, c2 = 1.0;
unsigned long t1, t2;
node_t *x = (node_t*)malloc(n * sizeof(node_t));
w_t *w = (w_t*)malloc(n * sizeof(w_t));
float *x1 = (float*)malloc(n * sizeof(float));
float *x2 = (float*)malloc(n * sizeof(float));
float *x3 = (float*)malloc(n * sizeof(float));
float *x4 = (float*)malloc(n * sizeof(float));
float *x5 = (float*)malloc(n * sizeof(float));
float *w1 = (float*)malloc(n * sizeof(float));
float *w2 = (float*)malloc(n * sizeof(float));
float *w3 = (float*)malloc(n * sizeof(float));
float *w4 = (float*)malloc(n * sizeof(float));
test(gen_tr);
test1(genv_tr);
test1(ssev_tr);
printf("\n");
return 0;
}
Compile options: icc -O3 -Wall -W -vec-report6 transform.c -o transform
Version of icc - 12.1.2, OS - Fedora 16 x86_64, CPU - Intel Core2 Quad CPU Q8200.
Then i run it with different size from 16 to 3000 with step 64, here script:
#!/bin/bash
echo "" > run.log
for ((c=16;c<3000;c+=64))
do
./transform $c | tee -a run.log
done
Here some result of work this script (size, gen_tr, genv_tr, ssev_tr), all times shown per one array element:
16 7.710743 3.168577 3.253829
272 7.166493 1.983918 2.618569
528 7.121866 1.920195 2.567109
784 7.115007 1.899451 2.549645
1040 8.104026 2.481062 2.944317
1296 8.137537 5.105032 5.104614
1552 8.118534 5.068812 5.064211
1808 8.138309 5.077831 5.085015
2064 8.149699 5.107503 5.069958
2320 8.164556 5.080981 5.099313
2576 8.151524 5.086056 5.089294
2832 8.212946 5.061927 5.072261
why it is so significant change about size 1000 when using vectorized version of function? Does it because of cache miss? Is it possible to save same speed on all data ranges?
You have 8 float arrays. When they are of size 1000, your test is manipulating about 32kB of data. Even though your L1 cache is probably a bit larger (64kB), the L1 cache is likely unable to hold all the 32kB data at the same time due to associativity.
Your test iterates, processing the same data over and over again. Consider the two cases:
Size = 528: The 8 arrays conveniently fit into the L1 cache. Each test iteration (except for the first one) has fast access to the data.
Size = 1268: The 8 arrays do not fit into the L1 cache at the same time. Each test iteration keeps evicting the data from L1, so effectively all reads and writes go to L2.
So the jump at input size 1000 is partly an artifact of your test, but not entirely. In the real world, if you already happen to have all the data you need in the L1 cache, genv_tr will be really fast. But on inputs of size >1000, all the inputs simply do not fit into the L1 cache, so some accesses will definitely go to L2.

Resources