Back story : uniform PRNG with arbitrary endpoints
I've got a fast uniform pseudo random number generator that creates uniform float32 numbers in range [1:2) i.e. u : 1 <= u <= 2-eps. Unfortunately mapping the endpoints [1:2) to that of an arbitrary range [a:b) is non-trivial in floating point math. I'd like to exactly match the endpoints with a simple affine calculation.
Formally stated
I want to make an IEEE-754 32 bit floating point affine function f(x,a,b) for 1<=x<2 and arbitrary a,b that exactly maps
1 -> a and nextlower(2) -> nextlower(b)
where nextlower(q) is the next lower FP representable number (e.g. in C++ std::nextafter(float(q),float(q-1)))
What I've tried
The simple mapping f(x,a,b) = (x-1)*(b-a) + a always achieves the f(1) condition but sometimes fails the f(2) condition due to floating point rounding.
I've tried replacing the 1 with a free design parameter to cancel FP errors in the spirit of Kahan summation.
i.e. with
f(x,c0,c1,c2) = (x-c0)*c1 + c2
one mathematical solution is c0=1,c1=(b-a),c2=a (the simple mapping above),
but the extra parameter lets me play around with constants c0,c1,c2 to match the endpoints. I'm not sure I understand the principles behind Kahan summation well enough to apply them to determine the parameters or even be confident a solution exists. It feels like I'm bumping around in the dark where others might've found the light already.
Aside: I'm fine assuming the following
a < b
both a and b are far from zero, i.e. OK to ignore subnormals
a and b are far enough apart (measuered in representable FP values) to mitigate non-uniform quantization and avoid degenerate cases
Update
I'm using a modified form of Chux's answer to avoid the division.
While I'm not 100% certain my refactoring kept all the magic, it does still work in all my test cases.
float lerp12(float x,float a,float b)
{
const float scale = 1.0000001f;
// scale = 1/(nextlower(2) - 1);
const float ascale = a*scale;
const float bscale = nextlower(b)*scale;
return (nextlower(2) - x)*ascale + (x - 1.0f)*bscale;
}
Note that only the last line (5 FLOPS) depends on x, so the others can be reused if (a,b) remain the same.
OP's goal
I want to make an IEEE-754 32 bit floating point affine function f(x,a,b) for 1<=x<2 and arbitrary a,b that exactly maps 1 -> a and nextlower(2) -> nextlower(b)
This differs slightly from "map range of IEEE 32bit float [1:2) to some arbitrary [a:b)".
General case
Map x0 to y0, x1 to y1 and various x in-between to y :
m = (y1 - y0)/(x1 - x0);
y = m*(x - x0) + y0;
OP's case
// x0 = 1.0f;
// x1 = nextafterf(2.0f, 1.0f);
// y0 = a;
// y1 = nextafterf(b, a);
#include <math.h> // for nextafterf()
float x = random_number_1_to_almost_2();
float m = (nextafterf(b, a) - a)/(nextafterf(2.0f, 1.0f) - 1.0f);
float y = m*(x - 1.0f) + a;
nextafterf(2.0f, 1.0f) - 1.0f, x - 1.0f and nextafterf(b, a) are exact, incurring no calculation error.
nextafterf(2.0f, 1.0f) - 1.0f is a value a little less than 1.0f.
Recommendation
Other re-formations are possible with better symmetry and numerical stability at the end-points.
float x = random_number_1_to_almost_2();
float afactor = nextafterf(2.0f, 1.0f) - x; // exact
float bfactor = x - 1.0f; // exact
float xwidth = nextafterf(2.0f, 1.0f) - 1.0f; // exact
// Do not re-order next line of code, perform 2 divisions
float y = (afactor/xwidth)*a + (bfactor/xwidth)*nextafterf(b, a);
Notice afactor/xwidth and bfactor/xwidth are both exactly 0.0 or 1.0 at the end-points, thus meeting "maps 1 -> a and nextlower(2) -> nextlower(b)". Extended precision not needed.
OP's (x-c0)*c1 + c2 has trouble as it divides (x-c0)*c1 by (2.0 - 1.0) or 1.0 (implied), when it should divide by nextafterf(2.0f, 1.0f) - 1.0f.
Simple lerping based on fused multiply-add can reliably hit the endpoints for interpolation factors 0 and 1. For x in [1, 2) the interpolation factor x - 1 does not reach unity, which can be fixed by slight stretching by multiplying x-1 with (2.0f / nextlower(2.0f)). Obviously the endpoint needs to also be adjusted to the endpoint nextlower(b). For the C code below I have used the definition of nextlower() provided in the question, which may not be what asker desires, since for floating-point q sufficiently large in magnitude, q == (q - 1).
Asker stated in comments that it is understood that this kind of mapping is not going to result in an exactly uniform distribution of the pseudo-random numbers in the interval [a, b), only approximately so, and that pathological mappings may occur when a and b are extremely close together. I have not mathematically proved that the implementation of map() below guarantees the desired behavior, but it seems to do so for a large number of random test cases.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
float nextlowerf (float q)
{
return nextafterf (q, q - 1);
}
float map (float a, float b, float x)
{
float t = (x - 1.0f) * (2.0f / nextlowerf (2.0f));
return fmaf (t, nextlowerf (b), fmaf (-t, a, a));
}
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof(r));
return r;
}
// George Marsaglia's KISS PRNG, period 2**123. Newsgroup sci.math, 21 Jan 1999
// Bug fix: Greg Rose, "KISS: A Bit Too Simple" http://eprint.iacr.org/2011/007
static uint32_t kiss_z=362436069, kiss_w=521288629;
static uint32_t kiss_jsr=123456789, kiss_jcong=380116160;
#define znew (kiss_z=36969*(kiss_z&65535)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&65535)+(kiss_w>>16))
#define MWC ((znew<<16)+wnew )
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17), \
kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+1234567)
#define KISS ((MWC^CONG)+SHR3)
int main (void)
{
float a, b, x, r;
float FP32_MIN_NORM = 0x1.000000p-126f;
float FP32_MAX_NORM = 0x1.fffffep+127f;
do {
do {
a = uint32_as_float (KISS);
} while ((fabsf (a) < FP32_MIN_NORM) || (fabsf (a) > FP32_MAX_NORM) || isnan (a));
do {
b = uint32_as_float (KISS);
} while ((fabsf (b) < FP32_MIN_NORM) || (fabsf (b) > FP32_MAX_NORM) || isnan (b) || (b < a));
x = 1.0f;
r = map (a, b, x);
if (r != a) {
printf ("lower bound failed: a=%12.6a b=%12.6a map=%12.6a\n", a, b, r);
return EXIT_FAILURE;
}
x = nextlowerf (2.0f);
r = map (a, b, x);
if (r != nextlowerf (b)) {
printf ("upper bound failed: a=%12.6a b=%12.6a map=%12.6a\n", a, b, r);
return EXIT_FAILURE;
}
} while (1);
return EXIT_SUCCESS;
}
Givens rotations provide a robust and easily parallelizable way to implement QR decomposition. A Givens rotation requires the computation of sine and cosine components of a rotation angle. In the case of real computation, this typically involves the computation of the reciprocal of the hypot() function to normalize a two-vector, as shown for example in Wikipedia.
While this avoids most cases of overflow and underflow in intermediate computation, for very large values a, b, hypot(a,b) may overflow to infinity, while 1/√(a2+b2) is actually representable as a subnormal floating-point number. Also, the use of a division adds further computational cost that can be significant on platforms with slow floating-point division.
A function rhypot(a,b) that directly computes 1/√(a2+b2) at a cost similar to the standard hypot() function would therefore be desirable. The accuracy should be same or better than the naive approach of computing 1.0/hypot(a,b). With a correctly-rounded hypot function, this expression has a maximum error of 1.5 ulps.
How can such a function be implemented efficiently and accurately? The use of IEEE-754 binary floating-point arithmetic and the availability of native hardware support for fused multiply-add (FMA) operations can be assumed. For ease of exposition and testing, we can restrict to single-precision computation, i.e. the IEEE-754 binary32 format.
In the following, I am showing ISO-C99 code that implements rhypot with good accuracy and good performance. The general algorithm is directly derived from the example implementations I showed for hypot in this answer. For hypot, one determines the value of largest magnitude among the arguments, then find a scale factor (a power of two for reasons of accuracy) that maps this value into the vicinity of unity. The scale factor is applied to both arguments, and the length of this transformed 2-vector is then computed with the sqrt function, finally the result scaled back with the "inverse' of the scale factor. The scaling relies on actual multiplication as the arguments may be subnormals that cannot be scaled correctly by simple exponent manipulation alone.
For rhypot, only two changes are needed: the reciprocal square root function rsqrt must be used instead of sqrt, and input scaling and result scaling use the same scale factor.
Some computing environments provide an rsqrt() function, and this function is scheduled for inclusion in a future version of the ISO C standard (ISO/IEC TS 18661-4:2015). For environments that do not provide a reciprocal square root function, I am showing some portable (within the platform requirements stated in the question) and machine-specific implementations.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
uint32_t __float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
float __uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
float my_rsqrtf (float);
/* Compute the reciprocal of sqrt (a**2 + b**2), avoiding premature overflow
and underflow in intermediate computation. The accuracy of this function
depends on the accuracy of the reciprocal square root implementation used.
With the rsqrtf() implementations shown below, the following maximum ulp
error was observed for 2**36 random test cases:
CORRECTLY_ROUNDED 1.20736973
SSE_HALLEY 1.33120522
SSE_2NR 1.42086841
SQRT_OOX 1.42906701
BIT_TWIDDLE_3NR 1.43062950
ITO_TAKAGI_YAJIMA_1NR 1.43681737
BIT_TWIDDLE_NR_HALLEY 1.47485797
*/
float my_rhypotf (float a, float b)
{
float fa, fb, mn, mx, scale, s, w, res;
uint32_t expo;
/* sort arguments by magnitude */
fa = fabsf (a);
fb = fabsf (b);
mx = fmaxf (fa, fb);
mn = fminf (fa, fb);
/* compute scale factor */
expo = __float_as_uint32 (mx) & 0xfc000000;
scale = __uint32_as_float (0x7e000000 - expo);
/* scale operand of maximum magnitude towards unity */
mn = mn * scale;
mx = mx * scale;
/* mx in [2**-23, 2**6) */
s = fmaf (mx, mx, mn * mn); // 0.75 ulp
w = my_rsqrtf (s);
/* reverse previous scaling */
res = w * scale;
/* handle special cases */
float t = a + b;
if (!(fabsf (t) <= INFINITY)) res = t; // isnan(t)
if (mx == INFINITY) res = 0.0f; // isinf(mx)
return res;
}
#define CORRECTLY_ROUNDED (1)
#define SSE_HALLEY (2)
#define SSE_2NR (3)
#define ITO_TAKAGI_YAJIMA_1NR (4)
#define SQRT_OOX (5)
#define BIT_TWIDDLE_3NR (6)
#define BIT_TWIDDLE_NR_HALLEY (7)
#define RSQRT_VARIANT (SSE_HALLEY)
#if (RSQRT_VARIANT == SSE_2NR) || (RSQRT_VARIANT == SSE_HALLEY)
#include "immintrin.h"
#endif // (RSQRT_VARIANT == SSE_2NR) || (RSQRT_VARIANT == SSE_HALLEY)
float my_rsqrtf (float a)
{
#if RSQRT_VARIANT == CORRECTLY_ROUNDED
float r = (float) sqrt (1.0/(double)a);
#elif RSQRT_VARIANT == SQRT_OOX
float r = sqrtf (1.0f / a);
#elif RSQRT_VARIANT == SSE_2NR
float r;
/* compute initial approximation */
_mm_store_ss (&r, _mm_rsqrt_ss (_mm_set_ss (a)));
/* refine approximation using two Newton-Raphson iterations */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
#elif RSQRT_VARIANT == SSE_HALLEY
float e, r;
/* compute initial approximation */
_mm_store_ss (&r, _mm_rsqrt_ss (_mm_set_ss (a)));
/* refine approximation using Halley iteration with cubic convergence */
e = fmaf (r * r, -a, 1.0f);
r = fmaf (fmaf (0.375f, e, 0.5f), e * r, r);
#elif RSQRT_VARIANT == BIT_TWIDDLE_3NR
float r;
/* compute initial approximation */
r = __uint32_as_float (0x5f375b0d - (__float_as_uint32(a) >> 1));
/* refine approximation using three Newton-Raphson iterations */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
#elif RSQRT_VARIANT == BIT_TWIDDLE_NR_HALLEY
float e, r;
/* compute initial approximation */
r = __uint32_as_float (0x5f375b0d - (__float_as_uint32(a) >> 1));
/* refine approximation using Newton-Raphson iteration */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
/* refine approximation using Halley iteration with cubic convergence */
e = fmaf (r * r, -a, 1.0f);
r = fmaf (fmaf (0.375f, e, 0.5f), e * r, r);
#elif RSQRT_VARIANT == ITO_TAKAGI_YAJIMA_1NR
/* Masayuki Ito, Naofumi Takagi, Shuzo Yajima, "Efficient Initial
Approximation for Multiplicative Division and Square Root by a
Multiplication with Operand Modification". IEEE Transactions on
Computers, Vol. 46, No. 4, April 1997, pp. 495-498.
*/
#define TAB_INDEX_BITS (7)
#define TAB_ENTRY_BITS (16)
#define TAB_ENTRIES (1 << TAB_INDEX_BITS)
#define FP32_EXPO_BIAS (127)
#define FP32_MANT_BITS (23)
#define FP32_SIGN_MASK (0x80000000)
#define FP32_EXPO_MASK (0x7f800000)
#define FP32_EXPO_LSB_MASK (1u << FP32_MANT_BITS)
#define FP32_INDEX_MASK (((1u << TAB_INDEX_BITS) - 1) << (FP32_MANT_BITS - TAB_INDEX_BITS))
#define FP32_XHAT_MASK (~(FP32_INDEX_MASK | FP32_SIGN_MASK) | FP32_EXPO_MASK)
#define FP32_FLIP_BIT_MASK (3u << (FP32_MANT_BITS - TAB_INDEX_BITS - 1))
#define FP32_ONE_HALF (0x3f000000)
const uint16_t d1tab [TAB_ENTRIES] = {
0xb2ec, 0xaed7, 0xaae9, 0xa720, 0xa37b, 0x9ff7, 0x9c93, 0x994d,
0x9623, 0x9316, 0x9022, 0x8d47, 0x8a85, 0x87d8, 0x8542, 0x82c0,
0x8053, 0x7bf0, 0x775f, 0x72f1, 0x6ea4, 0x6a77, 0x666a, 0x6279,
0x5ea5, 0x5aed, 0x574e, 0x53c9, 0x505d, 0x4d07, 0x49c8, 0x469e,
0x438a, 0x408a, 0x3d9e, 0x3ac4, 0x37fc, 0x3546, 0x32a0, 0x300b,
0x2d86, 0x2b10, 0x28a8, 0x264f, 0x2404, 0x21c6, 0x1f95, 0x1d70,
0x1b58, 0x194c, 0x174b, 0x1555, 0x136a, 0x1189, 0x0fb2, 0x0de6,
0x0c22, 0x0a68, 0x08b7, 0x070f, 0x056f, 0x03d8, 0x0249, 0x00c1,
0xfd08, 0xf742, 0xf1b4, 0xec5a, 0xe732, 0xe239, 0xdd6d, 0xd8cc,
0xd454, 0xd002, 0xcbd6, 0xc7cd, 0xc3e5, 0xc01d, 0xbc75, 0xb8e9,
0xb57a, 0xb225, 0xaeeb, 0xabc9, 0xa8be, 0xa5cb, 0xa2ed, 0xa024,
0x9d6f, 0x9ace, 0x983e, 0x95c1, 0x9355, 0x90fa, 0x8eae, 0x8c72,
0x8a45, 0x8825, 0x8614, 0x8410, 0x8219, 0x802e, 0x7c9c, 0x78f5,
0x7565, 0x71eb, 0x6e85, 0x6b31, 0x67f3, 0x64c7, 0x61ae, 0x5ea7,
0x5bb0, 0x58cb, 0x55f6, 0x5330, 0x5079, 0x4dd1, 0x4b38, 0x48ad,
0x462f, 0x43be, 0x4159, 0x3f01, 0x3cb5, 0x3a75, 0x3840, 0x3616
};
uint32_t arg, idx, d1, xhat;
float r;
arg = __float_as_uint32 (a);
idx = (arg >> ((FP32_MANT_BITS + 1) - TAB_INDEX_BITS)) & ((1u << TAB_INDEX_BITS) - 1);
d1 = FP32_ONE_HALF | (d1tab[idx] << ((FP32_MANT_BITS + 1) - TAB_ENTRY_BITS));
xhat = ((arg & FP32_INDEX_MASK) | (((((3 * FP32_EXPO_BIAS) << FP32_MANT_BITS) + ~arg) >> 1) & FP32_XHAT_MASK)) ^ FP32_FLIP_BIT_MASK;
/* compute initial approximation, accurate to about 14 bits */
r = __uint32_as_float (d1) * __uint32_as_float (xhat);
/* refine approximation with one Newton-Raphson iteration */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
#else
#error unsupported RSQRT_VARIANT
#endif // RSQRT_VARIANT
return r;
}
uint64_t __double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
double floatUlpErr (float res, double ref)
{
uint64_t i, j, err, refi;
int expoRef;
/* ulp error cannot be computed if either operand is NaN, infinity, zero */
if (isnan (res) || isnan (ref) || isinf (res) || isinf (ref) ||
(res == 0.0f) || (ref == 0.0f)) {
return 0.0;
}
/* Convert the float result to an "extended float". This is like a float
with 56 instead of 24 effective mantissa bits.
*/
i = ((uint64_t)__float_as_uint32(res)) << 32;
/* Convert the double reference to an "extended float". If the reference is
>= 2^129, we need to clamp to the maximum "extended float". If reference
is < 2^-126, we need to denormalize because of the float types's limited
exponent range.
*/
refi = __double_as_uint64(ref);
expoRef = (int)(((refi >> 52) & 0x7ff) - 1023);
if (expoRef >= 129) {
j = 0x7fffffffffffffffULL;
} else if (expoRef < -126) {
j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
j = j >> (-(expoRef + 126));
} else {
j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
j = j | ((uint64_t)(expoRef + 127) << 55);
}
j = j | (refi & 0x8000000000000000ULL);
err = (i < j) ? (j - i) : (i - j);
return err / 4294967296.0;
}
double rhypot (double a, double b)
{
return 1.0 / hypot (a, b);
}
// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579) /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)
#define FP32_QNAN_BIT (0x00400000)
int main (void)
{
float af, bf, resf, reff;
uint32_t ai, bi, resi, refi;
double ref, err, maxerr = 0;
uint64_t diff, diffsum = 0, count = 1ULL << 36;
do {
ai = KISS;
bi = KISS;
af = __uint32_as_float (ai);
bf = __uint32_as_float (bi);
resf = my_rhypotf (af, bf);
ref = rhypot ((double)af, (double)bf);
reff = (float)ref;
refi = __float_as_uint32 (reff);
resi = __float_as_uint32 (resf);
diff = llabs ((long long int)resi - (long long int)refi);
/* If both inputs are a NaN, result can be either argument, converted
to QNaN if necessary. If one input is NaN and the other not infinity
the NaN input must be returned, converted to QNaN if necessary. If
one input is infinity, zero must be returned even if the other input
is a NaN. In all other cases allow up to 1 ulp of difference.
*/
if ((isnan (af) && isnan (bf) && (resi != (ai | FP32_QNAN_BIT)) && (resi != (bi | FP32_QNAN_BIT))) ||
(isnan (af) && !isinf (bf) && !isnan (bf) && (resi != (ai | FP32_QNAN_BIT))) ||
(isnan (bf) && !isinf (af) && !isnan (af) && (resi != (bi | FP32_QNAN_BIT))) ||
(isinf (af) && (resi != 0)) ||
(isinf (bf) && (resi != 0)) ||
(diff > 1)) {
printf ("err # (%08x,%08x): res= %08x (%15.8e) ref=%08x (%15.8e)\n",
ai, bi, resi, resf, refi, reff);
return EXIT_FAILURE;
}
diffsum += diff;
err = floatUlpErr (resf, ref);
if (err > maxerr) {
printf ("ulp=%.8f # (% 15.8e, % 15.8e): res=%15.6a ref=%22.13a\n",
err, af, bf, resf, ref);
maxerr = err;
}
count--;
} while (count);
printf ("diffsum = %llu\n", diffsum);
return EXIT_SUCCESS;
}
I'm interested in a fast method for "expanding bits," which can be defined as the following:
Let B be a binary number with n bits, i.e. B \in {0,1}^n
Let P be the position of all 1/true bits in B, i.e. 1 << p[i] & B == 1, and |P|=k
For another given number, A \in {0,1}^k, let Ap be the bit-expanded form of A given B, such that Ap[j] == A[j] << p[j].
The result of the "bit expansion" is Ap.
A couple examples:
Given B: 0010 1110, A: 0110, then Ap should be 0000 1100
Given B: 1001 1001, A: 1101, then Ap should be 1001 0001
Following is a straightforward algorithm, but I can't help shake the feeling that there's a faster/easier way to do this.
unsigned int expand_bits(unsigned int A, unsigned int B, int n) {
int k = popcount(B); // cuda function, but there are good methods for this
unsigned int Ap = 0;
int j = k-1;
// Starting at the most significant bit,
for (int i = n - 1; i >= 0; --i) {
Ap <<= 1;
// if B is 1, add the value at A[j] to Ap, decrement j.
if (B & (1 << i)) {
Ap += (A >> j--) & 1;
}
}
return Ap;
}
The question appears to be asking for a CUDA emulation of the BMI2 instruction PDEP, which takes a source operand a, and deposits its bits based on the positions of the 1-bits of a mask b. There is no hardware support for an identical, or a similar, operation on currently shipping GPUs; that is, up to and including the Maxwell architecture.
I am assuming, based on the two examples given, that the mask b in general is sparse, and that we can minimize work by only iterating over the 1-bits of b. This could cause divergent branches on the GPU, but the exact trade-off in performance is unknown without knowledge of a specific use case. For now, I am assuming that the exploitation of sparsity in the mask b has a stronger positive influence on performance compared to the negative impact of divergence.
In the emulation code below, I have reduced the use of potentially "expensive" shift operations, instead relying mostly on simple ALU instructions. On various GPUs, shift instructions are executed with lower throughput than simple integer arithmetic. I have retained a single shift, off the critical path through the code, to avoid becoming execution limited by the arithmetic units. If desired, the expression 1U << i can be replaced by addition: introduce a variable m that is initialized to 1 before the loop and doubled each time through the loop.
The basic idea is to isolate each 1-bit of mask b in turn (starting at the least significant end), AND it with the value of the i-th bit of a, and incorporate the result into the expanded destination. After a 1-bit from b has been used, we remove it from the mask, and iterate until the mask becomes zero.
In order to avoid shifting the i-th bit of a into place, we simply isolate it and then replicate its value to all more significant bits by simple negation, taking advantage of the two's complement representation of integers.
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0;
int i;
for (i = 0; b; i++) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & (1U << i)); // spread i-th bit of 'a' to more signif. bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
}
return r;
}
The variant without any shift operations alluded to above looks as follows:
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0, m = 1;
while (b) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & m); // spread i-th bit of 'a' to more significant bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
m = m + m; // mask for next bit of 'a'
}
return r;
}
In comments below, #Evgeny Kluev pointed to a shift-free PDEP emulation at the chessprogramming website that looks potentially faster than either of my two implementations above; it seems worth a try.