How to generate uniform single precision floating point random number between 0 and 1 in FPGA? - random

I am trying to generate single precision floating point random number using FPGA by generating number between 0 and 0x3f80000 (IEEE format for 1). But since there are more number of discreet points near to zero than 1, I am not getting uniform generation. Is there any transformation which I can apply to mimic uniform generation. I am using LFSR(32 Bit) and Xoshiro random number generation.

A standard way to generate uniformly distributed floats in [0,1) from uniformly distributed 32-bit unsigned integers is to multiply the integers with 2-32. Obviously we wouldn't instantiate a floating-point multiplier on the FPGA just for this purpose, and we do not have to, since the multiplier is a power of two. In essence what is needed is a conversion of the integer to a floating-point number, then decrementing the exponent of the floating-point number by 32. This does not work for a zero input which has to be handled as a special case. In the ISO-C99 code below I am assuming that float is mapped to IEEE-754 binary32 type.
Other than for certain special cases, the significand of an IEEE-754 binary floating-point number is normalized to [1,2). To convert an integer into the significand, we need to normalize it, so the most significant bit is set. We can do this by counting the number of leading zero bits, then left shifting the number by that amount. The count of leading zeros is also needed to adjust the exponent.
The significand of a binary32 number comprises 24 bits, of which only 23 bits are stored; the most significant bit (the integer bit) is always one and therefore implicit. This means not all of the 32 bits of the integer can be incorporated into the binary32, so in converting a 32-bit unsigned integer one usually rounds to 24-bit precision. To simplify the implementation, in the code below I simply truncate by cutting off the least significant eight bits, which should have no noticeable effect on the uniform distribution. For the exponent part, we can combine the adjustments due to normalization step with the subtraction due to the scale factor of 2-32.
The code below is written using hardware-centric primitives. Extracting a bit is just a question of grabbing the correct wire, and shifts by fixed amounts are likewise simply wire shifts. The circuit needed to count the number of leading zeros is typically called a priority encoder.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#define USE_FP_MULTIPLY (0)
uint32_t bit (uint32_t, uint32_t);
uint32_t mux (uint32_t, uint32_t, uint32_t);
uint32_t clz (uint32_t);
float uint32_as_float (uint32_t);
/* uniform float in [0, 1) from uniformly distributed random integers */
float uniform_rand_01 (uint32_t i)
{
const uint32_t FP32_EXPO_BIAS = 127;
const uint32_t FP32_MANT_BITS = 24;
const uint32_t FP32_STORED_MANT_BITS = FP32_MANT_BITS - 1;
uint32_t lz, r;
// compute shift amount needed for normalization
lz = clz (i);
// normalize so that msb is set, except when input is zero
i = mux (bit (lz, 4), i << 16, i);
i = mux (bit (lz, 3), i << 8, i);
i = mux (bit (lz, 2), i << 4, i);
i = mux (bit (lz, 1), i << 2, i);
i = mux (bit (lz, 0), i << 1, i);
// build bit pattern for IEEE-754 binary32 floating-point number
r = (((FP32_EXPO_BIAS - 2 - lz) << FP32_STORED_MANT_BITS) +
(i >> (32 - FP32_MANT_BITS)));
// handle special case of zero input
r = mux (i == 0, i, r);
// treat bit-pattern as 'float'
return uint32_as_float (r);
}
// extract bit i from x
uint32_t bit (uint32_t x, uint32_t i)
{
return (x >> i) & 1;
}
// simulate 2-to-1 multiplexer: c ? a : b ; c must be in {0,1}
uint32_t mux (uint32_t c, uint32_t a, uint32_t b)
{
uint32_t m = c * 0xffffffff;
return (a & m) | (b & ~m);
}
// count leading zeros. A priority encoder in hardware.
uint32_t clz (uint32_t x)
{
uint32_t m, c, y, n = 32;
y = x >> 16; m = n - 16; c = (y != 0); n = mux (c, m, n); x = mux (c, y, x);
y = x >> 8; m = n - 8; c = (y != 0); n = mux (c, m, n); x = mux (c, y, x);
y = x >> 4; m = n - 4; c = (y != 0); n = mux (c, m, n); x = mux (c, y, x);
y = x >> 2; m = n - 2; c = (y != 0); n = mux (c, m, n); x = mux (c, y, x);
y = x >> 1; m = n - 2; c = (y != 0); n = mux (c, m, n - x);
return n;
}
// re-interpret bit pattern of a 32-bit integer as an IEEE-754 binary32
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
// George Marsaglia's KISS PRNG, period 2**123. Newsgroup sci.math, 21 Jan 1999
// Bug fix: Greg Rose, "KISS: A Bit Too Simple" http://eprint.iacr.org/2011/007
static uint32_t kiss_z=362436069, kiss_w=521288629;
static uint32_t kiss_jsr=123456789, kiss_jcong=380116160;
#define znew (kiss_z=36969*(kiss_z&65535)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&65535)+(kiss_w>>16))
#define MWC ((znew<<16)+wnew )
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17), \
kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+1234567)
#define KISS ((MWC^CONG)+SHR3)
#define N 100
uint32_t bucket [N];
int main (void)
{
for (int i = 0; i < 100000; i++) {
uint32_t i = KISS;
#if USE_FP_MULTIPLY
float r = i * 0x1.0p-32f;
#else // USE_FP_MULTIPLY
float r = uniform_rand_01 (i);
#endif // USE_FP_MULTIPLY
bucket [(int)(r * N)]++;
}
for (int i = 0; i < N; i++) {
printf ("bucket [%2d]: [%.5f,%.5f): %u\n",
i, 1.0f*i/N, (i+1.0f)/N, bucket[i]);
}
return EXIT_SUCCESS;
}

Please check the xoshiro128+ here https://prng.di.unimi.it/xoshiro128plus.c
The VHDL code written by someone can be found here:
https://github.com/jorisvr/vhdl_prng/tree/master/rtl
The seed value is generated from another random number generation algorithm so don't get confused by this.
Depending on the seed value used it should give a uniform distribution.

Related

map range of IEEE 32bit float [1:2) to some arbitrary [a:b)

Back story : uniform PRNG with arbitrary endpoints
I've got a fast uniform pseudo random number generator that creates uniform float32 numbers in range [1:2) i.e. u : 1 <= u <= 2-eps. Unfortunately mapping the endpoints [1:2) to that of an arbitrary range [a:b) is non-trivial in floating point math. I'd like to exactly match the endpoints with a simple affine calculation.
Formally stated
I want to make an IEEE-754 32 bit floating point affine function f(x,a,b) for 1<=x<2 and arbitrary a,b that exactly maps
1 -> a and nextlower(2) -> nextlower(b)
where nextlower(q) is the next lower FP representable number (e.g. in C++ std::nextafter(float(q),float(q-1)))
What I've tried
The simple mapping f(x,a,b) = (x-1)*(b-a) + a always achieves the f(1) condition but sometimes fails the f(2) condition due to floating point rounding.
I've tried replacing the 1 with a free design parameter to cancel FP errors in the spirit of Kahan summation.
i.e. with
f(x,c0,c1,c2) = (x-c0)*c1 + c2
one mathematical solution is c0=1,c1=(b-a),c2=a (the simple mapping above),
but the extra parameter lets me play around with constants c0,c1,c2 to match the endpoints. I'm not sure I understand the principles behind Kahan summation well enough to apply them to determine the parameters or even be confident a solution exists. It feels like I'm bumping around in the dark where others might've found the light already.
Aside: I'm fine assuming the following
a < b
both a and b are far from zero, i.e. OK to ignore subnormals
a and b are far enough apart (measuered in representable FP values) to mitigate non-uniform quantization and avoid degenerate cases
Update
I'm using a modified form of Chux's answer to avoid the division.
While I'm not 100% certain my refactoring kept all the magic, it does still work in all my test cases.
float lerp12(float x,float a,float b)
{
const float scale = 1.0000001f;
// scale = 1/(nextlower(2) - 1);
const float ascale = a*scale;
const float bscale = nextlower(b)*scale;
return (nextlower(2) - x)*ascale + (x - 1.0f)*bscale;
}
Note that only the last line (5 FLOPS) depends on x, so the others can be reused if (a,b) remain the same.
OP's goal
I want to make an IEEE-754 32 bit floating point affine function f(x,a,b) for 1<=x<2 and arbitrary a,b that exactly maps 1 -> a and nextlower(2) -> nextlower(b)
This differs slightly from "map range of IEEE 32bit float [1:2) to some arbitrary [a:b)".
General case
Map x0 to y0, x1 to y1 and various x in-between to y :
m = (y1 - y0)/(x1 - x0);
y = m*(x - x0) + y0;
OP's case
// x0 = 1.0f;
// x1 = nextafterf(2.0f, 1.0f);
// y0 = a;
// y1 = nextafterf(b, a);
#include <math.h> // for nextafterf()
float x = random_number_1_to_almost_2();
float m = (nextafterf(b, a) - a)/(nextafterf(2.0f, 1.0f) - 1.0f);
float y = m*(x - 1.0f) + a;
nextafterf(2.0f, 1.0f) - 1.0f, x - 1.0f and nextafterf(b, a) are exact, incurring no calculation error.
nextafterf(2.0f, 1.0f) - 1.0f is a value a little less than 1.0f.
Recommendation
Other re-formations are possible with better symmetry and numerical stability at the end-points.
float x = random_number_1_to_almost_2();
float afactor = nextafterf(2.0f, 1.0f) - x; // exact
float bfactor = x - 1.0f; // exact
float xwidth = nextafterf(2.0f, 1.0f) - 1.0f; // exact
// Do not re-order next line of code, perform 2 divisions
float y = (afactor/xwidth)*a + (bfactor/xwidth)*nextafterf(b, a);
Notice afactor/xwidth and bfactor/xwidth are both exactly 0.0 or 1.0 at the end-points, thus meeting "maps 1 -> a and nextlower(2) -> nextlower(b)". Extended precision not needed.
OP's (x-c0)*c1 + c2 has trouble as it divides (x-c0)*c1 by (2.0 - 1.0) or 1.0 (implied), when it should divide by nextafterf(2.0f, 1.0f) - 1.0f.
Simple lerping based on fused multiply-add can reliably hit the endpoints for interpolation factors 0 and 1. For x in [1, 2) the interpolation factor x - 1 does not reach unity, which can be fixed by slight stretching by multiplying x-1 with (2.0f / nextlower(2.0f)). Obviously the endpoint needs to also be adjusted to the endpoint nextlower(b). For the C code below I have used the definition of nextlower() provided in the question, which may not be what asker desires, since for floating-point q sufficiently large in magnitude, q == (q - 1).
Asker stated in comments that it is understood that this kind of mapping is not going to result in an exactly uniform distribution of the pseudo-random numbers in the interval [a, b), only approximately so, and that pathological mappings may occur when a and b are extremely close together. I have not mathematically proved that the implementation of map() below guarantees the desired behavior, but it seems to do so for a large number of random test cases.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
float nextlowerf (float q)
{
return nextafterf (q, q - 1);
}
float map (float a, float b, float x)
{
float t = (x - 1.0f) * (2.0f / nextlowerf (2.0f));
return fmaf (t, nextlowerf (b), fmaf (-t, a, a));
}
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof(r));
return r;
}
// George Marsaglia's KISS PRNG, period 2**123. Newsgroup sci.math, 21 Jan 1999
// Bug fix: Greg Rose, "KISS: A Bit Too Simple" http://eprint.iacr.org/2011/007
static uint32_t kiss_z=362436069, kiss_w=521288629;
static uint32_t kiss_jsr=123456789, kiss_jcong=380116160;
#define znew (kiss_z=36969*(kiss_z&65535)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&65535)+(kiss_w>>16))
#define MWC ((znew<<16)+wnew )
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17), \
kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+1234567)
#define KISS ((MWC^CONG)+SHR3)
int main (void)
{
float a, b, x, r;
float FP32_MIN_NORM = 0x1.000000p-126f;
float FP32_MAX_NORM = 0x1.fffffep+127f;
do {
do {
a = uint32_as_float (KISS);
} while ((fabsf (a) < FP32_MIN_NORM) || (fabsf (a) > FP32_MAX_NORM) || isnan (a));
do {
b = uint32_as_float (KISS);
} while ((fabsf (b) < FP32_MIN_NORM) || (fabsf (b) > FP32_MAX_NORM) || isnan (b) || (b < a));
x = 1.0f;
r = map (a, b, x);
if (r != a) {
printf ("lower bound failed: a=%12.6a b=%12.6a map=%12.6a\n", a, b, r);
return EXIT_FAILURE;
}
x = nextlowerf (2.0f);
r = map (a, b, x);
if (r != nextlowerf (b)) {
printf ("upper bound failed: a=%12.6a b=%12.6a map=%12.6a\n", a, b, r);
return EXIT_FAILURE;
}
} while (1);
return EXIT_SUCCESS;
}

Efficient and accurate computation of the reciprocal of hypot(a,b)

Givens rotations provide a robust and easily parallelizable way to implement QR decomposition. A Givens rotation requires the computation of sine and cosine components of a rotation angle. In the case of real computation, this typically involves the computation of the reciprocal of the hypot() function to normalize a two-vector, as shown for example in Wikipedia.
While this avoids most cases of overflow and underflow in intermediate computation, for very large values a, b, hypot(a,b) may overflow to infinity, while 1/√(a2+b2) is actually representable as a subnormal floating-point number. Also, the use of a division adds further computational cost that can be significant on platforms with slow floating-point division.
A function rhypot(a,b) that directly computes 1/√(a2+b2) at a cost similar to the standard hypot() function would therefore be desirable. The accuracy should be same or better than the naive approach of computing 1.0/hypot(a,b). With a correctly-rounded hypot function, this expression has a maximum error of 1.5 ulps.
How can such a function be implemented efficiently and accurately? The use of IEEE-754 binary floating-point arithmetic and the availability of native hardware support for fused multiply-add (FMA) operations can be assumed. For ease of exposition and testing, we can restrict to single-precision computation, i.e. the IEEE-754 binary32 format.
In the following, I am showing ISO-C99 code that implements rhypot with good accuracy and good performance. The general algorithm is directly derived from the example implementations I showed for hypot in this answer. For hypot, one determines the value of largest magnitude among the arguments, then find a scale factor (a power of two for reasons of accuracy) that maps this value into the vicinity of unity. The scale factor is applied to both arguments, and the length of this transformed 2-vector is then computed with the sqrt function, finally the result scaled back with the "inverse' of the scale factor. The scaling relies on actual multiplication as the arguments may be subnormals that cannot be scaled correctly by simple exponent manipulation alone.
For rhypot, only two changes are needed: the reciprocal square root function rsqrt must be used instead of sqrt, and input scaling and result scaling use the same scale factor.
Some computing environments provide an rsqrt() function, and this function is scheduled for inclusion in a future version of the ISO C standard (ISO/IEC TS 18661-4:2015). For environments that do not provide a reciprocal square root function, I am showing some portable (within the platform requirements stated in the question) and machine-specific implementations.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
uint32_t __float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
float __uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
float my_rsqrtf (float);
/* Compute the reciprocal of sqrt (a**2 + b**2), avoiding premature overflow
and underflow in intermediate computation. The accuracy of this function
depends on the accuracy of the reciprocal square root implementation used.
With the rsqrtf() implementations shown below, the following maximum ulp
error was observed for 2**36 random test cases:
CORRECTLY_ROUNDED 1.20736973
SSE_HALLEY 1.33120522
SSE_2NR 1.42086841
SQRT_OOX 1.42906701
BIT_TWIDDLE_3NR 1.43062950
ITO_TAKAGI_YAJIMA_1NR 1.43681737
BIT_TWIDDLE_NR_HALLEY 1.47485797
*/
float my_rhypotf (float a, float b)
{
float fa, fb, mn, mx, scale, s, w, res;
uint32_t expo;
/* sort arguments by magnitude */
fa = fabsf (a);
fb = fabsf (b);
mx = fmaxf (fa, fb);
mn = fminf (fa, fb);
/* compute scale factor */
expo = __float_as_uint32 (mx) & 0xfc000000;
scale = __uint32_as_float (0x7e000000 - expo);
/* scale operand of maximum magnitude towards unity */
mn = mn * scale;
mx = mx * scale;
/* mx in [2**-23, 2**6) */
s = fmaf (mx, mx, mn * mn); // 0.75 ulp
w = my_rsqrtf (s);
/* reverse previous scaling */
res = w * scale;
/* handle special cases */
float t = a + b;
if (!(fabsf (t) <= INFINITY)) res = t; // isnan(t)
if (mx == INFINITY) res = 0.0f; // isinf(mx)
return res;
}
#define CORRECTLY_ROUNDED (1)
#define SSE_HALLEY (2)
#define SSE_2NR (3)
#define ITO_TAKAGI_YAJIMA_1NR (4)
#define SQRT_OOX (5)
#define BIT_TWIDDLE_3NR (6)
#define BIT_TWIDDLE_NR_HALLEY (7)
#define RSQRT_VARIANT (SSE_HALLEY)
#if (RSQRT_VARIANT == SSE_2NR) || (RSQRT_VARIANT == SSE_HALLEY)
#include "immintrin.h"
#endif // (RSQRT_VARIANT == SSE_2NR) || (RSQRT_VARIANT == SSE_HALLEY)
float my_rsqrtf (float a)
{
#if RSQRT_VARIANT == CORRECTLY_ROUNDED
float r = (float) sqrt (1.0/(double)a);
#elif RSQRT_VARIANT == SQRT_OOX
float r = sqrtf (1.0f / a);
#elif RSQRT_VARIANT == SSE_2NR
float r;
/* compute initial approximation */
_mm_store_ss (&r, _mm_rsqrt_ss (_mm_set_ss (a)));
/* refine approximation using two Newton-Raphson iterations */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
#elif RSQRT_VARIANT == SSE_HALLEY
float e, r;
/* compute initial approximation */
_mm_store_ss (&r, _mm_rsqrt_ss (_mm_set_ss (a)));
/* refine approximation using Halley iteration with cubic convergence */
e = fmaf (r * r, -a, 1.0f);
r = fmaf (fmaf (0.375f, e, 0.5f), e * r, r);
#elif RSQRT_VARIANT == BIT_TWIDDLE_3NR
float r;
/* compute initial approximation */
r = __uint32_as_float (0x5f375b0d - (__float_as_uint32(a) >> 1));
/* refine approximation using three Newton-Raphson iterations */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
#elif RSQRT_VARIANT == BIT_TWIDDLE_NR_HALLEY
float e, r;
/* compute initial approximation */
r = __uint32_as_float (0x5f375b0d - (__float_as_uint32(a) >> 1));
/* refine approximation using Newton-Raphson iteration */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
/* refine approximation using Halley iteration with cubic convergence */
e = fmaf (r * r, -a, 1.0f);
r = fmaf (fmaf (0.375f, e, 0.5f), e * r, r);
#elif RSQRT_VARIANT == ITO_TAKAGI_YAJIMA_1NR
/* Masayuki Ito, Naofumi Takagi, Shuzo Yajima, "Efficient Initial
Approximation for Multiplicative Division and Square Root by a
Multiplication with Operand Modification". IEEE Transactions on
Computers, Vol. 46, No. 4, April 1997, pp. 495-498.
*/
#define TAB_INDEX_BITS (7)
#define TAB_ENTRY_BITS (16)
#define TAB_ENTRIES (1 << TAB_INDEX_BITS)
#define FP32_EXPO_BIAS (127)
#define FP32_MANT_BITS (23)
#define FP32_SIGN_MASK (0x80000000)
#define FP32_EXPO_MASK (0x7f800000)
#define FP32_EXPO_LSB_MASK (1u << FP32_MANT_BITS)
#define FP32_INDEX_MASK (((1u << TAB_INDEX_BITS) - 1) << (FP32_MANT_BITS - TAB_INDEX_BITS))
#define FP32_XHAT_MASK (~(FP32_INDEX_MASK | FP32_SIGN_MASK) | FP32_EXPO_MASK)
#define FP32_FLIP_BIT_MASK (3u << (FP32_MANT_BITS - TAB_INDEX_BITS - 1))
#define FP32_ONE_HALF (0x3f000000)
const uint16_t d1tab [TAB_ENTRIES] = {
0xb2ec, 0xaed7, 0xaae9, 0xa720, 0xa37b, 0x9ff7, 0x9c93, 0x994d,
0x9623, 0x9316, 0x9022, 0x8d47, 0x8a85, 0x87d8, 0x8542, 0x82c0,
0x8053, 0x7bf0, 0x775f, 0x72f1, 0x6ea4, 0x6a77, 0x666a, 0x6279,
0x5ea5, 0x5aed, 0x574e, 0x53c9, 0x505d, 0x4d07, 0x49c8, 0x469e,
0x438a, 0x408a, 0x3d9e, 0x3ac4, 0x37fc, 0x3546, 0x32a0, 0x300b,
0x2d86, 0x2b10, 0x28a8, 0x264f, 0x2404, 0x21c6, 0x1f95, 0x1d70,
0x1b58, 0x194c, 0x174b, 0x1555, 0x136a, 0x1189, 0x0fb2, 0x0de6,
0x0c22, 0x0a68, 0x08b7, 0x070f, 0x056f, 0x03d8, 0x0249, 0x00c1,
0xfd08, 0xf742, 0xf1b4, 0xec5a, 0xe732, 0xe239, 0xdd6d, 0xd8cc,
0xd454, 0xd002, 0xcbd6, 0xc7cd, 0xc3e5, 0xc01d, 0xbc75, 0xb8e9,
0xb57a, 0xb225, 0xaeeb, 0xabc9, 0xa8be, 0xa5cb, 0xa2ed, 0xa024,
0x9d6f, 0x9ace, 0x983e, 0x95c1, 0x9355, 0x90fa, 0x8eae, 0x8c72,
0x8a45, 0x8825, 0x8614, 0x8410, 0x8219, 0x802e, 0x7c9c, 0x78f5,
0x7565, 0x71eb, 0x6e85, 0x6b31, 0x67f3, 0x64c7, 0x61ae, 0x5ea7,
0x5bb0, 0x58cb, 0x55f6, 0x5330, 0x5079, 0x4dd1, 0x4b38, 0x48ad,
0x462f, 0x43be, 0x4159, 0x3f01, 0x3cb5, 0x3a75, 0x3840, 0x3616
};
uint32_t arg, idx, d1, xhat;
float r;
arg = __float_as_uint32 (a);
idx = (arg >> ((FP32_MANT_BITS + 1) - TAB_INDEX_BITS)) & ((1u << TAB_INDEX_BITS) - 1);
d1 = FP32_ONE_HALF | (d1tab[idx] << ((FP32_MANT_BITS + 1) - TAB_ENTRY_BITS));
xhat = ((arg & FP32_INDEX_MASK) | (((((3 * FP32_EXPO_BIAS) << FP32_MANT_BITS) + ~arg) >> 1) & FP32_XHAT_MASK)) ^ FP32_FLIP_BIT_MASK;
/* compute initial approximation, accurate to about 14 bits */
r = __uint32_as_float (d1) * __uint32_as_float (xhat);
/* refine approximation with one Newton-Raphson iteration */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
#else
#error unsupported RSQRT_VARIANT
#endif // RSQRT_VARIANT
return r;
}
uint64_t __double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
double floatUlpErr (float res, double ref)
{
uint64_t i, j, err, refi;
int expoRef;
/* ulp error cannot be computed if either operand is NaN, infinity, zero */
if (isnan (res) || isnan (ref) || isinf (res) || isinf (ref) ||
(res == 0.0f) || (ref == 0.0f)) {
return 0.0;
}
/* Convert the float result to an "extended float". This is like a float
with 56 instead of 24 effective mantissa bits.
*/
i = ((uint64_t)__float_as_uint32(res)) << 32;
/* Convert the double reference to an "extended float". If the reference is
>= 2^129, we need to clamp to the maximum "extended float". If reference
is < 2^-126, we need to denormalize because of the float types's limited
exponent range.
*/
refi = __double_as_uint64(ref);
expoRef = (int)(((refi >> 52) & 0x7ff) - 1023);
if (expoRef >= 129) {
j = 0x7fffffffffffffffULL;
} else if (expoRef < -126) {
j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
j = j >> (-(expoRef + 126));
} else {
j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
j = j | ((uint64_t)(expoRef + 127) << 55);
}
j = j | (refi & 0x8000000000000000ULL);
err = (i < j) ? (j - i) : (i - j);
return err / 4294967296.0;
}
double rhypot (double a, double b)
{
return 1.0 / hypot (a, b);
}
// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579) /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)
#define FP32_QNAN_BIT (0x00400000)
int main (void)
{
float af, bf, resf, reff;
uint32_t ai, bi, resi, refi;
double ref, err, maxerr = 0;
uint64_t diff, diffsum = 0, count = 1ULL << 36;
do {
ai = KISS;
bi = KISS;
af = __uint32_as_float (ai);
bf = __uint32_as_float (bi);
resf = my_rhypotf (af, bf);
ref = rhypot ((double)af, (double)bf);
reff = (float)ref;
refi = __float_as_uint32 (reff);
resi = __float_as_uint32 (resf);
diff = llabs ((long long int)resi - (long long int)refi);
/* If both inputs are a NaN, result can be either argument, converted
to QNaN if necessary. If one input is NaN and the other not infinity
the NaN input must be returned, converted to QNaN if necessary. If
one input is infinity, zero must be returned even if the other input
is a NaN. In all other cases allow up to 1 ulp of difference.
*/
if ((isnan (af) && isnan (bf) && (resi != (ai | FP32_QNAN_BIT)) && (resi != (bi | FP32_QNAN_BIT))) ||
(isnan (af) && !isinf (bf) && !isnan (bf) && (resi != (ai | FP32_QNAN_BIT))) ||
(isnan (bf) && !isinf (af) && !isnan (af) && (resi != (bi | FP32_QNAN_BIT))) ||
(isinf (af) && (resi != 0)) ||
(isinf (bf) && (resi != 0)) ||
(diff > 1)) {
printf ("err # (%08x,%08x): res= %08x (%15.8e) ref=%08x (%15.8e)\n",
ai, bi, resi, resf, refi, reff);
return EXIT_FAILURE;
}
diffsum += diff;
err = floatUlpErr (resf, ref);
if (err > maxerr) {
printf ("ulp=%.8f # (% 15.8e, % 15.8e): res=%15.6a ref=%22.13a\n",
err, af, bf, resf, ref);
maxerr = err;
}
count--;
} while (count);
printf ("diffsum = %llu\n", diffsum);
return EXIT_SUCCESS;
}

Bit twiddle help: Expanding bits to follow a given bitmask

I'm interested in a fast method for "expanding bits," which can be defined as the following:
Let B be a binary number with n bits, i.e. B \in {0,1}^n
Let P be the position of all 1/true bits in B, i.e. 1 << p[i] & B == 1, and |P|=k
For another given number, A \in {0,1}^k, let Ap be the bit-expanded form of A given B, such that Ap[j] == A[j] << p[j].
The result of the "bit expansion" is Ap.
A couple examples:
Given B: 0010 1110, A: 0110, then Ap should be 0000 1100
Given B: 1001 1001, A: 1101, then Ap should be 1001 0001
Following is a straightforward algorithm, but I can't help shake the feeling that there's a faster/easier way to do this.
unsigned int expand_bits(unsigned int A, unsigned int B, int n) {
int k = popcount(B); // cuda function, but there are good methods for this
unsigned int Ap = 0;
int j = k-1;
// Starting at the most significant bit,
for (int i = n - 1; i >= 0; --i) {
Ap <<= 1;
// if B is 1, add the value at A[j] to Ap, decrement j.
if (B & (1 << i)) {
Ap += (A >> j--) & 1;
}
}
return Ap;
}
The question appears to be asking for a CUDA emulation of the BMI2 instruction PDEP, which takes a source operand a, and deposits its bits based on the positions of the 1-bits of a mask b. There is no hardware support for an identical, or a similar, operation on currently shipping GPUs; that is, up to and including the Maxwell architecture.
I am assuming, based on the two examples given, that the mask b in general is sparse, and that we can minimize work by only iterating over the 1-bits of b. This could cause divergent branches on the GPU, but the exact trade-off in performance is unknown without knowledge of a specific use case. For now, I am assuming that the exploitation of sparsity in the mask b has a stronger positive influence on performance compared to the negative impact of divergence.
In the emulation code below, I have reduced the use of potentially "expensive" shift operations, instead relying mostly on simple ALU instructions. On various GPUs, shift instructions are executed with lower throughput than simple integer arithmetic. I have retained a single shift, off the critical path through the code, to avoid becoming execution limited by the arithmetic units. If desired, the expression 1U << i can be replaced by addition: introduce a variable m that is initialized to 1 before the loop and doubled each time through the loop.
The basic idea is to isolate each 1-bit of mask b in turn (starting at the least significant end), AND it with the value of the i-th bit of a, and incorporate the result into the expanded destination. After a 1-bit from b has been used, we remove it from the mask, and iterate until the mask becomes zero.
In order to avoid shifting the i-th bit of a into place, we simply isolate it and then replicate its value to all more significant bits by simple negation, taking advantage of the two's complement representation of integers.
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0;
int i;
for (i = 0; b; i++) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & (1U << i)); // spread i-th bit of 'a' to more signif. bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
}
return r;
}
The variant without any shift operations alluded to above looks as follows:
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0, m = 1;
while (b) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & m); // spread i-th bit of 'a' to more significant bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
m = m + m; // mask for next bit of 'a'
}
return r;
}
In comments below, #Evgeny Kluev pointed to a shift-free PDEP emulation at the chessprogramming website that looks potentially faster than either of my two implementations above; it seems worth a try.

Manually Converting rgba8 to rgba5551

I need to convert rgba8 to rgba5551 manually. I found some helpful code from another post and want to modify it to convert from rgba8 to rgba5551. I don't really have experience with bitewise stuff and haven't had any luck messing with the code myself.
void* rgba8888_to_rgba4444( void* src, int src_bytes)
{
// compute the actual number of pixel elements in the buffer.
int num_pixels = src_bytes / 4;
unsigned long* psrc = (unsigned long*)src;
unsigned short* pdst = (unsigned short*)src;
// convert every pixel
for(int i = 0; i < num_pixels; i++){
// read a source pixel
unsigned px = psrc[i];
// unpack the source data as 8 bit values
unsigned r = (px << 8) & 0xf000;
unsigned g = (px >> 4) & 0x0f00;
unsigned b = (px >> 16) & 0x00f0;
unsigned a = (px >> 28) & 0x000f;
// and store
pdst[i] = r | g | b | a;
}
return pdst;
}
The value of RGBA5551 is that it has color info condensed into 16 bits - or two bytes, with only one bit for the alpha channel (on or off). RGBA8888, on the other hand, uses a byte for each channel. (If you don't need an alpha channel, I hear RGB565 is better - as humans are more sensitive to green). Now, with 5 bits, you get the numbers 0 through 31, so r, g, and b each need to be converted to some number between 0 and 31, and since they are originally a byte each (0-255), we multiply each by 31/255. Here is a function that takes RGBA bytes as input and outputs RGBA5551 as a short:
short int RGBA8888_to_RGBA5551(unsigned char r, unsigned char g, unsigned char b, unsigned char a){
unsigned char r5 = r*31/255; // All arithmetic is integer arithmetic, and so floating points are truncated. If you want to round to the nearest integer, adjust this code accordingly.
unsigned char g5 = g*31/255;
unsigned char b5 = b*31/255;
unsigned char a1 = (a > 0) ? 1 : 0; // 1 if a is positive, 0 else. You must decide what is sensible.
// Now that we have our 5 bit r, g, and b and our 1 bit a, we need to shift them into place before combining.
short int rShift = (short int)r5 << 11; // (short int)r5 looks like 00000000000vwxyz - 11 zeroes. I'm not sure if you need (short int), but I've wasted time tracking down bugs where I didn't typecast properly before shifting.
short int gShift = (short int)g5 << 6;
short int bShift = (short int)b5 << 1;
// Combine and return
return rShift | gShift | bShift | a1;
}
You can, of course condense this code.

Fast Modulo 511 and 127

Is there a way, how to make modulo by 511 (and 127) faster than using "%" operator ?
int c = 758 % 511;
int d = 423 % 127;
Here is a way to do fast modulo by 511 assuming that x is at most 32767. It's about twice as fast as x%511. It does the modulo in five steps: two multiply, two addition, one shift.
inline int fast_mod_511(int x) {
int y = (513*x+64)>>18;
return x - 511*y;
}
Here is the theory at how I arrive at this. I posted the code I tested this at the end
Let's consider
y = x/511 = x/(512-1) = x/1000 * 1/(1-1/512).
Let's define z = 512, then
y = x/z*1/(1-1/z).
Using Taylor expansion
y = x/z(1 + 1/z + 1/z^2 + 1/z^3 + ...).
Now if we know that x has a limited range we can cut the expansion. Let's assume x is always less than 2^15=32768. Then we can write
512*512*y = (1+512)*x = 513*x.
After looking at the digits which are significant we arrive at
y = (513*x+64)>>18 //512^2 = 2^18.
We can divide x/511 (assuming x is less than 32768) in three steps:
multiply,
add,
shift.
Here is the code I just to profile this in MSVC2013 64-bit release mode on an Ivy Bridge core.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
inline int fast_mod_511(int x) {
int y = (513*x+64)>>18;
return x - 511*y;
}
int main() {
unsigned int i, x;
volatile unsigned int r;
double dtime;
dtime = omp_get_wtime();
for(i=0; i<100000; i++) {
for(int j=0; j<32768; j++) {
r = j%511;
}
}
dtime =omp_get_wtime() - dtime;
printf("time %f\n", dtime);
dtime = omp_get_wtime();
for(i=0; i<100000; i++) {
for(int j=0; j<32768; j++) {
r = fast_mod_511(j);
}
}
dtime =omp_get_wtime() - dtime;
printf("time %f\n", dtime);
}
You can use a lookup table with the solutions pre-stored. If you create an array of a million integers looking up is about twice as fast as actually doing modulo in my C# app.
// fill an array
var mod511 = new int[1000000];
for (int x = 0; x < 1000000; x++) mod511[x] = x % 511;
and instead of using
c = 758 % 511;
you use
c = mod511[758];
This will cost you (possibly a lot of) memory, and will obviously not work if you want to use it for very large numbers also. But it is faster.
If you have to repeat those two modulus operations on a large number of data and your CPU supports SIMD (for example Intel's SSE/AVX/AVX2) then you can vectorize the operations, i.e., do the operations on many data in parallel. You can do this by using intrinsics or inline assembly. Yes the solution will be platform specific but maybe that is fine...

Resources