Efficiently Get Random Numbers in Range on GPU

Efficiently Get Random Numbers in Range on GPU - random

Given a uniformly distributed random number generator in the range [0, 2^64), is there any efficient way (on a GPU) to build a random number generator for the range [0, k) for some k < 2^64?
Some solutions that don't work:
// not uniformly distributed in [0, k)
myRand(rng, k) = rng() % k;
// way too much branching to run efficiently on a gpu
myRand(rng, k) =
uint64_t ret;
while((ret = rng() & (nextPow2(k)-1)) >= k);
return ret;
// only 53 bits of random data, not 64. Also I
// have no idea how to reason about how "uniform"
// this distribution is.
myRand(doubleRng, k) =
double r = doubleRng(); // generates a random number in [0, 1)
return (uint64_t)floor(r*k);
I'd be willing to compromise non-uniformity if the difference is sufficiently small (say, within 1/2^64).

There are only two options: do the modulus (or the floating point) and settle for non-uniformity, or do rejection sampling with a loop. There really isn't a third option. Which one is better depends on your application.
If your k is typically very small (say, you're shuffling cards so k is on the order of 100), then the non-uniformity is so small that it's probably OK, even at 32 bits. At 64 bits, a k on the order of millions is still going to give you a non-uniformity vanishingly small. No, it won't be on the order of 1/2^64, but I can't imagine a real-world application where a non-uniformity on the order of 1/2^20 is noticeable. When I wrote the test suite for my RNG library, I deliberately ran it against a known bad mod implementation and it had a really hard time detecting the error even at 32 bits.
If you really have to be perfectly uniform, then you're just going to have to sample and reject. This can be done pretty fast, and you can even get rid of the division (calculate that nextPow2() outside the rejection loop--that's how I do it in ojrandlib). FYI, the fastest way to do the next-power-of-two mask is this:
mask = k - 1;
mask |= mask >> 1;
mask |= mask >> 2;
mask |= mask >> 4;
mask |= mask >> 8;
mask |= mask >> 16;
mask |= mask >> 32;

If you have a function that returns 53 bits of random data, but you need 64, call it twice, use the bottom 32 bits of the first call for the top 32 bits of your result, and the bottom 32 bits of the second call for the bottom 32 bits of your result. If your original function was uniform, this one is too.

Related

Fast bit permutation

I need to store and apply permutations to 16-bit integers. The best solution I came up with is to store permutation as 64-bit integer where each 4 bits correspond to the new position of i-th bit, the application would look like:
int16 permute(int16 bits, int64 perm)
{
int16 result = 0;
for(int i = 0; i < 16; ++i)
result |= ((bits >> i) & 1) * (1 << int( (perm >> (i*4))&0xf ));
return result;
}
is there a faster way to do this? Thank you.

There are alternatives.
Any permutation can be handled by a Beneš network, and encoded as the masks that are the inputs to the multiplexers to apply the shuffle. This can be done reasonably efficiently in software too (not great but OK), it's just a bunch of butterfly permutations. The masks are a bit tricky to compute, but probably faster to apply than moving every bit on its own, though that depends on how many bits you're dealing with and 16 is not a lot.
Some smaller categories of shuffles can be handled by simpler (faster) networks, which you can also find on that page.
Finally in practice, on modern x86 hardware, there is the highly versatile pshufb function which can apply a permutation (but may include dupes and zeroes) to 16 bytes in (typically) a single cycle. It is slightly awkward to distribute the bits over the bytes, but once you're there it only takes a pshufb to permute and a pmovmskb to compress it back down to 16 bits.

Generate random numbers without repetition (or vanishing probability of repetition) without storing full list of past generated numbers?

I need to generate random numbers in a very large range, 128 bits integers, and I will generate a many many of them. I'll generate so many of them, that I cannot fit into memory a list of the numbers generated.
I also have the requirement that the generated numbers do not repeat, or at least that the probability of repetition is vanishingly small.
Is there an algorithm that does this?

Build a 128 bit linear congruential generator or linear feedback shift register generator. With properly chosen coefficients either of those will achieve full cycle, meaning no repeats until you've exhausted all outcomes.

Any full-period PRNG with a 128-bit state will do what you need in principle. Unfortunately many of these generators tend to produce only 32 or 64 bits per iteration while the rest of the state goes through a predictable permutation (LFSRs being the worst case, producing only 1 bit per iteration). Each 128-bit state is unique, but many of its bits would show a trivial relation to the previous state.
This can be overcome with tempering -- taking your questionable-quality PRNG state with a known-good period, and permuting it through a 1:1 transform to hide the not-so-random factors.
For example, borrowing from the example xorshift+ shown on Wikipedia:
static uint64_t s[2] = { 1, 0 };
void random128(uint64_t result[]) {
uint64_t x = s[0];
uint64_t y = s[1];
x ^= x << 23;
x ^= y ^ (x >> 17) ^ (y >> 26);
s[0] = y;
s[1] = x;
At this point we know that s[0] is just the old value of s[1], which would be a terrible PRNG if all 128 bits were exposed (normally only s[1] is exposed). To overcome this we permute the result to disguise that relationship (following the same principle as a feistel network to ensure that the transform is 1:1).
y += x * 1630144151483159999;
x ^= y >> 3;
result[0] = x;
result[1] = y;
}
This seems to be sufficient to pass diehard. So long as the original generator has full(ish) period, the whole generator should be full period too.
The logical conclusion to tempering a low-quality generator is to use AES-128 in counter mode. Simply run a counter from 0 to 2**128-1 (an extremely low-quality generator), and encrypt each value using AES-128 and a consistent key (an ideal temper) for your final output.
If you do this, don't get distracted by full cryptographic RNG requirements. Those involve re-seeding and consequently can produce the same number more than once (which is more random, but it's what you want to avoid).

What is the fastest way to perform hardware division of an integer by a fixed constant?

I have a 16 bit number which I want to divide by 100. Let's say it's 50000. The goal is to obtain 500. However, I am trying to avoid inferred dividers on my FPGA because they break timing requirements. The result does not have to be accurate; an approximation will do.
I have tried hardware multiplication by 0.01 but real numbers are not supported. I'm looking at pipelined dividers now but I hope it does not come to that.

Conceptually: Multiply by 655 (= 65536/100) and then shift right by 16 bits. Of course, in hardware, the shift right is free.
If you need it to be even faster, you can hardwire the divide as a sum of divisions by powers of two (shifts). E.g.,
1/100 ~= 1/128 = 0.0078125
1/100 ~= 1/128 + 1/256 = 0.01171875
1/100 ~= 1/128 + 1/512 = 0.009765625
1/100 ~= 1/128 + 1/512 + 1/2048 = 0.01025390625
1/100 ~= 1/128 + 1/512 + 1/4096 = 0.010009765625
etc.
In C code the last example above would be:
uint16_t divideBy100 (uint16_t input)
{
return (input >> 7) + (input >> 9) + (input >> 12);
}

Assuming that
the integer division is intended to truncate, not round (e.g. 599 /
100 = 5)
it's ok to have a 16x16 multiplier in the FPGA (with a fixed value on
one input)
then you can get exact values by implementing a 16x16 unsigned multiplier where one input is 0xA3D7 and the other input is your 16-bit number. Add 0x8000 to the 32-bit product, and your result is in the upper 10 bits.
In C code, the algorithm looks like this
uint16_t divideBy100( uint16_t input )
{
uint32_t temp;
temp = input;
temp *= 0xA3D7; // compute the 32-bit product of two 16-bit unsigned numbers
temp += 0x8000; // adjust the 32-bit product since 0xA3D7 is actually a little low
temp >>= 22; // the upper 10-bits are the answer
return( (uint16_t)temp );
}

Generally, you can multiply by the inverse and shift. Compilers do this all the time, even for software.
Here is a page that does that for you: http://www.hackersdelight.org/magic.htm
In your case that seems to be multiplication by 0x431BDE83, followed by a right-shift of 17.
And here is an explanation: Computing the Multiplicative Inverse for Optimizing Integer Division

Multiplying by the reciprocal is often a good approach, as you have noted though real numbers are not supported. You need to work with fixed point rather than floating point reals.
Verilog does not have a definition of fixed point, but it it just uses a word length and you decide how many bits are integer and how many fractional.
0.01 (0.0098876953125) in binary would be 0_0000001010001. The bigger this word length the greater the precision.
// 1Int, 13Frac
wire ONE_HUNDREDTH = 14'b0_0000001010001 ;
input a [15:0]; //Integer (no fractional bits)
output result [15+14:0]; //13 fractional bits inherited form ONE_HUNDREDTH
output result_int [15:0]; //Integer result
always #* begin
result = ONE_HUNDREDTH * a;
result_int = result >>> 13;
end
Real to binary conversion done using the ruby gem fixed_point.
A ruby irb session (with fixed_point installed via gem install fixed_point):
require 'fixed_point'
#Unsigned, 1 Integer bit, 13 fractional bits
format = FixedPoint::Format.new(0, 1, 13)
fix_num = FixedPoint::Number.new(0.01, format )
=> 0.0098876953125
fix_num.to_b
=> "0.0000001010001"

hashing a small number to a random looking 64 bit integer

I am looking for a hash-function which operates on a small integer (say in the range 0...1000) and outputs a 64 bit int.
The result-set should look like a random distribution of 64 bit ints: a uniform distribution with no linear correlation between the results.
I was hoping for a function that only takes a few CPU-cycles to execute. (the code will be in C++).
I considered multiplying the input by a big prime number and taking the modulo 2**64 (something like a linear congruent generator), but there are obvious dependencies between the outputs (in the lower bits).
Googling did not show up anything, but I am probably using wrong search terms.
Does such a function exist?
Some Background-info:
I want to avoid using a big persistent table with pseudo random numbers in an algorithm, and calculate random-looking numbers on the fly.
Security is not an issue.

I tested the 64-bit finalizer of MurmurHash3 (suggested by #aix and this SO post). This gives zero if the input is zero, so I increased the input parameter by 1 first:
typedef unsigned long long uint64;
inline uint64 fasthash(uint64 i)
{
i += 1ULL;
i ^= i >> 33ULL;
i *= 0xff51afd7ed558ccdULL;
i ^= i >> 33ULL;
i *= 0xc4ceb9fe1a85ec53ULL;
i ^= i >> 33ULL;
return i;
}
Here the input argument i is a small integer, for example an element of {0, 1, ..., 1000}. The output looks random:
i fasthash(i) decimal: fasthash(i) hex:
0 12994781566227106604 0xB456BCFC34C2CB2C
1 4233148493373801447 0x3ABF2A20650683E7
2 815575690806614222 0x0B5181C509F8D8CE
3 5156626420896634997 0x47900468A8F01875
... ... ...
There is no linear correlation between subsequent elements of the series:
The range of both axes is 0..2^64-1

Why not use an existing hash function, such as MurmurHash3 with a 64-bit finalizer? According to the author, the function takes tens of CPU cycles per key on current Intel hardware.

Given: input i in the range of 0 to 1,000.
const MaxInt which is the maximum value that cna be contained in a 64 bit int. (you did not say if it is signed or unsigned; 2^64 = 18446744073709551616 )
and a function rand() that returns a value between 0 and 1 (most languages have such a function)
compute hashvalue = i * rand() * ( MaxInt / 1000 )

1,000 * 1,000 = 1,000,000. That fits well within an Int32.
Subtract the low bound of your range, from the number.
Square it, and use it as a direct subscript into some sort of bitmap.

What's the fastest algorithm to divide an integer by 3 without using a division instruction? [duplicate]

int x = n / 3; // <-- make this faster
// for instance
int a = n * 3; // <-- normal integer multiplication
int b = (n << 1) + n; // <-- potentially faster multiplication

The guy who said "leave it to the compiler" was right, but I don't have the "reputation" to mod him up or comment. I asked gcc to compile int test(int a) { return a / 3; } for an ix86 and then disassembled the output. Just for academic interest, what it's doing is roughly multiplying by 0x55555556 and then taking the top 32 bits of the 64 bit result of that. You can demonstrate this to yourself with eg:
$ ruby -e 'puts(60000 * 0x55555556 >> 32)'
20000
$ ruby -e 'puts(72 * 0x55555556 >> 32)'
24
$
The wikipedia page on Montgomery division is hard to read but fortunately the compiler guys have done it so you don't have to.

This is the fastest as the compiler will optimize it if it can depending on the output processor.
int a;
int b;
a = some value;
b = a / 3;

There is a faster way to do it if you know the ranges of the values, for example, if you are dividing a signed integer by 3 and you know the range of the value to be divided is 0 to 768, then you can multiply it by a factor and shift it to the left by a power of 2 to that factor divided by 3.
eg.
Range 0 -> 768
you could use shifting of 10 bits, which multiplying by 1024, you want to divide by 3 so your multiplier should be 1024 / 3 = 341,
so you can now use (x * 341) >> 10
(Make sure the shift is a signed shift if using signed integers), also make sure the shift is an actually shift and not a bit ROLL
This will effectively divide the value 3, and will run at about 1.6 times the speed as a natural divide by 3 on a standard x86 / x64 CPU.
Of course the only reason you can make this optimization when the compiler cant is because the compiler does not know the maximum range of X and therefore cannot make this determination, but you as the programmer can.
Sometime it may even be more beneficial to move the value into a larger value and then do the same thing, ie. if you have an int of full range you could make it an 64-bit value and then do the multiply and shift instead of dividing by 3.
I had to do this recently to speed up image processing, i needed to find the average of 3 color channels, each color channel with a byte range (0 - 255). red green and blue.
At first i just simply used:
avg = (r + g + b) / 3;
(So r + g + b has a maximum of 768 and a minimum of 0, because each channel is a byte 0 - 255)
After millions of iterations the entire operation took 36 milliseconds.
I changed the line to:
avg = (r + g + b) * 341 >> 10;
And that took it down to 22 milliseconds, its amazing what can be done with a little ingenuity.
This speed up occurred in C# even though I had optimisations turned on and was running the program natively without debugging info and not through the IDE.

See How To Divide By 3 for an extended discussion of more efficiently dividing by 3, focused on doing FPGA arithmetic operations.
Also relevant:
Optimizing integer divisions with Multiply Shift in C#

Depending on your platform and depending on your C compiler, a native solution like just using
y = x / 3
Can be fast or it can be awfully slow (even if division is done entirely in hardware, if it is done using a DIV instruction, this instruction is about 3 to 4 times slower than a multiplication on modern CPUs). Very good C compilers with optimization flags turned on may optimize this operation, but if you want to be sure, you are better off optimizing it yourself.
For optimization it is important to have integer numbers of a known size. In C int has no known size (it can vary by platform and compiler!), so you are better using C99 fixed-size integers. The code below assumes that you want to divide an unsigned 32-bit integer by three and that you C compiler knows about 64 bit integer numbers (NOTE: Even on a 32 bit CPU architecture most C compilers can handle 64 bit integers just fine):
static inline uint32_t divby3 (
uint32_t divideMe
) {
return (uint32_t)(((uint64_t)0xAAAAAAABULL * divideMe) >> 33);
}
As crazy as this might sound, but the method above indeed does divide by 3. All it needs for doing so is a single 64 bit multiplication and a shift (like I said, multiplications might be 3 to 4 times faster than divisions on your CPU). In a 64 bit application this code will be a lot faster than in a 32 bit application (in a 32 bit application multiplying two 64 bit numbers take 3 multiplications and 3 additions on 32 bit values) - however, it might be still faster than a division on a 32 bit machine.
On the other hand, if your compiler is a very good one and knows the trick how to optimize integer division by a constant (latest GCC does, I just checked), it will generate the code above anyway (GCC will create exactly this code for "/3" if you enable at least optimization level 1). For other compilers... you cannot rely or expect that it will use tricks like that, even though this method is very well documented and mentioned everywhere on the Internet.
Problem is that it only works for constant numbers, not for variable ones. You always need to know the magic number (here 0xAAAAAAAB) and the correct operations after the multiplication (shifts and/or additions in most cases) and both is different depending on the number you want to divide by and both take too much CPU time to calculate them on the fly (that would be slower than hardware division). However, it's easy for a compiler to calculate these during compile time (where one second more or less compile time plays hardly a role).

For 64 bit numbers:
uint64_t divBy3(uint64_t x)
{
return x*12297829382473034411ULL;
}
However this isn't the truncating integer division you might expect.
It works correctly if the number is already divisible by 3, but it returns a huge number if it isn't.
For example if you run it on for example 11, it returns 6148914691236517209. This looks like a garbage but it's in fact the correct answer: multiply it by 3 and you get back the 11!
If you are looking for the truncating division, then just use the / operator. I highly doubt you can get much faster than that.
Theory:
64 bit unsigned arithmetic is a modulo 2^64 arithmetic.
This means for each integer which is coprime with the 2^64 modulus (essentially all odd numbers) there exists a multiplicative inverse which you can use to multiply with instead of division. This magic number can be obtained by solving the 3*x + 2^64*y = 1 equation using the Extended Euclidean Algorithm.

What if you really don't want to multiply or divide? Here is is an approximation I just invented. It works because (x/3) = (x/4) + (x/12). But since (x/12) = (x/4) / 3 we just have to repeat the process until its good enough.
#include <stdio.h>
void main()
{
int n = 1000;
int a,b;
a = n >> 2;
b = (a >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
printf("a=%d\n", a);
}
The result is 330. It could be made more accurate using b = ((b+2)>>2); to account for rounding.
If you are allowed to multiply, just pick a suitable approximation for (1/3), with a power-of-2 divisor. For example, n * (1/3) ~= n * 43 / 128 = (n * 43) >> 7.
This technique is most useful in Indiana.

I don't know if it's faster but if you want to use a bitwise operator to perform binary division you can use the shift and subtract method described at this page:
Set quotient to 0
Align leftmost digits in dividend and divisor
Repeat:
If that portion of the dividend above the divisor is greater than or equal to the divisor:
Then subtract divisor from that portion of the dividend and
Concatentate 1 to the right hand end of the quotient
Else concatentate 0 to the right hand end of the quotient
Shift the divisor one place right
Until dividend is less than the divisor:
quotient is correct, dividend is remainder
STOP

For really large integer division (e.g. numbers bigger than 64bit) you can represent your number as an int[] and perform division quite fast by taking two digits at a time and divide them by 3. The remainder will be part of the next two digits and so forth.
eg. 11004 / 3 you say
11/3 = 3, remaineder = 2 (from 11-3*3)
20/3 = 6, remainder = 2 (from 20-6*3)
20/3 = 6, remainder = 2 (from 20-6*3)
24/3 = 8, remainder = 0
hence the result 3668
internal static List<int> Div3(int[] a)
{
int remainder = 0;
var res = new List<int>();
for (int i = 0; i < a.Length; i++)
{
var val = remainder + a[i];
var div = val/3;
remainder = 10*(val%3);
if (div > 9)
{
res.Add(div/10);
res.Add(div%10);
}
else
res.Add(div);
}
if (res[0] == 0) res.RemoveAt(0);
return res;
}

If you really want to see this article on integer division, but it only has academic merit ... it would be an interesting application that actually needed to perform that benefited from that kind of trick.

Easy computation ... at most n iterations where n is your number of bits:
uint8_t divideby3(uint8_t x)
{
uint8_t answer =0;
do
{
x>>=1;
answer+=x;
x=-x;
}while(x);
return answer;
}

A lookup table approach would also be faster in some architectures.
uint8_t DivBy3LU(uint8_t u8Operand)
{
uint8_t ai8Div3 = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, ....];
return ai8Div3[u8Operand];
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio