Algorithm for bit expansion/duplication? - algorithm

Is there an efficient (fast) algorithm that will perform bit expansion/duplication?
For example, expand each bit in an 8bit value by 3 (creating a 24bit value):
1101 0101 => 11111100 01110001 11000111
The brute force method that has been proposed is to create a lookup table. In the future, the expansion value may need to be variable. That is, in the above example we are expanding by 3 but may need to expand by some other value(s). This would require multiple lookup tables that I'd like to avoid if possible.

There is a chance to make it quicker than lookup table if arithmetic calculations are for some reason faster than memory access. This may be possible if calculations are vectorized (PPC AltiVec or Intel SSE) and/or if other parts of the program need to use every bit of cache memory.
If expansion factor = 3, only 7 instructions are needed:
out = (((in * 0x101 & 0x0F00F) * 0x11 & 0x0C30C3) * 5 & 0x249249) * 7;
Or other alternative, with 10 instructions:
out = (in | in << 8) & 0x0F00F;
out = (out | out << 4) & 0x0C30C3;
out = (out | out << 2) & 0x249249;
out *= 7;
For other expansion factors >= 3:
unsigned mask = 0x0FF;
unsigned out = in;
for (scale = 4; scale != 0; scale /= 2)
{
shift = scale * (N - 1);
mask &= ~(mask << scale);
mask |= mask << (scale * N);
out = out * ((1 << shift) + 1) & mask;
}
out *= (1 << N) - 1;
Or other alternative, for expansion factors >= 2:
unsigned mask = 0x0FF;
unsigned out = in;
for (scale = 4; scale != 0; scale /= 2)
{
shift = scale * (N - 1);
mask &= ~(mask << scale);
mask |= mask << (scale * N);
out = (out | out << shift) & mask;
}
out *= (1 << N) - 1;
shift and mask values are better to be calculated prior to bit stream processing.

You can do it one input bit at at time. Of course, it will be slower than a lookup table, but if you're doing something like writing for a tiny, 8-bit microcontroller without enough room for a table, it should have the smallest possible ROM footprint.

Related

Fast random/mutation algorithms (vector to vector) [duplicate]

I've been trying to create a generalized Gradient Noise generator (which doesn't use the hash method to get gradients). The code is below:
class GradientNoise {
std::uint64_t m_seed;
std::uniform_int_distribution<std::uint8_t> distribution;
const std::array<glm::vec2, 4> vector_choice = {glm::vec2(1.0, 1.0), glm::vec2(-1.0, 1.0), glm::vec2(1.0, -1.0),
glm::vec2(-1.0, -1.0)};
public:
GradientNoise(uint64_t seed) {
m_seed = seed;
distribution = std::uniform_int_distribution<std::uint8_t>(0, 3);
}
// 0 -> 1
// just passes the value through, origionally was perlin noise activation
double nonLinearActivationFunction(double value) {
//return value * value * value * (value * (value * 6.0 - 15.0) + 10.0);
return value;
}
// 0 -> 1
//cosine interpolation
double interpolate(double a, double b, double t) {
double mu2 = (1 - cos(t * M_PI)) / 2;
return (a * (1 - mu2) + b * mu2);
}
double noise(double x, double y) {
std::mt19937_64 rng;
//first get the bottom left corner associated
// with these coordinates
int corner_x = std::floor(x);
int corner_y = std::floor(y);
// then get the respective distance from that corner
double dist_x = x - corner_x;
double dist_y = y - corner_y;
double corner_0_contrib; // bottom left
double corner_1_contrib; // top left
double corner_2_contrib; // top right
double corner_3_contrib; // bottom right
std::uint64_t s1 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y) + m_seed);
std::uint64_t s2 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s3 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s4 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y) + m_seed);
// each xy pair turns into distance vector from respective corner, corner zero is our starting corner (bottom
// left)
rng.seed(s1);
corner_0_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y});
rng.seed(s2);
corner_1_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y - 1});
rng.seed(s3);
corner_2_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y - 1});
rng.seed(s4);
corner_3_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y});
double u = nonLinearActivationFunction(dist_x);
double v = nonLinearActivationFunction(dist_y);
double x_bottom = interpolate(corner_0_contrib, corner_3_contrib, u);
double x_top = interpolate(corner_1_contrib, corner_2_contrib, u);
double total_xy = interpolate(x_bottom, x_top, v);
return total_xy;
}
};
I then generate an OpenGL texture to display with like this:
int width = 1024;
int height = 1024;
unsigned char *temp_texture = new unsigned char[width*height * 4];
double octaves[5] = {2,4,8,16,32};
for( int i = 0; i < height; i++){
for(int j = 0; j < width; j++){
double d_noise = 0;
d_noise += temp_1.noise(j/octaves[0], i/octaves[0]);
d_noise += temp_1.noise(j/octaves[1], i/octaves[1]);
d_noise += temp_1.noise(j/octaves[2], i/octaves[2]);
d_noise += temp_1.noise(j/octaves[3], i/octaves[3]);
d_noise += temp_1.noise(j/octaves[4], i/octaves[4]);
d_noise/=5;
uint8_t noise = static_cast<uint8_t>(((d_noise * 128.0) + 128.0));
temp_texture[j*4 + (i * width * 4) + 0] = (noise);
temp_texture[j*4 + (i * width * 4) + 1] = (noise);
temp_texture[j*4 + (i * width * 4) + 2] = (noise);
temp_texture[j*4 + (i * width * 4) + 3] = (255);
}
}
Which give good results:
But gprof is telling me that the Mersenne twister is taking up 62.4% of my time and growing with larger textures. Nothing else individual takes any where near as much time. While the Mersenne twister is fast after initialization, the fact that I initialize it every time I use it seems to make it pretty slow.
This initialization is 100% required for this to make sure that the same x and y generates the same gradient at each integer point (so you need either a hash function or seed the RNG each time).
I attempted to change the PRNG to both the linear congruential generator and Xorshiftplus, and while both ran orders of magnitude faster, they gave odd results:
LCG (one time, then running 5 times before using)
Xorshiftplus
After one iteration
After 10,000 iterations.
I've tried:
Running the generator several times before utilizing output, this results in slow execution or simply different artifacts.
Using the output of two consecutive runs after initial seed to seed the PRNG again and use the value after wards. No difference in result.
What is happening? What can i do to get faster results that are of the same quality as the mersenne twister?
OK BIG UPDATE:
I don't know why this works, I know it has something to do with the prime number utilized, but after messing around a bit, it appears that the following works:
Step 1, incorporate the x and y values as seeds separately (and incorporate some other offset value or additional seed value with them, this number should be a prime/non trivial factor)
Step 2, Use those two seed results into seeding the generator again back into the function (so like geza said, the seeds made were bad)
Step 3, when getting the result, instead of using modulo number of items (4) trying to get, or & 3, modulo the result by a prime number first then apply & 3. I'm not sure if the prime being a mersenne prime matters or not.
Here is the result with prime = 257 and xorshiftplus being used! (note I used 2048 by 2048 for this one, the others were 256 by 256)
LCG is known to be inadequate for your purpose.
Xorshift128+'s results are bad, because it needs good seeding. And providing good seeding defeats the whole purpose of using it. I don't recommend this.
However, I recommend using an integer hash. For example, one from Bob's page.
Here's a result of the first hash of that page, it looks OK to me, and it is fast (I think it is much faster than Mersenne Twister):
Here's the code I've written to generate this:
#include <cmath>
#include <stdio.h>
unsigned int hash(unsigned int a) {
a = (a ^ 61) ^ (a >> 16);
a = a + (a << 3);
a = a ^ (a >> 4);
a = a * 0x27d4eb2d;
a = a ^ (a >> 15);
return a;
}
unsigned int ivalue(int x, int y) {
return hash(y<<16|x)&0xff;
}
float smooth(float x) {
return 6*x*x*x*x*x - 15*x*x*x*x + 10*x*x*x;
}
float value(float x, float y) {
int ix = floor(x);
int iy = floor(y);
float fx = smooth(x-ix);
float fy = smooth(y-iy);
int v00 = ivalue(iy+0, ix+0);
int v01 = ivalue(iy+0, ix+1);
int v10 = ivalue(iy+1, ix+0);
int v11 = ivalue(iy+1, ix+1);
float v0 = v00*(1-fx) + v01*fx;
float v1 = v10*(1-fx) + v11*fx;
return v0*(1-fy) + v1*fy;
}
unsigned char pic[1024*1024];
int main() {
for (int y=0; y<1024; y++) {
for (int x=0; x<1024; x++) {
float v = 0;
for (int o=0; o<=9; o++) {
v += value(x/64.0f*(1<<o), y/64.0f*(1<<o))/(1<<o);
}
int r = rint(v*0.5f);
pic[y*1024+x] = r;
}
}
FILE *f = fopen("x.pnm", "wb");
fprintf(f, "P5\n1024 1024\n255\n");
fwrite(pic, 1, 1024*1024, f);
fclose(f);
}
If you want to understand, how a hash function work (or better yet, which properties a good hash have), check out Bob's page, for example this.
You (unknowingly?) implemented a visualization of PRNG non-random patterns. That looks very cool!
Except Mersenne Twister, all your tested PRNGs do not seem fit for your purpose. As I have not done further tests myself, I can only suggest to try out and measure further PRNGs.
The randomness of LCGs are known to be sensitive to the choice of their parameters. In particular, the period of a LCG is relative to the m parameter - at most it will be m (your prime factor) & for many values it can be less.
Similarly, the careful parameters selection is required to get a long period from Xorshift PRNGs.
You've noted that some PRNGs give good procedural generation results while other do not. In order to isolate the cause, I would factor out the proc gen stuff & examine the PRNG output directly. An easy way to visualize the data is to build a grey scale image where each pixel value is a (possibly scaled) random value. For image based stuff, I find this to be an easy way to find stuff that may lead to visual artifacts. Any artifacts you see with this are likely to cause issues with your proc gen output.
Another option is to try something like the Diehard tests. If the aforementioned image test failed to reveal any problems, I might use this just to be sure my PRNG techniques were trustworthy.
Note that your code seeds the PRNG, then generates one pseudorandom number from the PRNG. The reason for the nonrandomness in xorshift128+ that you discovered is that xorshift128+ simply adds the two halves of the seed (and uses the result mod 264 as the generated number) before changing its state (review its source code). This makes that PRNG considerably different from a hash function.
What you see is the practical demonstration of quality of PRNG. Mersenne Twister is one of the best PRNGs with good performance, it passes DIEHARD tests. One should know that generating a random numbers is not an easy computational task, so looking for a better performance will inevitably result in poor quality. LCG is known to be simplest and worst PRNG ever designed and it clearly shows two-dimensional correlation as in your picture. The quality of Xorshift generators largely depend on bitness and parameters. They are definitely worse than Mersenne Twister, but some (xorshift128+) may work good enough to pass BigCrush battery of TestU01 tests.
In other words, if you are making an important physical modelling numerical experiment, you better continue to use Mersenne Twister as known to be a good trade-off between speed and quality and it comes in many standard libraries. On a less important case you may try to use xorshift128+ generator. For an ultimate results you need to use cryptographical-quality PRNG (none of mentioned here may be used for cryptographical purposes).

Is this part of a real IFFT process really optimal?

When calculating (I)FFT it is possible to calculate "N*2 real" data points using a ordinary complex (I)FFT of N data points.
Not sure about my terminology here, but this is how I've read it described.
There are several posts about this on stackoverflow already.
This can speed things up a bit when only dealing with such "real" data which is often the case when dealing with for example sound (re-)synthesis.
This increase in speed is offset by the need for a pre-processing step that somehow... uhh... fidaddles? the data to achieve this. Look I'm not even going to try to convince anyone I fully understand this but thanks to previously mentioned threads, I came up with the following routine, which does the job nicely (thank you!).
However, on my microcontroller this costs a bit more than I'd like even though trigonometric functions are already optimized with LUTs.
But the routine itself just looks like it should be possible to optimize mathematically to minimize processing. To me it seems similar to plain 2d rotation. I just can't quite wrap my head around it, but it just feels like this could be done with fewer both trigonometric calls and arithmetic operations.
I was hoping perhaps someone else might easily see what I don't and provide some insight into how this math may be simplified.
This particular routine is for use with IFFT, before the bit-reversal stage.
pseudo-version:
INPUT
MAG_A/B = 0 TO 1
PHA_A/B = 0 TO 2PI
INDEX = 0 TO PI/2
r = MAG_A * sin(PHA_A)
i = MAG_B * sin(PHA_B)
rsum = r + i
rdif = r - i
r = MAG_A * cos(PHA_A)
i = MAG_B * cos(PHA_B)
isum = r + i
idif = r - i
r = -cos(INDEX)
i = -sin(INDEX)
rtmp = r * isum + i * rdif
itmp = i * isum - r * rdif
OUTPUT rsum + rtmp
OUTPUT itmp + idif
OUTPUT rsum - rtmp
OUTPUT itmp - idif
original working code, if that's your poison:
void fft_nz_set(fft_complex_t complex[], unsigned bits, unsigned index, int32_t mag_lo, int32_t pha_lo, int32_t mag_hi, int32_t pha_hi) {
unsigned size = 1 << bits;
unsigned shift = SINE_TABLE_BITS - (bits - 1);
unsigned n = index; // index for mag_lo, pha_lo
unsigned z = size - index; // index for mag_hi, pha_hi
int32_t rsum, rdif, isum, idif, r, i;
r = smmulr(mag_lo, sine(pha_lo)); // mag_lo * sin(pha_lo)
i = smmulr(mag_hi, sine(pha_hi)); // mag_hi * sin(pha_hi)
rsum = r + i; rdif = r - i;
r = smmulr(mag_lo, cosine(pha_lo)); // mag_lo * cos(pha_lo)
i = smmulr(mag_hi, cosine(pha_hi)); // mag_hi * cos(pha_hi)
isum = r + i; idif = r - i;
r = -sinetable[(1 << SINE_BITS) - (index << shift)]; // cos(pi_c * (index / size) / 2)
i = -sinetable[index << shift]; // sin(pi_c * (index / size) / 2)
int32_t rtmp = smmlar(r, isum, smmulr(i, rdif)) << 1; // r * isum + i * rdif
int32_t itmp = smmlsr(i, isum, smmulr(r, rdif)) << 1; // i * isum - r * rdif
complex[n].r = rsum + rtmp;
complex[n].i = itmp + idif;
complex[z].r = rsum - rtmp;
complex[z].i = itmp - idif;
}
// For reference, this would be used as follows to generate a sawtooth (after IFFT)
void synth_sawtooth(fft_complex_t *complex, unsigned fft_bits) {
unsigned fft_size = 1 << fft_bits;
fft_sym_dc(complex, 0, 0); // sets dc bin [0]
for(unsigned n = 1, z = fft_size - 1; n <= fft_size >> 1; n++, z--) {
// calculation of amplitude/index (sawtooth) for both n and z
fft_sym_magnitude(complex, fft_bits, n, 0x4000000 / n, 0x4000000 / z);
}
}

Expand Right Bitwise Algorithm

Originally this post requested an inverse sheep-and-goats operation, but I realized that it was more than I really needed, so I edited the title, because I only need an expand-right algorithm, which is simpler. The example that I described below is still relevant.
Original Post:
I'm trying to figure out how to do either an inverse sheep-and-goats operation or, even better, an expand-right-flip.
According to Hacker's Delight, a sheeps-and-goats operation can be represented by:
SAG(x, m) = compress_left(x, m) | compress(x, ~m)
According to this site, the inverse can be found by:
INV_SAG(x, m, sw) = expand_left(x, ~m, sw) | expand_right(x, m, sw)
However, I can't find any code for the expand_left and expand_right functions. They are, of course, the inverse functions for compress, but compress is kind of hard to understand in itself.
Example:
To better explain what I'm looking for, consider a set of 8 bits like:
0000abcd
The variables a, b, c and d may be either ones or zeros. In addition, there is a mask which repositions the bits. So for example, if the mask were 01100101, the resulting bits would be repositioned as follows:
0ab00c0d
This can be done with an inverse sheeps-and-goats operation. However, according to this section of the site mentioned above, there is a more efficient way which he refers to as the expand-right-flip. Looking at his site, I was unable to figure out how that can be done.
Here's the expand_right from Hacker's Delight, it just says expand but it's the right version.
unsigned expand(unsigned x, unsigned m) {
unsigned m0, mk, mp, mv, t;
unsigned array[5];
int i;
m0 = m; // Save original mask.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel suffix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
array[i] = mv;
m = (m ^ mv) | (mv >> (1 << i)); // Compress m.
mk = mk & ~mp;
}
for (i = 4; i >= 0; i--) {
mv = array[i];
t = x << (1 << i);
x = (x & ~mv) | (t & mv);
}
return x & m0; // Clear out extraneous bits.
}
You can use expand_left(x, m) == expand_right(x >> (32 - popcnt(m)), m) to make the left version, but that's probably not the best way.

How do convolution matrices work?

How do those matrices work? Do I need to multiple every single pixel? How about the upperleft, upperright, bottomleft and bottomleft pixels where there's no surrounding pixel? And does the matrix work from left to right and from up to bottom or from up to bottom first and then left to right?
Why does this kernel (Edge enhance) : http://i.stack.imgur.com/d755G.png
turns into this image: http://i.stack.imgur.com/NRdkK.jpg
The Convolution filter is applied to every single pixel.
On the edges there are a few things you can do (all leave a type of border or shrink the image):
skip the edges and crop 1 pixel from the edge of the image
substitute 0 or 255 for any of the pixels that are out of bounds for the image
use a cubic spline (or other interpolation method) between 0 (or 255) and the value of the images edge pixel to come up with a substitute.
The order you apply the convolution does not matter (upper right to bottom left is most common) you should get the same results no matter the order.
However, a common mistake when applying a convolution matrix is to overwrite the current pixel you are examining with the new value. This will affect the value you come up with for the pixel next to the current one. A better method would be to create a buffer to hold the computed values, so that previous applications of the convolution filter do not affect current application of the matrix.
From your example images it is hard to tell why the filter applied creates the black and white version without seeing the original image.
Below is a step by step example of applying a convolution kernel to an image (1D for simplicity).
As for the edge enhancement kernel in your post, notice the +1 next to the -1. Think about what that will do. If the region is constant the two pixel under the +/-1 will add to zero (black). If the two pixels are different they will have a non-zero value. So what you are seeing is that pixels next to each other that are different get highlighted, while ones that are the same get set to black. The bigger the difference the brighter (more white) the pixel in the filtered image.
Yes, you multiply every pixel, with that matrix. The traditional method is to find the relevant pixels relative to the pixel being convoluted, multiple the factors, and average it out. So a 3x3 blur of:
1, 1, 1,
1, 1, 1,
1, 1, 1
This matrix, means you take the relevant values of the various components and multiply them. Then divide by the number of elements. So you would get that 3 by 3 box, add up all the red values then divide by 9. You'd get the 3 by 3 box, add up all the green values then divide by 9. You'd get the 3 by 3 box, add up all the blue values then divide by 9.
This means a couple things. First, you need a second giant chunk of memory to perform this operation. And you do every pixel you can.
However, that's only for the traditional method and the traditional method is actually needlessly convoluted (get it?). If you return the results in a corner. You never actually need any additional memory and always do the entire operation within the memory footprint you started with.
public static void convolve(int[] pixels, int offset, int stride, int x, int y, int width, int height, int[][] matrix, int parts) {
int index = offset + x + (y*stride);
for (int j = 0; j < height; j++, index += stride) {
for (int k = 0; k < width; k++) {
int pos = index + k;
pixels[pos] = convolve(pixels,stride,pos, matrix, parts);
}
}
}
private static int crimp(int color) {
return (color >= 0xFF) ? 0xFF : (color < 0) ? 0 : color;
}
private static int convolve(int[] pixels, int stride, int index, int[][] matrix, int parts) {
int redSum = 0;
int greenSum = 0;
int blueSum = 0;
int pixel, factor;
for (int j = 0, m = matrix.length; j < m; j++, index+=stride) {
for (int k = 0, n = matrix[j].length; k < n; k++) {
pixel = pixels[index + k];
factor = matrix[j][k];
redSum += factor * ((pixel >> 16) & 0xFF);
greenSum += factor * ((pixel >> 8) & 0xFF);
blueSum += factor * ((pixel) & 0xFF);
}
}
return 0xFF000000 | ((crimp(redSum / parts) << 16) | (crimp(greenSum / parts) << 8) | (crimp(blueSum / parts)));
}
With the kernel traditionally returning the value to the center most pixel. This allows the image to blur around the edges but more or less remain where it started. This seemed like a good idea but it's actually problematic. The correct way to do it is to have the results pixel in the upper-left corner. Then you can simply, and with no extra memory, just iterate the the entire image with a scanline, going one pixel at a time and returning the value, without causing errors. The bulk of the color weight is shifted up and left by one pixel. But, it's one pixel, and you can shift it back down and to the left if you iterate backwards with a result pixel in the bottom-right. Though this might be trouble for the cache hits.
However, a lot of modern architecture have GPUs now, so the entire image can be done simultaneously. Making it a kind of moot point. But, it is strange that one of the most important algorithm in graphics is weird in requiring this, as that makes the easiest way to do the operation impossible, and a memory hog.
So that people like Matt on this question say things like "However, a common mistake when applying a convolution matrix is to overwrite the current pixel you are examining with the new value." -- Really this is the correct way to do it, the error is writing the result pixel to the center rather than the upper left corner. Because unlike the upper-left corner, you will need the center pixel again. You won't ever need the upper-left corner again (assuming you are iterating left->right, top->bottom), and so it's safe to store your value there.
"This will affect the value you come up with for the pixel next to the current one." -- If you wrote it to the upper left corner as you processed it as a scan, you would overwrite data that you do not ever need again. Using a bunch of extra memory isn't a better solution.
As such, here's likely the fastest Java blur you'd ever see.
private static void applyBlur(int[] pixels, int stride) {
int v0, v1, v2, r, g, b;
int pos;
pos = 0;
try {
while (true) {
v0 = pixels[pos];
v1 = pixels[pos+1];
v2 = pixels[pos+2];
r = ((v0 >> 16) & 0xFF) + ((v1 >> 16) & 0xFF) + ((v2 >> 16) & 0xFF);
g = ((v0 >> 8 ) & 0xFF) + ((v1 >> 8) & 0xFF) + ((v2 >> 8) & 0xFF);
b = ((v0 ) & 0xFF) + ((v1 ) & 0xFF) + ((v2 ) & 0xFF);
r/=3;
g/=3;
b/=3;
pixels[pos++] = r << 16 | g << 8 | b;
}
}
catch (ArrayIndexOutOfBoundsException e) { }
pos = 0;
try {
while (true) {
v0 = pixels[pos];
v1 = pixels[pos+stride];
v2 = pixels[pos+stride+stride];
r = ((v0 >> 16) & 0xFF) + ((v1 >> 16) & 0xFF) + ((v2 >> 16) & 0xFF);
g = ((v0 >> 8 ) & 0xFF) + ((v1 >> 8) & 0xFF) + ((v2 >> 8) & 0xFF);
b = ((v0 ) & 0xFF) + ((v1 ) & 0xFF) + ((v2 ) & 0xFF);
r/=3;
g/=3;
b/=3;
pixels[pos++] = r << 16 | g << 8 | b;
}
}
catch (ArrayIndexOutOfBoundsException e) { }
}

More CPU/time-efficient way to get N quasi random numbers than using N rand() calls?

This is a continuation of another question about efficiently getting quasi-random numbers.
I need to get N unique quasi-random numbers, bias/distribution quality is no big deal. Can I get them more CPU/time-efficiently than using N calls to rand() or /dev/random etc? Maybe using some math/bitwise manipulations on few random numbers or something like that. Or using precomputed tables or...
N can be quite big like 10000 or 1000000.
Thanks.
I found simething like this here:
m_w = <choose-initializer>; /* must not be zero */
m_z = <choose-initializer>; /* must not be zero */
uint get_random()
{
m_z = 36969 * (m_z & 65535) + (m_z >> 16);
m_w = 18000 * (m_w & 65535) + (m_w >> 16);
return (m_z << 16) + m_w; /* 32-bit result */
}

Resources