Generating 32-bit random number seed on 16-bit CPU - random

I'm writing a program in assembly for a 16-bit CPU (8086), and I need to generate a 32-bit random number seed. I have about 80 bits of entropy, but many of those bits are not completely uniformly random. How do I combine those 80 bits to a random number seed of 32 bits, so that each bit of the seed (and the entire seed itself) would be much more uniformly distributed random than each of the original 80 bits?
Preferably I need a short and simple algorithm, C code or 8086 assembly code.
I need something better than just xor the entropy bits together, preferably something which was proven to be high quality by randomness probes and/or mathematical theory.
I need something shorter than just compute the MD5 and take the first 32 bits, because the MD5 algorithm implementation is quote long.
I'm aware of MurmurHash3 32-bit (see C implementation), but it's too long, and it uses too many 32-bit operations (e.g. multiplication). I need something shorter and simpler for a 16-bit CPU.

Related

Is it acceptable to use each byte of a PRNG-generated number separately?

Say you have a non-cryptographically secure PRNG that generates 64-bit output.
Assuming that bytes are 8 bits, is it acceptable to use each byte of the 64-bit output as separate 8-bit random numbers or would that possibly break the randomness guarantees of a good PRNG? Or does it depend on the PRNG?
Because the PRNG is not cryptographically secure, the "randomness guarantee" I am worried about is not security, but whether the byte stream has the same guarantee of randomness, using the same definition of "randomness" that PRNG authors use, that the PRNG has with respect to its 64-bit output.
This should be quite safe with a CSPRNG. For comparison it's like reading /dev/random byte by byte. With a good CSPRNG it is also perfectly acceptable to simply generate a 64bit sample 8 times and pick 8 bits per sample as well (throwing away the 56 other bits).
With PRNGs that are not CSPRNG you will have 'security' concerns in terms of the raw output of the PRNG that outweigh whether or not you chop up output into byte sized chunks.
In all cases it is vital to make sure the PRNG is seeded and periodically re-seeded correctly (so as to flush any possibly compromised internal state regularly). Security depends on the unpredictability of your internal state, which is ultimately driven by the quality of your seed input. One thing good CSPRNG implementations will do for you is to pessimistically estimate the amount of captured 'entropy' to safeguard the output from predictable internal state.
Note however that with 8 bits you only have 256 possible outputs in any case, so it becomes more of a question of how you use this. For instance, if you do something like XOR based encryption against the output of a PRNG (i.e. treating it as a one time pad based on some pre shared secret seed), then using a known plain text attack may relatively easily reveal the contents of the internal state of the PRNG. That is another type of attack which good CSPRNG implementations are supposed to guard against by their design (using e.g. a computationally secure hash function).
EDIT to add: if you don't care about 'security' but only need the output to look random, then this should be quite safe -- in theory a good PRNG is just as likely to yield a 0 as 1, and that should not vary between any octet. So you expect a linear distribution of possible output values. One thing you can do to verify whether this skews the distribution is to run a Monte Carlo simulation of some reasonably large size (e.g. 1M) and compare the histograms with 256 bins for both the raw 64 bit and the 8 * 8 bit output. You expect a roughly flat diagram for both cases if the linear distribution is preserved intact.
It depends on the generator and its parameterization. Quoting from the Wikipedia page for Linear Congruential Generators: "The low-order bits of LCGs when m is a power of 2 should never be relied on for any degree of randomness whatsoever. [...]any full-cycle LCG when m is a power of 2 will produce alternately odd and even results."

Bitboard algorithms for board sizes greater than 64?

I know the Magic BitBoard technique is useful for modern games that are on a n 8x8 grid because you it aligns perfectly with a single 64-bit integer, but is the idea extensible to board sizes greater than 64 squares?
Some games like Shogi have larger board sizes such as 81 squares, which doesn't cleanly fit into a 64-bit integer.
I assume you'd have to use multiple integers but would it would it be better to use 2 64-bit integers or something like 3 32-bit ones?
I know there probably isn't a trivial answer to this, but what kind of knowledge would I need in order to research something like this? I only have some basic/intermediate algorithms and data structures knowledge.
Yes, you could do this with a structure that contains multiple integers of varying lengths. For example, you could use 11 unsigned bytes. Or a 64-bit integer and a 32-bit integer, etc. Anything that will add up to 81 or more bits.
I rather like the idea of three 32-bit integers because you can store three rows per integer. It makes your indexing code simpler than if you used a 64-bit integer and a 32-bit integer. 9 16-bit words would work well, too, but you're wasting almost half your bits.
You could use 11 unsigned bytes, but the indexing is kind of ugly.
All things considered, I'd probably go with the 3 32-bit integers, using the low 27 bits of each.

PRNG concatenation

I would like to know if there is a difference between these two points:
a PRNG generating 256 bits
a PRNG generating 8 times 32 bits and concatenating them
In theory, I don't think there's a difference but with a PRNG which could not be optimal. Which one do you prefer and why ?
If you need 256 bit you should go with option one and calculate the random bytes in one go.
A PRNG usually calculates its random data in blocks which are almost always larger than 32 bits. So if you request 32 bit 8 times, the RNG will a) have to do more calculations and b) drop random data which is has calculated but was not requested by you.
This might turn into a security problem if you do this a lot of times (millions of time and more) and are not able to reseed the PRNG.

How to uniquely represent 99,999 bits as a byte, word, or double word

I have 99,999 bit flags that I need to represent uniquely with 32 bits or less. Any of the bits can be set and I need to know if the set bits differ from a comparable set of bits. I am considering using CRC to store a unique value hash but I am not sure if collisions will be a problem. Ideally, less than 500 of these bits will be set at any given time, but they will not be know ahead of time.
Is there suitable hash or other algorithm to uniquely represent these bits?
NO!
Without some other information about those bit flags to identify that certain combinations are impossible, this cannot be done. If all combinations are possible, then you will need to use 99,999 bits to store your 99,999 bit flags.
Edit:
Based on the background information that this is to reduce network usage and the expectation is that only about 500 of the bits are set, there are techniques that can be used, but none are a simple hash, and none are efficient enough to store in 32 bits. I would start by looking at Arithmetic Coding. This uses a probability distribution of the characters that you want to send (0.5% 1, 99.5% 0) to compress data. By my computations, you can "expect" a compression of about 22 times. But, for signals that are considered rare, you will pay the price by needing to transmit a signal larger than your starting 99,999 bits.

Can long integer routines benefit from SSE?

I'm still working on routines for arbitrary long integers in C++. So far, I have implemented addition/subtraction and multiplication for 64-bit Intel CPUs.
Everything works fine, but I wondered if I can speed it a bit by using SSE. I browsed through the SSE docs and processor instruction lists, but I could not find anything I think I can use and here is why:
SSE has some integer instructions, but most instructions handle floating point. It doesn't look like it was designed for use with integers (e.g. is there an integer compare for less?)
The SSE idea is SIMD (same instruction, multiple data), so it provides instructions for 2 or 4 independent operations. I, on the other hand, would like to have something like a 128 bit integer add (128 bit input and output). This doesn't seem to exist. (Yet? In AVX2 maybe?)
The integer additions and subtractions handle neither input nor output carries. So it's very cumbersome (and thus, slow) to do it by hand.
My question is: is my assessment correct or is there anything I have overlooked? Can long integer routines benefit from SSE? In particular, can they help me to write a quicker add, sub or mul routine?
In the past, the answer to this question was a solid, "no". But as of 2017, the situation is changing.
But before I continue, time for some background terminology:
Full Word Arithmetic
Partial Word Arithmetic
Full-Word Arithmetic:
This is the standard representation where the number is stored in base 232 or 264 using an array of 32-bit or 64-bit integers.
Many bignum libraries and applications (including GMP) use this representation.
In full-word representation, every integer has a unique representation. Operations like comparisons are easy. But stuff like addition are more difficult because of the need for carry-propagation.
It is this carry-propagation that makes bignum arithmetic almost impossible to vectorize.
Partial-Word Arithmetic
This is a lesser-used representation where the number uses a base less than the hardware word-size. For example, putting only 60 bits in each 64-bit word. Or using base 1,000,000,000 with a 32-bit word-size for decimal arithmetic.
The authors of GMP call this, "nails" where the "nail" is the unused portion of the word.
In the past, use of partial-word arithmetic was mostly restricted to applications working in non-binary bases. But nowadays, it's becoming more important in that it allows carry-propagation to be delayed.
Problems with Full-Word Arithmetic:
Vectorizing full-word arithmetic has historically been a lost cause:
SSE/AVX2 has no support for carry-propagation.
SSE/AVX2 has no 128-bit add/sub.
SSE/AVX2 has no 64 x 64-bit integer multiply.*
*AVX512-DQ adds a lower-half 64x64-bit multiply. But there is still no upper-half instruction.
Furthermore, x86/x64 has plenty of specialized scalar instructions for bignums:
Add-with-Carry: adc, adcx, adox.
Double-word Multiply: Single-operand mul and mulx.
In light of this, both bignum-add and bignum-multiply are difficult for SIMD to beat scalar on x64. Definitely not with SSE or AVX.
With AVX2, SIMD is almost competitive with scalar bignum-multiply if you rearrange the data to enable "vertical vectorization" of 4 different (and independent) multiplies of the same lengths in each of the 4 SIMD lanes.
AVX512 will tip things more in favor of SIMD again assuming vertical vectorization.
But for the most part, "horizontal vectorization" of bignums is largely still a lost cause unless you have many of them (of the same size) and can afford the cost of transposing them to make them "vertical".
Vectorization of Partial-Word Arithmetic
With partial-word arithmetic, the extra "nail" bits enable you to delay carry-propagation.
So as long as you as you don't overflow the word, SIMD add/sub can be done directly. In many implementations, partial-word representation uses signed integers to allow words to go negative.
Because there is (usually) no need to perform carryout, SIMD add/sub on partial words can be done equally efficiently on both vertically and horizontally-vectorized bignums.
Carryout on horizontally-vectorized bignums is still cheap as you merely shift the nails over the next lane. A full carryout to completely clear the nail bits and get to a unique representation usually isn't necessary unless you need to do a comparison of two numbers that are almost the same.
Multiplication is more complicated with partial-word arithmetic since you need to deal with the nail bits. But as with add/sub, it is nevertheless possible to do it efficiently on horizontally-vectorized bignums.
AVX512-IFMA (coming with Cannonlake processors) will have instructions that give the full 104 bits of a 52 x 52-bit multiply (presumably using the FPU hardware). This will play very well with partial-word representations that use 52 bits per word.
Large Multiplication using FFTs
For really large bignums, multiplication is most efficiently done using Fast-Fourier Transforms (FFTs).
FFTs are completely vectorizable since they work on independent doubles. This is possible because fundamentally, the representation that FFTs use is
a partial word representation.
To summarize, vectorization of bignum arithmetic is possible. But sacrifices must be made.
If you expect SSE/AVX to be able to speed up some existing bignum code without fundamental changes to the representation and/or data layout, that's not likely to happen.
But nevertheless, bignum arithmetic is possible to vectorize.
Disclosure:
I'm the author of y-cruncher which does plenty of large number arithmetic.

Categories

Resources