Expected collisions for perfect 32bit crc - probability

I'm trying to determine how my crc compares to an "ideal" 32bit crc.
So I ran my crc over 1 million completely random samples of data and collected the amount of collisions, I want to compare this number to the number of collisions I could expect from the "ideal" crc.
Does anyone know how to calculate the expected collision for an "ideal" 32bit crc?

Compare your own CRC with 0x1EDC6F41 as your "ideal" reference.
Having said that, there is no ideal 32-bit CRC. Different polynomials have different collision characteristics depending on the length of data hashed. However, a paper by Castagnoli in 1993 found what is considered the best 32-bit CRC value over the broadest range of data lengths, which is 0x1EDC6F41. This polynomial is used by some network protocols like iSCSI and also the x86 CRC32 instruction.

This explains beautifully the "Birthday Problem" and all about predicting the collision probability CRC32 Hash Collision Probability

Related

Is there any bit level error detection algorithm that use minimum extra bits?

I have a 32-bit number that is created by encoding some data, I want to be more confident that the data (a max 32-bit number) is not changed when decoding it, so I am going to add some error detection bits.
I need to keep the data as short as possible, so I can only add a few bits for error detection, in some cases just 1 bit.
I'm looking for an algorithm that detects more bit changes and needs fewer extra bits.
I was thinking of calculating a checksum or CRC and just dropping extra bits or maybe xor the result to make it shorter but I'm not sure if the error detection remains good enough.
Thanks in advance for any help.
A 1-bit CRC, with polynomial x+1 would simply be the parity of your 32 message bits. That will detect any one-bit error in the resulting 32 bits. For a 2-bit CRC, you can use x2+1. You can define a CRC of any length. See Koopman's list for good CRC polynomials for CRCs of degree 3 and higher.

Generating 32-bit random number seed on 16-bit CPU

I'm writing a program in assembly for a 16-bit CPU (8086), and I need to generate a 32-bit random number seed. I have about 80 bits of entropy, but many of those bits are not completely uniformly random. How do I combine those 80 bits to a random number seed of 32 bits, so that each bit of the seed (and the entire seed itself) would be much more uniformly distributed random than each of the original 80 bits?
Preferably I need a short and simple algorithm, C code or 8086 assembly code.
I need something better than just xor the entropy bits together, preferably something which was proven to be high quality by randomness probes and/or mathematical theory.
I need something shorter than just compute the MD5 and take the first 32 bits, because the MD5 algorithm implementation is quote long.
I'm aware of MurmurHash3 32-bit (see C implementation), but it's too long, and it uses too many 32-bit operations (e.g. multiplication). I need something shorter and simpler for a 16-bit CPU.

Is it acceptable to use each byte of a PRNG-generated number separately?

Say you have a non-cryptographically secure PRNG that generates 64-bit output.
Assuming that bytes are 8 bits, is it acceptable to use each byte of the 64-bit output as separate 8-bit random numbers or would that possibly break the randomness guarantees of a good PRNG? Or does it depend on the PRNG?
Because the PRNG is not cryptographically secure, the "randomness guarantee" I am worried about is not security, but whether the byte stream has the same guarantee of randomness, using the same definition of "randomness" that PRNG authors use, that the PRNG has with respect to its 64-bit output.
This should be quite safe with a CSPRNG. For comparison it's like reading /dev/random byte by byte. With a good CSPRNG it is also perfectly acceptable to simply generate a 64bit sample 8 times and pick 8 bits per sample as well (throwing away the 56 other bits).
With PRNGs that are not CSPRNG you will have 'security' concerns in terms of the raw output of the PRNG that outweigh whether or not you chop up output into byte sized chunks.
In all cases it is vital to make sure the PRNG is seeded and periodically re-seeded correctly (so as to flush any possibly compromised internal state regularly). Security depends on the unpredictability of your internal state, which is ultimately driven by the quality of your seed input. One thing good CSPRNG implementations will do for you is to pessimistically estimate the amount of captured 'entropy' to safeguard the output from predictable internal state.
Note however that with 8 bits you only have 256 possible outputs in any case, so it becomes more of a question of how you use this. For instance, if you do something like XOR based encryption against the output of a PRNG (i.e. treating it as a one time pad based on some pre shared secret seed), then using a known plain text attack may relatively easily reveal the contents of the internal state of the PRNG. That is another type of attack which good CSPRNG implementations are supposed to guard against by their design (using e.g. a computationally secure hash function).
EDIT to add: if you don't care about 'security' but only need the output to look random, then this should be quite safe -- in theory a good PRNG is just as likely to yield a 0 as 1, and that should not vary between any octet. So you expect a linear distribution of possible output values. One thing you can do to verify whether this skews the distribution is to run a Monte Carlo simulation of some reasonably large size (e.g. 1M) and compare the histograms with 256 bins for both the raw 64 bit and the 8 * 8 bit output. You expect a roughly flat diagram for both cases if the linear distribution is preserved intact.
It depends on the generator and its parameterization. Quoting from the Wikipedia page for Linear Congruential Generators: "The low-order bits of LCGs when m is a power of 2 should never be relied on for any degree of randomness whatsoever. [...]any full-cycle LCG when m is a power of 2 will produce alternately odd and even results."

Compress Random 32-bit Integers: How close can we get to Shannon Entropy?

I've developed a lossless compression algorithm that compresses 32-bit integers (of unknown frequency/probability) to 31.95824 bits per integer (it works a lot better for smaller values, just as most compression algorithms do). Obviously it isn't possible to compress uniformly-distributed random data to become smaller than its uncompressed size.
Therefore my question is, which lossless compression algorithms get closest to the Shannon Entropy of 32 bits per integer for pseudorandom data, assuming 32-bit integers?
Essentially, I'm looking for a table which includes compression algorithms and their respective bits-per-integer value for positive, compressed, 32-bit integers.
When you say "it works a lot better for smaller values", I presume that you have a transformation from the 32-bit integer to a variable-bit-length representation that is optimized for some non-uniform expected distribution of values. Then that same transformation applied to a uniform distribution of 32-bit values will necessarily take more than 32 bits on average. How much more depends on how non-uniform a distribution you started with.
So the answer is, of course you can get to 32 bits exactly by doing nothing at all to the number. But then you are not optimized for the application implied by the non-uniform distribution you designed to.
The identity function requires precisely 32 bits per 32 bit integer, which is pretty hard to beat. (There are many other length-preserving bijections, if you insist on changing the data stream.)
It's not obvious to me what other criteria you might be employing to recommend an algorithm which does worse than that. Perhaps you believe that the input stream is not truly a uniform sample; rather, it is a restricted to (or significantly biased towards) a subset of the universe, but you do not a priori know what the subset is. In that case, the entropy of the stream is less than one (if there is an upper bound on the size of the subset which is reasonably less than the size of the universe) and you might be able to actually compress the input stream.
It's worth noting that unless messages are fixed-length, the length of the message needs to be taken into account in the computation of entropy, both in the numerator and the denominator. For very long messages, that can mostly be ignored but if messages are short, the cost of message delimiters (or explicit length indicators) can be significant. (Otherwise, "compressing" to 103% of original size is a somewhat humptydumptyesque definition of "to compress".)
This is exactly what Quantile Compression (https://github.com/mwlon/quantile-compression/) was built to do: lossless compression of numbers drawn from a numerical distributuon. I'm not aware of any other algorithms that do this. You can see its results vs theoretical optimum in the readme. It also works on floats and timestamps! I'm not sure what your distribution is, but real-world distributions often only take a few bits per number with
It works by encoding each number in the sequence as a Huffman code for a coarse numeric range and then an offset for the exact position within that range.

How to uniquely represent 99,999 bits as a byte, word, or double word

I have 99,999 bit flags that I need to represent uniquely with 32 bits or less. Any of the bits can be set and I need to know if the set bits differ from a comparable set of bits. I am considering using CRC to store a unique value hash but I am not sure if collisions will be a problem. Ideally, less than 500 of these bits will be set at any given time, but they will not be know ahead of time.
Is there suitable hash or other algorithm to uniquely represent these bits?
NO!
Without some other information about those bit flags to identify that certain combinations are impossible, this cannot be done. If all combinations are possible, then you will need to use 99,999 bits to store your 99,999 bit flags.
Edit:
Based on the background information that this is to reduce network usage and the expectation is that only about 500 of the bits are set, there are techniques that can be used, but none are a simple hash, and none are efficient enough to store in 32 bits. I would start by looking at Arithmetic Coding. This uses a probability distribution of the characters that you want to send (0.5% 1, 99.5% 0) to compress data. By my computations, you can "expect" a compression of about 22 times. But, for signals that are considered rare, you will pay the price by needing to transmit a signal larger than your starting 99,999 bits.

Resources