Practical Compression of Random Data - algorithm

So yesterday I asked a question on compression of a sequence of integers (link) and most comments had a similar point: if the order is random (or worst, the data is completely random) then one have to settle down with log2(k) bits for a value k. I've also read similar replies in other questions on this site. Now, I hope this isn't a silly question, if I take that sequence and serialize it to a file and then I run gzip on this file then I do achieve compression (and depending on the time I allow gzip to run I might get high compression). Could somebody explain this fact ?
Thanks in advance.

My guess is that you're achieving compression on your random file because you're not using an optimal serialization technique, but without more details it's impossible to answer your question. Is the compressed file with n numbers in the range [0, k) less than n*log2(k) bits? (That is, n*log256(k) bytes). If so, does gzip manage to do that for all the random files you generate, or just occasionally?
Let me note one thing: suppose you say to me, "I've generated a file of random octets by using a uniform_int_distribution(0, 255) with the mt19937 prng [1]. What's the optimal compression of my file?" Now, my answer could reasonably be: "probably about 80 bits". All I need to reproduce your file is
the value you used to seed the prng, quite possibly a 32-bit integer [2]; and
the length of the file, which probably fits in 48 bits.
And if I can reproduce the file given 80 bits of data, that's the optimal compression. Unfortunately, that's not a general purpose compression strategy. It's highly unlikely that gzip will be able to figure out that you used a particular prng to generate the file, much less that it will be able to reverse-engineer the seed (although these things are, at least in theory, achievable; the Mersenne twister is not a cryptographically secure prng.)
For another example, it's generally recommended that text be compressed before being encrypted; the result will be quite a bit shorter than compressing after encryption. But the fact is that encryption adds very little entropy; at most, it adds the number of bits in the encryption key. Nonetheless, the resulting output is difficult to distinguish from random data, and gzip will struggle to compress it (although it often manages to squeeze a few bits out).
Note 1: Note: that's all c++11/boost lingo. mt19937 is an instance of the Mersenne twister pseudo-random number generator (prng), which has a period of 2^19937 - 1.
Note 2: The state of the Mersenne twister is actually 624 words (19968 bits), but most programs use somewhat fewer bits to seed it. Perhaps you used a 64-bit integer instead of a 32-bit integer, but it doesn't change the answer by much.

if I take that sequence and serialize it to a file and then I run gzip
on this file then I do achieve compression
What is "it"? If you take random bytes (each uniformly distributed in 0..255) and feed them to gzip or any compressor, you may on very rare occasions get a small amount of compression, but most of the time you will get a small amount of expansion.

If the data is truly random, on average no compression algorithm can compress it. But if the data has some predictable patterns (for e.g. if the probability of a symbol is dependent on the previous k-symbols occurring in the data), many (prediction-based) compression algorithms will succeed.

Related

Is it acceptable to use each byte of a PRNG-generated number separately?

Say you have a non-cryptographically secure PRNG that generates 64-bit output.
Assuming that bytes are 8 bits, is it acceptable to use each byte of the 64-bit output as separate 8-bit random numbers or would that possibly break the randomness guarantees of a good PRNG? Or does it depend on the PRNG?
Because the PRNG is not cryptographically secure, the "randomness guarantee" I am worried about is not security, but whether the byte stream has the same guarantee of randomness, using the same definition of "randomness" that PRNG authors use, that the PRNG has with respect to its 64-bit output.
This should be quite safe with a CSPRNG. For comparison it's like reading /dev/random byte by byte. With a good CSPRNG it is also perfectly acceptable to simply generate a 64bit sample 8 times and pick 8 bits per sample as well (throwing away the 56 other bits).
With PRNGs that are not CSPRNG you will have 'security' concerns in terms of the raw output of the PRNG that outweigh whether or not you chop up output into byte sized chunks.
In all cases it is vital to make sure the PRNG is seeded and periodically re-seeded correctly (so as to flush any possibly compromised internal state regularly). Security depends on the unpredictability of your internal state, which is ultimately driven by the quality of your seed input. One thing good CSPRNG implementations will do for you is to pessimistically estimate the amount of captured 'entropy' to safeguard the output from predictable internal state.
Note however that with 8 bits you only have 256 possible outputs in any case, so it becomes more of a question of how you use this. For instance, if you do something like XOR based encryption against the output of a PRNG (i.e. treating it as a one time pad based on some pre shared secret seed), then using a known plain text attack may relatively easily reveal the contents of the internal state of the PRNG. That is another type of attack which good CSPRNG implementations are supposed to guard against by their design (using e.g. a computationally secure hash function).
EDIT to add: if you don't care about 'security' but only need the output to look random, then this should be quite safe -- in theory a good PRNG is just as likely to yield a 0 as 1, and that should not vary between any octet. So you expect a linear distribution of possible output values. One thing you can do to verify whether this skews the distribution is to run a Monte Carlo simulation of some reasonably large size (e.g. 1M) and compare the histograms with 256 bins for both the raw 64 bit and the 8 * 8 bit output. You expect a roughly flat diagram for both cases if the linear distribution is preserved intact.
It depends on the generator and its parameterization. Quoting from the Wikipedia page for Linear Congruential Generators: "The low-order bits of LCGs when m is a power of 2 should never be relied on for any degree of randomness whatsoever. [...]any full-cycle LCG when m is a power of 2 will produce alternately odd and even results."

Compress Random 32-bit Integers: How close can we get to Shannon Entropy?

I've developed a lossless compression algorithm that compresses 32-bit integers (of unknown frequency/probability) to 31.95824 bits per integer (it works a lot better for smaller values, just as most compression algorithms do). Obviously it isn't possible to compress uniformly-distributed random data to become smaller than its uncompressed size.
Therefore my question is, which lossless compression algorithms get closest to the Shannon Entropy of 32 bits per integer for pseudorandom data, assuming 32-bit integers?
Essentially, I'm looking for a table which includes compression algorithms and their respective bits-per-integer value for positive, compressed, 32-bit integers.
When you say "it works a lot better for smaller values", I presume that you have a transformation from the 32-bit integer to a variable-bit-length representation that is optimized for some non-uniform expected distribution of values. Then that same transformation applied to a uniform distribution of 32-bit values will necessarily take more than 32 bits on average. How much more depends on how non-uniform a distribution you started with.
So the answer is, of course you can get to 32 bits exactly by doing nothing at all to the number. But then you are not optimized for the application implied by the non-uniform distribution you designed to.
The identity function requires precisely 32 bits per 32 bit integer, which is pretty hard to beat. (There are many other length-preserving bijections, if you insist on changing the data stream.)
It's not obvious to me what other criteria you might be employing to recommend an algorithm which does worse than that. Perhaps you believe that the input stream is not truly a uniform sample; rather, it is a restricted to (or significantly biased towards) a subset of the universe, but you do not a priori know what the subset is. In that case, the entropy of the stream is less than one (if there is an upper bound on the size of the subset which is reasonably less than the size of the universe) and you might be able to actually compress the input stream.
It's worth noting that unless messages are fixed-length, the length of the message needs to be taken into account in the computation of entropy, both in the numerator and the denominator. For very long messages, that can mostly be ignored but if messages are short, the cost of message delimiters (or explicit length indicators) can be significant. (Otherwise, "compressing" to 103% of original size is a somewhat humptydumptyesque definition of "to compress".)
This is exactly what Quantile Compression (https://github.com/mwlon/quantile-compression/) was built to do: lossless compression of numbers drawn from a numerical distributuon. I'm not aware of any other algorithms that do this. You can see its results vs theoretical optimum in the readme. It also works on floats and timestamps! I'm not sure what your distribution is, but real-world distributions often only take a few bits per number with
It works by encoding each number in the sequence as a Huffman code for a coarse numeric range and then an offset for the exact position within that range.

How exactly does PC/Mac generates random numbers for either 0 or 1?

This question is NOT about how to use any language to generate a random number between any interval. It is about generating either 0 or 1.
I understand that many random generator algorithm manipulate the very basic random(0 or 1) function and take seed from users and use an algorithm to generate various random numbers as needed.
The question is that how the CPU generate either 0 or 1? If I throw a coin, I can generate head or tailer. That's because I physically throw a coin and let the nature decide. But how does CPU do it? There must be an action that the CPU does (like throwing a coin) to get either 0 or 1 randomly, right?
Could anyone tell me about it?
Thanks
(This has several facets and thus several algorithms. Keep in mind that there are many different forms of randomness used for different purposes, but I understand your question in the way that you are interested in actual randomness used for cryptography.)
The fundamental problem here is that computers are (mostly) deterministic machines. Given the same input in the same state they always yield the same result. However, there are a few ways of actually gathering entropy:
User input. Since users bring outside input into the system you can take that to derive some bits from that. Similar to how you could use radioactive decay or line noise.
Network activity. Again, an outside source of stuff.
Generally interrupts (which kinda include the first two).
As alluded to in the first item, noise from peripherals, such as audio input or a webcam can be used.
There is dedicated hardware that can generate a few hundred MiB of randomness per second. Usually they give you random numbers directly instead of their internal entropy, though.
How exactly you derive bits from that is up to you but you could use time between events, or actual content from the events, etc. – generally eliminating bias from entropy sources isn't easy or trivial and a lot of thought and algorithmic work goes into that (in the case of the aforementioned special hardware this is all done in hardware and the code using it doesn't need to care about it).
Once you have a pool of actually random bits you can just use them as random numbers (/dev/random on Linux does that). But this has downsides, since there is usually little actual entropy and possibly a higher demand for random numbers. So you can invent algorithms to “stretch” that initial randomness in a manner that makes it still impossible or at least very difficult to predict anything about following numbers (/dev/urandom on Linux or both /dev/random and /dev/urandom on FreeBSD do that). Fortuna and Yarrow are so-called cryptographically secure pseudo-random number generators and designed with that in mind. You still have a very good guarantee about the quality of random numbers you generate, but have many more before your entropy pool runs out.
In any case, the CPU itself cannot give you a random 0 or 1. There's a lot more involved and this usually includes the complete computer system or special hardware built for that purpose.
There is also a second class of computational randomness: Plain vanilla pseudo-random number generators (PRNGs). What I said earlier about determinism – this is the embodiment of it. Given the same so-called seed a PRNG will yield the exact same sequence of numbers every time¹. While this sounds idiotic it has practical benefits.
Suppose you run a simulation involving lots of random numbers, maybe to simulate interaction between molecules or atoms that involve certain probabilities and unpredictable behaviour. In science you want results anyone can independently verify, given the same setup and procedure (or, with computing, the same algorithms). If you used actual randomness the only option you have would be to save every single random number used to make sure others can replicate the results independently.
But with a PRNG all you need to save is the seed and remember what algorithm you used. Others can then get the exact same sequence of pseudo-random numbers independently. Very nice property to have :-)
Footnotes
¹ This even includes the CSPRNGs mentioned above, but they are designed to be used in a special way that includes regular re-seeding with entropy to overcome that problem.
A CPU can only generate a uniform random number, U(0,1), which happens to range from 0 to 1. So mathematically, it would be defined as a random variable U in the range [0,1]. Examples of random draws of a U(0,1) random number in the range 0 to 1 would be 0.28100002, 0.34522, 0.7921, etc. The probability of any value between 0 and 1 is equal, i.e., they are equiprobable.
You can generate binary random variates that are either 0 or 1 by setting a random draw of U(0,1) to a 0 if U(0,1)<=0.5 and 1 if U(0,1)>0.5, since in theory there will be an equal number of random draws of U(0,1) below 0.5 and above 0.5.

Does Kernel::srand have a maximum input value?

I'm trying to seed a random number generator with the output of a hash. Currently I'm computing a SHA-1 hash, converting it to a giant integer, and feeding it to srand to initialize the RNG. This is so that I can get a predictable set of random numbers for an set of infinite cartesian coordinates (I'm hashing the coordinates).
I'm wondering whether Kernel::srand actually has a maximum value that it'll take, after which it truncates it in some way. The docs don't really make this obvious - they just say "a number".
I'll try to figure it out myself, but I'm assuming somebody out there has run into this already.
Knowing what programmers are like, it probably just calls libc's srand(). Either way, it's probably limited to 2^32-1, 2^31-1, 2^16-1, or 2^15-1.
There's also a danger that the value is clipped when cast from a biginteger to a C int/long, instead of only taking the low-order bits.
An easy test is to seed with 1 and take the first output. Then, seed with 2i+1 for i in [1..64] or so, take the first output of each, and compare. If you get a match for some i=n and all greater is, then it's probably doing arithmetic modulo 2n.
Note that the random number generator is almost certainly limited to 32 or 48 bits of entropy anyway, so there's little point seeding it with a huge value, and an attacker can reasonably easily predict future outputs given past outputs (and an "attacker" could simply be a player on a public nethack server).
EDIT: So I was wrong.
According to the docs for Kernel::rand(),
Ruby currently uses a modified Mersenne Twister with a period of 2**19937-1.
This means it's not just a call to libc's rand(). The Mersenne Twister is statistically superior (but not cryptographically secure). But anyway.
Testing using Kernel::srand(0); Kernel::sprintf("%x",Kernel::rand(2**32)) for various output sizes (2*16, 2*32, 2*36, 2*60, 2*64, 2*32+1, 2*35, 2*34+1), a few things are evident:
It figures out how many bits it needs (number of bits in max-1).
It generates output in groups of 32 bits, most-significant-bits-first, and drops the top bits (i.e. 0x[r0][r1][r2][r3][r4] with the top bits masked off).
If it's not less than max, it does some sort of retry. It's not obvious what this is from the output.
If it is less than max, it outputs the result.
I'm not sure why 2*32+1 and 2*64+1 are special (they produce the same output from Kernel::rand(2**1024) so probably have the exact same state) — I haven't found another collision.
The good news is that it doesn't simply clip to some arbitrary maximum (i.e. passing in huge numbers isn't equivalent to passing in 2**31-1), which is the most obvious thing that can go wrong. Kernel::srand() also returns the previous seed, which appears to be 128-bit, so it seems likely to be safe to pass in something large.
EDIT 2: Of course, there's no guarantee that the output will be reproducible between different Ruby versions (the docs merely say what it "currently uses"; apparently this was initially committed in 2002). Java has several portable deterministic PRNGs (SecureRandom.getInstance("SHA1PRNG","SUN"), albeit slow); I'm not aware of something similar for Ruby.

Best algorithm for hashing number values?

When dealing with a series of numbers, and wanting to use hash results for security reasons, what would be the best way to generate a hash value from a given series of digits? Examples of input would be credit card numbers, or bank account numbers. Preferred output would be a single unsigned integer to assist in matching purposes.
My feeling is that most of the string implementations appear to have low entropy when run against such a short range of characters and because of that, the collision rate might be higher than when run against a larger sample.
The target language is Delphi, however answers from other languages are welcome if they can provide a mathmatical basis which can lead to an optimal solution.
The purpose of this routine will be to determine if a previously received card/account was previously processed or not. The input file could have multiple records against a database of multiple records so performance is a factor.
With security questions all the answers lay on a continuum from most secure to most convenient. I'll give you two answers, one that is very secure, and one that is very convenient. Given that and the explanation of each you can choose the best solution for your system.
You stated that your objective was to store this value in lieu of the actual credit card so you could later know if the same credit card number is used again. This means that it must contain only the credit card number and maybe a uniform salt. Inclusion of the CCV, expiration date, name, etc. would render it useless since it the value could be different with the same credit card number. So we will assume you pad all of your credit card numbers with the same salt value that will remain uniform for all entries.
The convenient solution is to use a FNV (As Zebrabox and Nick suggested). This will produce a 32 bit number that will index quickly for searches. The downside of course is that it only allows for at max 4 billion different numbers, and in practice will produce collisions much quicker then that. Because it has such a high collision rate a brute force attack will probably generate enough invalid results as to make it of little use.
The secure solution is to rely on SHA hash function (the larger the better), but with multiple iterations. I would suggest somewhere on the order of 10,000. Yes I know, 10,000 iterations is a lot and it will take a while, but when it comes to strength against a brute force attack speed is the enemy. If you want to be secure then you want it to be SLOW. SHA is designed to not have collisions for any size of input. If a collision is found then the hash is considered no longer viable. AFAIK the SHA-2 family is still viable.
Now if you want a solution that is secure and quick to search in the DB, then I would suggest using the secure solution (SHA-2 x 10K) and then storing the full hash in one column, and then take the first 32 bits and storing it in a different column, with the index on the second column. Perform your look-up on the 32 bit value first. If that produces no matches then you have no matches. If it does produce a match then you can compare the full SHA value and see if it is the same. That means you are performing the full binary comparison (hashes are actually binary, but only represented as strings for easy human reading and for transfer in text based protocols) on a much smaller set.
If you are really concerned about speed then you can reduce the number of iterations. Frankly it will still be fast even with 1000 iterations. You will want to make some realistic judgment calls on how big you expect the database to get and other factors (communication speed, hardware response, load, etc.) that may effect the duration. You may find that your optimizing the fastest point in the process, which will have little to no actual impact.
Also, I would recommend that you benchmark the look-up on the full hash vs. the 32 bit subset. Most modern database system are fairly fast and contain a number of optimizations and frequently optimize for us doing things the easy way. When we try to get smart we sometimes just slow it down. What is that quote about premature optimization . . . ?
This seems to be a case for key derivation functions. Have a look at PBKDF2.
Just using cryptographic hash functions (like the SHA family) will give you the desired distribution, but for very limited input spaces (like credit card numbers) they can be easily attacked using brute force because this hash algorithms are usually designed to be as fast as possible.
UPDATE
Okay, security is no concern for your task. Because you have already a numerical input, you could just use this (account) number modulo your hash table size. If you process it as string, you might indeed encounter a bad distribution, because the ten digits form only a small subset of all possible characters.
Another problem is probably that the numbers form big clusters of assigned (account) numbers with large regions of unassigned numbers between them. In this case I would suggest to try highly non-linear hash function to spread this clusters. And this brings us back to cryptographic hash functions. Maybe good old MD5. Just split the 128 bit hash in four groups of 32 bits, combine them using XOR, and interpret the result as a 32 bit integer.
While not directly related, you may also have a look at Benford's law - it provides some insight why numbers are usually not evenly distributed.
If you need security, use a cryptographically secure hash, such as SHA-256.
I needed to look deeply into hash functions a few months ago. Here are some things I found.
You want the hash to spread out hits evenly and randomly throughout your entire target space (usually 32 bits, but could be 16 or 64-bits.) You want every character of the input to have and equally large effect on the output.
ALL the simple hashes (like ELF or PJW) that simply loop through the string and xor in each byte with a shift or a mod will fail that criteria for a simple reason: The last characters added have the most effect.
But there are some really good algorithms available in Delphi and asm. Here are some references:
See 1997 Dr. Dobbs article at burtleburtle.net/bob/hash/doobs.html
code at burtleburtle.net/bob/c/lookup3.c
SuperFastHash Function c2004-2008 by Paul Hsieh (AKA HsiehHash)
www.azillionmonkeys.com/qed/hash.html
You will find Delphi (with optional asm) source code at this reference:
http://landman-code.blogspot.com/2008/06/superfasthash-from-paul-hsieh.html
13 July 2008
"More than a year ago Juhani Suhonen asked for a fast hash to use for his
hashtable. I suggested the old but nicely performing elf-hash, but also noted
a much better hash function I recently found. It was called SuperFastHash (SFH)
and was created by Paul Hsieh to overcome his 'problems' with the hash functions
from Bob Jenkins. Juhani asked if somebody could write the SFH function in basm.
A few people worked on a basm implementation and posted it."
The Hashing Saga Continues:
2007-03-13 Andrew: When Bad Hashing Means Good Caching
www.team5150.com/~andrew/blog/2007/03/hash_algorithm_attacks.html
2007-03-29 Andrew: Breaking SuperFastHash
floodyberry.wordpress.com/2007/03/29/breaking-superfasthash/
2008-03-03 Austin Appleby: MurmurHash 2.0
murmurhash.googlepages.com/
SuperFastHash - 985.335173 mb/sec
lookup3 - 988.080652 mb/sec
MurmurHash 2.0 - 2056.885653 mb/sec
Supplies c++ code MurmurrHash2.cpp and aligned-read-only implementation -
MurmurHashAligned2.cpp
//========================================================================
// Here is Landman's MurmurHash2 in C#
//2009-02-25 Davy Landman does C# implimentations of SuperFashHash and MurmurHash2
//landman-code.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00%2B01%3A00&updated-max=2010-01-01T00%3A00%3A00%2B01%3A00&max-results=2
//
//Landman impliments both SuperFastHash and MurmurHash2 4 ways in C#:
//1: Managed Code 2: Inline Bit Converter 3: Int Hack 4: Unsafe Pointers
//SuperFastHash 1: 281 2: 780 3: 1204 4: 1308 MB/s
//MurmurHash2 1: 486 2: 759 3: 1430 4: 2196
Sorry if the above turns out to look like a mess. I had to just cut&paste it.
At least one of the references above gives you the option of getting out a 64-bit hash, which would certainly have no collisions in the space of credit card numbers, and could be easily stored in a bigint field in MySQL.
You do not need a cryptographic hash. They are much more CPU intensive. And the purpose of "cryptographic" is to stop hacking, not to avoid collisions.
If performance is a factor I suggest to take a look at a CodeCentral entry of Peter Below. It performs very well for large number of items.
By default it uses P.J. Weinberger ELF hashing function. But others are also provided.
By definition, a cryptographic hash will work perfectly for your use case. Even if the characters are close, the hash should be nicely distributed.
So I advise you to use any cryptographic hash (SHA-256 for example), with a salt.
For a non cryptographic approach you could take a look at the FNV hash it's fast with a low collision rate.
As a very fast alternative, I've also used this algorithm for a few years and had few collision issues however I can't give you a mathematical analysis of it's inherent soundness but for what it's worth here it is
=Edit - My code sample was incorrect - now fixed =
In c/c++
unsigned int Hash(const char *s)
{
int hash = 0;
while (*s != 0)
{
hash *= 37;
hash += *s;
s++;
}
return hash;
}
Note that '37' is a magic number, so chosen because it's prime
Best hash function for the natural numbers let
f(n)=n
No conflicts ;)

Resources