Compress Random 32-bit Integers: How close can we get to Shannon Entropy? - algorithm

I've developed a lossless compression algorithm that compresses 32-bit integers (of unknown frequency/probability) to 31.95824 bits per integer (it works a lot better for smaller values, just as most compression algorithms do). Obviously it isn't possible to compress uniformly-distributed random data to become smaller than its uncompressed size.
Therefore my question is, which lossless compression algorithms get closest to the Shannon Entropy of 32 bits per integer for pseudorandom data, assuming 32-bit integers?
Essentially, I'm looking for a table which includes compression algorithms and their respective bits-per-integer value for positive, compressed, 32-bit integers.

When you say "it works a lot better for smaller values", I presume that you have a transformation from the 32-bit integer to a variable-bit-length representation that is optimized for some non-uniform expected distribution of values. Then that same transformation applied to a uniform distribution of 32-bit values will necessarily take more than 32 bits on average. How much more depends on how non-uniform a distribution you started with.
So the answer is, of course you can get to 32 bits exactly by doing nothing at all to the number. But then you are not optimized for the application implied by the non-uniform distribution you designed to.

The identity function requires precisely 32 bits per 32 bit integer, which is pretty hard to beat. (There are many other length-preserving bijections, if you insist on changing the data stream.)
It's not obvious to me what other criteria you might be employing to recommend an algorithm which does worse than that. Perhaps you believe that the input stream is not truly a uniform sample; rather, it is a restricted to (or significantly biased towards) a subset of the universe, but you do not a priori know what the subset is. In that case, the entropy of the stream is less than one (if there is an upper bound on the size of the subset which is reasonably less than the size of the universe) and you might be able to actually compress the input stream.
It's worth noting that unless messages are fixed-length, the length of the message needs to be taken into account in the computation of entropy, both in the numerator and the denominator. For very long messages, that can mostly be ignored but if messages are short, the cost of message delimiters (or explicit length indicators) can be significant. (Otherwise, "compressing" to 103% of original size is a somewhat humptydumptyesque definition of "to compress".)

This is exactly what Quantile Compression (https://github.com/mwlon/quantile-compression/) was built to do: lossless compression of numbers drawn from a numerical distributuon. I'm not aware of any other algorithms that do this. You can see its results vs theoretical optimum in the readme. It also works on floats and timestamps! I'm not sure what your distribution is, but real-world distributions often only take a few bits per number with
It works by encoding each number in the sequence as a Huffman code for a coarse numeric range and then an offset for the exact position within that range.

Related

Is it acceptable to use each byte of a PRNG-generated number separately?

Say you have a non-cryptographically secure PRNG that generates 64-bit output.
Assuming that bytes are 8 bits, is it acceptable to use each byte of the 64-bit output as separate 8-bit random numbers or would that possibly break the randomness guarantees of a good PRNG? Or does it depend on the PRNG?
Because the PRNG is not cryptographically secure, the "randomness guarantee" I am worried about is not security, but whether the byte stream has the same guarantee of randomness, using the same definition of "randomness" that PRNG authors use, that the PRNG has with respect to its 64-bit output.
This should be quite safe with a CSPRNG. For comparison it's like reading /dev/random byte by byte. With a good CSPRNG it is also perfectly acceptable to simply generate a 64bit sample 8 times and pick 8 bits per sample as well (throwing away the 56 other bits).
With PRNGs that are not CSPRNG you will have 'security' concerns in terms of the raw output of the PRNG that outweigh whether or not you chop up output into byte sized chunks.
In all cases it is vital to make sure the PRNG is seeded and periodically re-seeded correctly (so as to flush any possibly compromised internal state regularly). Security depends on the unpredictability of your internal state, which is ultimately driven by the quality of your seed input. One thing good CSPRNG implementations will do for you is to pessimistically estimate the amount of captured 'entropy' to safeguard the output from predictable internal state.
Note however that with 8 bits you only have 256 possible outputs in any case, so it becomes more of a question of how you use this. For instance, if you do something like XOR based encryption against the output of a PRNG (i.e. treating it as a one time pad based on some pre shared secret seed), then using a known plain text attack may relatively easily reveal the contents of the internal state of the PRNG. That is another type of attack which good CSPRNG implementations are supposed to guard against by their design (using e.g. a computationally secure hash function).
EDIT to add: if you don't care about 'security' but only need the output to look random, then this should be quite safe -- in theory a good PRNG is just as likely to yield a 0 as 1, and that should not vary between any octet. So you expect a linear distribution of possible output values. One thing you can do to verify whether this skews the distribution is to run a Monte Carlo simulation of some reasonably large size (e.g. 1M) and compare the histograms with 256 bins for both the raw 64 bit and the 8 * 8 bit output. You expect a roughly flat diagram for both cases if the linear distribution is preserved intact.
It depends on the generator and its parameterization. Quoting from the Wikipedia page for Linear Congruential Generators: "The low-order bits of LCGs when m is a power of 2 should never be relied on for any degree of randomness whatsoever. [...]any full-cycle LCG when m is a power of 2 will produce alternately odd and even results."

Lightweight (de)compression algorithm for embedded use

I have a low-resource embedded system with a graphical user interface. The interface requires font data. To conserve read-only memory (flash), the font data needs to be compressed. I am looking for an algorithm for this purpose.
Properties of the data to be compressed
transparency data for a rectangular pixel map with 8 bits per pixel
there are typically around 200..300 glyphs in a font (typeface sampled in certain size)
each glyph is typically from 6x9 to 15x20 pixels in size
there are a lot of zeros ("no ink") and somewhat less 255's ("completely inked"), otherwise the distribution of octets is quite even due to the nature of anti-aliasing
Requirements for the compression algorithm
The important metrics for the decompression algorithm is the size of the data plus the size of the algorithm (as they will reside in the same limited memory).
There is very little RAM available for the decompression; it is possible to decompress the data for a single glyph into RAM but not much more.
To make things more difficult, the algorithm has to be very fast on a 32-bit microcontroller (ARM Cortex-M core), as the glyphs need to be decompressed while they are being drawn onto the display. Ten or twenty machine cycles per octet is ok, a hundred is certainly too much.
To make things easier, the complete corpus of data is known a priori, and there is a lot of processing power and memory available during the compression phase.
Conclusions and thoughts
The naïve approach of just packing each octet by some variable-length encoding does not give good results due to the relatively high entropy.
Any algorithm taking advantage of data decompressed earlier seems to be out of question as it is not possible to store the decompressed data of other glyphs. This makes LZ algorithms less efficient as they can only reference to a small amount of data.
Constraints on the processing power seem to rule out most bitwise operations, i.e. decompression should handle the data octet-by-octet. This makes Huffman coding difficult and arithmetic coding impossible.
The problem seems to be a good candidate for static dictionary coding, as all data is known beforehand, and the data is somewhat repetitive in nature (different glyphs share same shapes).
Questions
How can a good dictionary be constructed? I know finding the optimal dictionary for certain data is a np complete problem, but are there any reasonably good approximations? I have tried the zstandard's dictionary builder, but the results were not very good.
Is there something in my conclusions that I've gotten wrong? (Am I on the wrong track and omitting something obvious?)
Best algorithm this far
Just to give some background information, the best useful algorithm I have been able to figure out is as follows:
All samples in the font data for a single glyph are concatenated (flattened) into a one-dimensional array (vector, table).
Each sample has three possible states: 0, 255, and "something else".
This information is packed five consecutive samples at a time into a 5-digit base-three number (0..3^5).
As there are some extra values available in an octet (2^8 = 256, 3^5 = 243), they are used to signify longer strings of 0's and 255's.
For each "something else" value the actual value (1..254) is stored in a separate vector.
This data is fast to decompress, as the base-3 values can be decoded into base-4 values by a smallish (243 x 3 = 729 octets) lookup table. The compression ratios are highly dependent on the font size, but with my typical data I can get around 1:2. As this is significantly worse than LZ variants (which get around 1:3), I would like to try the static dictionary approach.
Of course, the usual LZ variants use Huffman or arithmetic coding, which naturally makes the compressed data smaller. On the other hand, I have all the data available, and the compression speed is not an issue. This should make it possible to find much better dictionaries.
Due to the nature of the data I could be able to use a lossy algorithm, but in that case the most likely lossy algorithm would be reducing the number of quantization levels in the pixel data. That won't change the underlying compression problem much, and I would like to avoid the resulting bit-alignment hassle.
I do admit that this is a borderline case of being a good answer to my question, but as I have researched the problem somewhat, this answer both describes the approach I chose and gives some more information on the nature of the problem should someone bump into it.
"The right answer" a.k.a. final algorithm
What I ended up with is a variant of what I describe in the question. First, each glyph is split into trits 0, 1, and intermediate. This ternary information is then compressed with a 256-slot static dictionary. Each item in the dictionary (or look-up table) is a binary encoded string (0=0, 10=1, 11=intermediate) with a single 1 added to the most significant end.
The grayscale data (for the intermediate trits) is interspersed between the references to the look-up table. So, the data essentially looks like this:
<LUT reference><gray value><gray value><LUT reference>...
The number of gray scale values naturally depends on the number of intermediate trits in the ternary data looked up from the static dictionary.
Decompression code is very short and can easily be written as a state machine with only one pointer and one 32-bit variable giving the state. Something like this:
static uint32_t trits_to_decode;
static uint8_t *next_octet;
/* This should be called when starting to decode a glyph
data : pointer to the compressed glyph data */
void start_glyph(uint8_t *data)
{
next_octet = data; // set the pointer to the beginning of the glyph
trits_to_decode = 1; // this triggers reloading a new dictionary item
}
/* This function returns the next 8-bit pixel value */
uint8_t next_pixel(void)
{
uint8_t return_value;
// end sentinel only? if so, we are out of ternary data
if (trits_to_decode == 1)
// get the next ternary dictionary item
trits_to_decode = dictionary[*next_octet++];
// get the next pixel from the ternary word
// check the LSB bit(s)
if (trits_to_decode & 1)
{
trits_to_decode >>= 1;
// either full value or gray value, check the next bit
if (trits_to_decode & 1)
{
trits_to_decode >>= 1;
// grayscale value; get next from the buffer
return *next_octet++;
}
// if we are here, it is a full value
trits_to_decode >>= 1;
return 255;
}
// we have a zero, return it
trits_to_decode >>= 1;
return 0;
}
(The code has not been tested in exactly this form, so there may be typos or other stupid little errors.)
There is a lot of repetition with the shift operations. I am not too worried, as the compiler should be able to clean it up. (Actually, left shift could be even better, because then the carry bit could be used after shifting. But as there is no direct way to do that in C, I don't bother.)
One more optimization relates to the size of the dictionary (look-up table). There may be short and long items, and hence it can be built to support 32-bit, 16-bit, or 8-bit items. In that case the dictionary has to be ordered so that small numerical values refer to 32-bit items, middle values to 16-bit items and large values to 8-bit items to avoid alignment problems. Then the look-up code looks like this:
static uint8_t dictionary_lookup(uint8_t octet)
{
if (octet < NUMBER_OF_32_BIT_ITEMS)
return dictionary32[octet];
if (octet < NUMBER_OF_32_BIT_ITEMS + NUMBER_OF_16_BIT_ITEMS)
return dictionary16[octet - NUMBER_OF_32_BIT_ITEMS];
return dictionary8[octet - NUMBER_OF_16_BIT_ITEMS - NUMBER_OF_32_BIT_ITEMS];
}
Of course, if every font has its own dictionary, the constants will become variables looked up form the font information. Any half-decent compiler will inline that function, as it is called only once.
If the number of quantization levels is reduced, it can be handled, as well. The easiest case is with 4-bit gray levels (1..14). This requires one 8-bit state variable to hold the gray levels. Then the gray level branch will become:
// new state value
static uint8_t gray_value;
...
// new variable within the next_pixel() function
uint8_t return_value;
...
// there is no old gray value available?
if (gray_value == 0)
gray_value = *next_octet++;
// extract the low nibble
return_value = gray_value & 0x0f;
// shift the high nibble into low nibble
gray_value >>= 4;
return return_value;
This actually allows using 15 intermediate gray levels (a total of 17 levels), which maps very nicely into linear 255-value system.
Three- or five-bit data is easier to pack into a 16-bit halfword and set MSB always one. Then the same trick as with the ternary data can be used (shift until you get 1).
It should be noted that the compression ratio starts to deteriorate at some point. The amount of compression with the ternary data does not depend on the number of gray levels. The gray level data is uncompressed, and the number of octets scales (almost) linearly with the number of bits. For a typical font the gray level data at 8 bits is 1/2 .. 2/3 of the total, but this is highly dependent on the typeface and size.
So, reduction from 8 to 4 bits (which is visually quite imperceptible in most cases) reduces the compressed size typically by 1/4..1/3, whereas the further reduction offered by going down to three bits is significantly less. Two-bit data does not make sense with this compression algorithm.
How to build the dictionary?
If the decompression algorithm is very straightforward and fast, the real challenges are in the dictionary building. It is easy to prove that there is such thing as an optimal dictionary (dictionary giving the least number of compressed octets for a given font), but wiser people than me seem to have proven that the problem of finding such dictionary is NP-complete.
With my arguably rather lacking theoretical knowledge on the field I thought there would be great tools offering reasonably good approximations. There might be such tools, but I could not find any, so I rolled my own mickeymouse version. EDIT: the earlier algorithm was rather goofy; a simpler and more effective was found
start with a static dictionary of '0', g', '1' (where 'g' signifies an intermediate value)
split the ternary data for each glyph into a list of trits
find the most common consecutive combination of items (it will most probably be '0', '0' at the first iteration)
replace all occurrences of the combination with the combination and add the combination into the dictionary (e.g., data '0', '1', '0', '0', 'g' will become '0', '1', '00', 'g' if '0', '0' is replaced by '00')
remove any unused items in the dictionary (they may occur at least in theory)
repeat steps 3-5 until the dictionary is full (i.e. at least 253 rounds)
This is still a very simplistic approach and it probably gives a very sub-optimal result. Its only merit is that it works.
How well does it work?
One answer is well enough, but to elaborate that a bit, here are some numbers. This is a font with 864 glyphs, typical glyph size of 14x11 pixels, and 8 bits per pixel.
raw uncompressed size: 127101
number of intermediate values: 46697
Shannon entropies (octet-by-octet):
total: 528914 bits = 66115 octets
ternary data: 176405 bits = 22051 octets
intermediate values: 352509 bits = 44064 octets
simply compressed ternary data (0=0, 10=1, 11=intermediate) (127101 trits): 207505 bits = 25939 octets
dictionary compressed ternary data: 18492 octets
entropy: 136778 bits = 17097 octets
dictionary size: 647 octets
full compressed data: 647 + 18492 + 46697 = 65836 octets
compression: 48.2 %
The comparison with octet-by-octet entropy is quite revealing. The intermediate value data has high entropy, whereas the ternary data can be compressed. This can also be interpreted by the high number of values 0 and 255 in the raw data (as compared to any intermediate values).
We do not do anything to compress the intermediate values, as there do not seem to be any meaningful patterns. However, we beat entropy by a clear margin with ternary data, and even the total amount of data is below entropy limit. So, we could do worse.
Reducing the number of quantization levels to 17 would reduce the data size to approximately 42920 octets (compression over 66 %). The entropy is then 41717 octets, so the algorithm gets slightly worse as is expected.
In practice, smaller font sizes are difficult to compress. This should be no surprise, as larger fraction of the information is in the gray scale information. Very big font sizes compress efficiently with this algorithm, but there run-length compression is a much better candidate.
What would be better?
If I knew, I would use it! But I can still speculate.
Jubatian suggests there would be a lot of repetition in a font. This must be true with the diacritics, as aàäáâå have a lot in common in almost all fonts. However, it does not seem to be true with letters such as p and b in most fonts. While the basic shape is close, it is not enough. (Careful pixel-by-pixel typeface design is then another story.)
Unfortunately, this inevitable repetition is not very easy to exploit in smaller size fonts. I tried creating a dictionary of all possible scan lines and then only referencing to those. Unfortunately, the number of different scan lines is high, so that the overhead added by the references outweighs the benefits. The situation changes somewhat if the scan lines themselves can be compressed, but there the small number of octets per scan line makes efficient compression difficult. This problem is, of course, dependent on the font size.
My intuition tells me that this would still be the right way to go, if both longer and shorter runs than full scan lines are used. This combined with using 4-bit pixels would probably give very good results—only if there were a way to create that optimal dictionary.
One hint to this direction is that LZMA2 compressed file (with xz at the highest compression) of the complete font data (127101 octets) is only 36720 octets. Of course, this format fulfils none of the other requirements (fast to decompress, can be decompressed glyph-by-glyph, low RAM requirements), but it still shows there is more redundance in the data than what my cheap algorithm has been able to exploit.
Dictionary coding is typically combined with Huffman or arithmetic coding after the dictionary step. We cannot do it here, but if we could, it would save another 4000 octets.
You can consider using something already developed for a scenario similar to Yours
https://github.com/atomicobject/heatshrink
https://spin.atomicobject.com/2013/03/14/heatshrink-embedded-data-compression/
You could try lossy compression using a sparse representation with custom dictionary.
The output of each glyph is a superposition of 1-N blocks from the dictionary;
most cpu time spent in preprocessing
predetermined decoding time (max, average or constant N) additions per pixel
controllable compressed size (dictionary size + xyn codes per glyph)
It seems that the simplest lossy method would be to reduce the number of bits-per-pixel. With glyphs of that size, 16 levels are likely to be sufficient. That would halve the data immediately, then you might apply your existing algorithm in the values 0, 16 or "something else" to perhaps halve it again.
I would go for Clifford's answer, that is, converting the font to 4 bits per pixel first which is sufficient for this task.
Then, since this is a font, you have lots of row repetitions, that is when rows defining one character match those of another character. Take for example the letter 'p' and 'b', the middle part of these letters should be the same (you will have even more matches if the target language uses loads of diacritics). Your encoder then could first collect all distinct rows of the font, store these, and then each character image is formed by a list of pointers to the rows.
The efficiency depends on the font of course, depending on the source, you might need some preprocessing to get it compress better with this method.
If you want more, you might rather choose to go for 3 bits per pixel or even 2 bits per pixel, depending on your goals (and some will for hand-tuning the font images), these might still be satisfactory.
This method in overall of course works very well for real-time display (you only need to traverse a pointer to get the row data).

Lossless Compression of Random Data

tl;dr
I recently started listening to a security podcast, and heard the following sentence (paraphrasing)
One of the good hallmarks of a cryptographically strong random number is its lack of compressibility
Which immediately got me thinking, can random data be lossless-ly compressed? I started reading, and found this wikipedia article. A quoted block is below
In particular, files of random data cannot be consistently compressed by any conceivable lossless data compression algorithm: indeed, this result is used to define the concept of randomness in algorithmic complexity theory.
I understand the pigeon hole principle, so I'm assuming I'm way wrong here somewhere, but what am I missing?
IDEA:
Assume you have an asymmetric variable-length encryption method by which you could convert any N bit into either a N-16 bit number or N+16 bit number. Is this possible?
IF we had an assymetric algorithm could either make the data say 16 bits bigger or 16 bits smaller, then I think I can come up with an algorithm for reliably producing lossless compression.
Lossless Compression Algorithm for Arbitrary Data
Break the initial data into chunks of a given size. Then use a "key" and attempt to compress each chunk as follows.
function compress(data)
compressedData = []
chunks = data.splitBy(chunkSize);
foreach chunk in chunks
encryptedChunk = encrypt(chunk, key)
if (encryptedChunk.Length <= chunk.Length - 16) // arbitrary amount
compressedData.append(0) // 1 bit, not an integer
compressedData.append(encryptedChunk)
else
compressedData.append(1) // 1 bit, not an integer
compressedData.append(chunk)
end foreach
return compressedData;
end function
And for de-compression, if you know the chunk-size, then each chunk that begins with 0 perform the asymmetric encryption and append the data to the on going array. If the chunk begins with a 0 simply append the data as-is. If the encryption method produces the 16-bit smaller value even 1/16 as often as the 16-bit larger value, then this will work right? Each chunk is either 1 bit bigger, or 15 bits smaller.
One other consideration is that the "key" used by the compression algorithm can be either fixed or perhaps appended to the beginning of the compressed data. Same consideration for the chunk size.
There are 2N−16 possible (N−16)-bit sequences, and 2N possible N-bit sequences. Consequently, no more than one in every 216 N-bit sequence can be losslessly compressed to N−16 bits. So it will happen a lot less frequently than 1/16 of the time. It will happen at most 1/65536 of the time.
As your reasoning indicates, the remaining N-bit sequences could be expanded to N+1 bits; there is no need to waste an additional 15 bits encoding them. All the same, the probability of a random N-bit sequence being in the set of (N−16)-bit compressible sequences is so small that the average compression (or expected compression) will continue to be 1.0 (at best).

Practical Compression of Random Data

So yesterday I asked a question on compression of a sequence of integers (link) and most comments had a similar point: if the order is random (or worst, the data is completely random) then one have to settle down with log2(k) bits for a value k. I've also read similar replies in other questions on this site. Now, I hope this isn't a silly question, if I take that sequence and serialize it to a file and then I run gzip on this file then I do achieve compression (and depending on the time I allow gzip to run I might get high compression). Could somebody explain this fact ?
Thanks in advance.
My guess is that you're achieving compression on your random file because you're not using an optimal serialization technique, but without more details it's impossible to answer your question. Is the compressed file with n numbers in the range [0, k) less than n*log2(k) bits? (That is, n*log256(k) bytes). If so, does gzip manage to do that for all the random files you generate, or just occasionally?
Let me note one thing: suppose you say to me, "I've generated a file of random octets by using a uniform_int_distribution(0, 255) with the mt19937 prng [1]. What's the optimal compression of my file?" Now, my answer could reasonably be: "probably about 80 bits". All I need to reproduce your file is
the value you used to seed the prng, quite possibly a 32-bit integer [2]; and
the length of the file, which probably fits in 48 bits.
And if I can reproduce the file given 80 bits of data, that's the optimal compression. Unfortunately, that's not a general purpose compression strategy. It's highly unlikely that gzip will be able to figure out that you used a particular prng to generate the file, much less that it will be able to reverse-engineer the seed (although these things are, at least in theory, achievable; the Mersenne twister is not a cryptographically secure prng.)
For another example, it's generally recommended that text be compressed before being encrypted; the result will be quite a bit shorter than compressing after encryption. But the fact is that encryption adds very little entropy; at most, it adds the number of bits in the encryption key. Nonetheless, the resulting output is difficult to distinguish from random data, and gzip will struggle to compress it (although it often manages to squeeze a few bits out).
Note 1: Note: that's all c++11/boost lingo. mt19937 is an instance of the Mersenne twister pseudo-random number generator (prng), which has a period of 2^19937 - 1.
Note 2: The state of the Mersenne twister is actually 624 words (19968 bits), but most programs use somewhat fewer bits to seed it. Perhaps you used a 64-bit integer instead of a 32-bit integer, but it doesn't change the answer by much.
if I take that sequence and serialize it to a file and then I run gzip
on this file then I do achieve compression
What is "it"? If you take random bytes (each uniformly distributed in 0..255) and feed them to gzip or any compressor, you may on very rare occasions get a small amount of compression, but most of the time you will get a small amount of expansion.
If the data is truly random, on average no compression algorithm can compress it. But if the data has some predictable patterns (for e.g. if the probability of a symbol is dependent on the previous k-symbols occurring in the data), many (prediction-based) compression algorithms will succeed.

Does Kernel::srand have a maximum input value?

I'm trying to seed a random number generator with the output of a hash. Currently I'm computing a SHA-1 hash, converting it to a giant integer, and feeding it to srand to initialize the RNG. This is so that I can get a predictable set of random numbers for an set of infinite cartesian coordinates (I'm hashing the coordinates).
I'm wondering whether Kernel::srand actually has a maximum value that it'll take, after which it truncates it in some way. The docs don't really make this obvious - they just say "a number".
I'll try to figure it out myself, but I'm assuming somebody out there has run into this already.
Knowing what programmers are like, it probably just calls libc's srand(). Either way, it's probably limited to 2^32-1, 2^31-1, 2^16-1, or 2^15-1.
There's also a danger that the value is clipped when cast from a biginteger to a C int/long, instead of only taking the low-order bits.
An easy test is to seed with 1 and take the first output. Then, seed with 2i+1 for i in [1..64] or so, take the first output of each, and compare. If you get a match for some i=n and all greater is, then it's probably doing arithmetic modulo 2n.
Note that the random number generator is almost certainly limited to 32 or 48 bits of entropy anyway, so there's little point seeding it with a huge value, and an attacker can reasonably easily predict future outputs given past outputs (and an "attacker" could simply be a player on a public nethack server).
EDIT: So I was wrong.
According to the docs for Kernel::rand(),
Ruby currently uses a modified Mersenne Twister with a period of 2**19937-1.
This means it's not just a call to libc's rand(). The Mersenne Twister is statistically superior (but not cryptographically secure). But anyway.
Testing using Kernel::srand(0); Kernel::sprintf("%x",Kernel::rand(2**32)) for various output sizes (2*16, 2*32, 2*36, 2*60, 2*64, 2*32+1, 2*35, 2*34+1), a few things are evident:
It figures out how many bits it needs (number of bits in max-1).
It generates output in groups of 32 bits, most-significant-bits-first, and drops the top bits (i.e. 0x[r0][r1][r2][r3][r4] with the top bits masked off).
If it's not less than max, it does some sort of retry. It's not obvious what this is from the output.
If it is less than max, it outputs the result.
I'm not sure why 2*32+1 and 2*64+1 are special (they produce the same output from Kernel::rand(2**1024) so probably have the exact same state) — I haven't found another collision.
The good news is that it doesn't simply clip to some arbitrary maximum (i.e. passing in huge numbers isn't equivalent to passing in 2**31-1), which is the most obvious thing that can go wrong. Kernel::srand() also returns the previous seed, which appears to be 128-bit, so it seems likely to be safe to pass in something large.
EDIT 2: Of course, there's no guarantee that the output will be reproducible between different Ruby versions (the docs merely say what it "currently uses"; apparently this was initially committed in 2002). Java has several portable deterministic PRNGs (SecureRandom.getInstance("SHA1PRNG","SUN"), albeit slow); I'm not aware of something similar for Ruby.

Resources