Why is $RANDOM not very random? - bash

I've been using $RANDOM to generate a random number between 1-15, in order to generate a little jitter between two systems. For example:
sleep $(( RANDOM %= 15 ))
If I run echo $(( RANDOM %= 15 )) every few minutes, it seems the random numbers are fairly random. But if I start running a script with this call every minute via cron, or even just echo the random number every few seconds, the randomness is gone—on my Mac, I end up with not-so-random values like 11 and 6, alternating, or 8, 4, and 2, in sequence. Not very random.
On one of my linux servers (CentOS 6.5 x64), I added the following bash script, which, after the first couple loops, just output 13 over and over again:
#!/bin/bash
for ((n = 0; n < 100; n++))
do
echo $(( RANDOM %= 15 ))
done
My questions:
Why is this happening? I know $RANDOM is unsuitable for encryption, but why is it so bad at generating random numbers in general?
Is there any other easy way to get a more random number (even if needed in rapid succession) via bash script?

Assigning to RANDOM Sets the Seed Value
RANDOM is a special variable that provides a pseudo-random number. By using a modulo operator, you are vastly restricting the possible values, and by assigning to RANDOM you are changing the seed value to some member of your restricted set, which eventually seems to settle on 13.
The following gives me a reasonable distribution:
for i in {1..100}; do echo $(( RANDOM % 15 )); done
By using %= instead of just %, you are setting the seed value. Don't do that.

You can generate a high quality random integer between 1 and 15 using to following command :
echo "$(od -An -N4 -tu4 /dev/urandom) % 15 + 1" | bc
or even better
echo "$(od -An -N4 -tu4 /dev/random) % 15 + 1" | bc
Moreover you should not assign a new seed using %= as it destroys the entropy gathered.
Lastly 100 generations is really not enough to assess the quality of a random number generator. You should at least generate one million values.
Few handy techniques useful to test the quality of a random number generator :
Optimum compression -- 0 %
Data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use, through use of specific encoding schemes. If the same data structure is repeated multiple times a short binary representation can stand in for long data structures and thus reduce the size of the compresses file. If our random data is truly random then we should NOT see any compression at all.
Chi square distribution -- between 10% and 90%
The chi-square test is the most commonly used test for the randomness of data, and is extremely sensitive to errors in pseudorandom sequence generators. The chi-square distribution is calculated for the stream of bytes in the file and expressed as an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated. We interpret the percentage as the degree to which the sequence tested is suspected of being non-random. If the percentage is greater than 90% or less than 10%, the sequence is almost certainly not random.
Arithmetic mean -- 127.5. 15/2 in your case
This is simply the result of summing the all the bytes in the file and dividing by the file length. If the data is close to random, this should be about 127.5 . If the mean departs from this value then the values are consistently high or low.
Monte Carlo value for Pi -- 3.14159265
Each successive sequence of six bytes is used as 24 bit X and Y co-ordinates within a square. If the distance of the randomly-generated point is less than the radius of a circle inscribed within the square, the six-byte sequence is considered a hit. The percentage of hits can be used to calculate the value of Pi. For very large streams the value will approach the correct value of Pi if the sequence is close to random.
Serial correlation coefficient -- 0.0
This quantity measures the extent to which each byte in the file depends upon the previous byte. For random sequences, this value (this can be positive or negative) will, of course, be close to zero.
source : https://calomel.org/

Related

Generating random integer and real numbers in a given range

According the man page of getNext in the PCGRandom module, we can generate random numbers in a given range, for example:
use Random;
var rng1 = new owned RandomStream( eltType= real, seed= 100 );
var rng2 = new owned RandomStream( eltType= int, seed= 100 );
for i in 1..5 do
writeln( rng1.getNext( min= 3.0, max= 5.0 ) );
writeln();
for i in 1..5 do
writeln( rng2.getNext( min= 20, max= 80 ) );
which gives (with chpl-1.20.0):
4.50371
4.85573
4.2246
4.84289
3.63607
36
57
79
39
57
Here, I noticed that the man page gives the following notes for both the integer and real-number cases:
For integers, this class uses a strategy for generating a value in a particular range that has not been subject to rigorous study and may have statistical problems.
For real numbers, this class generates a random value in [max, min] by computing a random value in [0,1] and scaling and shifting that value. Note that not all possible floating point values in the interval [min, max] can be constructed in this way.
(where I used italics for emphasis). For real numbers, is this related to the so-called "density of floating-point number", e.g. asked in this page)? Also, for integers, is there some case that we need to be careful even for "typical" use?
(here, "typical" means, e.g., a generation of 10**8 random integers distributed approximately flat in a given range.)
FYI, my "use case" is not something like rigorous quality tests for random numbers, but just typical Monte Carlo calculations (e.g., selecting random sites on a cubic lattice).
The notes in the manual page are indicating a difference from the other PCG random number methods that have been studied (by the author of the PCG algorithm at the very least).
The issue with floating-point numbers is indeed related to floating-point number density. See http://www.pcg-random.org/using-pcg-c-basic.html#generating-doubles from the PCG author. It is a potential problem even when generating random numbers in [0.0, 1.0]. This paragraph from the documentation describes the issue:
When generating a real, imaginary, or complex number, this
implementation uses the strategy of generating a 64-bit unsigned
integer and then multiplying it by 2.0**-64 in order to convert it to
a floating point number. While this does construct a uniform
distribution on rounded floating point values, it leaves out many
possible real values (for example, 2**-128). We believe that this
strategy has reasonable statistical properties. One side effect of
this strategy is that the real number 1.0 can be generated because of
rounding. The real number 0.0 can be generated because PCG can produce
the value 0 as a random integer.
Note that a 64-bit real can store numbers as small as 2.0**-1024 but it is quite impossible to get such a number by dividing a positive integer by 2**64. (Here and in the above I am using ** as the exponentiation operator, as that is what it does in Chapel syntax). I recommend reading up on IEEE floating point formats (e.g. https://en.wikipedia.org/wiki/IEEE_754 or https://en.wikipedia.org/wiki/Double-precision_floating-point_format ) for background information in this area. You might care about this if you were using an RNG to generate test inputs to an algorithm operating on real(64) values. In that event you might wish for even the very small values to be generated. Note though that constructing an RNG that can generate all real(64) values in a non-uniform manner is not so hard (e.g. just by copying the bits from a uint into a real).
Regarding the other part of your question:
I did some basic statistical testing with the generation of random integers in a particular range with TestU01 and I'd be confident in its use with Monte Carlo calculations. However I am not an expert in this area and as a result I put that warning in the documentation. The below information from the documentation describes the testing that I did:
We have tested this implementation with TestU01 (available at
http://simul.iro.umontreal.ca/testu01/tu01.html ). We measured our
implementation with TestU01 1.2.3 and the Crush suite, which consists
of 144 statistical tests. The results were:
no failures for generating uniform reals
1 failure for generating 32-bit values (which is also true for the reference version of PCG with the same configuration)
0 failures for generating 64-bit values (which we provided to TestU01 as 2
different 32-bit values since it only accepts 32 bits at a time)
0 failures for generating bounded integers (which we provided to TestU01 by requesting values in [0..,2**31+2**30+1) until we had two values < 2**31, removing the top 0 bit, and then combining the top 16 bits into the value provided to TestU01).

Are pseudo random number generators less likely to repeat?

So they say if you flip a coin 50 times and get heads all 50 times, you're still 50/50 the next flip and 1/4 for the next two. Do you think/know if this same principle applies to computer pseudo-random number generators? I theorize they're less likely to repeat the same number for long stretches.
I ran this a few times and the results are believable, but I'm wondering how many times I'd have to run it to get an anomaly output.
def genString(iterations):
mystring = ''
for _ in range(iterations):
mystring += str(random.randint(0,9))
return mystring
def repeatMax(mystring):
tempchar = ''
max = 0
for char in mystring:
if char == tempchar:
count += 1
if count > max:
max = count
else:
count = 0
tempchar = char
return max
for _ in range(10):
stringer = genString()
print repeatMax(stringer)
I got all 7's and a couple 6's. If I run this 1000 times, will it approximate a normal distribution or should I expect it to stay relatively predictable? I'm trying to understand the predictability of pseudo random number generation.
Failure to produce specific patterns is a typical weakness of PRNGs, but the probability of hitting a substantial run of repeated digits at random is so small it's hard to demonstrate that weakness.
It's perfectly reasonable for a PRNG to use only a 32-bit state, which (traditionally) means producing a sequence of four billion numbers and then repeating from the start again. In that case your sequence of 50 coin-flips coming out the same is probably never going to happen (four billion tries at something that has a one in a quadrillion chance is unlikely to succeed); but if it does, then it's going to appear way too often.
Superficially you're looking for k-dimensional equidistribution as a test for whether or not you can expect to find a prescribed pattern in the output without deeper analysis of the specific generator. If your generator claims at least 50-dimensional equidistribution then you're guaranteed to see the 50-heads state at least once.
However, if your generator emits 32-bit results but you only test whether each result maps to heads or tails, you have some chance at success even if the generator fails the k-dimension test, and that chance depends on the specifics of the generator and the mapping function.
If you adjust the implementation of your generator to return just one bit at a time, then you have an opportunity to try to squeeze 50 heads out of just 50 bits of state (or potentially as few as 18, but that generator would probably be faulty). Provided the generator visits all 2**50 possible states, one of those states will produce 50 heads in a row. You may get a few more heads when adjacent states start or end with more zeroes.

Generate random number in interval in PostScript

I am struggling to find a way to generate a random number within a given interval in PostScript.
Basically PostScript has three functions to help you generate (pseudo-)random numbers. Those are rand, srand and rrand.
The later two are for passing a seed to the number generator to be able to reproduce specific results. At least that´s what I understood they are for. Anyway they don´t seem suitable for my case.
So rand seems to be the only function I can use to generate a random number, but...
rand returns a random integer in the range 0 to 231 − 1 (From the PostScript Language Reference, page 637 (651 in the PDF))
This is far beyond the the interval I´m looking for. I am more interested in values up to small thousands, maybe 10.000 or something like that and small float values, up to 100, all with the lower limit of 0.
I thought I could just narrow my numbers down by simple divisions and extracting the root but that tends to give me unusable small values in quite a lot cases. I am wondering if there are robust ways to either shrink a large number down to what I need or, I´d prefer that, only generate numbers in the desired interval.
Besides: while-loops are not possible in PostScript, otherwise I´d have written a function to generate numbers until they fit in my interval.
Any hints on what to look for breaking numbers down into my interval?
mod is often good enough and it's fast. But you may get a more uniform distribution by using floating-point ops.
rand 16#7fffffff div 100 mul cvi
This is because mod discards the upper bits of the input. And the PRNG is usually trying to randomize over all the bits. By scaling down then up, they all contribute something in the way of rounding effects.
Just use the modulo operator to get it down to the size you want:
GS>rand 100 mod stack
7

Integer Time Series compression

Is there a well known documented algorithm for (positive) integer streams / time series compression, that would:
have variable bit length
work on deltas
My input data is a stream of temperature measurements from a sensor (more specifically a TMP36 read out by an Arduino). It is physically impossible that big jumps occur between measurements (time constant of sensor). I therefore think my compression algorithm should work on deltas (set a base on stream start and then only difference to next value). Because gaps are limited, I want variable bit length, because differences lower than 4 fit on 2 bits, lower than 8 on 3 bits and so on... But there is a dilemma between telling in stream the bit size of the next delta and just working on, say, 3 bit deltas and telling size only when bigger for instance.
Any idea what algorithm solves than one?
Use variable-length integers to code the deltas between values, and feed that to zlib to do the compression.
First of all there are different formats in existent. One thing I would do first is getting rid of the sign. A sign is usually a distraction when thinking about compression. I usually use the scheme where every positive is 2*v and every negative value is just 2*(-v)-1. So 0 = 0, -1 = 1, 1 = 2, -2 = 3, 2 = 4... .
Since with that scheme you have nothing like 0b11111111 = -1 the leading bits are gone. Now you can think about how to compress those symbols / numbers. One thing you can do is create a representive sample and use it to train a static huffman code. This should be possible within your on chip constraints. Another more simple aproach is using huffman codes for bit lengths and write the bits to stream. So 0 = bitlength 0, -1 = bitlength 1, 2,3 = bitlength length 2, ... . By using huffman codes to describe this bitlength you become quite compact literals.
I usually use a mixture. I use the most frequent symbols / values as raw values and use not so frequent numbers by using bit lengths + bit pattern of the actual value. This way you stay compact and do not have to deal with excessive tables (there are only 64 symbols for 64 bits lengths possible).
Also there are other schemes like leading bit where for example of every byte the first bit (or the highest) marks the last byte of the value so as long as the bit is set there will be another byte for the integer. If it is zero its the last byte of the value.
I usually train a static huffman code for such purposes. Its easy and you can even do the encoding and decoding becoming source code / generate source code out from your code (simply create ifs/switch statements and write your tables as arrays in your code).
You can use Integer compression methods with delta or delta of delta encoding like used in TurboPFor Integer Compression. Gamma coding can be also used if the deltas have very small values.
The current state of the art for this problem is Quantile Compression. It compresses numerical sequences such as integers and typically achieves 35% higher compression ratio than other approaches. It has delta encoding as a built-in feature.
CLI example:
cargo run --release compress \
--csv my.csv \
--col-name my_col \
--level 6 \
--delta-order 1 \
out.qco
Rust API example:
let my_nums: Vec<i64> = ...
let compressor = Compressor::<i64>::from_config(CompressorConfig {
compression_level: 6,
delta_encoding_order: 1,
});
let bytes: Vec<u8> = compressor.simple_compress(&my_nums);
println!("compressed down to {} bytes", bytes.len());
It does this by describing each number with a Huffman code for a range (a [lower, upper] bound) followed by an exact offset into that range.
By strategically choosing the ranges based on your data, it comes close the Shannon entropy of the data distribution.
Since your data comes from a temperature sensor, your data should be very smooth, and you may even consider delta orders higher than 1 (e.g. delta order 2 is "delta-of-deltas").

Data Compression : Arithmetic coding unclear

Can anyone please explain arithmetic encoding for data compression with implementation details ? I have surfed through internet and found mark nelson's post but the implementation's technique is indeed unclear to me after trying for many hours.
Mark nelson's explanation on arithmetic coding can be located at
http://marknelson.us/1991/02/01/arithmetic-coding-statistical-modeling-data-compression/
The main idea with arithmetic compression is its the capability to code a probability using the exact amount of data length required.
This amount of data is known, proven by Shannon, and can be calculated simply by using the following formula : -log2(p)
For example, if p=50%, then you need 1 bit.
And if p=25%, you need 2 bits.
That's simple enough for probabilities which are power of 2 (and in this special case, huffman coding could be enough). But what if the probability is 63% ? Then you need -log2(0.63) = 0.67 bits. Sounds tricky...
This property is especially important if your probability is high. If you can predict something with a 95% accuracy, then you only need 0.074 bits to represent a good guess. Which means you are going to compress a lot.
Now, how to do that ?
Well, it's simpler than it sounds. You will divide your range depending on probabilities. For example, if you have a range of 100, 2 possible events, and a probability of 95% for the 1st one, then the first 95 values will say "Event 1", and the last 5 remaining values will say "Event 2".
OK, but on computers, we are accustomed to use powers of 2. For example, with 16 bits, you have a range of 65536 possible values. Just do the same : take the 1st 95% of the range (which is 62259) to say "Event 1", and the rest to say "Event 2". You obviously have a problem of "rounding" (precision), but as long as you have enough values to distribute, it does not matter too much. Furthermore, you are not constrained to 2 events, you could have a myriad of events. All that matters is that values are allocated depending on the probabilities of each event.
OK, but now i have 62259 possible values to say "Event 1", and 3277 to say "Event 2". Which one should i choose ?
Well, any of them will do. Wether it is 1, 30, 5500 or 62256, it still means "Event 1".
In fact, deciding which value to select will not depend on the current guess, but on the next ones.
Suppose i'm having "Event 1". So now i have to choose any value between 0 and 62256. On next guess, i have the same distribution (95% Event 1, 5% Event 2). I will simply allocate the distribution map with these probabilities. Except that this time, it is distributed over 62256 values. And we continue like this, reducing the range of values with each guess.
So in fact, we are defining "ranges", which narrow with each guess. At some point, however, there is a problem of accuracy, because very little values remain.
The idea, is to simply "inflate" the range again. For example, each time the range goes below 32768 (2^15), you output the highest bit, and multiply the rest by 2 (effectively shifting the values by one bit left). By continuously doing like this, you are outputting bits one by one, as they are being settled by the series of guesses.
Now the relation with compression becomes obvious : when the range are narrowed swiftly (ex : 5%), you output a lot of bits to get the range back above the limit. On the other hand, when the probability is very high, the range narrow very slowly. You can even have a lot of guesses before outputting your first bits. That's how it is possible to compress an event to "a fraction of a bit".
I've intentionally used the terms "probability", "guess", "events" to keep this article generic. But for data compression, you just to replace them with the way you want to model your data. For example, the next event can be the next byte; in this case, you have 256 of them.
Maybe this script could be useful to build a better mental model of arithmetic coder: gen_map.py. Originally it was created to facilitate debugging of arithmetic coder library and simplify generation of unit tests for it. However it creates nice ASCII visualizations that also could be useful in understanding arithmetic coding.
A small example. Imagine we have an alphabet of 3 symbols: 0, 1 and 2 with probabilities 1/10, 2/10 and 7/10 correspondingly. And we want to encode sequence [1, 2]. Script will give the following output (ignore -b N option for now):
$ ./gen_map.py -b 6 -m "1,2,7" -e "1,2"
000000111111|1111|111222222222222222222222222222222222222222222222
------011222|2222|222000011111111122222222222222222222222222222222
---------011|2222|222-------------00011111122222222222222222222222
------------|----|-------------------------00111122222222222222222
------------|----|-------------------------------01111222222222222
------------|----|------------------------------------011222222222
==================================================================
000000000000|0000|000000000000000011111111111111111111111111111111
000000000000|0000|111111111111111100000000000000001111111111111111
000000001111|1111|000000001111111100000000111111110000000011111111
000011110000|1111|000011110000111100001111000011110000111100001111
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
001100110011|0011|001100110011001100110011001100110011001100110011
010101010101|0101|010101010101010101010101010101010101010101010101
First 6 lines (before ==== line) represent a range from 0.0 to 1.0 which is recursively subdivided on intervals proportional to symbol probabilities. Annotated first line:
[1/10][ 2/10 ][ 7/10 ]
000000111111|1111|111222222222222222222222222222222222222222222222
Then we subdivide each interval again:
[ 0.1][ 0.2 ][ 0.7 ]
000000111111|1111|111222222222222222222222222222222222222222222222
[ 0.7 ][.1][ 0.2 ][ 0.7 ]
------011222|2222|222000011111111122222222222222222222222222222222
[.1][ .2][ 0.7 ]
---------011|2222|222-------------00011111122222222222222222222222
Note, that some intervals are not subdivided. That happens when there is not enough space to represent every subinterval within given precision (which is specified by -b option).
Each line corresponds to a symbol from the input (in our case - sequence [1, 2]). By following subintervals for each input symbol we'll get a final interval that we want to encode with minimal amount of bits. In our case it's a first 2 subinterval on a second line:
[ This one ]
------011222|2222|222000011111111122222222222222222222222222222222
Following 7 lines (after ====) represent the same interval 0.0 to 1.0, but subdivided according to binary notation. Each line is a bit of output and by choosing between 0 and 1 you choose left or right half-subinterval. For example bits 01 corresponds to subinterval [0.25, 05) on a second line:
[ This one ]
000000000000|0000|111111111111111100000000000000001111111111111111
The idea of arithmetic coder is to output bits (0 or 1) until the corresponding interval will be entirely inside (or equal to) the interval determined by the input sequence. In our case it's 0011. The ~~~~ line shows where we have enough bits to unambiguously identify the interval we want.
Vertical lines formed by | symbol show the range of bit sequences (rows) that could be used to encode the input sequence.
First of all thanks for introducing me to the concept of arithmetic compression!
I can see that this method has the following steps:
Creating mapping: Calculate the fraction of occurrence for each letter which gives a range size for each alphabet. Then order them and assign actual ranges from 0 to 1
Given a message calculate the range (pretty straightforward IMHO)
Find the optimal code
The third part is a bit tricky. Use the following algorithm.
Let b be the optimal representation. Initialize it to empty string (''). Let x be the minimum value and y the maximum value.
double x and y: x=2*x, y=2*y
If both of them are greater than 1 append 1 to b. Go to step 1.
If both of them are less than 1, append 0 to b. Go to step 1.
If x<1, but y>1, then append 1 to b and stop
b essentially contains the fractional part of the number you are transmitting. Eg. If b=011, then the fraction corresponds to 0.011 in binary.
What part of implementation do you not understand?

Resources