Computing entropy/disorder - algorithm

Given an ordered sequence of around a few thousand 32 bit integers, I would like to know how measures of their disorder or entropy are calculated.
What I would like is to be able to calculate a single value of the entropy for each of two such sequences and be able to compare their entropy values to determine which is more (dis)ordered.
I am asking here, as I think I may not be the first with this problem and would like to know of prior work.
Thanks in advance.
UPDATE #1
I have just found this answer that looks great, but would give the same entropy if the integers were sorted. It only gives a measure of the entropy of the individual ints in the list and disregards their (dis)order.

Entropy calculation generally:
http://en.wikipedia.org/wiki/Entropy_%28information_theory%29
Furthermore, you have to sort your integers, then iterate over the sorted integer list to find out the frequency of your integers. Afterwards, you can use the formula.

I think I'll have to code a shannon entropy in 2D. Arrange the list of 32 bit ints as a series of 8 bit bytes and do a Shannons on that, then to cover how ordered they may be, take the bytes eight at a time and form a new list of bytes composed of bits 0 of the eight followed by bits 1 of the eight ... bits 7 of the 8; then the next 8 original bytes ..., ...
I'll see how it goes/codes...

Entropy is a function on probabilities, not data (arrays of ints, or files). Entropy is a measure of disorder, but when the function is modified to take data as input it loses this meaning.
The only true way one can generate a measure of disorder for data is to use Kolmogorov Complexity. Though this has problems too, in particular it's uncomputable and is not yet strictly well defined as one must arbitrarily pick a base language. This well-definedness can be solved if the disorder one is measuring is relative to something that is going to process the data. So when considering compression on a particular computer, the base language would be Assembly for that computer.
So you could define the disorder of an array of integers as follows:
The length of the shortest program written in Assembly that outputs the array.

Related

Random number from many other random numbers, is it more random?

We want to generate a uniform random number from the interval [0, 1].
Let's first generate k random booleans (for example by rand()<0.5) and decide according to these on what subinterval [m*2^{-k}, (m+1)*2^{-k}] the number will fall. Then we use one rand() to get the final output as m*2^{-k} + rand()*2^{-k}.
Let's assume we have arbitrary precision.
Will a random number generated this way be 'more random' than the usual rand()?
PS. I guess the subinterval picking amounts to just choosing the binary representation of the output 0. b_1 b_2 b_3... one digit b_i at a time and the final step is adding the representation of rand() to the end of the output.
It depends on the definition of "more random". If you use more random generators, it means more random state, and it means that cycle length will be greater. But cycle length is just one property of random generators. Cycle length of 2^64 usually OK for almost any purpose (the only exception I know is that if you need a lot of different, long sequences, like for some kind of simulation).
However, if you combine two bad random generators, they don't necessarily become better, you have to analyze it. But there are generators, which do work this way. For example, KISS is an example for this: it combines 3, not-too-good generators, and the result is a good generator.
For card shuffling, you'll need a cryptographic RNG. Even a very good, but not cryptographic RNG is inadequate for this purpose. For example, Mersenne Twister, which is a good RNG, is not suitable for secure card shuffling! It is because observing output numbers, it is possible to figure out its internal state, so shuffle result can be predicted.
This can help, but only if you use a different pseudorandom generator for the first and last bits. (It doesn't have to be a different pseudorandom algorithm, just a different seed.)
If you use the same generator, then you will still only be able to construct 2^n different shuffles, where n is the number of bits in the random generator's state.
If you have two generators, each with n bits of state, then you can produce up to a total of 2^(2n) different shuffles.
Tinkering with a random number generator, as you are doing by using only one bit of random space and then calling iteratively, usually weakens its random properties. All RNGs fail some statistical tests for randomness, but you are more likely to get find that a noticeable cycle crops up if you start making many calls and combining them.

Trie for numbers

Is it good idea to use trie or DAWG for numbers instead of strings? I have many two numbers combinations and want to decrease required memory size.
I would like to be able to store all the numbers combination, but if the data structure only supports queries to check if a give combination exists among the given ones I will be happy.
I don't think the optimization for 2-digit numbers will be too great. But for longer numbers a TRIE definitely seems like a good solution.
As for the two digit combinations I think it is best to use an array of size 100 that stores a flag(or count if repetition is allowed) corresponding to each of the 100 two digit combinations. Of course if only numbers are allowed you will only need 90 places as the 10 combinations starting with 0 are not valid numbers. When inserting a number simply set the corresponding flag in the array(constant computational complexity). When checking if a number is found in the set simply check the corresponding flag in the array(again constant complexity). To recover all the numbers you have, iterate over the array and print all numbers that have their corresponding flags set. My idea is somewhat similar to set sort and when repetition is allowed to counting sort. This solution also has the best possible computational complexity- constant for both operations.

Shuffling a huge range of numbers using minimal storage

I've got a very large range/set of numbers, (1..1236401668096), that I would basically like to 'shuffle', i.e. randomly traverse without revisiting the same number. I will be running a Web service, and each time a request comes in it will increment a counter and pull the next 'shuffled' number from the range. The algorithm will have to accommodate for the server going offline, being able to restart traversal using the persisted value of the counter (something like how you can seed a pseudo-random number generator, and get the same pseudo-random number given the seed and which iteration you are on).
I'm wondering if such an algorithm exists or is feasible. I've seen the Fisher-Yates Shuffle, but the 1st step is to "Write down the numbers from 1 to N", which would take terabytes of storage for my entire range. Generating a pseudo-random number for each request might work for awhile, but as the database/tree gets full, collisions will become more common and could degrade performance (already a 0.08% chance of collision after 1 billion hits according to my calculation). Is there a more ideal solution for my scenario, or is this just a pipe dream?
The reason for the shuffling is that being able to correctly guess the next number in the sequence could lead to a minor DOS vulnerability in my app, but also because the presentation layer will look much nicer with a wider number distribution (I'd rather not go into details about exactly what the app does). At this point I'm considering just using a PRNG and dealing with collisions or shuffling range slices (starting with (1..10000000).to_a.shuffle, then, (10000001, 20000000).to_a.shuffle, etc. as each range's numbers start to run out).
Any mathemagicians out there have any better ideas/suggestions?
Concatenate a PRNG or LFSR sequence with /dev/random bits
There are several algorithms that can generate pseudo-random numbers with arbitrarily large and known periods. The two obvious candidates are the LCPRNG (LCG) and the LFSR, but there are more algorithms such as the Mersenne Twister.
The period of these generators can be easily constructed to fit your requirements and then you simply won't have collisions.
You could deal with the predictable behavior of PRNG's and LFSR's by adding 10, 20, or 30 bits of cryptographically hashed entropy from an interface like /dev/random. Because the deterministic part of your number is known to be unique it makes no difference if you ever repeat the actually random part of it.
Divide and conquer? Break down into manageable chunks and shuffle them. You could divide the number range e.g. by their value modulo n. The list is constructive and quite small depending on n. Once a group is exhausted, you can use the next one.
For example if you choose an n of 1000, you create 1000 different groups. Pick a random number between 1 and 1000 (let's call this x) and shuffle the numbers whose value modulo 1000 equals x. Once you have exhausted that range, you can choose a new random number between 1 and 1000 (without x obviously) to get the next subset to shuffle. It shouldn't exactly be challenging to keep track of which numbers of the 1..1000 range have already been used, so you'd just need a repeatable shuffle algorithm for the numbers in the subset (e.g. Fisher-Yates on their "indices").
I guess the best option is to use a GUID/UUID. They are made for this type of thing, and it shouldn't be hard to find an existing implementation to suit your needs.
While collisions are theoretically possible, they are extremely unlikely. To quote Wikipedia:
The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs

Mapping function

I have a set of 128bit number and the size of set < 2^32 ...so theoretically I can have a mapping function that maps all the 128bit numbers to 32 bit number ....how can I construct the mapping function ???
Seems like you are looking for a minimal perfect hash which maps n keys to n consecutive integers.
The wiki page link in the above sentence mentions two libraries which implement this.
Also see this for more detail: http://burtleburtle.net/bob/hash/perfect.html
Without knowing the nature of the input data, it's impossible to give the optimal hashing algorithm. But if the input is evenly distributed then you could use the lower 32 bits of the input. This means the possibility of collisions, so you have to deal with that.
The generic construction is to keep all your 128-bit values in a big array, sorted in ascending order. Then, each value is "mapped" to its index in the array. To "compute" the map, you do a binary search in the array, to get the precise index of the value in the array. With 232 values, the array has size 64 GB, and the binary search entails 35-or-so lookups in the array.
In all generality you cannot do really better than that. However, if your 128-bit values have a reasonably uniform spread (it depends from where they come), then the big array structure can be compressed by a large margin, especially if you can guarantee that all inputs to your map will always be part of the set of 128-bit values; my bet is that you can trim it down to a couple of gigabytes -- but the lookup will be more expensive.
For a more practical solution, you will have to work with the structure of your 128-bit values: where they come from, what they represent...
Set a position of your number as division of it's value on 2^32.

How to adjust the distribution of values in a random data stream?

Given a infinite stream of random 0's and 1's that is from a biased (e.g. 1's are more common than 0's by a know factor) but otherwise ideal random number generator, I want to convert it into a (shorter) infinite stream that is just as ideal but also unbiased.
Looking up the definition of entropy finds this graph showing how many bits of output I should, in theory, be able to get from each bit of input.
The question: Is there any practical way to actually implement a converter that is nearly ideally efficient?
There is a well-known device due to Von Neumann for turning an unfair coin into a fair coin. We can use this device to solve our problem here.
Repeatedly draw two bits from your biased source until you obtain a pair for which the bits are different. Now return the first bit, discarding the second. This produces an unbiased source. The reason this works is because regardless of the source, the probability of a 01 is the same as a probability of a 10. Therefore the probability of a 0 conditional on 01 or 10 is 1/2 and the probability of a 1 conditional on 01 or 10 is 1/2.
Please see
http://en.wikipedia.org/wiki/Randomness_extractor
http://en.wikipedia.org/wiki/Whitening_transform
http://en.wikipedia.org/wiki/Decorrelation
Hoffman encode the input.
Given that the input is of a known bias, you can compute a probability distribution for check sums of each n bit segment. From that construct a Hoffman code and then just encode the sequence.
I'm not sure but one potential problem is that this might introduce some correlation between sequential bits.

Resources