Entropy repacking - algorithm

I have been tossing around a conceptual idea for a machine (as in a Turing machine) and I'm wondering if any work has been done on this or related topics.
The idea is a machine that takes an entropy stream and gives out random symbols in any range without losing any entropy.
I'll grand that is a far from rigorous description so I'll give an example: Say I have a generator of random symbols in the range of 1 to n and I want to be able to ask for a symbols in any given range, first 1 to 12 and then 1 to 1234. (To keep it practicable I'll only consider deterministic machines where, given the same input stream and requests, it will always give the same output.) One necessary constraint is that the output contain at least as much entropy as the input. However, the constraint I'm most interested in is that the machine only reads in as much entropy as it spits out.
E.g. if asked for tokens in the range of 1 to S1, S2, S3, ... Sm it would only consume ceiling(sum(i = 1 to m, log(Si))/log(n)) inputs tokens.
This question asks about how to do this conversion while satisfying the first constraint but does very badly on the second.

Okay, I'm still not sure that I'm following what you want. It sounds like you want a function
f: I → O
where the inputs are a strongly random (uniform distribution etc) sequence of symbols on an alphabet I={1..n}. (So a series of random natural numbers ≤ n.) The outputs are another sequence on O={1..m} and you want that sequence to have as much entropy as the inputs.
Okay, if I've got this right, first off, if m < n, you can't. If m < n then lg m < lg n, so the entropy of the set of output symbols is smaller.
If m ≥ n, then you can do it trivially by just selecting the ith element of {1..m}. Entropy will be the same, since the number of possible output symbols is the same. They aren't going to be "random" in the sense of being uniformly distributed over the whole set {1..m}, though, because necessarily (pigeonhole principle) some symbols won't be selected at all.
If, on the other hand, you'd be satisfied with having a random sequence on {1..m}, then you can do it by selecting an appropriate pseudorandom number generator using your input from the random source as a seed.

My current pass at it:
By adding the following restriction: you know in advance what the sequence of ranges is {S1, S2, S3, ..., Sn}, than using base translation with a non-constant base might work:
Find Sp = S1 * S2 * S3 * ... * Sn
Extract m=cealing(log(Sp)/log(n)) terms from the input {R1, R2, R3, ..., Rm}
Find X = R1 + R2*n + R3*n^2 + ... + Rm*n^(m-1)
Reform X as O1 + S1*O2 + S1*S2*O3 + ... Sn*On + x where 1 <= Oi <= Si
This might be reformable into a solution that works for one value at a time by pushing x back into the input stream. However I can't convince my self that even the known outputs range form is sound so...

Related

Hashing with the Division Method - Choosing number of slots?

So, in CLRS, there's this quote
A prime not too close to an exact power of 2 is often a good choice for m.
Several Questions...
I understand how a power of 2 will just be the lower order bits of your key...however, say you have keys from a universe of 1 to 1 million, with each key having an equal probability of being any number from universe (which I'm guessing is a common assumption about your universe if given no other data?) then wouldn't taking say the 4 lower order bits result in (2^4) lower order bit patterns that were pretty much equally likely for the keys from 1 to 1 million? How am I thinking about this incorrectly?
Why a prime number? So, if power of 2's aren't a good idea, why is a prime number a better choice as opposed to a composite number close to a power of 2 (Also why should it be close to a power of 2...lol)?
You are trying to find a hash table that works well for typical input data, and typical input data does things that you wouldn't expect from good random number generators. Very often you get formatted or semi-formatted strings which, when converted to numbers, end up as K, K+A, K+2A, K+3A,.... for some integers K and A. If K+xA and K+yA hash to the same number mod m, then (x-y)A must be 0 mod m. If m is prime, this can only happen if A = 0 mod m or if x = y mod m, so one time in m. But if m=pq and A happens to be divisible by p, then you get a collision every time x-y is divisible by q, which is more often since q < m.
I guess close to a power of 2 because it might be convenient for the memory management system to have blocks of memory of the resulting size - I really don't know. If you really care, and if you have the time, you could try different primes with some representative data and see which of them are best in practice.

Random String generation from a given string, and inverse transform

I am working on a requirement where a function f will use string s as a seed and generate n no of strings y0..n , I can easily do this, but I also want to do inverse ie, f-1(yi) of generated strings will give me back s.
y0 = f(s) # first time I call f(s) it gives me y0
y1 = f(s) # second time I call f(s) it gives me y1
...
yi = f(s) # ith time I call f(s) it gives me yi
and so on.
The inverse function,
s = f-1(yi)
How can find the functions f and f-1, the other constraint the character size cannot to be too large for these strings, say max 20-25 characters.
Any suggestions please ?
Ok, this will get too channel-coding specific if I do it in broadness, here, but:
These are mathematical concepts, so let's map strings to numbers and look at them algebraically:
Your 20-character string space, assuming we're just using the 128 common ASCII characters, has 27 * 20 elements. That's pretty many elements.
However, communication technology has a method called scrambling which is a reversible process of mingling the bits in a sequence in a way that spreads the per-bit energy over the whole sequence. That leads to pretty randomly looking bit streams. It's typically implemented using feedback shift registers.
It's possible to find a 2140 state LFSR that fulfills your scrambling needs, and you can interpret the output of a multiplicative scrambler as the next element in your sequence.
However, please be aware that your problem is a hard one, which I hope I've illustrated sufficiently -- getting something that has good random properties is a harsh thing, and I can't recommend implementing something like that yourself -- it's going to make problems as soon as you need to rely on mathematical properties of your pseudorandom string.

Generate all subset sums within a range faster than O((k+N) * 2^(N/2))?

Is there a way to generate all of the subset sums s1, s2, ..., sk that fall in a range [A,B] faster than O((k+N)*2N/2), where k is the number of sums there are in [A,B]? Note that k is only known after we have enumerated all subset sums within [A,B].
I'm currently using a modified Horowitz-Sahni algorithm. For example, I first call it to for the smallest sum greater than or equal to A, giving me s1. Then I call it again for the next smallest sum greater than s1, giving me s2. Repeat this until we find a sum sk+1 greater than B. There is a lot of computation repeated between each iteration, even without rebuilding the initial two 2N/2 lists, so is there a way to do better?
In my problem, N is about 15, and the magnitude of the numbers is on the order of millions, so I haven't considered the dynamic programming route.
Check the subset sum on Wikipedia. As far as I know, it's the fastest known algorithm, which operates in O(2^(N/2)) time.
Edit:
If you're looking for multiple possible sums, instead of just 0, you can save the end arrays and just iterate through them again (which is roughly an O(2^(n/2) operation) and save re-computing them. The value of all the possible subsets is doesn't change with the target.
Edit again:
I'm not wholly sure what you want. Are we running K searches for one independent value each, or looking for any subset that has a value in a specific range that is K wide? Or are you trying to approximate the second by using the first?
Edit in response:
Yes, you do get a lot of duplicate work even without rebuilding the list. But if you don't rebuild the list, that's not O(k * N * 2^(N/2)). Building the list is O(N * 2^(N/2)).
If you know A and B right now, you could begin iteration, and then simply not stop when you find the right answer (the bottom bound), but keep going until it goes out of range. That should be roughly the same as solving subset sum for just one solution, involving only +k more ops, and when you're done, you can ditch the list.
More edit:
You have a range of sums, from A to B. First, you solve subset sum problem for A. Then, you just keep iterating and storing the results, until you find the solution for B, at which point you stop. Now you have every sum between A and B in a single run, and it will only cost you one subset sum problem solve plus K operations for K values in the range A to B, which is linear and nice and fast.
s = *i + *j; if s > B then ++i; else if s < A then ++j; else { print s; ... what_goes_here? ... }
No, no, no. I get the source of your confusion now (I misread something), but it's still not as complex as what you had originally. If you want to find ALL combinations within the range, instead of one, you will just have to iterate over all combinations of both lists, which isn't too bad.
Excuse my use of auto. C++0x compiler.
std::vector<int> sums;
std::vector<int> firstlist;
std::vector<int> secondlist;
// Fill in first/secondlist.
std::sort(firstlist.begin(), firstlist.end());
std::sort(secondlist.begin(), secondlist.end());
auto firstit = firstlist.begin();
auto secondit = secondlist.begin();
// Since we want all in a range, rather than just the first, we need to check all combinations. Horowitz/Sahni is only designed to find one.
for(; firstit != firstlist.end(); firstit++) {
for(; secondit = secondlist.end(); secondit++) {
int sum = *firstit + *secondit;
if (sum > A && sum < B)
sums.push_back(sum);
}
}
It's still not great. But it could be optimized if you know in advance that N is very large, for example, mapping or hashmapping sums to iterators, so that any given firstit can find any suitable partners in secondit, reducing the running time.
It is possible to do this in O(N*2^(N/2)), using ideas similar to Horowitz Sahni, but we try and do some optimizations to reduce the constants in the BigOh.
We do the following
Step 1: Split into sets of N/2, and generate all possible 2^(N/2) sets for each split. Call them S1 and S2. This we can do in O(2^(N/2)) (note: the N factor is missing here, due to an optimization we can do).
Step 2: Next sort the larger of S1 and S2 (say S1) in O(N*2^(N/2)) time (we optimize here by not sorting both).
Step 3: Find Subset sums in range [A,B] in S1 using binary search (as it is sorted).
Step 4: Next, for each sum in S2, find using binary search the sets in S1 whose union with this gives sum in range [A,B]. This is O(N*2^(N/2)). At the same time, find if that corresponding set in S2 is in the range [A,B]. The optimization here is to combine loops. Note: This gives you a representation of the sets (in terms of two indexes in S2), not the sets themselves. If you want all the sets, this becomes O(K + N*2^(N/2)), where K is the number of sets.
Further optimizations might be possible, for instance when sum from S2, is negative, we don't consider sums < A etc.
Since Steps 2,3,4 should be pretty clear, I will elaborate further on how to get Step 1 done in O(2^(N/2)) time.
For this, we use the concept of Gray Codes. Gray codes are a sequence of binary bit patterns in which each pattern differs from the previous pattern in exactly one bit.
Example: 00 -> 01 -> 11 -> 10 is a gray code with 2 bits.
There are gray codes which go through all possible N/2 bit numbers and these can be generated iteratively (see the wiki page I linked to), in O(1) time for each step (total O(2^(N/2)) steps), given the previous bit pattern, i.e. given current bit pattern, we can generate the next bit pattern in O(1) time.
This enables us to form all the subset sums, by using the previous sum and changing that by just adding or subtracting one number (corresponding to the differing bit position) to get the next sum.
If you modify the Horowitz-Sahni algorithm in the right way, then it's hardly slower than original Horowitz-Sahni. Recall that Horowitz-Sahni works two lists of subset sums: Sums of subsets in the left half of the original list, and sums of subsets in the right half. Call these two lists of sums L and R. To obtain subsets that sum to some fixed value A, you can sort R, and then look up a number in R that matches each number in L using a binary search. However, the algorithm is asymmetric only to save a constant factor in space and time. It's a good idea for this problem to sort both L and R.
In my code below I also reverse L. Then you can keep two pointers into R, updated for each entry in L: A pointer to the last entry in R that's too low, and a pointer to the first entry in R that's too high. When you advance to the next entry in L, each pointer might either move forward or stay put, but they won't have to move backwards. Thus, the second stage of the Horowitz-Sahni algorithm only takes linear time in the data generated in the first stage, plus linear time in the length of the output. Up to a constant factor, you can't do better than that (once you have committed to this meet-in-the-middle algorithm).
Here is a Python code with example input:
# Input
terms = [29371, 108810, 124019, 267363, 298330, 368607,
438140, 453243, 515250, 575143, 695146, 840979, 868052, 999760]
(A,B) = (500000,600000)
# Subset iterator stolen from Sage
def subsets(X):
yield []; pairs = []
for x in X:
pairs.append((2**len(pairs),x))
for w in xrange(2**(len(pairs)-1), 2**(len(pairs))):
yield [x for m, x in pairs if m & w]
# Modified Horowitz-Sahni with toolow and toohigh indices
L = sorted([(sum(S),S) for S in subsets(terms[:len(terms)/2])])
R = sorted([(sum(S),S) for S in subsets(terms[len(terms)/2:])])
(toolow,toohigh) = (-1,0)
for (Lsum,S) in reversed(L):
while R[toolow+1][0] < A-Lsum and toolow < len(R)-1: toolow += 1
while R[toohigh][0] <= B-Lsum and toohigh < len(R): toohigh += 1
for n in xrange(toolow+1,toohigh):
print '+'.join(map(str,S+R[n][1])),'=',sum(S+R[n][1])
"Moron" (I think he should change his user name) raises the reasonable issue of optimizing the algorithm a little further by skipping one of the sorts. Actually, because each list L and R is a list of sizes of subsets, you can do a combined generate and sort of each one in linear time! (That is, linear in the lengths of the lists.) L is the union of two lists of sums, those that include the first term, term[0], and those that don't. So actually you should just make one of these halves in sorted form, add a constant, and then do a merge of the two sorted lists. If you apply this idea recursively, you save a logarithmic factor in the time to make a sorted L, i.e., a factor of N in the original variable of the problem. This gives a good reason to sort both lists as you generate them. If you only sort one list, you have some binary searches that could reintroduce that factor of N; at best you have to optimize them somehow.
At first glance, a factor of O(N) could still be there for a different reason: If you want not just the subset sum, but the subset that makes the sum, then it looks like O(N) time and space to store each subset in L and in R. However, there is a data-sharing trick that also gets rid of that factor of O(N). The first step of the trick is to store each subset of the left or right half as a linked list of bits (1 if a term is included, 0 if it is not included). Then, when the list L is doubled in size as in the previous paragraph, the two linked lists for a subset and its partner can be shared, except at the head:
0
|
v
1 -> 1 -> 0 -> ...
Actually, this linked list trick is an artifact of the cost model and never truly helpful. Because, in order to have pointers in a RAM architecture with O(1) cost, you have to define data words with O(log(memory)) bits. But if you have data words of this size, you might as well store each word as a single bit vector rather than with this pointer structure. I.e., if you need less than a gigaword of memory, then you can store each subset in a 32-bit word. If you need more than a gigaword, then you have a 64-bit architecture or an emulation of it (or maybe 48 bits), and you can still store each subset in one word. If you patch the RAM cost model to take account of word size, then this factor of N was never really there anyway.
So, interestingly, the time complexity for the original Horowitz-Sahni algorithm isn't O(N*2^(N/2)), it's O(2^(N/2)). Likewise the time complexity for this problem is O(K+2^(N/2)), where K is the length of the output.

Is there "good" PRNG generating values without hidden state?

I need some good pseudo random number generator that can be computed like a pure function from its previous output without any state hiding. Under "good" I mean:
I must be able to parametrize generator in such way that running it for 2^n iterations with any parameters (or with some large subset of them) should cover all or almost all values between 0 and 2^n - 1, where n is the number of bits in output value.
Combined generator output of n + p bits must cover all or almost all values between 0 and 2^(n + p) - 1 if I run it for 2^n iterations for every possible combination of its parameters, where p is the number of bits in parameters.
For example, LCG can be computed like a pure function and it can meet first condition, but it can not meet second one. Say, we have 32-bit LCG, m = 2^32 and it is constant, our p = 64 (two 32-bit parameters a and c), n + p = 96, so we must peek data by three ints from output to meet second condition. Unfortunately, condition can not be meet because of strictly alternating sequence of odd and even ints in output. To overcome this, hidden state must be introduced, but that makes function not pure and breaks first condition (long hidden period).
EDIT: Strictly speaking, I want family of functions parametrized by p bits and with full state of n bits, each generating all possible binary strings of p + n bits in unique "randomish" way, not just continuously incrementing (p + n)-bit int. Parametrization required to select that unique way.
Am I wanting too much?
You can use any block cipher, with a fixed key. To generate the next number, decrypt the current one, increment it, and re-encrypt it. Because block ciphers are 1:1, they'll necessarily iterate through every number in the output domain before repeating.
Try LFSR
All you need is list of primitive polynomials.
Period of generating finite field this way, generates field of size 2^n-1. But you can generalise this procedure to generate anything whit period of k^n-1.
I have not seen this implemented, but all you have to implement is shifting numbers by small number s>n where gcd(s,2^n-1) == 1. gcd stands for greatest common divisor

How do you seed a PRNG with two seeds?

For a game that I'm making, where solar systems have an x and y coordinates, I'd like to use the coordinates to randomly generate the features for that solar system. The easiest way to do this seems to seed a random number generator with two seeds, the x and y coordinates. Is there anyway to get one reliable seed from the two seeds, or is there a good PRNG that takes two seeds and produces long periods?
EDIT: I'm aware of binary operations between the two numbers, but I'm trying to find the method that will lead to the least number of collisions? Addition and multiplication will easily result in collisions. But what about XOR?
Why not just combine the numbers in a meaningful way to generate your seed. For example, you could add them, which could be unique enough, or perhaps stack them using a little multiplication, for example:
seed = (x << 32) + y
seed1 ^ seed2
(where ^ is the bitwise XOR operator)
A simple Fibonacci PRNG uses 2 seeds
One of which should be odd. This generator
Uses a modulus which is a power of 10.
The period is long and invariable being
1.5 times the modulus; thus for modulus
1000000 or 10^6 the period is 1,500,000.
The simple pseudocode is:
Input "Enter power for 10^n modulus";m
Mod& = 10 ^ m
Input "Enter # of iterations"; n
Input "Enter seed #1"; a
Input "Enter seed #2"; b
Loop = 1
For loop = 1 to n
C = a + b
If c > m then c = c - m
A = b
B = c
Next
This generator is very fast and gives
An excellent uniform distribution.
Hope this helps.
why not use some kind of super simple fibonacci arithmetic or something like it to produce coordinates directly in base 10. Use the two starting numbers as the seeds. It won't produce random numbers suitable for monte carlo or anything like that, but they should be all right for a game. I'm not a programer or a mathmatician and have never tried to code anything so I couldn't do it for you.....
edit - something like f1 = some seed then f2 = some seed and G = (sqrt(5) + 1) / 2....
then some kind of loop. Xn = Xn-1 + Xn-2 mod(G) mod(1) (should produce a decimal between 0 and 1) and then multiply by what ever and take the least significant digits
and perhaps to prevent decay for as long as the numbers need to be produced...
an initial reseeding point at which f1 and f2 will be reseeded based on the generators own output, which will prevent the sequence of numbers being able to be described by a closed expression so...
if counter = initial reseeding point f1 = Xn and f2 = Xn - something. and... reseeding point is set to ceiling Xn * some multiplier.
so it's period should end when identical values for Xn and Xn - something are re-fed into f1 and f2, which shouldn't happen for at least what ever bit length you are using for the numbers.
.... I mean, that's my best guess...
Is there a reason you want to use the co-ordinates? For example, do you always want a system generated at the same coordinate to always be identical to any other system generated at that particular co-ordinate?
I would suggest using the more classical method of just seeding with the current time and using the results of that to continue generating your pseudo-randomness.
If you're adamant about using the coordinates, I would suggest concatenation (As I believe someone else suggested). At least then you're guaranteed to avoid collisions, assuming that you don't have two systems at the same co-ords.
I use one of George Marsaglia's PRNGs:
http://www.math.uni-bielefeld.de/~sillke/ALGORITHMS/random/marsaglia-c
It explicitly relies on two seeds so might just what you are looking for.

Resources