Finding a number of maximally different binary vectors from a set - algorithm

Consider the set, S, of all binary vectors of length n where each contains exactly m ones; so there are n-m zeros in each vector.
My goal is to construct a number, k, of vectors from S such that these vectors are as different as possible from each other.
As a simple example, take n=4, m=2 and k=2, then a possible solution is: [1,1,0,0] and [0,0,1,1].
It seems that this is an open problem in the coding theory literature (?).
Is there any way (i.e. algorithm) to find a suboptimal yet good solution ?
Is Hamming distance the right performance measure to use in this case ?
Some thoughts:
In this paper, the authors propose a couple of algorithms to find the subset of vectors such that the pairwise Hamming distance is >= a certain value, d.
I have implemented the Random approach as follows: take a set SS, which is initialized by any vector from S. Then, I consider the remaining vectors
in S. For each of these vectors, I check if this vector has at least a distance d with respect to each vector in SS. If so, then it is added to SS.
By taking the maximal possible d, if the size of SS is >= k, then I consider SS as an optimal solution, and I choose any subset of k vectors from SS.
Using this approach, I think that the resulting SS will depend on the identity of the initial vector in SS; i.e. there are multiple solutions(?).
But how to proceed if the size of SS is < k ?
From the proposed algorithms in the paper, I have only understood the Random one. I am interested in the Binary lexicographic search (section 2.3) but I don't know how to implement it (?).

Maybe you find this paper useful (I wrote it). It contains algorithms that efficiently create permutations of bitstrings.
For example, the inc() algorithm:
long inc(long h_in , long m0 , long m1) {
long h_out = h_in | (~m1); //pre -mask
h_out ++;
// increment
h_out = (h_out & m1) | m0; //post -mask
return h_out;
}
It takes an input h_in and return the next higher value that is at least 1 larger than h_in and 'matches' the boundaries m0 and m1. 'Matching' means: the result has a 1 whereever m0 has a 1, and the result has a 0 whereever m1 has a 0. Not that h_in MUST BE a valid value with regards to mo and m1! Also, note that m0 has to be bitwise smaller than m1, which means that m0 cannot have a 1 in a position where m1 has a 0.
This could be used to generate permutations with a minimum edit distance to a given input string:
Let's assume you have 0110, you first NEGATE it to 1001 (edit distance = k).
Set 'm0=1001' and 'm1=1001'. Using this would result only on '1001' itself.
Now to get all values with edit distance k-1, you can do the following, simply flip one of the bits of m0 or m1, then inc() will return an ordered series of all bitstring that have a difference of k or k-1.
I know, not very interesting yet, but you can modify up to k bits, and inc() will always return all permutations with the maximum allowed edit difference with regard to m0 and m1.
Now, to get all permutations, you would have to re-run the algorithm with all possibly combinations of m0 and m1.
Example: To get all possible permutations of 0110 with edit distance 2, you would have to run inc() with the following permutations of m0=0110 and m1=0110 (to get permutations, a bit position has to be expanded, meaning that m0 is set to 0 and m1 is set to 1:
Bit 0 and 1 expanded: m0=0010 and m1=1110
Bit 0 and 2 expanded: m0=0100 and m1=1110
Bit 0 and 3 expanded: m0=0110 and m1=1111
Bit 1 and 2 expanded: m0=0000 and m1=0110
Bit 1 and 3 expanded: m0=0010 and m1=0111
Bit 2 and 3 expanded: m0=0100 and m1=0111
As starting value for h_0 I suggest to use simply m0. Iteration can be aborted once inc() returns m1.
Summary
The above algorithm generates in O(x) all x binary vectors that differ in at least y bits (configurable) from a given vector v.
Using your definition of n=number of bits in a vector v, setting y=n generates exactly 1 vector which is the exact opposite of the input vector v. For y=n-1, it will generate n+1 vectors: n vectors which differ in all but one bits and 1 vector that differs in all bits. And so on different values of y.
**EDIT: Added summary and replaced erroneous 'XOR' with 'NEGATE' in the text above.

I don't know if maximizing the sum of the Hamming distances is the best criterion to obtain a set of "maximally different" binary vectors, but I strongly suspect it is. Furthermore I strongly suspect that the algorithm that I'm going to present yields exactly a set of k vectors that maximizes the sum of Hamming distances for vectors of n bits of with m ones and n - m zeroes. Unfortunately I don't have the time to prove it (and, of course, I might be wrong – in which case you would be left with a “suboptimal yet good” solution, as per your request.)
Warning: In the following I'm assuming that, as a further condition, the result set may not contain the same vector twice.
The algorithm I propose is the following:
Starting from a result set with just one vector, repeatedly add one of
those remaining vectors that have the maximum sum of Hamming distances
from all the vectors that are already in the result set. Stop when the
result set contains k vectors or all available vectors have been
added.
Please note that the sum of Hamming distances of the result set does not depend on the choice of the first or any subsequent vector.
I found a “brute force” approach to be viable, given the constraints you mentioned in a comment:
n<25, 1<m<10, 10<k<100 (or 10<k<50)
The “brute force” consists in precalculating all vectors in “lexicographical” order in an array, and also keeping up-to-date an array of the same size that contains, for each vector with the same index, the total Hamming distance of that vector to all the vectors that are in the result set. At each iteration the total Hamming distances are updated, and the first (in “lexicographical” order) of all vectors that have the maximum total Hamming distance from the current result set is chosen. The chosen vector is added to the result set, and the arrays are shifted in order to fill in its place, effectively decreasing their size.
Here is my solution in Java. It's meant to be easily translatable to any procedural language, if needed. The part that calculates the combinations of m items out of n can be replaced by a library call, if one is available. The following Java methods have a corresponding C/C++ macro that uses fast specialized processor instructions on modern CPUs:
Long.numberOfTrailingZeros→__builtin_ctzl, Long.bitCount→__builtin_popcountl.
package waltertross.bits;
public class BitsMain {
private static final String USAGE =
"USAGE: java -jar <thisJar> n m k (1<n<64, 0<m<n, 0<k)";
public static void main (String[] args) {
if (args.length != 3) {
throw new IllegalArgumentException(USAGE);
}
int n = parseIntArg(args[0]); // number of bits
int m = parseIntArg(args[1]); // number of ones
int k = parseIntArg(args[2]); // max size of result set
if (n < 2 || n > 63 || m < 1 || m >= n || k < 1) {
throw new IllegalArgumentException(USAGE);
}
// calculate the total number of available bit vectors
int c = combinations(n, m);
// truncate k to the above number
if (k > c) {
k = c;
}
long[] result = new long[k]; // the result set (actually an array)
long[] vectors = new long[c - 1]; // all remaining candidate vectors
long[] hammingD = new long[c - 1]; // their total Hamming distance to the result set
long firstVector = (1L << m) - 1; // m ones in the least significant bits
long lastVector = firstVector << (n - m); // m ones in the most significant bits
result[0] = firstVector; // initialize the result set
// generate the remaining candidate vectors in "lexicographical" order
int size = 0;
for (long v = firstVector; v != lastVector; ) {
// See http://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
long t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
v = (t + 1) | (((~t & -~t) - 1) >>> (Long.numberOfTrailingZeros(v) + 1));
vectors[size++] = v;
}
assert(size == c - 1);
// chosenVector is always the last vector added to the result set
long chosenVector = firstVector;
// do until the result set is filled with k vectors
for (int r = 1; r < k; r++) {
// find the index of the new chosen vector starting from the first
int chosen = 0;
// add the distance to the old chosenVector to the total distance of the first
hammingD[0] += Long.bitCount(vectors[0] ^ chosenVector);
// initialize the maximum total Hamming distance to that of the first
long maxHammingD = hammingD[0];
// for all the remaining vectors
for (int i = 1; i < size; i++) {
// add the distance to the old chosenVector to their total distance
hammingD[i] += Long.bitCount(vectors[i] ^ chosenVector);
// whenever the calculated distance is greater than the max,
// update the max and the index of the new chosen vector
if (maxHammingD < hammingD[i]) {
maxHammingD = hammingD[i];
chosen = i;
}
}
// set the new chosenVector to the one with the maximum total distance
chosenVector = vectors[chosen];
// add the chosenVector to the result set
result[r] = chosenVector;
// fill in the hole left by the chosenVector by moving all vectors
// that follow it down by 1 (keeping vectors and total distances in sync)
System.arraycopy(vectors, chosen + 1, vectors, chosen, size - chosen - 1);
System.arraycopy(hammingD, chosen + 1, hammingD, chosen, size - chosen - 1);
size--;
}
// dump the result set
for (int r = 0; r < k; r++) {
dumpBits(result[r], n);
}
}
private static int parseIntArg(String arg) {
try {
return Integer.parseInt(arg);
} catch (NumberFormatException ex) {
throw new IllegalArgumentException(USAGE);
}
}
private static int combinations(int n, int m) {
// calculate n over m = n! / (m! (n - m)!)
// without using arbitrary precision numbers
if (n <= 0 || m <= 0 || m > n) {
throw new IllegalArgumentException();
}
// possibly avoid unnecessary calculations by swapping m and n - m
if (m * 2 < n) {
m = n - m;
}
if (n == m) {
return 1;
}
// primeFactors[p] contains the power of the prime number p
// in the prime factorization of the result
int[] primeFactors = new int[n + 1];
// collect prime factors of each term of n! / m! with a power of 1
for (int term = n; term > m; term--) {
collectPrimeFactors(term, primeFactors, 1);
}
// collect prime factors of each term of (n - m)! with a power of -1
for (int term = n - m; term > 1; term--) {
collectPrimeFactors(term, primeFactors, -1);
}
// multiply the collected prime factors, checking for overflow
int combinations = 1;
for (int f = 2; f <= n; f += (f == 2) ? 1 : 2) {
// multiply as many times as requested by the stored power
for (int i = primeFactors[f]; i > 0; i--) {
int before = combinations;
combinations *= f;
// check for overflow
if (combinations / f != before) {
String msg = "combinations("+n+", "+m+") > "+Integer.MAX_VALUE;
throw new IllegalArgumentException(msg);
}
}
}
return combinations;
}
private static void collectPrimeFactors(int n, int[] primeFactors, int power) {
// for each candidate prime that fits in the remaining n
// (note that non-primes will have been preceded by their component primes)
for (int i = 2; i <= n; i += (i == 2) ? 1 : 2) {
while (n % i == 0) {
primeFactors[i] += power;
n /= i;
}
}
}
private static void dumpBits(Long bits, int nBits) {
String binary = Long.toBinaryString(bits);
System.out.println(String.format("%"+nBits+"s", binary).replace(' ', '0'));
}
}
The algorithm's data for n=5, m=2, k=4:
result
00011 00101 00110 01001 01010 01100 10001 10010 10100 11000 vectors
0→2 0→2 0→2 0→2 0→4 0→2 0→2 0→4 0→4 hammingD
^ chosen
00011 00101 00110 01001 01010 10001 10010 10100 11000
01100 2→4 2→4 2→4 2→4 2→6 2→6 4→6 4→6
^
00011 00101 00110 01001 01010 10010 10100 11000
01100 4→6 4→8 4→6 4→8 6→8 6→8 6→8
10001 ^
00011 00101 01001 01010 10010 10100 11000
01100 6 6 8 8 8 8
10001
00110
Sample output (n=24, m=9, k=20):
[wtross ~/Dropbox/bits]$ time java -jar bits-1.0-SNAPSHOT.jar 24 9 20
000000000000000111111111
000000111111111000000000
111111000000000000000111
000000000000111111111000
000111111111000000000000
111000000000000000111111
000000000111111111000000
111111111000000000000000
000000000000001011111111
000000111111110100000000
111111000000000000001011
000000000000111111110100
001011111111000000000000
110100000000000000111111
000000001011111111000000
111111110100000000000000
000000000000001101111111
000000111111110010000000
111111000000000000001101
000000000000111111110010
real 0m0.269s
user 0m0.244s
sys 0m0.046s
The toughest case within your constraints (n=24, m=9, k=99) takes ~550 ms on my Mac.
The algorithm could be made even faster by some optimization, e.g., by shifting shorter array chunks. Remarkably, in Java I found shifting "up" to be considerably slower than shifting "down".

UPDATED ANSWER
Looking at the example output of Walter Tross's code, I think that generating a random solution can be simplified to this:
Take any vector to start with, e.g. for n=8, m=3, k=5:
A: 01001100
After every step, sum the vectors to get the number of times each position has been used:
SUM: 01001100
Then, for the next vector, place the ones at positions that have been used least (in this case zero times), e.g.:
B: 00110001
to get:
A: 01001100
B: 00110001
SUM: 01111101
Then, there are 2 least-used positions left, so for the 3 ones in the next vector, use those 2 positions, and then put the third one anywhere:
C: 10010010
to get:
A: 01001100
B: 00110001
C: 10010010
SUM: 11121111 (or reset to 00010000 at this point)
Then for the next vector, you have 7 least-used positions (the ones in the sum), so choose any 3, e.g.:
D: 10100010
to get:
A: 01001100
B: 00110001
C: 10010010
D: 10100010
SUM: 21221121
And for the final vector, choose any of the 4 least-used positions, e.g.:
E: 01000101
To generate all solutions, simply generate every possible vector in each step:
A: 11100000, 11010000, 11001000, ... 00000111
Then, e.g. when A and SUM are 11100000:
B: 00011100, 00011010, 00011001, ... 00000111
Then, e.g. when B is 00011100 and SUM is 11111100:
C: 10000011, 01000011, 00100011, 00010011, 00001011, 00000111
Then, e.g. when C is 10000011 and SUM is 21111111:
D: 01110000, 01101000, 01100100, ... 00000111
And finally, e.g. when D is 01110000 and SUM is 22221111:
E: 00001110, 00001101, 00001011, 00000111
This would result in C(8,3) × C(5,3) × C(8,1) × C(7,3) × C(4,3) = 56 × 10 × 8 × 35 × 4 = 627,200 solutions for n=8, m=3, k=5.
Actually, you need to add a method to avoid repeating the same vector, and avoid painting yourself into a corner; so I don't think this will be simpler than Walter's answer.
INITIAL ANSWER - HAS MAJOR ISSUES
(I will assume than m is not greater than n/2, i.e. the number of ones is not greater than the number of zeros. Otherwise, use a symmetrical approach.)
When k×m is not greater than n, there obviously are optimal solutions, e.g.:
n=10, m=3, k=3:
A: 1110000000
B: 0001110000
C: 0000001110
where the Hamming distances are all 2×m:
|AB|=6, |AC|=6, |BC|=6, total=18
When k×m is greater than n, solutions where the difference in Hamming distances between consecutive vectors are minimized offer the greatest total distance:
n=8, m=3, k=4:
A: 11100000
B: 00111000
C: 00001110
D: 10000011
|AB|=4, |AC|=6, |AD|=4, |BC|=4, |BD|=6, |CD|=4, total=28
n=8, m=3, k=4:
A: 11100000
B: 00011100
C: 00001110
D: 00000111
|AB|=6, |AC|=6, |AD|=6, |BC|=2, |BD|=4, |CD|=2, total=26
So, practically, you take m×k and see how much greater it is than n, let's call it x = m×k−n, and this x is the number of overlaps, i.e. how often a vector will have a one in the same position as the previous vector. You then spread out the overlap over the different vectors as evenly as possible to maximize the total distance.
In the example above, x = 3×4−8 = 4 and we have 4 vectors, so we can spread out the overlap evenly and every vector has 1 one in the same position as the previous vector.
To generate all unique solutions, you could:
Calculate x = m×k−n and generate all partitions of x into k parts, with the lowest possible maximum value:
n=8, m=3, k=5 -> x=7
22111, 21211, 21121, 21112, 12211, 12121, 12112, 11221, 11212, 11122
(discard partitions with value 3)
Generate all vectors to be used as vector A, e.g.:
A: 11100000, 11010000, 11001000, 11000100, ... 00000111
For each of these, generate all vectors B, which are lexicographically smaller than vector A, and have the correct number of overlapping ones with vector A (in the example that is 1 and 2), e.g.:
A: 10100100
overlap=1:
B: 10011000, 10010010, 10010001, 10001010, 10001001, 10000011, 01110000, ... 00000111
overlap=2:
B: 10100010, 10100001, 10010100, 10001100, 10000110, 10000101, 01100100, ... 00100101
For each of these, generate all vectors C, and so on, until you have sets of k vectors. When generating the last vector, you have to take into account the overlapping with the previous as well as the next (i.e. first) vector.
I assume it's best to treat the partitions of x into k as a binary tree:
1 2
11 12 21 22
111 112 121 122 211 212 221
1112 1121 1122 1211 1212 1221 2111 2112 2121 2211
11122 11212 11221 12112 12121 12211 21112 21121 21211 22111
and traverse this tree while creating solutions, so that each vector only needs to be generated once.
I think this method only works for some values of n, m and k; I'm not sure it can be made to work for the general case.

Related

Why is the complexity of the binary search Log in base 2?

Considering the implementation of the iterative binary search code:
// Java implementation of iterative Binary Search
class BinarySearch {
// Returns index of x if it is present in arr[],
// else return -1
int binarySearch(int arr[], int x)
{
int l = 0, r = arr.length - 1;
while (l <= r) {
int m = l + (r - l) / 2;
// Check if x is present at mid
if (arr[m] == x)
return m;
// If x greater, ignore left half
if (arr[m] < x)
l = m + 1;
// If x is smaller, ignore right half
else
r = m - 1;
}
// if we reach here, then element was
// not present
return -1;
}
// Driver method to test above
public static void main(String args[])
{
BinarySearch ob = new BinarySearch();
int arr[] = { 2, 3, 4, 10, 40 };
int n = arr.length;
int x = 10;
int result = ob.binarySearch(arr, x);
if (result == -1)
System.out.println("Element not present");
else
System.out.println("Element found at "
+ "index " + result);
}
}
The GeeksforGeeks website says the following:
"For example Binary Search (iterative implementation) has O(Logn) time complexity."
My question is what does the division by 2 have to do with logarithm in base 2? What is the relationship between each other? I will use the analogy of 1 pizza (array) to facilitate the understanding of my question:
1 pizza - divided into 2 parts = 2 pieces of pizza
2 pieces of pizza - divide each piece in half = 4 pieces of pizza
4 pieces of pizza - divide each piece in half = 8 pieces of pizza
8 pieces of pizza - divide each piece in half = 16 pieces of pizza
Logₐb = x
b = logarithming
a = base
x = logarithm result
aˣ = b
The values of pieces of pizza are 1, 2, 4, 8 and 16 are similar to logarithms, but I still can't understand what the relationship is. What would be the relationship among logarithming (b), base (a) and the result of logarithm (x) with the division by 2 of a array (pizza)? Would x be the final amount of pieces that I can divide my array(pizza)? Or is the x the number of divisions of my array (pizza)?
Contrary to your belief, O(log(n)) is independent of any base.
If you have a pizza consisting of 16 slices of unit size, how often can you halve it (and throw away one of the halves) until you get a single slice of unit size?
Answer: log2(16) = 4 times
If you have an array of length n, how often can you halve it (and throw away one of the halves) until you get an array slice of length 1?
Answer: log2(n)
More generally, how does an n-ary search algorithm relate to logarithms?
Logₐb = x
b = the size of the array to search
a = the number of slices you get after one cut (all but one are thrown away)
x = the number of cuts you need to make until you get a slice of size 1
Let's use the same pizza analogy you have, and assume we have 1 whole pizza and we want 8 slices. Every time we cut, we divide by 2 as well.
The first cut means we will have 2 slices. The second cut gives us 4 slices. The third cut results in 8 slices. We made 3 cuts to get to 8 slices. Mathematically, it turns out that there is a relationship with the numbers 2, 3, and 8. The log function connects those numbers accordingly. When we are limited to how much we can divide, that is our base (base = 2). We have a quantity which is 8. The number of operations was lg(8) = 3 (using lg as log of base 2).
The same idea applies to binary search. We divide each section of the array we search by 2, the quantity is whatever our size of the array is, and the number of operations we perform is asymptotically lg(n).
Considering the answers, comments and the following video:
StackOverflow response 1
StackOverflow response 2
Binary Search Video
#Mo B. comment:
The question is not: how many cuts are necessary to get 16 slices. But rather: how many cuts are necessary to get a slice of size 1? And that's 4. In other words, like in the algorithm, you cut in half and throw away one of the halves at each step. How often can you do that with an array of size 16?
#Yves Daoust comment:
The logarithm of a number is roughly the number of times you can halve it until you reach 1.
My conclusions are:
The logarithm of a array of size n is approximately the number of times we can divide it in half (considering the base = 2) until it reaches the smallest unit of size 1.
If (x = Logₐb) then 1*2ˣ = n
So x = # times you can multiply 1 by 2 until you get to n
Reversing Logic: x = # of times you can divide n by 2 until you get to 1
The example in the figure would be Log₂10 = x, where x is a result that is not exact. However, if I had drawn the array with 16 positions, this would imply Log₂16 = 4, the result 4 is the number of levels or divisions.

How to turn integers into Fibonacci coding efficiently?

Fibonacci sequence is obtained by starting with 0 and 1 and then adding the two last numbers to get the next one.
All positive integers can be represented as a sum of a set of Fibonacci numbers without repetition. For example: 13 can be the sum of the sets {13}, {5,8} or {2,3,8}. But, as we have seen, some numbers have more than one set whose sum is the number. If we add the constraint that the sets cannot have two consecutive Fibonacci numbers, than we have a unique representation for each number.
We will use a binary sequence (just zeros and ones) to do that. For example, 17 = 1 + 3 + 13. Then, 17 = 100101. See figure 2 for a detailed explanation.
I want to turn some integers into this representation, but the integers may be very big. How to I do this efficiently.
The problem itself is simple. You always pick the largest fibonacci number less than the remainder. You can ignore the the constraint with the consecutive numbers (since if you need both, the next one is the sum of both so you should have picked that one instead of the initial two).
So the problem remains how to quickly find the largest fibonacci number less than some number X.
There's a known trick that starting with the matrix (call it M)
1 1
1 0
You can compute fibbonacci number by matrix multiplications(the xth number is M^x). More details here: https://www.nayuki.io/page/fast-fibonacci-algorithms . The end result is that you can compute the number you're look in O(logN) matrix multiplications.
You'll need large number computations (multiplications and additions) if they don't fit into existing types.
Also store the matrices corresponding to powers of two you compute the first time, since you'll need them again for the results.
Overall this should be O((logN)^2 * large_number_multiplications/additions)).
First I want to tell you that I really liked this question, I didn't know that All positive integers can be represented as a sum of a set of Fibonacci numbers without repetition, I saw the prove by induction and it was awesome.
To respond to your question I think that we have to figure how the presentation is created. I think that the easy way to find this is that from the number we found the closest minor fibonacci item.
For example if we want to present 40:
We have Fib(9)=34 and Fib(10)=55 so the first element in the presentation is Fib(9)
since 40 - Fib(9) = 6 and (Fib(5) =5 and Fib(6) =8) the next element is Fib(5). So we have 40 = Fib(9) + Fib(5)+ Fib(2)
Allow me to write this in C#
class Program
{
static void Main(string[] args)
{
List<int> fibPresentation = new List<int>();
int numberToPresent = Convert.ToInt32(Console.ReadLine());
while (numberToPresent > 0)
{
int k =1;
while (CalculateFib(k) <= numberToPresent)
{
k++;
}
numberToPresent = numberToPresent - CalculateFib(k-1);
fibPresentation.Add(k-1);
}
}
static int CalculateFib(int n)
{
if (n == 1)
return 1;
int a = 0;
int b = 1;
// In N steps compute Fibonacci sequence iteratively.
for (int i = 0; i < n; i++)
{
int temp = a;
a = b;
b = temp + b;
}
return a;
}
}
Your result will be in fibPresentation
This encoding is more accurately called the "Zeckendorf representation": see https://en.wikipedia.org/wiki/Fibonacci_coding
A greedy approach works (see https://en.wikipedia.org/wiki/Zeckendorf%27s_theorem) and here's some Python code that converts a number to this representation. It uses the first 100 Fibonacci numbers and works correctly for all inputs up to 927372692193078999175 (and incorrectly for any larger inputs).
fibs = [0, 1]
for _ in xrange(100):
fibs.append(fibs[-2] + fibs[-1])
def zeck(n):
i = len(fibs) - 1
r = 0
while n:
if fibs[i] <= n:
r |= 1 << (i - 2)
n -= fibs[i]
i -= 1
return r
print bin(zeck(17))
The output is:
0b100101
As the greedy approach seems to work, it suffices to be able to invert the relation N=Fn.
By the Binet formula, Fn=[φ^n/√5], where the brackets denote the nearest integer. Then with n=floor(lnφ(√5N)) you are very close to the solution.
17 => n = floor(7.5599...) => F7 = 13
4 => n = floor(4.5531) => F4 = 3
1 => n = floor(1.6722) => F1 = 1
(I do not exclude that some n values can be off by one.)
I'm not sure if this is an efficient enough for you, but you could simply use Backtracking to find a(the) valid representation.
I would try to start the backtracking steps by taking the biggest possible fib number and only switch to smaller ones if the consecutive or the only once constraint is violated.

Generating a non-repeating set from a random seed, and extract result by index

p.s. I have referred to this as Random, but this is a Seed Based Random Shuffle, where the Seed will be generated by a PRNG, but with the same Seed, the same "random" distribution will be observed.
I am currently trying to find a method to assist in doing 2 things:
1) Generate Non-Repeating Sequence
This will take 2 arguments: Seed; and N. It will generate a sequence, of size N, populated with numbers between 1 and N, with no repetitions.
I have found a few good methods to do this, but most of them get stumped by feasibility with the second thing.
2) Extract an entry from the Sequence
This will take 3 arguments: Seed; N; and I. This is for determining what value would appear at position I in a Sequence that would be generated with Seed and N. However, in order to work with what I have in mind, it absolutely cannot use a generated sequence, and pick out an element.
I initially worked with pre-calculating the sequence, then querying it, but this only really works in test cases, as the number of Seeds, and the value of N that will be used would create a database into the Petabytes.
From what I can tell, having a method that implements requirement 1 by using requirement 2 would be the most ideal method.
i.e. a sequence is generated by:
function Generate_Sequence(int S, int N) {
int[] sequence = new int[N];
for (int i = 0; i < N; i++) {
sequence[i] = Extract_From_Sequence(S, N, i);
}
return sequence;
}
For Example
GS = Generate Sequence
ES = Extract from Sequence
for:
S = 1
N = 5
I = 4
GS(S, N) = { 4, 2, 5, 1, 3 }
ES(S, N, I) = 1
let S = 2
GS(S, N) = { 3, 5, 2, 4, 1 }
ES(S, N, I) = 4
One way to do this is to make a permutation over the bit positions of the number. Assume that N is a power of two (I will discuss the general case later!).
Use the seed S to generate a permutation \sigma over the set of {1,2,...,log(n)}. Then permute the bits of I according to the \sigma to obtain I'. In other words, the bit of I' at the position \sigma(x) is obtained from the bit of I at the position x.
One problem with this method is its linearity (It is closed under the XOR operation). To overcome this, you can find a number p with gcd(p,N)=1 (this can be done easily even for very large Ns) and generate a random number (q < N) using the seed S. The output of the Extract_From_Sequence(S, N, I) would be (p*I'+q mod N).
Now the case where N is not a complete power of two. The problem arises when the I' falls outside the range of [1,N]. In that case, we return the most significant bits of I to their initial position until the resulting value falls into the desired range. This is done by changing the \sigma(log(n)) bit of I' with the log(n) bit, and so on ....

Keep uniform distribution after remapping to a new range

Since this is about remapping a uniform distribution to another with a different range, this is not a PHP question specifically although I am using PHP.
I have a cryptographicaly secure random number generator that gives me evenly distributed integers (uniform discrete distribution) between 0 and PHP_INT_MAX.
How do I remap these results to fit into a different range in an efficient manner?
Currently I am using $mappedRandomNumber = $randomNumber % ($range + 1) + $min where $range = $max - $min, but that obvioulsy doesn't work since the first PHP_INT_MAX%$range integers from the range have a higher chance to be picked, breaking the uniformity of the distribution.
Well, having zero knowledge of PHP definitely qualifies me as an expert, so
mentally converting to float U[0,1)
f = r / PHP_MAX_INT
then doing
mapped = min + f*(max - min)
going back to integers
mapped = min + (r * max - r * min)/PHP_MAX_INT
if computation is done via 64bit math, and PHP_MAX_INT being 2^31 it should work
This is what I ended up doing. PRNG 101 (if it does not fit, ignore and generate again). Not very sophisticated, but simple:
public function rand($min = 0, $max = null){
// pow(2,$numBits-1) calculated as (pow(2,$numBits-2)-1) + pow(2,$numBits-2)
// to avoid overflow when $numBits is the number of bits of PHP_INT_MAX
$maxSafe = (int) floor(
((pow(2,8*$this->intByteCount-2)-1) + pow(2,8*$this->intByteCount-2))
/
($max - $min)
) * ($max - $min);
// discards anything above the last interval N * {0 .. max - min -1}
// that fits in {0 .. 2^(intBitCount-1)-1}
do {
$chars = $this->getRandomBytesString($this->intByteCount);
$n = 0;
for ($i=0;$i<$this->intByteCount;$i++) {$n|=(ord($chars[$i])<<(8*($this->intByteCount-$i-1)));}
} while (abs($n)>$maxSafe);
return (abs($n)%($max-$min+1))+$min;
}
Any improvements are welcomed.
(Full code on https://github.com/elcodedocle/cryptosecureprng/blob/master/CryptoSecurePRNG.php)
Here is the sketch how I would do it:
Consider you have uniform random integer distribution in range [A, B) that's what your random number generator provide.
Let L = B - A.
Let P be the highest power of 2 such that P <= L.
Let X be a sample from this range.
First calculate Y = X - A.
If Y >= P, discard it and start with new X until you get an Y that fits.
Now Y contains log2(P) uniformly random bits - zero extend it up to log2(P) bits.
Now we have uniform random bit generator that can be used to provide arbitrary number of random bits as needed.
To generate a number in the target range, let [A_t, B_t) be the target range. Let L_t = B_t - A_t.
Let P_t be the smallest power of 2 such that P_t >= L_t.
Read log2(P_t) random bits and make an integer from it, let's call it X_t.
If X_t >= L_t, discard it and try again until you get a number that fits.
Your random number in the desired range will be L_t + A_t.
Implementation considerations: if your L_t and L are powers of 2, you never have to discard anything. If not, then even in the worst case you should get the right number in less than 2 trials on average.

number to unique permutation mapping of a sequence containing duplicates

I am looking for an algorithm that can map a number to a unique permutation of a sequence. I have found out about Lehmer codes and the factorial number system thanks to a similar question, Fast permutation -> number -> permutation mapping algorithms, but that question doesn't deal with the case where there are duplicate elements in the sequence.
For example, take the sequence 'AAABBC'. There are 6! = 720 ways that could be arranged, but I believe there are only 6! / (3! * 2! * 1!) = 60 unique permutation of this sequence. How can I map a number to a permutation in these cases?
Edit: changed the term 'set' to 'sequence'.
From Permutation to Number:
Let K be the number of character classes (example: AAABBC has three character classes)
Let N[K] be the number of elements in each character class. (example: for AAABBC, we have N[K]=[3,2,1], and let N= sum(N[K])
Every legal permutation of the sequence then uniquely corresponds to a path in an incomplete K-way tree.
The unique number of the permutation then corresponds to the index of the tree-node in a post-order traversal of the K-ary tree terminal nodes.
Luckily, we don't actually have to perform the tree traversal -- we just need to know how many terminal nodes in the tree are lexicographically less than our node. This is very easy to compute, as at any node in the tree, the number terminal nodes below the current node is equal to the number of permutations using the unused elements in the sequence, which has a closed form solution that is a simple multiplication of factorials.
So given our 6 original letters, and the first element of our permutation is a 'B', we determine that there will be 5!/3!1!1! = 20 elements that started with 'A', so our permutation number has to be greater than 20. Had our first letter been a 'C', we could have calculated it as 5!/2!2!1! (not A) + 5!/3!1!1! (not B) = 30+ 20, or alternatively as
60 (total) - 5!/3!2!0! (C) = 50
Using this, we can take a permutation (e.g. 'BAABCA') and perform the following computations:
Permuation #= (5!/2!2!1!) ('B') + 0('A') + 0('A')+ 3!/1!1!1! ('B') + 2!/1!
= 30 + 3 +2 = 35
Checking that this works: CBBAAA corresponds to
(5!/2!2!1! (not A) + 5!/3!1!1! (not B)) 'C'+ 4!/2!2!0! (not A) 'B' + 3!/2!1!0! (not A) 'B' = (30 + 20) +6 + 3 = 59
Likewise, AAABBC =
0 ('A') + 0 'A' + '0' A' + 0 'B' + 0 'B' + 0 'C = 0
Sample implementation:
import math
import copy
from operator import mul
def computePermutationNumber(inPerm, inCharClasses):
permutation=copy.copy(inPerm)
charClasses=copy.copy(inCharClasses)
n=len(permutation)
permNumber=0
for i,x in enumerate(permutation):
for j in xrange(x):
if( charClasses[j]>0):
charClasses[j]-=1
permNumber+=multiFactorial(n-i-1, charClasses)
charClasses[j]+=1
if charClasses[x]>0:
charClasses[x]-=1
return permNumber
def multiFactorial(n, charClasses):
val= math.factorial(n)/ reduce(mul, (map(lambda x: math.factorial(x), charClasses)))
return val
From Number to Permutation:
This process can be done in reverse, though I'm not sure how efficiently:
Given a permutation number, and the alphabet that it was generated from, recursively subtract the largest number of nodes less than or equal to the remaining permutation number.
E.g. Given a permutation number of 59, we first can subtract 30 + 20 = 50 ('C') leaving 9. Then we can subtract 'B' (6) and a second 'B'(3), re-generating our original permutation.
Here is an algorithm in Java that enumerates the possible sequences by mapping an integer to the sequence.
public class Main {
private int[] counts = { 3, 2, 1 }; // 3 Symbols A, 2 Symbols B, 1 Symbol C
private int n = sum(counts);
public static void main(String[] args) {
new Main().enumerate();
}
private void enumerate() {
int s = size(counts);
for (int i = 0; i < s; ++i) {
String p = perm(i);
System.out.printf("%4d -> %s\n", i, p);
}
}
// calculates the total number of symbols still to be placed
private int sum(int[] counts) {
int n = 0;
for (int i = 0; i < counts.length; i++) {
n += counts[i];
}
return n;
}
// calculates the number of different sequences with the symbol configuration in counts
private int size(int[] counts) {
int res = 1;
int num = 0;
for (int pos = 0; pos < counts.length; pos++) {
for (int den = 1; den <= counts[pos]; den++) {
res *= ++num;
res /= den;
}
}
return res;
}
// maps the sequence number to a sequence
private String perm(int num) {
int[] counts = this.counts.clone();
StringBuilder sb = new StringBuilder(n);
for (int i = 0; i < n; ++i) {
int p = 0;
for (;;) {
while (counts[p] == 0) {
p++;
}
counts[p]--;
int c = size(counts);
if (c > num) {
sb.append((char) ('A' + p));
break;
}
counts[p]++;
num -= c;
p++;
}
}
return sb.toString();
}
}
The mapping used by the algorithm is as follows. I use the example given in the question (3 x A, 2 x B, 1 x C) to illustrate it.
There are 60 (=6!/3!/2!/1!) possible sequences in total, 30 (=5!/2!/2!/1!) of them have an A at the first place, 20 (=5!/3!/1!/1!) have a B at the first place, and 10 (=5!/3!/2!/0!) have a C at the first place.
The numbers 0..29 are mapped to all sequences starting with an A, 30..49 are mapped to the sequences starting with B, and 50..59 are mapped to the sequences starting with C.
The same process is repeated for the next place in the sequence, for example if we take the sequences starting with B we have now to map numbers 0 (=30-30) .. 19 (=49-30) to the sequences with configuration (3 x A, 1 x B, 1 x C)
A very simple algorithm to mapping a number for a permutation consists of n digits is
number<-digit[0]*10^(n-1)+digit[1]*10^(n-2)+...+digit[n]*10^0
You can find plenty of resources for algorithms to generate permutations. I guess you want to use this algorithm in bioinformatics. For example you can use itertools.permutations from Python.
Assuming the resulting number fits inside a word (e.g. 32 or 64 bit integer) relatively easily, then much of the linked article still applies. Encoding and decoding from a variable base remains the same. What changes is how the base varies.
If you're creating a permutation of a sequence, you pick an item out of your bucket of symbols (from the original sequence) and put it at the start. Then you pick out another item from your bucket of symbols and put it on the end of that. You'll keep picking and placing symbols at the end until you've run out of symbols in your bucket.
What's significant is which item you picked out of the bucket of the remaining symbols each time. The number of remaining symbols is something you don't have to record because you can compute that as you build the permutation -- that's a result of your choices, not the choices themselves.
The strategy here is to record what you chose, and then present an array of what's left to be chosen. Then choose, record which index you chose (packing it via the variable base method), and repeat until there's nothing left to choose. (Just as above when you were building a permuted sequence.)
In the case of duplicate symbols it doesn't matter which one you picked, so you can treat them as the same symbol. The difference is that when you pick a symbol which still has a duplicate left, you didn't reduce the number of symbols in the bucket to pick from next time.
Let's adopt a notation that makes this clear:
Instead of listing duplicate symbols left in our bucket to choose from like c a b c a a we'll list them along with how many are still in the bucket: c-2 a-3 b-1.
Note that if you pick c from the list, the bucket has c-1 a-3 b-1 left in it. That means next time we pick something, we have three choices.
But on the other hand, if I picked b from the list, the bucket has c-2 a-3 left in it. That means next time we pick something, we only have two choices.
When reconstructing the permuted sequence we just maintain the bucket the same way as when we were computing the permutation number.
The implementation details aren't trivial, but they're straightforward with standard algorithms. The only thing that might heckle you is what to do when a symbol in your bucket is no longer available.
Suppose your bucket was represented by a list of pairs (like above): c-1 a-3 b-1 and you choose c. Your resulting bucket is c-0 a-3 b-1. But c-0 is no longer a choice, so your list should only have two entries, not three. You could move the entire list down by 1 resulting in a-3 b-1, but if your list is long this is expensive. A fast an easy solution: move the last element of the bucket into the removed location and decrease your bucket size: c0 a-3 b-1 becomes b-1 a-3 <empty> or just b-1 a-3.
Note that we can do the above because it doesn't matter what order the symbols in the bucket are listed in, as long as it's the same way when we encode or decode the number.
As I was unsure of the code in gbronner's answer (or of my understanding), I recoded it in R as follows
ritpermz=function(n, parclass){
return(factorial(n) / prod(factorial(parclass)))}
rankum <- function(confg, parclass){
n=length(confg)
permdex=1
for (i in 1:(n-1)){
x=confg[i]
if (x > 1){
for (j in 1:(x-1)){
if(parclass[j] > 0){
parclass[j]=parclass[j]-1
permdex=permdex + ritpermz(n-i, parclass)
parclass[j]=parclass[j]+1}}}
parclass[x]=parclass[x]-1
}#}
return(permdex)
}
which does produce a ranking with the right range of integers

Resources