Quickly generating the "triangle sequence": avoiding mispredictions - performance

I'm interested in calculating the triangle sequence1, which is the sequence of pairs (i, j): (0, 0), (1, 0), (1, 1), (2, 0), (2, 1) ...
which iterates though all pairs (i, j) with the restriction that i >= j. The same sequence with but with the restriction i > j is also interesting.
These sequences represent, among others things, all the ways to choose 2 (possibly identical) elements from a n-element set (for the sequence up to (n, n)2), or the indices of the lower triagular elements of a matrix3. The sequence of values for i alone is A003056 in OEIS, while j alone is A002262. The sequence frequently arises in combinartorial algorithms, where their performance may be critical.
A simple but branchy way to generate the next value in the sequence is:
if (i == j) {
j = 0;
i++;
} else {
j++;
}
}
However, this suffers from many mispredicts while calculating the initial elements of the sequence, when checking the condition (i == j) -
generally one mispredict each time i is incremented. As the sequence increases, the number of mispredicts becomes lower since i is incremented
with reduced frequency, so the j++ branch dominates and is well predicted. Still, some types of combinatorial search repeatedly iterate over the
small terms in the sequence, so I'm looking for a branch-free approach or some other approach that suffers fewer mispredicts.
For many uses, the order of the sequences isn't as important, so generating the values in differnet order than above is a allowable if it leads to
a better solution. For example, j could count down rather than up: (0, 0), (1, 1), (1, 0), (2, 2), (2, 1), ....
1 I'm also interested in knowing what the right name for this sequence is (perhaps so I make a better title for this question). I just kind of made up "triangle sequence".
2 Here, the i >= j version represents sub-multisets (repetition allowed), while the i > j variant represents normal subsets (no repetition).
3 Here, the i >= j version includes the main diagonal, while the i > j variant excludes it.

Here are two branch-free approaches that do not use any expensive calculations. First one uses comparison and logical AND:
const bool eq = i == j;
i += eq;
j = (j + 1) & (eq - 1);
Second one uses comparison and multiplication:
const bool eq = i == j;
i += eq;
j = (j + 1) * (1 - eq);
In theory "multiplication" variant should be slower than "logical" one, but measurements show very little difference.
Both approaches would result in branchless code only for processors that allow branchless comparisons (for example x86). Also these approaches assume to be implemented using a language where results of conditional expressions could be easily converted to integers (for example C/C++, where "false" comparisons are converted to zero integers, and "true" ones - to integers equal to "1").
The only problem with these approaches is performance. They could in theory outperform branchy code, but only when mispredicts are really frequent. A simple test where there is no other work besides generating "triangle sequence" (see it on ideone) shows miserable mispredict rate and therefore both branchless methods about 3 times slower than branchy one. The explanation is simple: there should be not much mispredicts for longer sequences; as for shorter ones, modern processors have very good branch predictors that almost never fail in case of short branch patterns; so we have not many mispredicts, branchy code almost always executes only 2 instructions (compare, increment), while branchless code executes both active and incative "branches" plus some instructions specific to branchless approach.
In case you want to repeatedly iterate over the small terms in the sequence, probably other approach would be preferable. You calculate the sequence only once, then repeatedly read it from memory.

In Python we can express this as:
i, j = i + (i == j), (j + 1) * (i != j)
but it turns out, at around a million iterations or so, on my machine, the following, more long winded, lazy evaluation code is about 20% faster:
from itertools import count, repeat
def gen_i():
""" A003056 """
for x in count(0): # infinitely counts up
yield from repeat(x, x + 1) # replication
def gen_j():
""" A002262 """
for x in count(0): # infinitely counts up
yield from range(x + 1) # count up to (including) x
sequence = zip(gen_i(), gen_j())
for _ in range(1000000):
i, j = next(sequence)
In the above code, gen_i(), gen_j(), count(), repeat(), and zip() are all generators (and range() is an iterator) so sequence continues to call into the code on demand as new (i, j) pairs are required. I assume both the implementation of range() and repeat() terminate with a misprediction.
Simple isn't necessarily also quick (i.e. consider all the unnecessary additions of zero and multiplictions by one in the compact form.)
So which is more important, quickly generating the sequence or avoiding mispredictions?

You can derive j from i:
...set val...
old_j = j;
j = (j + 1) % (i + 1);
if (i == old_j) {
i++;
}
...loop if more...
And further derive i increment from j and current i:
...set val...
old_j = j;
j = (j + 1) % (i + 1);
i = i + (i / old_j);
...loop if more...
(Can't test it at the moment... Please review)

Related

Is there a sparse edit distance algorithm?

Say you have two strings of length 100,000 containing zeros and ones. You can compute their edit distance in roughly 10^10 operations.
If each string only has 100 ones and the rest are zeros then I can represent each string using 100 integers saying where the ones are.
Is there a much faster algorithm to compute the edit distance using
this sparse representation? Even better would be an algorithm that also uses 100^2 space instead of 10^10 space.
To give something to test on, consider these two strings with 10 ones each. The integers say where the ones are in each string.
[9959, 10271, 12571, 21699, 29220, 39972, 70600, 72783, 81449, 83262]
[9958, 10270, 12570, 29221, 34480, 37952, 39973, 83263, 88129, 94336]
In algorithmic terms, if we have two sparse binary strings of length n each represented by k integers each, does there exist an O(k^2) time edit distance algorithm?
Of course! There are so few possible operations with so many 0s. I mean, the answer is at most 200.
Take a look at
10001010000000001
vs ||||||
10111010100000010
Look at all the zeroes with pipes. Does it matter which one out of those you end up deleting? Nope. That's the key.
Solution 1
Let's consider the normal n*m solution:
dp(int i, int j) {
// memo & base case
if( str1[i-1] == str1[j-1] ) {
return dp(i-1, j-1);
}
return 1 + min( dp(i-1, j), dp(i-1, j-1), dp(i, j-1) );
}
If almost every single character was a 0, what would hog the most amount of time?
if( str1[i-1] == str1[j-1] ) { // They will be equal so many times, (99900)^2 times!
return dp(i-1, j-1);
}
I could imagine that trickling down for tens of thousands of recursions. All you actually need logic for are the ~200 critical points. You can ignore the rest. A simple modification would be
if( str1[i-1] == str1[j-1] ) {
if( str1[i-1] == 1 )
return dp(i-1, j-1); // Already hit a critical point
// rightmost location of a 1 in str1 or str2, that is <= i-1
best = binarySearch(CriticalPoints, i-1);
return dp(best + 1, best + 1); // Use that critical point
// Important! best+1 because we still want to compute the answer at best
// Without it, we would skip over in a case where str1[best] is 1, and str2[best] is 0.
}
CriticalPoints would be the array containing the index of every 1 in either str1 or str2. Make sure that it's sorted before you binary search. Keep in mind those gochya's. My logic was: Okay I need to make sure to calculate the answer at the index best itself, so let's go with best + 1 as the parameter. But, if best == i - 1, we get stuck in a loop. I'll handle that with a quick str1[i-1] == 1 check. Done, phew.
You can do a quick check for correctness by noting that at worst case you will hit all 200*100000 combinations of i and j that make critical points, and when those critical points call min(a, b, c), it only makes three recursive function calls. If any of those functions are critical points, then it's part of those 200*100000 we already counted and we can ignore it. If it's not, then in O(log(200)) it falls into a single call on another critical point (Now, it's something we know is part of the 200*100000 we already counted). Thus, each critical point takes at worst 3*log(200) time excluding calls to other critical points. Similarly, the very first function call will fall into a critical point in log(200) time. Thus, we have an upper bound of O(200*100000*3*log(200) + log(200)).
Also, make sure your memo table is a hashmap, not an array. 10^10 memory will not fit on your computer.
Solution 2
You know the answer is at most 200, so just prevent yourself from computing more than that many operations deep.
dp(int i, int j) { // O(100000 * 205), sounds good to me.
if( abs(i - j) > 205 )
return 205; // The answer in this case is at least 205, so it's irrelevant to calculating the answer because when min is called, it wont be smallest.
// memo & base case
if( str1[i-1] == str1[j-1] ) {
return dp(i-1, j-1);
}
return 1 + min( dp(i-1, j), dp(i-1, j-1), dp(i, j-1) );
}
This one is very simple, but I leave it for solution two because this solution seems to have come out from thin air, as opposed to analyzing the problem and figuring out where the slow part is and how to optimize it. Keep this in your toolbox though, since you should be coding this solution.
Solution 3
Just like Solution 2, we could have done it like this:
dp(int i, int j, int threshold = 205) {
if( threshold == 0 )
return 205;
// memo & base case
if( str1[i-1] == str1[j-1] ) {
return dp(i-1, j-1);
}
return 1 + min( dp(i-1, j, threshold - 1), dp(i-1, j-1, threshold - 1), dp(i, j-1, threshold - 1) );
}
You might be worried about dp(i-1, j-1) trickling down, but the threshold keeps i and j close together so it calculates a subset of Solution 2. This is because the threshold gets decremented every time i and j get farther apart. dp(i-1, j-1, threshold) would make it identical to Solution 2 (Thus, this one is slightly faster).
Space
These solutions will give you the answer very quickly, but if you want a space-optimizing solution as well, it would be easy to replace str1[i] with (i in Str1CriticalPoints) ? 1 : 0, using a hashmap. This would give a final solution that is still very fast (Though will be 10x slower), and also avoids keeping the long strings in memory (To the point where it could run on an Arduino). I don't think this is necessary though.
Note that the original solution does not use 10^10 space. You mention "even better, less than 10^10 space", with an implication that 10^10 space would be acceptable. Unfortunately, even with enough RAM, iterating though that space takes 10^10 time, which is definitely not acceptable. None of my solutions use 10^10 space: only 2 * 10^5 to hold the strings - which can be avoided as discussed above. 10^10 Bytes it 10 GB for reference.
EDIT: As maniek notes, you only need to check abs(i - j) > 105, as the remaining 100 insertions needed to equate i and j will pull the number of operations above 200.

Proving that there are no overlapping sub-problems?

I just got the following interview question:
Given a list of float numbers, insert “+”, “-”, “*” or “/” between each consecutive pair of numbers to find the maximum value you can get. For simplicity, assume that all operators are of equal precedence order and evaluation happens from left to right.
Example:
(1, 12, 3) -> 1 + 12 * 3 = 39
If we built a recursive solution, we would find that we would get an O(4^N) solution. I tried to find overlapping sub-problems (to increase the efficiency of this algorithm) and wasn't able to find any overlapping problems. The interviewer then told me that there wasn't any overlapping subsolutions.
How can we detect when there are overlapping solutions and when there isn't? I spent a lot of time trying to "force" subsolutions to appear and eventually the Interviewer told me that there wasn't any.
My current solution looks as follows:
def maximumNumber(array, current_value=None):
if current_value is None:
current_value = array[0]
array = array[1:]
if len(array) == 0:
return current_value
return max(
maximumNumber(array[1:], current_value * array[0]),
maximumNumber(array[1:], current_value - array[0]),
maximumNumber(array[1:], current_value / array[0]),
maximumNumber(array[1:], current_value + array[0])
)
Looking for "overlapping subproblems" sounds like you're trying to do bottom up dynamic programming. Don't bother with that in an interview. Write the obvious recursive solution. Then memoize. That's the top down approach. It is a lot easier to get working.
You may get challenged on that. Here was my response the last time that I was asked about that.
There are two approaches to dynamic programming, top down and bottom up. The bottom up approach usually uses less memory but is harder to write. Therefore I do the top down recursive/memoize and only go for the bottom up approach if I need the last ounce of performance.
It is a perfectly true answer, and I got hired.
Now you may notice that tutorials about dynamic programming spend more time on bottom up. They often even skip the top down approach. They do that because bottom up is harder. You have to think differently. It does provide more efficient algorithms because you can throw away parts of that data structure that you know you won't use again.
Coming up with a working solution in an interview is hard enough already. Don't make it harder on yourself than you need to.
EDIT Here is the DP solution that the interviewer thought didn't exist.
def find_best (floats):
current_answers = {floats[0]: ()}
floats = floats[1:]
for f in floats:
next_answers = {}
for v, path in current_answers.iteritems():
next_answers[v + f] = (path, '+')
next_answers[v * f] = (path, '*')
next_answers[v - f] = (path, '-')
if 0 != f:
next_answers[v / f] = (path, '/')
current_answers = next_answers
best_val = max(current_answers.keys())
return (best_val, current_answers[best_val])
Generally the overlapping sub problem approach is something where the problem is broken down into smaller sub problems, the solutions to which when combined solve the big problem. When these sub problems exhibit an optimal sub structure DP is a good way to solve it.
The decision about what you do with a new number that you encounter has little do with the numbers you have already processed. Other than accounting for signs of course.
So I would say this is a over lapping sub problem solution but not a dynamic programming problem. You could use dive and conquer or evenmore straightforward recursive methods.
Initially let's forget about negative floats.
process each new float according to the following rules
If the new float is less than 1, insert a / before it
If the new float is more than 1 insert a * before it
If it is 1 then insert a +.
If you see a zero just don't divide or multiply
This would solve it for all positive floats.
Now let's handle the case of negative numbers thrown into the mix.
Scan the input once to figure out how many negative numbers you have.
Isolate all the negative numbers in a list, convert all the numbers whose absolute value is less than 1 to the multiplicative inverse. Then sort them by magnitude. If you have an even number of elements we are all good. If you have an odd number of elements store the head of this list in a special var , say k, and associate a processed flag with it and set the flag to False.
Proceed as before with some updated rules
If you see a negative number less than 0 but more than -1, insert a / divide before it
If you see a negative number less than -1, insert a * before it
If you see the special var and the processed flag is False, insert a - before it. Set processed to True.
There is one more optimization you can perform which is removing paris of negative ones as candidates for blanket subtraction from our initial negative numbers list, but this is just an edge case and I'm pretty sure you interviewer won't care
Now the sum is only a function of the number you are adding and not the sum you are adding to :)
Computing max/min results for each operation from previous step. Not sure about overall correctness.
Time complexity O(n), space complexity O(n)
const max_value = (nums) => {
const ops = [(a, b) => a+b, (a, b) => a-b, (a, b) => a*b, (a, b) => a/b]
const dp = Array.from({length: nums.length}, _ => [])
dp[0] = Array.from({length: ops.length}, _ => [nums[0],nums[0]])
for (let i = 1; i < nums.length; i++) {
for (let j = 0; j < ops.length; j++) {
let mx = -Infinity
let mn = Infinity
for (let k = 0; k < ops.length; k++) {
if (nums[i] === 0 && k === 3) {
// If current number is zero, removing division
ops.splice(3, 1)
dp.splice(3, 1)
continue
}
const opMax = ops[j](dp[i-1][k][0], nums[i])
const opMin = ops[j](dp[i-1][k][1], nums[i])
mx = Math.max(opMax, opMin, mx)
mn = Math.min(opMax, opMin, mn)
}
dp[i].push([mx,mn])
}
}
return Math.max(...dp[nums.length-1].map(v => Math.max(...v)))
}
// Tests
console.log(max_value([1, 12, 3]))
console.log(max_value([1, 0, 3]))
console.log(max_value([17,-34,2,-1,3,-4,5,6,7,1,2,3,-5,-7]))
console.log(max_value([59, 60, -0.000001]))
console.log(max_value([0, 1, -0.0001, -1.00000001]))

Fortran multidimensional sub-array performance

While manipulating and assigning sub-arrays within multidimensional arrays in Fortran90, I stumbled across an interesting performance quirk.
Fortran90 introduced the ability to manipulate sub-sections of arrays and I have seen a few places which recommends that array operations be performed using this "slicing" method instead of loops. For instance, if I have to add two arrays, a and b of size 10, it is better to write:
c(1:10) = a(1:10) + b(1:10)
or
c = a + b
Instead of
do i = 1, 10
c(i) = a(i) + b(i)
end do
I tried this method for simple one dimensional and two dimensional arrays and found it to be faster with the "slicing" notation. However, things began to get a little interesting when assigning such results within multidimensional arrays.
First of all, I must apologize for my rather crude performance measuring exercise. I am not even sure if the method I have adopted is the right way to time and test codes, but I am fairly confident about the qualitative results of the test.
program main
implicit none
integer, parameter :: mSize = 10000
integer :: i, j
integer :: pCnt, nCnt, cntRt, cntMx
integer, dimension(mSize, mSize) :: a, b
integer, dimension(mSize, mSize, 3) :: c
pCnt = 0
call SYSTEM_CLOCK(nCnt, cntRt, cntMx)
print *, "First call: ", nCnt-pCnt
pCnt = nCnt
do j = 1, mSize
do i = 1, mSize
a(i, j) = i*j
b(i, j) = i+j
end do
end do
call SYSTEM_CLOCK(nCnt, cntRt, cntMx)
print *, "Created Matrices: ", nCnt-pCnt
pCnt = nCnt
! OPERATIONS BY SLICING NOTATION
!c(1:mSize, 1:mSize, 1) = a + b
!c(1:mSize, 1:mSize, 2) = a - b
!c(1:mSize, 1:mSize, 3) = a * b
! OPERATIONS WITH LOOP
do j = 1, mSize
do i = 1, mSize
c(i, j, 1) = a(i, j) + b(i, j)
c(i, j, 2) = a(i, j) - b(i, j)
c(i, j, 3) = a(i, j) * b(i, j)
end do
end do
call SYSTEM_CLOCK(nCnt, cntRt, cntMx)
print *, "Added Matrices: ", nCnt-pCnt
pCnt = nCnt
end program main
As can be seen, I have two methods of operating upon and assigning two large 2D arrays into a 3D array. I was heavily in favour of using the slicing notation as it helped me write shorter and more elegant looking code. But upon observing how severely sluggish my code was, I was forced to recheck the capacity of slicing notation over calculating within loops.
I ran the above code with and without -O3 flag using GNU Fortran 4.8.4 for Ubuntu 14.04
Without -O3 flag
a. Slicing notation
5 Runs - 843, 842, 842, 841, 859
Average - 845.4
b. Looped calculation
5 Runs - 1713, 1713, 1723, 1711, 1713
Average - 1714.6
With -O3 flag
a. Slicing notation
5 Runs - 545, 545, 544, 544, 548
Average - 545.2
b. Looped calculation
5 Runs - 479, 477, 475, 472, 472
Average - 475
I found it very interesting that without -O3 flag, the slicing notation continued to perform way better than loops. However, using -O3 flag causes this advantage to vanish completely. Contrarily, it becomes detrimental to use array slicing notation in this case.
In fact, with my rather large 3D parallel computation code, this is turning out to be a significant bottle-neck. I strongly suspect that the formation of array temporaries during the assignment of a lower dimensional array to a higher dimensional array is the culprit here. But why did the optimization flag fail to optimize the assignment in this case?
Moreover, I feel that blaming -O3 flag is not a respectable thing to do. So are array temporaries really the culprit? Is there something else I may be missing? Any insight will be extremely helpful in speeding up my code. Thanks!
When doing any performance comparison, you have to compare apple with apples and orange with oranges. What I mean is that you are not really comparing the same thing. They are totally different even if they are producing the same result.
What comes into play here is the memory management, think of cache faults during the operation. If you turn the loop version into 3 different loops as suggested by haraldkl you will certainly get similar performance.
What happens is that when you combine the 3 assignments in the same loop, there is a lot of cache reuse for right hand side since all the 3 share the same variables in the right hand side. Each element of a or b is loaded into the cache and into registers only once for the loop version while for the array operation version, each element of a or b gets loaded 3 times. That is what makes the difference. The larger the size of the array, the larger the difference, because you will get more cache fault and more reloading of elements into the registers.
I don't know what the compiler really does so not really an answer, but too much text for a comment...
I'd have the suspicion that the compiler expands the array notation into something like this:
do j = 1, mSize
do i = 1, mSize
c(i, j, 1) = a(i, j) + b(i, j)
end do
end do
do j = 1, mSize
do i = 1, mSize
c(i, j, 2) = a(i, j) - b(i, j)
end do
end do
do j = 1, mSize
do i = 1, mSize
c(i, j, 3) = a(i, j) * b(i, j)
end do
end do
Of course, the compiler might still collapse these loops if written like that, so you might need to confuse him a little more, for example by writing something of c to the screen between the loops.

Generate all binary strings of length n with k bits set

What's the best algorithm to find all binary strings of length n that contain k bits set? For example, if n=4 and k=3, there are...
0111
1011
1101
1110
I need a good way to generate these given any n and any k so I'd prefer it to be done with strings.
This method will generate all integers with exactly N '1' bits.
From https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
Compute the lexicographically next bit permutation
Suppose we have a pattern of N bits set to 1 in an integer and we want
the next permutation of N 1 bits in a lexicographical sense. For
example, if N is 3 and the bit pattern is 00010011, the next patterns
would be 00010101, 00010110, 00011001, 00011010, 00011100, 00100011,
and so forth. The following is a fast way to compute the next
permutation.
unsigned int v; // current permutation of bits
unsigned int w; // next permutation of bits
unsigned int t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
The __builtin_ctz(v) GNU C compiler intrinsic for x86 CPUs returns the number of trailing zeros. If you are using Microsoft compilers for
x86, the intrinsic is _BitScanForward. These both emit a bsf
instruction, but equivalents may be available for other architectures.
If not, then consider using one of the methods for counting the
consecutive zero bits mentioned earlier. Here is another version that
tends to be slower because of its division operator, but it does not
require counting the trailing zeros.
unsigned int t = (v | (v - 1)) + 1;
w = t | ((((t & -t) / (v & -v)) >> 1) - 1);
Thanks to Dario Sneidermanis of Argentina, who provided this on November 28, 2009.
Python
import itertools
def kbits(n, k):
result = []
for bits in itertools.combinations(range(n), k):
s = ['0'] * n
for bit in bits:
s[bit] = '1'
result.append(''.join(s))
return result
print kbits(4, 3)
Output: ['1110', '1101', '1011', '0111']
Explanation:
Essentially we need to choose the positions of the 1-bits. There are n choose k ways of choosing k bits among n total bits. itertools is a nice module that does this for us. itertools.combinations(range(n), k) will choose k bits from [0, 1, 2 ... n-1] and then it's just a matter of building the string given those bit indexes.
Since you aren't using Python, look at the pseudo-code for itertools.combinations here:
http://docs.python.org/library/itertools.html#itertools.combinations
Should be easy to implement in any language.
Forget about implementation ("be it done with strings" is obviously an implementation issue!) -- think about the algorithm, for Pete's sake... just as in, your very first TAG, man!
What you're looking for is all combinations of K items out of a set of N (the indices, 0 to N-1 , of the set bits). That's obviously simplest to express recursively, e.g., pseudocode:
combinations(K, setN):
if k > length(setN): return "no combinations possible"
if k == 0: return "empty combination"
# combinations including the first item:
return ((first-item-of setN) combined combinations(K-1, all-but-first-of setN))
union combinations(K, all-but-first-of setN)
i.e., the first item is either present or absent: if present, you have K-1 left to go (from the tail aka all-but-firs), if absent, still K left to go.
Pattern-matching functional languages like SML or Haskell may be best to express this pseudocode (procedural ones, like my big love Python, may actually mask the problem too deeply by including too-rich functionality, such as itertools.combinations, which does all the hard work for you and therefore HIDES it from you!).
What are you most familiar with, for this purpose -- Scheme, SML, Haskell, ...? I'll be happy to translate the above pseudocode for you. I can do it in languages such as Python too, of course -- but since the point is getting you to understand the mechanics for this homework assignment, I won't use too-rich functionality such as itertools.combinations, but rather recursion (and recursion-elimination, if needed) on more obvious primitives (such as head, tail, and concatenation). But please DO let us know what pseudocode-like language you're most familiar with! (You DO understand that the problem you state is identically equipotent to "get all combinations of K items out or range(N)", right?).
This C# method returns an enumerator that creates all combinations. As it creates the combinations as you enumerate them it only uses stack space, so it's not limited by memory space in the number of combinations that it can create.
This is the first version that I came up with. It's limited by the stack space to a length of about 2700:
static IEnumerable<string> BinStrings(int length, int bits) {
if (length == 1) {
yield return bits.ToString();
} else {
if (length > bits) {
foreach (string s in BinStrings(length - 1, bits)) {
yield return "0" + s;
}
}
if (bits > 0) {
foreach (string s in BinStrings(length - 1, bits - 1)) {
yield return "1" + s;
}
}
}
}
This is the second version, that uses a binary split rather than splitting off the first character, so it uses the stack much more efficiently. It's only limited by the memory space for the string that it creates in each iteration, and I have tested it up to a length of 10000000:
static IEnumerable<string> BinStrings(int length, int bits) {
if (length == 1) {
yield return bits.ToString();
} else {
int first = length / 2;
int last = length - first;
int low = Math.Max(0, bits - last);
int high = Math.Min(bits, first);
for (int i = low; i <= high; i++) {
foreach (string f in BinStrings(first, i)) {
foreach (string l in BinStrings(last, bits - i)) {
yield return f + l;
}
}
}
}
}
One problem with many of the standard solutions to this problem is that the entire set of strings is generated and then those are iterated through, which may exhaust the stack. It quickly becomes unwieldy for any but the smallest sets. In addition, in many instances, only a partial sampling is needed, but the standard (recursive) solutions generally chop the problem into pieces that are heavily biased to one direction (eg. consider all the solutions with a zero starting bit, and then all the solutions with a one starting bit).
In many cases, it would be more desireable to be able to pass a bit string (specifying element selection) to a function and have it return the next bit string in such a way as to have a minimal change (this is known as a Gray Code) and to have a representation of all the elements.
Donald Knuth covers a whole host of algorithms for this in his Fascicle 3A, section 7.2.1.3: Generating all Combinations.
There is an approach for tackling the iterative Gray Code algorithm for all ways of choosing k elements from n at http://answers.yahoo.com/question/index?qid=20081208224633AA0gdMl
with a link to final PHP code listed in the comment (click to expand it) at the bottom of the page.
One possible 1.5-liner:
$ python -c 'import itertools; \
print set([ n for n in itertools.permutations("0111", 4)])'
set([('1', '1', '1', '0'), ('0', '1', '1', '1'), ..., ('1', '0', '1', '1')])
.. where k is the number of 1s in "0111".
The itertools module explains equivalents for its methods; see the equivalent for the permutation method.
One algorithm that should work:
generate-strings(prefix, len, numBits) -> String:
if (len == 0):
print prefix
return
if (len == numBits):
print prefix + (len x "1")
generate-strings(prefix + "0", len-1, numBits)
generate-strings(prefix + "1", len-1, numBits)
Good luck!
In a more generic way, the below function will give you all possible index combinations for an N choose K problem which you can then apply to a string or whatever else:
def generate_index_combinations(n, k):
possible_combinations = []
def walk(current_index, indexes_so_far=None):
indexes_so_far = indexes_so_far or []
if len(indexes_so_far) == k:
indexes_so_far = tuple(indexes_so_far)
possible_combinations.append(indexes_so_far)
return
if current_index == n:
return
walk(current_index + 1, indexes_so_far + [current_index])
walk(current_index + 1, indexes_so_far)
if k == 0:
return []
walk(0)
return possible_combinations
I would try recursion.
There are n digits with k of them 1s. Another way to view this is sequence of k+1 slots with n-k 0s distributed among them. That is, (a run of 0s followed by a 1) k times, then followed by another run of 0s. Any of these runs can be of length zero, but the total length needs to be n-k.
Represent this as an array of k+1 integers. Convert to a string at the bottom of the recursion.
Recursively call to depth n-k, a method that increments one element of the array before a recursive call and then decrements it, k+1 times.
At the depth of n-k, output the string.
int[] run = new int[k+1];
void recur(int depth) {
if(depth == 0){
output();
return;
}
for(int i = 0; i < k + 1; ++i){
++run[i];
recur(depth - 1);
--run[i];
}
public static void main(string[] arrrgghhs) {
recur(n - k);
}
It's been a while since I have done Java, so there are probably some errors in this code, but the idea should work.
Are strings faster than an array of ints? All the solutions prepending to strings probably result in a copy of the string at each iteration.
So probably the most efficient way would be an array of int or char that you append to. Java has efficient growable containers, right? Use that, if it's faster than string. Or if BigInteger is efficient, it's certainly compact, since each bit only takes a bit, not a whole byte or int. But then to iterate over the bits you need to & mask a bit, and bitshift the mask to the next bit position. So probably slower, unless JIT compilers are good at that these days.
I would post this a a comment on the original question, but my karma isn't high enough. Sorry.
Python (functional style)
Using python's itertools.combinations you can generate all choices of k our of n and map those choices to a binary array with reduce
from itertools import combinations
from functools import reduce # not necessary in python 2.x
def k_bits_on(k,n):
one_at = lambda v,i:v[:i]+[1]+v[i+1:]
return [tuple(reduce(one_at,c,[0]*n)) for c in combinations(range(n),k)]
Example usage:
In [4]: k_bits_on(2,5)
Out[4]:
[(0, 0, 0, 1, 1),
(0, 0, 1, 0, 1),
(0, 0, 1, 1, 0),
(0, 1, 0, 0, 1),
(0, 1, 0, 1, 0),
(0, 1, 1, 0, 0),
(1, 0, 0, 0, 1),
(1, 0, 0, 1, 0),
(1, 0, 1, 0, 0),
(1, 1, 0, 0, 0)]
Well for this question (where you need to iterate over all the submasks in increasing order of their number of set bits), which has been marked as a duplicate of this.
We can simply iterate over all the submasks add them to a vector and sort it according to the number of set bits.
vector<int> v;
for(ll i=mask;i>0;i=(i-1)&mask)
v.push_back(i);
auto cmp = [](const auto &a, const auto &b){
return __builtin_popcountll(a) < __builtin_popcountll(b);
}
v.sort(v.begin(), v.end(), cmp);
Another way would be to iterate over all the submasks N times and add a number to the vector if the number of set bits is equal to i in the ith iteration.
Both ways have complexity of O(n*2^n)
Best and Easy Solution
This is an easy problem. We just need to use Dynamic Programming.
I can give my solution which stores integeres. After that you can convert integers to bitwise strings.
List<Long> dp[]=new List[m+1];
for(int i=0;i<=m;i++) dp[i]=new ArrayList<>();
// dp[i] stores all possible bit masks of n length and i bits set
dp[0].add(0l);
for(int i=1;i<=m;i++){
// transitions
for(int j=0;j<dp[i-1].size();j++){
long num=dp[i-1].get(j);
for(int p=0;p<n;p++){
if((num&(1l<<p))==0) dp[i].add(num|(1l<<p));
}
}
}
// dp[m] contains all possible numbers having m bits set of len n
But dp[m] contains duplicates because adding 1 to 10 or 01 gives 11 two times. To handle that we can use HashSet
Set<Long> set=new HashSet<>();
for(int i=0;i<dp[m].size();i++) set.add(dp[m].get(i));
if you want to solve this problem recursively, you can do this by a D&C algorithm :
def binlist(n,k,s):
if n==0:
if s.count('1')==k:
print(s)
else:
binlist(n-1,k,s+'1')
binlist(n-1,k,s+'0')
binlist(5,3,'')
the output will be :
11100
11010
11001
10110
10101
10011
01110
01101
01011
00111

Can diff be beaten at its own game?

I'm looking for the appropriate algorithm to use to compare two files. I think I can do better than diff due to some added constraints.
What I have are two text files each containing a list of files. They are snapshots of all the files on a system taken at two different times. I want to figure out which files have been added or deleted between the two snapshots.
I could use diff to compare these files, but I don't want to because:
diff tries to group changes together, finding which chunks in a file have changed. I'm only looking for a list of lines that have changed, and that should be a much simpler problem than finding the longest-common-subsequence or some such thing.
Generalized diff algorithms are O(mn) in runtime or space. I'm looking for something more like O(m+n) in time and O(1) in space.
Here are the constraints on the problem:
The file listings are in the same order in both files. They are not necessarily in alphabetical order, but they are in the same relative order.
Most of the time there will be no differences between the lists. If there are differences, there will usually only be a handful of new/deleted files.
I don't need to group the results together, like saying "this entire directory was deleted" or "lines 100-200 are new". I can individually list each line that is different.
I'm thinking this is equivalent to the problem of having two sorted lists and trying to figure out the differences between the two lists. The hitch is the list items aren't necessarily sorted alphabetically, so you don't know if one item is "greater" than another. You just know that the files that are present in both lists will be in the same order.
For what it's worth, I previously posted this question on Ask Metafilter several years ago. Allow me to respond to several potential answers upfront.
Answer: This problem is called Longest Common Subsequence.
Response: I'm trying to avoid the longest common subsequence because simple algorithms run in O(mn) time/space and better ones are complicated and more "heuristical". My intuition tells me that there is a linear-time algorithm due to the added constraints.
Answer: Sort them alphabetically and then compare.
Response: That would be O(m log m+n log n), which is worse than O(m+n).
This isn't quite O(1) memory, the memory requirement in the order of the number of changes, but it's O(m+n) runtime.
It's essentially a buffered streaming algorithm that at any given line knows the difference of all previous lines.
// Pseudo-code:
initialize HashMap<Line, SourceFile> changes = new empty HashMap
while (lines left in A and B) {
read in lineA from file A
read in lineB from file B
if (lineA.equals(lineB)) continue
if (changes.contains(lineA) && changes.get(lineA).SourceFile != A) {
changes.remove(lineA)
} else {
changes.add(lineA, A)
}
if (changes.contains(lineB) && changes.get(lineB).SourceFile != B) {
changes.remove(lineB)
} else {
changes.add(lineB, B)
}
}
for each (line in longerFile) {
if (changes.contains(line) && changes.get(line).SourceFile != longerFile) {
changes.remove(line)
} else {
changes.add(line, longerFile)
}
}
Lines in the HashMap from SourceFile == A have been removed
Lines in the HashMap from SourceFile == B have been added
This heavily relies on the fact the the files are listed in the same relative order. Otherwise, the memory requirement would be much larger than the number of changes. However, due to that ordering this algorithm shouldn't use much more memory than 2 * numChanges.
Read one file, placing each file-name into a HashSet-like data structure with O(1) add and O(1) contains implementations.
Then read the seconds file, checking each file-name against the HashSet.
Total algorithm if file one has length m and the second file has length n is O(m+n) as required.
Note: This algorithm assumes the data-set fits comfortably in physical memory to be fast.
If the data set cannot easily fit in memory, the lookup could be implemented using some variation of a B-Tree with disk paging. The complexity would then be O(mlog m) to initially setup and O(n log m) for each other file compare.
From a theoretical point of view, comparing the editing distance between two strings (because here you have strings in a funny language where a 'character' is a file name) cannot be made O(m+n). But here we have simplifications.
An implementation of an algorithm in your case (should contain mistakes):
# i[0], i[1] are undoable iterables; at the end they both return Null
while (a = i[0].next()) && (b = i[1].next()) : # read one item from each stream
if a != b: # skip if they are identical
c = [[a],[b]] # otherwise, prepare two fast arrays to store difference
for (w = 1; ; w = 1-w) # and read from one stream at a time
nxi = Null
if (nx = i[1-w].next()) in c[w]: # if we read a new character that matches
nxi = c[w].index(nx)
if nx is Null: nxi = -1 # or if we read end of stream
if nxi is not Null: # then output that we found some diff
for cc in c[1-w]: yield cc # the ones stored
for cc in c[w][0:nxi-1]: yield cc # and the ones stored before nx
for cc in c[w][nxi+1:]: i[w].undo(cc) # about the remainder - put it back
break # and return back to normal cycle
# one of them finished
if a: yield a
if b: yield b
for ci in i:
while (cc = ci.next()): yield cc
There are data structures that I call fast arrays -- they are probably HashSet things, but the ones that remember ordering. The addition and lookup in them should be O(log N), but the memory use O(N).
This doesn't use any memory or cycles beyond O(m+n) outside of finding differences. For every 'difference block' -- the operation that can be described as taking away M consequtive items and adding N ones -- this takes O(M+N) memory and O(MN) O(Mlog N+Nlog M) instructions. The memory is released after a block is done, so this isn't much of a thing if you indeed only have small changes. Of course, the worst-case performance is as bad as with generic method.
In practice, a log factor difference in sorting times is probably insignificant -- sort can sort hundreds of thousands of lines in a few seconds. So you don't actually need to write any code:
sort filelist1 > filelist1.sorted
sort filelist2 > filelist2.sorted
comm -3 filelist1.sorted filelist2.sorted > changes
I'm not claiming that this is necessarily the fastest solution -- I think Ben S's accepted answer will be, at least above some value of N. But it's definitely the simplest, it will scale to any number of files, and (unless you are the guy in charge of Google's backup operation) it will be more than fast enough for the number of files you have.
If you accept that dictionaries (hash maps) are O(n) space and O(1) insert/lookup, this solution ought to be O(m+n) in both time and space.
from collections import defaultdict
def diff(left, right):
left_map, right_map = defaultdict(list), defaultdict(list)
for index, object in enumerate(left): left_map[object] += [index]
for index, object in enumerate(right): right_map[object] += [index]
i, j = 0, 0
while i < len(left) and j < len(right):
if left_map[right[j]]:
i2 = left_map[right[j]].pop(0)
if i2 < i: continue
del right_map[right[j]][0]
for i in range(i, i2): print '<', left[i]
print '=', left[i2], right[j]
i, j = i2 + 1, j + 1
elif right_map[left[i]]:
j2 = right_map[left[i]].pop(0)
if j2 < j: continue
del left_map[left[i]][0]
for j in range(j, j2): print '>', right[j]
print '=', left[i], right[j2]
i, j = i + 1, j2 + 1
else:
print '<', left[i]
i = i + 1
for j in range(j, len(right)): print '>', right[j]
>>> diff([1, 2, 1, 1, 3, 5, 2, 9],
... [ 2, 1, 3, 6, 5, 2, 8, 9])
< 1
= 2 2
= 1 1
< 1
= 3 3
> 6
= 5 5
= 2 2
> 8
= 9 9
Okay, slight cheating as list.append and list.__delitem__ are only O(1) if they're linked lists, which isn't really true... but that's the idea, anyhow.
A refinement of ephemient's answer, this only uses extra memory when there are changes.
def diff(left, right):
i, j = 0, 0
while i < len(left) and j < len(right):
if left[i] == right[j]:
print '=', left[i], right[j]
i, j = i+1, j+1
continue
old_i, old_j = i, j
left_set, right_set = set(), set()
while i < len(left) or j < len(right):
if i < len(left) and left[i] in right_set:
for i2 in range(old_i, i): print '<', left[i2]
j = old_j
break
elif j < len(right) and right[j] in left_set:
for j2 in range(old_j, j): print '>', right[j2]
i = old_i
break
else:
left_set .add(left [i])
right_set.add(right[j])
i, j = i+1, j+1
while i < len(left):
print '<', left[i]
i = i+1
while j < len(right):
print '>', right[j]
j = j+1
Comments? Improvements?
I've been after a program to diff large files without running out of memory, but found nothing to fit my purposes. I'm not interested in using the diffs for patching (then I'd probably use rdiff from librdiff), but for visually inspecting the diffs, maybe turning them into word-diffs with dwdiff --diff-input (which reads the unified diff format) and perhaps collecting the word-diffs somehow.
(My typical use case: I have some NLP tool that I use to process a large text corpus. I run it once, get a file that's 122760246 lines long, I make a change to my tool, run it again, get a file that differs like every million lines, maybe two insertions and a deletion, or just one line differs, that kind of thing.)
Since I couldn't find anything, I just made a little script https://github.com/unhammer/diff-large-files – it works (dwdiff accepts it as input), it's fast enough (faster than the xz process that often runs after it in the pipeline), and most importantly it doesn't run out of memory.
I would read the lists of files into two sets and find those file names that are unique to either list.
In Python, something like:
files1 = set(line.strip() for line in open('list1.txt'))
files2 = set(line.strip() for line in open('list2.txt'))
print('\n'.join(files1.symmetric_difference(files2)))

Resources