Can diff be beaten at its own game? - algorithm

I'm looking for the appropriate algorithm to use to compare two files. I think I can do better than diff due to some added constraints.
What I have are two text files each containing a list of files. They are snapshots of all the files on a system taken at two different times. I want to figure out which files have been added or deleted between the two snapshots.
I could use diff to compare these files, but I don't want to because:
diff tries to group changes together, finding which chunks in a file have changed. I'm only looking for a list of lines that have changed, and that should be a much simpler problem than finding the longest-common-subsequence or some such thing.
Generalized diff algorithms are O(mn) in runtime or space. I'm looking for something more like O(m+n) in time and O(1) in space.
Here are the constraints on the problem:
The file listings are in the same order in both files. They are not necessarily in alphabetical order, but they are in the same relative order.
Most of the time there will be no differences between the lists. If there are differences, there will usually only be a handful of new/deleted files.
I don't need to group the results together, like saying "this entire directory was deleted" or "lines 100-200 are new". I can individually list each line that is different.
I'm thinking this is equivalent to the problem of having two sorted lists and trying to figure out the differences between the two lists. The hitch is the list items aren't necessarily sorted alphabetically, so you don't know if one item is "greater" than another. You just know that the files that are present in both lists will be in the same order.
For what it's worth, I previously posted this question on Ask Metafilter several years ago. Allow me to respond to several potential answers upfront.
Answer: This problem is called Longest Common Subsequence.
Response: I'm trying to avoid the longest common subsequence because simple algorithms run in O(mn) time/space and better ones are complicated and more "heuristical". My intuition tells me that there is a linear-time algorithm due to the added constraints.
Answer: Sort them alphabetically and then compare.
Response: That would be O(m log m+n log n), which is worse than O(m+n).

This isn't quite O(1) memory, the memory requirement in the order of the number of changes, but it's O(m+n) runtime.
It's essentially a buffered streaming algorithm that at any given line knows the difference of all previous lines.
// Pseudo-code:
initialize HashMap<Line, SourceFile> changes = new empty HashMap
while (lines left in A and B) {
read in lineA from file A
read in lineB from file B
if (lineA.equals(lineB)) continue
if (changes.contains(lineA) && changes.get(lineA).SourceFile != A) {
changes.remove(lineA)
} else {
changes.add(lineA, A)
}
if (changes.contains(lineB) && changes.get(lineB).SourceFile != B) {
changes.remove(lineB)
} else {
changes.add(lineB, B)
}
}
for each (line in longerFile) {
if (changes.contains(line) && changes.get(line).SourceFile != longerFile) {
changes.remove(line)
} else {
changes.add(line, longerFile)
}
}
Lines in the HashMap from SourceFile == A have been removed
Lines in the HashMap from SourceFile == B have been added
This heavily relies on the fact the the files are listed in the same relative order. Otherwise, the memory requirement would be much larger than the number of changes. However, due to that ordering this algorithm shouldn't use much more memory than 2 * numChanges.

Read one file, placing each file-name into a HashSet-like data structure with O(1) add and O(1) contains implementations.
Then read the seconds file, checking each file-name against the HashSet.
Total algorithm if file one has length m and the second file has length n is O(m+n) as required.
Note: This algorithm assumes the data-set fits comfortably in physical memory to be fast.
If the data set cannot easily fit in memory, the lookup could be implemented using some variation of a B-Tree with disk paging. The complexity would then be O(mlog m) to initially setup and O(n log m) for each other file compare.

From a theoretical point of view, comparing the editing distance between two strings (because here you have strings in a funny language where a 'character' is a file name) cannot be made O(m+n). But here we have simplifications.
An implementation of an algorithm in your case (should contain mistakes):
# i[0], i[1] are undoable iterables; at the end they both return Null
while (a = i[0].next()) && (b = i[1].next()) : # read one item from each stream
if a != b: # skip if they are identical
c = [[a],[b]] # otherwise, prepare two fast arrays to store difference
for (w = 1; ; w = 1-w) # and read from one stream at a time
nxi = Null
if (nx = i[1-w].next()) in c[w]: # if we read a new character that matches
nxi = c[w].index(nx)
if nx is Null: nxi = -1 # or if we read end of stream
if nxi is not Null: # then output that we found some diff
for cc in c[1-w]: yield cc # the ones stored
for cc in c[w][0:nxi-1]: yield cc # and the ones stored before nx
for cc in c[w][nxi+1:]: i[w].undo(cc) # about the remainder - put it back
break # and return back to normal cycle
# one of them finished
if a: yield a
if b: yield b
for ci in i:
while (cc = ci.next()): yield cc
There are data structures that I call fast arrays -- they are probably HashSet things, but the ones that remember ordering. The addition and lookup in them should be O(log N), but the memory use O(N).
This doesn't use any memory or cycles beyond O(m+n) outside of finding differences. For every 'difference block' -- the operation that can be described as taking away M consequtive items and adding N ones -- this takes O(M+N) memory and O(MN) O(Mlog N+Nlog M) instructions. The memory is released after a block is done, so this isn't much of a thing if you indeed only have small changes. Of course, the worst-case performance is as bad as with generic method.

In practice, a log factor difference in sorting times is probably insignificant -- sort can sort hundreds of thousands of lines in a few seconds. So you don't actually need to write any code:
sort filelist1 > filelist1.sorted
sort filelist2 > filelist2.sorted
comm -3 filelist1.sorted filelist2.sorted > changes
I'm not claiming that this is necessarily the fastest solution -- I think Ben S's accepted answer will be, at least above some value of N. But it's definitely the simplest, it will scale to any number of files, and (unless you are the guy in charge of Google's backup operation) it will be more than fast enough for the number of files you have.

If you accept that dictionaries (hash maps) are O(n) space and O(1) insert/lookup, this solution ought to be O(m+n) in both time and space.
from collections import defaultdict
def diff(left, right):
left_map, right_map = defaultdict(list), defaultdict(list)
for index, object in enumerate(left): left_map[object] += [index]
for index, object in enumerate(right): right_map[object] += [index]
i, j = 0, 0
while i < len(left) and j < len(right):
if left_map[right[j]]:
i2 = left_map[right[j]].pop(0)
if i2 < i: continue
del right_map[right[j]][0]
for i in range(i, i2): print '<', left[i]
print '=', left[i2], right[j]
i, j = i2 + 1, j + 1
elif right_map[left[i]]:
j2 = right_map[left[i]].pop(0)
if j2 < j: continue
del left_map[left[i]][0]
for j in range(j, j2): print '>', right[j]
print '=', left[i], right[j2]
i, j = i + 1, j2 + 1
else:
print '<', left[i]
i = i + 1
for j in range(j, len(right)): print '>', right[j]
>>> diff([1, 2, 1, 1, 3, 5, 2, 9],
... [ 2, 1, 3, 6, 5, 2, 8, 9])
< 1
= 2 2
= 1 1
< 1
= 3 3
> 6
= 5 5
= 2 2
> 8
= 9 9
Okay, slight cheating as list.append and list.__delitem__ are only O(1) if they're linked lists, which isn't really true... but that's the idea, anyhow.

A refinement of ephemient's answer, this only uses extra memory when there are changes.
def diff(left, right):
i, j = 0, 0
while i < len(left) and j < len(right):
if left[i] == right[j]:
print '=', left[i], right[j]
i, j = i+1, j+1
continue
old_i, old_j = i, j
left_set, right_set = set(), set()
while i < len(left) or j < len(right):
if i < len(left) and left[i] in right_set:
for i2 in range(old_i, i): print '<', left[i2]
j = old_j
break
elif j < len(right) and right[j] in left_set:
for j2 in range(old_j, j): print '>', right[j2]
i = old_i
break
else:
left_set .add(left [i])
right_set.add(right[j])
i, j = i+1, j+1
while i < len(left):
print '<', left[i]
i = i+1
while j < len(right):
print '>', right[j]
j = j+1
Comments? Improvements?

I've been after a program to diff large files without running out of memory, but found nothing to fit my purposes. I'm not interested in using the diffs for patching (then I'd probably use rdiff from librdiff), but for visually inspecting the diffs, maybe turning them into word-diffs with dwdiff --diff-input (which reads the unified diff format) and perhaps collecting the word-diffs somehow.
(My typical use case: I have some NLP tool that I use to process a large text corpus. I run it once, get a file that's 122760246 lines long, I make a change to my tool, run it again, get a file that differs like every million lines, maybe two insertions and a deletion, or just one line differs, that kind of thing.)
Since I couldn't find anything, I just made a little script https://github.com/unhammer/diff-large-files – it works (dwdiff accepts it as input), it's fast enough (faster than the xz process that often runs after it in the pipeline), and most importantly it doesn't run out of memory.

I would read the lists of files into two sets and find those file names that are unique to either list.
In Python, something like:
files1 = set(line.strip() for line in open('list1.txt'))
files2 = set(line.strip() for line in open('list2.txt'))
print('\n'.join(files1.symmetric_difference(files2)))

Related

Find the index of an element in a fixed amount of time O(1)

I'm trying to solve this question for an hour and just can't find any way to do so.
The question is as follows:
A sorted list, length N. There might be duplicates inside the list.
Given an element x, you need to find the latest index of x in the list.
If x does not exist, return a relevant message.
Note: The model is CREW (Concurrent Read Exclusive Write) - meaning concurrent read is allowed, but write is exclusive meaning concurrent write is not allowed.
1) Describe a parallel algorithm that uses N CPUs and solves the problem in a fixed amount of time (I guess they mean O(1)).
2) Explain why the algorithm described is correct.
I assume the input is a 0-indexed, sorted (increasing) array A[] of length N.
Initialise a shared result variable with the value UNSET:
RESULT := "UNSET"
Start N CPUs with the following program, parameterized by i (from 0 to N-1):
CPU(i):
if i==0 and A[0] > x {
RESULT = "NO SOLUTION"
} else if A[i] == x and (i + 1 == N or A[i+1] > x) {
RESULT = i
} else if A[i] < x and (i + 1 == N or A[i+1] > x) {
RESULT = "NO SOLUTION"
}
The program has terminated when RESULT is updated.
Note that exactly one CPU writes to RESULT (because the input is sorted), so there's never a concurrent write, but each array location except the first is read by two CPUs. Each CPU does a fixed amount of work, so the program terminates in a fixed amount of time.

Quickly generating the "triangle sequence": avoiding mispredictions

I'm interested in calculating the triangle sequence1, which is the sequence of pairs (i, j): (0, 0), (1, 0), (1, 1), (2, 0), (2, 1) ...
which iterates though all pairs (i, j) with the restriction that i >= j. The same sequence with but with the restriction i > j is also interesting.
These sequences represent, among others things, all the ways to choose 2 (possibly identical) elements from a n-element set (for the sequence up to (n, n)2), or the indices of the lower triagular elements of a matrix3. The sequence of values for i alone is A003056 in OEIS, while j alone is A002262. The sequence frequently arises in combinartorial algorithms, where their performance may be critical.
A simple but branchy way to generate the next value in the sequence is:
if (i == j) {
j = 0;
i++;
} else {
j++;
}
}
However, this suffers from many mispredicts while calculating the initial elements of the sequence, when checking the condition (i == j) -
generally one mispredict each time i is incremented. As the sequence increases, the number of mispredicts becomes lower since i is incremented
with reduced frequency, so the j++ branch dominates and is well predicted. Still, some types of combinatorial search repeatedly iterate over the
small terms in the sequence, so I'm looking for a branch-free approach or some other approach that suffers fewer mispredicts.
For many uses, the order of the sequences isn't as important, so generating the values in differnet order than above is a allowable if it leads to
a better solution. For example, j could count down rather than up: (0, 0), (1, 1), (1, 0), (2, 2), (2, 1), ....
1 I'm also interested in knowing what the right name for this sequence is (perhaps so I make a better title for this question). I just kind of made up "triangle sequence".
2 Here, the i >= j version represents sub-multisets (repetition allowed), while the i > j variant represents normal subsets (no repetition).
3 Here, the i >= j version includes the main diagonal, while the i > j variant excludes it.
Here are two branch-free approaches that do not use any expensive calculations. First one uses comparison and logical AND:
const bool eq = i == j;
i += eq;
j = (j + 1) & (eq - 1);
Second one uses comparison and multiplication:
const bool eq = i == j;
i += eq;
j = (j + 1) * (1 - eq);
In theory "multiplication" variant should be slower than "logical" one, but measurements show very little difference.
Both approaches would result in branchless code only for processors that allow branchless comparisons (for example x86). Also these approaches assume to be implemented using a language where results of conditional expressions could be easily converted to integers (for example C/C++, where "false" comparisons are converted to zero integers, and "true" ones - to integers equal to "1").
The only problem with these approaches is performance. They could in theory outperform branchy code, but only when mispredicts are really frequent. A simple test where there is no other work besides generating "triangle sequence" (see it on ideone) shows miserable mispredict rate and therefore both branchless methods about 3 times slower than branchy one. The explanation is simple: there should be not much mispredicts for longer sequences; as for shorter ones, modern processors have very good branch predictors that almost never fail in case of short branch patterns; so we have not many mispredicts, branchy code almost always executes only 2 instructions (compare, increment), while branchless code executes both active and incative "branches" plus some instructions specific to branchless approach.
In case you want to repeatedly iterate over the small terms in the sequence, probably other approach would be preferable. You calculate the sequence only once, then repeatedly read it from memory.
In Python we can express this as:
i, j = i + (i == j), (j + 1) * (i != j)
but it turns out, at around a million iterations or so, on my machine, the following, more long winded, lazy evaluation code is about 20% faster:
from itertools import count, repeat
def gen_i():
""" A003056 """
for x in count(0): # infinitely counts up
yield from repeat(x, x + 1) # replication
def gen_j():
""" A002262 """
for x in count(0): # infinitely counts up
yield from range(x + 1) # count up to (including) x
sequence = zip(gen_i(), gen_j())
for _ in range(1000000):
i, j = next(sequence)
In the above code, gen_i(), gen_j(), count(), repeat(), and zip() are all generators (and range() is an iterator) so sequence continues to call into the code on demand as new (i, j) pairs are required. I assume both the implementation of range() and repeat() terminate with a misprediction.
Simple isn't necessarily also quick (i.e. consider all the unnecessary additions of zero and multiplictions by one in the compact form.)
So which is more important, quickly generating the sequence or avoiding mispredictions?
You can derive j from i:
...set val...
old_j = j;
j = (j + 1) % (i + 1);
if (i == old_j) {
i++;
}
...loop if more...
And further derive i increment from j and current i:
...set val...
old_j = j;
j = (j + 1) % (i + 1);
i = i + (i / old_j);
...loop if more...
(Can't test it at the moment... Please review)

Number of partitions with a given constraint

Consider a set of 13 Danish, 11 Japanese and 8 Polish people. It is well known that the number of different ways of dividing this set of people to groups is the 13+11+8=32:th Bell number (the number of set partitions). However we are asked to find the number of possible set partitions under a given constraint. The question is as follows:
A set partition is said to be good if it has no group consisting of at least two people that only includes a single nationality. How many good partitions there are for this set? (A group may include only one person.)
The brute force approach requires going though about 10^26 partitions and checking which ones are good. This seems pretty unfeasible, especially if the groups are larger or one introduces other nationalities. Is there a smart way instead?
EDIT: As a side note. There probably is no hope for a really nice solution. A highly esteemed expert in combinatorics answered a related question, which, I think, basically says that the related problem, and thus this problem also, is very difficult to solve exactly.
Here's a solution using dynamic programming.
It starts from an empty set, then adds one element at a time and calculates all the valid partitions.
The state space is huge, but notice that to be able to calculate the next step we only need to know about a partition the following things:
For each nationality, how many sets it contains that consists of only a single member of that nationality. (e.g.: {a})
How many sets it contains with mixed elements. (e.g.: {a, b, c})
For each of these configurations I only store the total count. Example:
[0, 1, 2, 2] -> 3
{a}{b}{c}{mixed}
e.g.: 3 partitions that look like: {b}, {c}, {c}, {a,c}, {b,c}
Here's the code in python:
import collections
from operator import mul
from fractions import Fraction
def nCk(n,k):
return int( reduce(mul, (Fraction(n-i, i+1) for i in range(k)), 1) )
def good_partitions(l):
n = len(l)
i = 0
prev = collections.defaultdict(int)
while l:
#any more from this kind?
if l[0] == 0:
l.pop(0)
i += 1
continue
l[0] -= 1
curr = collections.defaultdict(int)
for solution,total in prev.iteritems():
for idx,item in enumerate(solution):
my_solution = list(solution)
if idx == i:
# add element as a new set
my_solution[i] += 1
curr[tuple(my_solution)] += total
elif my_solution[idx]:
if idx != n:
# add to a set consisting of one element
# or merge into multiple sets that consist of one element
cnt = my_solution[idx]
c = cnt
while c > 0:
my_solution = list(solution)
my_solution[n] += 1
my_solution[idx] -= c
curr[tuple(my_solution)] += total * nCk(cnt, c)
c -= 1
else:
# add to a mixed set
cnt = my_solution[idx]
curr[tuple(my_solution)] += total * cnt
if not prev:
# one set with one element
lone = [0] * (n+1)
lone[i] = 1
curr[tuple(lone)] = 1
prev = curr
return sum(prev.values())
print good_partitions([1, 1, 1, 1]) # 15
print good_partitions([1, 1, 1, 1, 1]) # 52
print good_partitions([2, 1]) # 4
print good_partitions([13, 11, 8]) # 29811734589499214658370837
It produces correct values for the test cases. I also tested it against a brute-force solution (for small values), and it produces the same results.
An exact analytic solution is hard, but a polynomial time+space dynamic programming solution is straightforward.
First of all, we need an absolute order on the size of groups. We do that by comparing how many Danes, Japanese, and Poles we have.
Next, the function to write is this one.
m is the maximum group size we can emit
p is the number of people of each nationality that we have left to split
max_good_partitions_of_maximum_size(m, p) is the number of "good partitions"
we can form from p people, with no group being larger than m
Clearly you can write this as a somewhat complicated recursive function that always select the next partition to use, then call itself with that as the new maximum size, and subtract the partition from p. If you had this function, then your answer is simply max_good_partitions_of_maximum_size(p, p) with p = [13, 11, 8]. But that is going to be a brute force search that won't run in reasonable time.
Finally apply https://en.wikipedia.org/wiki/Memoization by caching every call to this function, and it will run in polynomial time. However you will also have to cache a polynomial number of calls to it.

abstract inplace mergesort for effective merge sort

I am reading about merge sort in Algorithms in C++ by Robert Sedgewick and have following questions.
static void mergeAB(ITEM[] c, int cl, ITEM[] a, int al, int ar, ITEM[] b, int bl, int br )
{
int i = al, j = bl;
for (int k = cl; k < cl+ar-al+br-bl+1; k++)
{
if (i > ar) { c[k] = b[j++]; continue; }
if (j > br) { c[k] = a[i++]; continue; }
c[k] = less(a[i], b[j]) ? a[i++] : b[j++];
}
}
The characteristic of the basic merge that is worthy of note is that
the inner loop includes two tests to determine whether the ends of the
two input arrays have been reached. Of course, these two tests usually
fail, and the situation thus cries out for the use of sentinel keys to
allow the tests to be removed. That is, if elements with a key value
larger than those of all the other keys are added to the ends of the a
and aux arrays, the tests can be removed, because when the a (b) array
is exhausted, the sentinel causes the next elements for the c array to
be taken from the b (a) array until the merge is complete.
However, it is not always easy to use sentinels, either because it
might not be easy to know the largest key value or because space might
not be available conveniently.
For merging, there is a simple remedy. The method is based on the
following idea: Given that we are resigned to copying the arrays to
implement the in-place abstraction, we simply put the second array in
reverse order when it is copied (at no extra cost), so that its
associated index moves from right to left. This arrangement leads to
the largest element—in whichever array it is—serving as sentinel for
the other array.
My questions on above text
What does statement "when the a (b) array is exhausted"? what is 'a (b)' here?
Why is the author mentioning that it is not easy to determine the largest key and how is the space related in determining largest key?
What does author mean by "Given that we are resigned to copying the arrays"? What is resigned in this context?
Request with simple example in understanding idea which is mentioned as simple remedy?
"When the a (b) array is exhausted" is a shorthand for "When either the a array or the b array is exhausted".
The interface is dealing with sub-arrays of a bigger array, so you can't simply go writing beyond the ends of the arrays.
The code copies the data from two arrays into one other array. Since this copy is inevitable, we are 'resigned to copying the arrays' means we reluctantly accept that it is inevitable that the arrays must be copied.
Tricky...that's going to take some time to work out what is meant.
Tangentially: That's probably not the way I'd write the loop. I'd be inclined to use:
int i = al, j = bl;
for (int k = cl; i <= ar && j <= br; k++)
{
if (a[i] < b[j])
c[k] = a[i++];
else
c[k] = b[j++];
}
while (i <= ar)
c[k++] = a[i++];
while (j <= br)
c[k++] = b[j++];
One of the two trailing loops does nothing. The revised main merge loop has 3 tests per iteration versus 4 tests per iteration for the one original algorithm. I've not formally measured it, but the simpler merge loop is likely to be quicker than the original single-loop algorithm.
The first three questions are almost best suited for English Language Learners.
a(b) and b(a)
Sometimes parenthesis are used to tell one or more similar phrases at once:
when a (b) is exhausted we copy elements from b (a)
means:
when a is exhausted we copy elements from b,
when b is exhausted we copy elements from a
What is difficult about sentinels
Two annoying things about sentinels are
sometimes your array data may potentially contain every possible value, so there is no value you can use as sentinel that is guaranteed to be bigger that all the values in the array
to use a sentinel instead of checking the index to see if you are done with an array requires that you have room for one extra space in the array to store the sentinel
Resigning
We programmers are never happy to copy (or move) things around and leaving them where they already are is, if possible, better (because we are lazy).
In this version of the merge sort we already gave up about trying to not copy things around... we resigned to it.
Given that we must copy, we can copy things in the opposite order if we like (and of course use the copy in opposite order) because that is free(*).
(*) is free at this level of abstraction, the cost on some real CPU may be high. As almost always in the performance area YMMV.

Algorithm for linear pattern matching?

I have a linear list of zeros and ones and I need to match multiple simple patterns and find the first occurrence. For example, I might need to find 0001101101, 01010100100, OR 10100100010 within a list of length 8 million. I only need to find the first occurrence of either, and then return the index at which it occurs. However, doing the looping and accesses over the large list can be expensive, and I'd rather not do it too many times.
Is there a faster method than doing
foreach (patterns) {
for (i=0; i < listLength; i++)
for(t=0; t < patternlength; t++)
if( list[i+t] != pattern[t] ) {
break;
}
if( t == patternlength - 1 ) {
return i; // pattern found!
}
}
}
}
Edit: BTW, I have implemented this program according to the above pseudocode, and performance is OK, but nothing spectacular. I'm estimating that I process about 6 million bits a second on a single core of my processor. I'm using this for image processing, and it's going to have to go through a few thousand 8 megapixel images, so every little bit helps.
Edit: If it's not clear, I'm working with a bit array, so there's only two possibilities: ONE and ZERO. And it's in C++.
Edit: Thanks for the pointers to BM and KMP algorithms. I noted that, on the Wikipedia page for BM, it says
The algorithm preprocesses the target
string (key) that is being searched
for, but not the string being searched
in (unlike some algorithms that
preprocess the string to be searched
and can then amortize the expense of
the preprocessing by searching
repeatedly).
That looks interesting, but it didn't give any examples of such algorithms. Would something like that also help?
The key for Googling is "multi-pattern" string matching.
Back in 1975, Aho and Corasick published a (linear-time) algorithm, which was used in the original version of fgrep. The algorithm subsequently got refined by many researchers. For example, Commentz-Walter (1979) combined Aho&Corasick with Boyer&Moore matching. Baeza-Yates (1989) combined AC with the Boyer-Moore-Horspool variant. Wu and Manber (1994) did similar work.
An alternative to the AC line of multi-pattern matching algorithms is Rabin and Karp's algorithm.
I suggest to start with reading the Aho-Corasick and Rabin-Karp Wikipedia pages and then decide whether that would make sense in your case. If so, maybe there already is an implementation for your language/runtime available.
Yes.
The Boyer–Moore string search algorithm
See also: Knuth–Morris–Pratt algorithm
You could Build an SuffixArray and search the runtime is crazy : O ( length(pattern) ).
BUT .. you have to build that array.
It's only worth .. when the Text is static and the pattern dynamic .
A solution that could be efficient:
store the patterns in a trie data structure
start searching the list
check if the next pattern_length chars are in the trie, stop on success ( O(1) operation )
step one char and repeat #3
If the list isn't mutable you can store the offset of matching patterns to avoid repeating calculations the next time.
If your strings need to be flexible, I would also recommend a modified "The Boyer–Moore string search algorithm" as per Mitch Wheat. If your strings do not need to be flexible, you should be able to collapse your pattern matching even more. The model of Boyer-Moore is incredibly efficient for searching a large amount of data for one of multiple strings to match against.
Jacob
If it's a bit array, I suppose doing a rolling sum would be an improvement. If pattern is length n, sum the first n bits and see if it matches the pattern's sum. Store the first bit of the sum always. Then, for every next bit, subtract the first bit from the sum and add the next bit, and see if the sum matches the pattern's sum. That would save the linear loop over the pattern.
It seems like the BM algorithm isn't as awesome for this as it looks, because here I only have two possible values, zero and one, so the first table doesn't help a whole lot. Second table might help, but that means BMH is mostly worthless.
Edit: In my sleep-deprived state I couldn't understand BM, so I just implemented this rolling sum (it was really easy) and it made my search 3 times faster. Thanks to whoever mentioned "rolling hashes". I can now search through 321,750,000 bits for two 30-bit patterns in 5.45 seconds (and that's single-threaded), versus 17.3 seconds before.
If it's just alternating 0's and 1's, then encode your text as runs. A run of n 0's is -n and a run of n 1's is n. Then encode your search strings. Then create a search function that uses the encoded strings.
The code looks like this:
try:
import psyco
psyco.full()
except ImportError:
pass
def encode(s):
def calc_count(count, c):
return count * (-1 if c == '0' else 1)
result = []
c = s[0]
count = 1
for i in range(1, len(s)):
d = s[i]
if d == c:
count += 1
else:
result.append(calc_count(count, c))
count = 1
c = d
result.append(calc_count(count, c))
return result
def search(encoded_source, targets):
def match(encoded_source, t, max_search_len, len_source):
x = len(t)-1
# Get the indexes of the longest segments and search them first
most_restrictive = [bb[0] for bb in sorted(((i, abs(t[i])) for i in range(1,x)), key=lambda x: x[1], reverse=True)]
# Align the signs of the source and target
index = (0 if encoded_source[0] * t[0] > 0 else 1)
unencoded_pos = sum(abs(c) for c in encoded_source[:index])
start_t, end_t = abs(t[0]), abs(t[x])
for i in range(index, len(encoded_source)-x, 2):
if all(t[j] == encoded_source[j+i] for j in most_restrictive):
encoded_start, encoded_end = abs(encoded_source[i]), abs(encoded_source[i+x])
if start_t <= encoded_start and end_t <= encoded_end:
return unencoded_pos + (abs(encoded_source[i]) - start_t)
unencoded_pos += abs(encoded_source[i]) + abs(encoded_source[i+1])
if unencoded_pos > max_search_len:
return len_source
return len_source
len_source = sum(abs(c) for c in encoded_source)
i, found, target_index = len_source, None, -1
for j, t in enumerate(targets):
x = match(encoded_source, t, i, len_source)
print "Match at: ", x
if x < i:
i, found, target_index = x, t, j
return (i, found, target_index)
if __name__ == "__main__":
import datetime
def make_source_text(len):
from random import randint
item_len = 8
item_count = 2**item_len
table = ["".join("1" if (j & (1 << i)) else "0" for i in reversed(range(item_len))) for j in range(item_count)]
return "".join(table[randint(0,item_count-1)] for _ in range(len//item_len))
targets = ['0001101101'*2, '01010100100'*2, '10100100010'*2]
encoded_targets = [encode(t) for t in targets]
data_len = 10*1000*1000
s = datetime.datetime.now()
source_text = make_source_text(data_len)
e = datetime.datetime.now()
print "Make source text(length %d): " % data_len, (e - s)
s = datetime.datetime.now()
encoded_source = encode(source_text)
e = datetime.datetime.now()
print "Encode source text: ", (e - s)
s = datetime.datetime.now()
(i, found, target_index) = search(encoded_source, encoded_targets)
print (i, found, target_index)
print "Target was: ", targets[target_index]
print "Source matched here: ", source_text[i:i+len(targets[target_index])]
e = datetime.datetime.now()
print "Search time: ", (e - s)
On a string twice as long as you offered, it takes about seven seconds to find the earliest match of three targets in 10 million characters. Of course, since I am using random text, that varies a bit with each run.
psyco is a python module for optimizing the code at run-time. Using it, you get great performance, and you might estimate that as an upper bound on the C/C++ performance. Here is recent performance:
Make source text(length 10000000): 0:00:02.277000
Encode source text: 0:00:00.329000
Match at: 2517905
Match at: 494990
Match at: 450986
(450986, [1, -1, 1, -2, 1, -3, 1, -1, 1, -1, 1, -2, 1, -3, 1, -1], 2)
Target was: 1010010001010100100010
Source matched here: 1010010001010100100010
Search time: 0:00:04.325000
It takes about 300 milliseconds to encode 10 million characters and about 4 seconds to search three encoded strings against it. I don't think the encoding time would be high in C/C++.

Resources