Adjacency of intersected sets - algorithm

Given a known number N of sets of items. Each set containing exactly D - 1 (N >= D) unique (within the set) items. But each item shared between D - 1 sets. Therefore every set have two "neighbouring" sets: both neighbours differs by exactly one element. Also every set have (if D is big enough) two more distant neighbouring sets, that differs by exactly two elements, etc. All sets together forms a closed chain.
E.g. there are ten elements a x b n q p j t r c. D = 4. And sets are (in parentheses are hints for possible ordering of neighbouring sets):
c x j (1)
p j x (2)
x a p (3)
p a n (4)
n q b (6)
a n q (5)
b r t (8)
b q t (7)
j c r (0)
c t r (9)
=> respective chain of items is: r c j x p a n q b t. The example generated as the result of backward substitution. But how to perform the restoration of the neighbourhood algoritmically ?
One way is obvious: simply enumerate all possible pairs of sets and compare sets from each pair whether they are differs exactly by one element or not (also there possible small optimizations, but they not matters much).
Another way to solve the problem is to generate (for each set from input) hashes for all possible D - 2-tuples of ordered sets of elements, then find pairs of collisions. There is a knowledge domain called Locality-Sensitive Hashing.
Both approaches seems to me as a full opposites. Hashing is faster, but implies the adjustment (of buckets sizes, choosing the way of hash combining for vector elements etc.) and most of its operations have amortized constant time. So, there involved some probabilistic actions. I can conclude, that for some D and N there is possibility to encounter performance degradation.
I suspect that there is a deterministic (in above sense) way to find all the neighbouring (adjacent) sets.

Here is an O(D*N) solution.
Two sets are defined to be neighbors if they differ in exactly 1 element. E.g. xap and pan.
Define N buckets, each labeled with a single element.
Place a copy of each set in each associated bucket. E.g. Bucket a holds xap, pan, and anq.
Start the chain by selecting any bucket, say a. Find 3 sets in the bucket that form a chain. E.g. xap, pan, anq. These are the first 3 sets in the chain.
Based on last two sets in the current chain, find the element in the last set that is not in the previous. For pan and anq, this is q. Go to this bucket.
In the present bucket, find the set which is a neighbor of the last element in the chain. E.g. in bucket q, the set neighboring anq is nqb. Add this set to the chain. Go to previous step until you circle back to the first set of the chain, e.g. xap.
One slight optimization is to remove sets from buckets once they are put into the chain.

Related

How to find a minimal set of keys?

I have a set of keys K and a finite set S &subset; K n of n-tuples of keys. Is there an efficient algorithm to find a bijective mapping f : S &mapsto; S' where S' &subset; K k with k < n minimal that strips some of the keys, leaving the others untouched?
I'm afraid this is NP-complete.
It is equivalent to set cover.
Each of your keys allows you to distinguish certain pairs of elements (i.e. a set of edges). Your task is to select the smallest number of keys that allows you to distinguish every element - i.e. the smallest number of sets of edges that allows you to cover every edge.
However, the wiki page shows an approximate solution based on integer programming that may give a useful solution in practice.
Sketch of Proof
Suppose we have a generic set cover problem:
A,B,C
C,D
A,B,D
where we need to find the smallest number of these sets to cover every element A,B,C,D.
We construct a tuple for each letter A,B,C,D.
The tuple has a unique number in position i if and only if set i contains the letter. Otherwise, they contain 0.
There is also a zero tuple.
This means that the tuples would look like:
(0,0,0) The zero tuple
(1,0,2) The tuple for A (in sets 1 and 3)
(3,0,4) The tuple for B (in sets 1 and 3)
(5,6,0) The tuple for C (in sets 1 and 2)
(0,7,8) The tuple for D (in sets 2 and 3)
If you could solve your problem efficiently, you would then be able to use this mapping to solve set cover efficiently.

Match every point in two different sized sets with minimum total line length

I have two sets of points plotted in a coordinate system. Each point in a set must be matched to at least one point at the other set, in a way that the sum of the length of the lines drawn by joining those points should be as low as possible. To make it clear, line drawing is just an abstraction, the actual output is just the pairs of points that must be matched.
I've seen this question about a similar problem, except that in my case there's no single-link restriction since the sets may have different sizes. Is there any kind of problem that describes this situation? More specifically, what algorithm could I use to solve this, assuming each set may have a maximum of 10 points?
Algorithm
You can model this as a network flow problem.
By having a source of 1 at each point in the first set, and a sink of 1 at each point in the second set, plus an extra node 'dest' for any left over capacity, any valid flow will always connect every point.
Make edges between the points with cost according to the distance between the points.
So far we have a network whose solution will be the lowest cost matching of set 1 to set 2 (i.e. each point will have a single link).
To allow multiple links you can simply make the following additions:
add 0 weight edges between each point in set2 and 'dest' (this allows points in set 2 to be multiply connected)
add 0 weight edges between 'dest' and each point in set2 (this allows points in set 1 to be multiply connected)
Example Python code using Networkx
import networkx as nx
import random
G=nx.DiGraph()
set1=['A','B','C','D','E','F','G','H','I']
set2=['a','b','c']
# Assume set1 > set2 (or swap sets)
assert len(set1)>=len(set2)
G.add_node('dest',demand=len(set1)-len(set2))
A=[]
for person in set1:
G.add_node(person,demand=-1)
G.add_edge('dest',person,weight=0)
for project in set2:
cost = random.randint(1,10) # Assign appropriate costs here
G.add_edge(person,project,weight=cost) # Edge taken if person does this project
for project in set2:
G.add_node(project,demand=1)
G.add_edge(project,'dest',weight=0)
flowdict = nx.min_cost_flow(G)
for person in set1:
for project,flow in flowdict[person].items():
if flow:
print person,'->',project
You can use a discrete optimization approach (Integer Programming).
We have two sets A, of size X, and B, of size Y. This means a maximum of X*Y links, each described by a boolean variable: L(i,j) = L(Y*i+j) is 1 if nodes A(i) and B(j) are linked, 0 if not. If X = Y = 10, we can write link L(7,3) as L73.
We can rewrite the problem like this:
Node A(i) has at least one link: X (say, ten) criteria with i from 0 to X-1, each of them comprised of Y components:
L(i,0)+L(i,1)+L(i,2)+...+L(i,Y-1) >= 1
Node B(j) has at least one link, and there are Y criteria made up of X components:
L(0,j)+L(1,j)+L(2,j)+...+L(X-1,j) >= 1
The minimal cost requirement becomes:
cost = SUM(C(0,0)*L(0,0)+C(0,1)*L(0,1)+...+C(9,9)*L(9,9)
With these conventions, we can easily build the matrices for an ILP problem, that can be passed to our favorite ILP solving package or library (C, Java, Python, even PHP).
====
A self-contained "greedy" algorithm which is not guaranteed to find a minimum, but is reasonably quick and should give reasonable results unless you feed it a pathological data set, is:
- connect all points in the smaller set, each to its nearest point in the other set.
- connect all unconnected points remaining in the larger set, each to its
nearest point in the first set, whether it's already connected or not.
As an optimization, you can then enumerate the points in the larger data set; if one of them (say A) is singly connected to a point in the first data set (say B) which is multiply connected, and is not its nearest neighbour C, you can switch the link from A-B to A-C. This takes care of one of the simplest problems that may arise from the "greediness" of the algorithm.

Generate all subset sums within a range faster than O((k+N) * 2^(N/2))?

Is there a way to generate all of the subset sums s1, s2, ..., sk that fall in a range [A,B] faster than O((k+N)*2N/2), where k is the number of sums there are in [A,B]? Note that k is only known after we have enumerated all subset sums within [A,B].
I'm currently using a modified Horowitz-Sahni algorithm. For example, I first call it to for the smallest sum greater than or equal to A, giving me s1. Then I call it again for the next smallest sum greater than s1, giving me s2. Repeat this until we find a sum sk+1 greater than B. There is a lot of computation repeated between each iteration, even without rebuilding the initial two 2N/2 lists, so is there a way to do better?
In my problem, N is about 15, and the magnitude of the numbers is on the order of millions, so I haven't considered the dynamic programming route.
Check the subset sum on Wikipedia. As far as I know, it's the fastest known algorithm, which operates in O(2^(N/2)) time.
Edit:
If you're looking for multiple possible sums, instead of just 0, you can save the end arrays and just iterate through them again (which is roughly an O(2^(n/2) operation) and save re-computing them. The value of all the possible subsets is doesn't change with the target.
Edit again:
I'm not wholly sure what you want. Are we running K searches for one independent value each, or looking for any subset that has a value in a specific range that is K wide? Or are you trying to approximate the second by using the first?
Edit in response:
Yes, you do get a lot of duplicate work even without rebuilding the list. But if you don't rebuild the list, that's not O(k * N * 2^(N/2)). Building the list is O(N * 2^(N/2)).
If you know A and B right now, you could begin iteration, and then simply not stop when you find the right answer (the bottom bound), but keep going until it goes out of range. That should be roughly the same as solving subset sum for just one solution, involving only +k more ops, and when you're done, you can ditch the list.
More edit:
You have a range of sums, from A to B. First, you solve subset sum problem for A. Then, you just keep iterating and storing the results, until you find the solution for B, at which point you stop. Now you have every sum between A and B in a single run, and it will only cost you one subset sum problem solve plus K operations for K values in the range A to B, which is linear and nice and fast.
s = *i + *j; if s > B then ++i; else if s < A then ++j; else { print s; ... what_goes_here? ... }
No, no, no. I get the source of your confusion now (I misread something), but it's still not as complex as what you had originally. If you want to find ALL combinations within the range, instead of one, you will just have to iterate over all combinations of both lists, which isn't too bad.
Excuse my use of auto. C++0x compiler.
std::vector<int> sums;
std::vector<int> firstlist;
std::vector<int> secondlist;
// Fill in first/secondlist.
std::sort(firstlist.begin(), firstlist.end());
std::sort(secondlist.begin(), secondlist.end());
auto firstit = firstlist.begin();
auto secondit = secondlist.begin();
// Since we want all in a range, rather than just the first, we need to check all combinations. Horowitz/Sahni is only designed to find one.
for(; firstit != firstlist.end(); firstit++) {
for(; secondit = secondlist.end(); secondit++) {
int sum = *firstit + *secondit;
if (sum > A && sum < B)
sums.push_back(sum);
}
}
It's still not great. But it could be optimized if you know in advance that N is very large, for example, mapping or hashmapping sums to iterators, so that any given firstit can find any suitable partners in secondit, reducing the running time.
It is possible to do this in O(N*2^(N/2)), using ideas similar to Horowitz Sahni, but we try and do some optimizations to reduce the constants in the BigOh.
We do the following
Step 1: Split into sets of N/2, and generate all possible 2^(N/2) sets for each split. Call them S1 and S2. This we can do in O(2^(N/2)) (note: the N factor is missing here, due to an optimization we can do).
Step 2: Next sort the larger of S1 and S2 (say S1) in O(N*2^(N/2)) time (we optimize here by not sorting both).
Step 3: Find Subset sums in range [A,B] in S1 using binary search (as it is sorted).
Step 4: Next, for each sum in S2, find using binary search the sets in S1 whose union with this gives sum in range [A,B]. This is O(N*2^(N/2)). At the same time, find if that corresponding set in S2 is in the range [A,B]. The optimization here is to combine loops. Note: This gives you a representation of the sets (in terms of two indexes in S2), not the sets themselves. If you want all the sets, this becomes O(K + N*2^(N/2)), where K is the number of sets.
Further optimizations might be possible, for instance when sum from S2, is negative, we don't consider sums < A etc.
Since Steps 2,3,4 should be pretty clear, I will elaborate further on how to get Step 1 done in O(2^(N/2)) time.
For this, we use the concept of Gray Codes. Gray codes are a sequence of binary bit patterns in which each pattern differs from the previous pattern in exactly one bit.
Example: 00 -> 01 -> 11 -> 10 is a gray code with 2 bits.
There are gray codes which go through all possible N/2 bit numbers and these can be generated iteratively (see the wiki page I linked to), in O(1) time for each step (total O(2^(N/2)) steps), given the previous bit pattern, i.e. given current bit pattern, we can generate the next bit pattern in O(1) time.
This enables us to form all the subset sums, by using the previous sum and changing that by just adding or subtracting one number (corresponding to the differing bit position) to get the next sum.
If you modify the Horowitz-Sahni algorithm in the right way, then it's hardly slower than original Horowitz-Sahni. Recall that Horowitz-Sahni works two lists of subset sums: Sums of subsets in the left half of the original list, and sums of subsets in the right half. Call these two lists of sums L and R. To obtain subsets that sum to some fixed value A, you can sort R, and then look up a number in R that matches each number in L using a binary search. However, the algorithm is asymmetric only to save a constant factor in space and time. It's a good idea for this problem to sort both L and R.
In my code below I also reverse L. Then you can keep two pointers into R, updated for each entry in L: A pointer to the last entry in R that's too low, and a pointer to the first entry in R that's too high. When you advance to the next entry in L, each pointer might either move forward or stay put, but they won't have to move backwards. Thus, the second stage of the Horowitz-Sahni algorithm only takes linear time in the data generated in the first stage, plus linear time in the length of the output. Up to a constant factor, you can't do better than that (once you have committed to this meet-in-the-middle algorithm).
Here is a Python code with example input:
# Input
terms = [29371, 108810, 124019, 267363, 298330, 368607,
438140, 453243, 515250, 575143, 695146, 840979, 868052, 999760]
(A,B) = (500000,600000)
# Subset iterator stolen from Sage
def subsets(X):
yield []; pairs = []
for x in X:
pairs.append((2**len(pairs),x))
for w in xrange(2**(len(pairs)-1), 2**(len(pairs))):
yield [x for m, x in pairs if m & w]
# Modified Horowitz-Sahni with toolow and toohigh indices
L = sorted([(sum(S),S) for S in subsets(terms[:len(terms)/2])])
R = sorted([(sum(S),S) for S in subsets(terms[len(terms)/2:])])
(toolow,toohigh) = (-1,0)
for (Lsum,S) in reversed(L):
while R[toolow+1][0] < A-Lsum and toolow < len(R)-1: toolow += 1
while R[toohigh][0] <= B-Lsum and toohigh < len(R): toohigh += 1
for n in xrange(toolow+1,toohigh):
print '+'.join(map(str,S+R[n][1])),'=',sum(S+R[n][1])
"Moron" (I think he should change his user name) raises the reasonable issue of optimizing the algorithm a little further by skipping one of the sorts. Actually, because each list L and R is a list of sizes of subsets, you can do a combined generate and sort of each one in linear time! (That is, linear in the lengths of the lists.) L is the union of two lists of sums, those that include the first term, term[0], and those that don't. So actually you should just make one of these halves in sorted form, add a constant, and then do a merge of the two sorted lists. If you apply this idea recursively, you save a logarithmic factor in the time to make a sorted L, i.e., a factor of N in the original variable of the problem. This gives a good reason to sort both lists as you generate them. If you only sort one list, you have some binary searches that could reintroduce that factor of N; at best you have to optimize them somehow.
At first glance, a factor of O(N) could still be there for a different reason: If you want not just the subset sum, but the subset that makes the sum, then it looks like O(N) time and space to store each subset in L and in R. However, there is a data-sharing trick that also gets rid of that factor of O(N). The first step of the trick is to store each subset of the left or right half as a linked list of bits (1 if a term is included, 0 if it is not included). Then, when the list L is doubled in size as in the previous paragraph, the two linked lists for a subset and its partner can be shared, except at the head:
0
|
v
1 -> 1 -> 0 -> ...
Actually, this linked list trick is an artifact of the cost model and never truly helpful. Because, in order to have pointers in a RAM architecture with O(1) cost, you have to define data words with O(log(memory)) bits. But if you have data words of this size, you might as well store each word as a single bit vector rather than with this pointer structure. I.e., if you need less than a gigaword of memory, then you can store each subset in a 32-bit word. If you need more than a gigaword, then you have a 64-bit architecture or an emulation of it (or maybe 48 bits), and you can still store each subset in one word. If you patch the RAM cost model to take account of word size, then this factor of N was never really there anyway.
So, interestingly, the time complexity for the original Horowitz-Sahni algorithm isn't O(N*2^(N/2)), it's O(2^(N/2)). Likewise the time complexity for this problem is O(K+2^(N/2)), where K is the length of the output.

Finding Common Sets within noisy data

Context: Consider each set within G to be a collection of the files (contents or MD5 hashes, not names) that are found on a particular computer.
Suppose I have a giant list of giant sets G and an unknown to me list of sets H. Each individual set I in G was created by taking the union of some unknown number of sets from list H, then adding and removing an unknown number of elements.
Now, I could use other data to construct a few of the sets in list H. However, I feel like there might be some sort of technique involving Bayesian probability to do this. E.g. something like, "If finding X in a set within G means there is a high probability of also finding Y, then there is probably a set in H containing both X and Y."
Edit: My goal is to construct a set of sets that is, with high probability, very similar or equal to H.
Any thoughts?
Example usage:
Compress G by replacing chunks of it with pieces of H, e.g.
G[1] = {1,2,3,5,6,7,9,10,11}
H[5] = {1,2,3}
H[6] = {5,6,7,8,9,10}
G[1]' = {H[5],H[6],-8,11}
Define the distance d(i,j) = 1/(number of sets in G which contain both i and j) and then run a cluster analysis.(http://en.wikipedia.org/wiki/Cluster_analysis) The resulting clusters are your candidates for the elements in H.
There are tons of non-brainy ad hoc ways to attack this. Here's one.
Start by taking a random sample from G, say 64 sets.
For each file in these sets, construct a 64-bit integer telling which sets it appears in.
Group the files by this 64-bit value; so all the files that always appear together end up in the same group. Find the group with maximum ((number of files in group - 1) × (number of bits set in the bit-vector - 1)) and call that H[0].
Now throw that sample back and take a new random sample. Reduce it as much as you can using the H[0] you've already defined. Then apply the same algorithm to find H[1]. Rinse. Repeat.
Stop when additional H's are no longer helping you compress the sets.
To improve on this algorithm:
You can easily choose a slightly different measure of the goodness of groups that promotes groups with lots of nearby neighbors--files that appear in nearly the same set of sets.
You can also pretty easily test your existing H's against random samples from G to see if there are files you should consider adding or removing.
Well, the current ad-hoc way, which seems to be good enough, is as follows:
Remove all elements from all G_x that are in under 25 sets.
Create a mapping from element to set and from set to element.
For each element E in the element map, pick 3 sets and take their intersection. Make two copies of this, A and B.
For each set S in the set map that does not contain E, remove all elements of S from A or B (alternate between them)
Add Union(A,B) to H
Remove all elements of `Union(A,B) from the element to set map (i.e. do not find overlapping sets).
How about a deterministic way (if you do not wish sets to intersect at all):
A) Turn sets in H into vertices labeled 1, 2, 3, ... size(H). Create a complete [un] directed graph between them all. Each vertex gets a value - equal to the cardinality / size of the set.
B) Go through all elements x in sets in H, create a mapping x -> [x1, x2, ... xm] if and only if x is in H[xi]. An array of sets will do. This helps you find overlapping sets.
C) Go through through all sets in this array, for every pair of x1, x2 that are within the same set - eliminate two edges between x1 and x2.
D) In the remaining graph only non-overlapping sets (well, their indices in H).
E) Now find the non-intersecting path within this graph with the highest total value. From this you can reconstruct the list of non-intersecting sets with highest coverage. It is trivial to compute the missing elements.
F) If you want to minimize the cardinality of the remaining set, then subtract 0.5 from the value of each vertex. We know that 1 + 1 = 2, but 0.5 + 0.5 < 1.5 - so the algorithm will prefer a set {a,b} over {a} and {b}. This may not be exactly what you want, but it might expire you.

Superset Search

I'm looking for an algorithm to solve the following in a reasonable amount of time.
Given a set of sets, find all such sets that are subsets of a given set.
For example, if you have a set of search terms like ["stack overflow", "foo bar", ...], then given a document D, find all search terms whose words all appear in D.
I have found two solutions that are adequate:
Use a list of bit vectors as an index. To query for a given superset, create a bit vector for it, and then iterate over the list performing a bitwise OR for each vector in the list. If the result is equal to the search vector, the search set is a superset of the set represented by the current vector. This algorithm is O(n) where n is the number of sets in the index, and bitwise OR is very fast. Insertion is O(1). Caveat: to support all words in the English language, the bit vectors will need to be several million bits long, and there will need to exist a total order for the words, with no gaps.
Use a prefix tree (trie). Sort the sets before inserting them into the trie. When searching for a given set, sort it first. Iterate over the elements of the search set, activating nodes that match if they are either children of the root node or of a previously activated node. All paths, through activated nodes to a leaf, represent subsets of the search set. The complexity of this algorithm is O(a log a + ab) where a is the size of the search set and b is the number of indexed sets.
What's your solution?
The prefix trie sounds like something I'd try if the sets were sparse compared to the total vocabulary. Don't forget that if the suffix set of two different prefixes is the same, you can share the subgraph representing the suffix set (this can be achieved by hash-consing rather than arbitrary DFA minimization), giving a DAG rather than a tree. Try ordering your words least or most frequent first (I'll bet one or the other is better than some random or alphabetic order).
For a variation on your first strategy, where you represent each set by a very large integer (bit vector), use a sparse ordered set/map of integers (a trie on the sequence of bits which skips runs of consecutive 0s) - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.5452 (implemented in http://www.scala-lang.org/docu/files/api/scala/collection/immutable/IntMap.html).
If your reference set (of sets) is fixed, and you want to find for many of those sets which ones contain others, I'd compute the immediate containment relation (a directed acyclic graph with a path from a->b iff b is contained in a, and without the redundant arcs a->c where a->b and b->c). The branching factor is no more than the number of elements in a set. The vertices reachable from the given set are exactly those that are subsets of it.
First I would construct 2 data structures, S and E.
S is an array of sets (set S has the N subsets).
S[0] = set(element1, element2, ...)
S[1] = set(element1, element2, ...)
...
S[N] = set(element1, element2, ...)
E is a map (element hash for index) of lists. Each list contains S-indices, where the element appears.
// O( S_total_elements ) = O(n) operation
E[element1] = list(S1, S6, ...)
E[element2] = list(S3, S4, S8, ...)
...
Now, 2 new structures, set L and array C.
I store all the elements of D, that exist in E, in the L. (O(n) operation)
C is an array (S-indices) of counters.
// count subset's elements that are in E
foreach e in L:
foreach idx in E[e]:
C[idx] = C[idx] + 1
Finally,
for i in C:
if C[i] == S[i].Count()
// S[i] subset exists in D
Can you build an index for your documents? i.e. a mapping from each word to those documents containing that word. Once you've built that, lookup should be pretty quick and you can just do set intersection to find the documents matching all words.
Here's Wiki on full text search.
EDIT: Ok, I got that backwards.
You could convert your document to a set (if your language has a set datatype), do the same with your searches. Then it becomes a simple matter of testing whether one is a subset of the other.
Behind the scenes, this is effectively the same idea: it would probably involve building a hash table for the document, hashing the queries, and checking each word in the query in turn. This would be O(nm) where n is the number of searches and m the average number of words in a search.

Resources