Finding Common Sets within noisy data - algorithm

Context: Consider each set within G to be a collection of the files (contents or MD5 hashes, not names) that are found on a particular computer.
Suppose I have a giant list of giant sets G and an unknown to me list of sets H. Each individual set I in G was created by taking the union of some unknown number of sets from list H, then adding and removing an unknown number of elements.
Now, I could use other data to construct a few of the sets in list H. However, I feel like there might be some sort of technique involving Bayesian probability to do this. E.g. something like, "If finding X in a set within G means there is a high probability of also finding Y, then there is probably a set in H containing both X and Y."
Edit: My goal is to construct a set of sets that is, with high probability, very similar or equal to H.
Any thoughts?
Example usage:
Compress G by replacing chunks of it with pieces of H, e.g.
G[1] = {1,2,3,5,6,7,9,10,11}
H[5] = {1,2,3}
H[6] = {5,6,7,8,9,10}
G[1]' = {H[5],H[6],-8,11}

Define the distance d(i,j) = 1/(number of sets in G which contain both i and j) and then run a cluster analysis.(http://en.wikipedia.org/wiki/Cluster_analysis) The resulting clusters are your candidates for the elements in H.

There are tons of non-brainy ad hoc ways to attack this. Here's one.
Start by taking a random sample from G, say 64 sets.
For each file in these sets, construct a 64-bit integer telling which sets it appears in.
Group the files by this 64-bit value; so all the files that always appear together end up in the same group. Find the group with maximum ((number of files in group - 1) × (number of bits set in the bit-vector - 1)) and call that H[0].
Now throw that sample back and take a new random sample. Reduce it as much as you can using the H[0] you've already defined. Then apply the same algorithm to find H[1]. Rinse. Repeat.
Stop when additional H's are no longer helping you compress the sets.
To improve on this algorithm:
You can easily choose a slightly different measure of the goodness of groups that promotes groups with lots of nearby neighbors--files that appear in nearly the same set of sets.
You can also pretty easily test your existing H's against random samples from G to see if there are files you should consider adding or removing.

Well, the current ad-hoc way, which seems to be good enough, is as follows:
Remove all elements from all G_x that are in under 25 sets.
Create a mapping from element to set and from set to element.
For each element E in the element map, pick 3 sets and take their intersection. Make two copies of this, A and B.
For each set S in the set map that does not contain E, remove all elements of S from A or B (alternate between them)
Add Union(A,B) to H
Remove all elements of `Union(A,B) from the element to set map (i.e. do not find overlapping sets).

How about a deterministic way (if you do not wish sets to intersect at all):
A) Turn sets in H into vertices labeled 1, 2, 3, ... size(H). Create a complete [un] directed graph between them all. Each vertex gets a value - equal to the cardinality / size of the set.
B) Go through all elements x in sets in H, create a mapping x -> [x1, x2, ... xm] if and only if x is in H[xi]. An array of sets will do. This helps you find overlapping sets.
C) Go through through all sets in this array, for every pair of x1, x2 that are within the same set - eliminate two edges between x1 and x2.
D) In the remaining graph only non-overlapping sets (well, their indices in H).
E) Now find the non-intersecting path within this graph with the highest total value. From this you can reconstruct the list of non-intersecting sets with highest coverage. It is trivial to compute the missing elements.
F) If you want to minimize the cardinality of the remaining set, then subtract 0.5 from the value of each vertex. We know that 1 + 1 = 2, but 0.5 + 0.5 < 1.5 - so the algorithm will prefer a set {a,b} over {a} and {b}. This may not be exactly what you want, but it might expire you.

Related

How to find a minimal set of keys?

I have a set of keys K and a finite set S &subset; K n of n-tuples of keys. Is there an efficient algorithm to find a bijective mapping f : S &mapsto; S' where S' &subset; K k with k < n minimal that strips some of the keys, leaving the others untouched?
I'm afraid this is NP-complete.
It is equivalent to set cover.
Each of your keys allows you to distinguish certain pairs of elements (i.e. a set of edges). Your task is to select the smallest number of keys that allows you to distinguish every element - i.e. the smallest number of sets of edges that allows you to cover every edge.
However, the wiki page shows an approximate solution based on integer programming that may give a useful solution in practice.
Sketch of Proof
Suppose we have a generic set cover problem:
A,B,C
C,D
A,B,D
where we need to find the smallest number of these sets to cover every element A,B,C,D.
We construct a tuple for each letter A,B,C,D.
The tuple has a unique number in position i if and only if set i contains the letter. Otherwise, they contain 0.
There is also a zero tuple.
This means that the tuples would look like:
(0,0,0) The zero tuple
(1,0,2) The tuple for A (in sets 1 and 3)
(3,0,4) The tuple for B (in sets 1 and 3)
(5,6,0) The tuple for C (in sets 1 and 2)
(0,7,8) The tuple for D (in sets 2 and 3)
If you could solve your problem efficiently, you would then be able to use this mapping to solve set cover efficiently.

Adjacency of intersected sets

Given a known number N of sets of items. Each set containing exactly D - 1 (N >= D) unique (within the set) items. But each item shared between D - 1 sets. Therefore every set have two "neighbouring" sets: both neighbours differs by exactly one element. Also every set have (if D is big enough) two more distant neighbouring sets, that differs by exactly two elements, etc. All sets together forms a closed chain.
E.g. there are ten elements a x b n q p j t r c. D = 4. And sets are (in parentheses are hints for possible ordering of neighbouring sets):
c x j (1)
p j x (2)
x a p (3)
p a n (4)
n q b (6)
a n q (5)
b r t (8)
b q t (7)
j c r (0)
c t r (9)
=> respective chain of items is: r c j x p a n q b t. The example generated as the result of backward substitution. But how to perform the restoration of the neighbourhood algoritmically ?
One way is obvious: simply enumerate all possible pairs of sets and compare sets from each pair whether they are differs exactly by one element or not (also there possible small optimizations, but they not matters much).
Another way to solve the problem is to generate (for each set from input) hashes for all possible D - 2-tuples of ordered sets of elements, then find pairs of collisions. There is a knowledge domain called Locality-Sensitive Hashing.
Both approaches seems to me as a full opposites. Hashing is faster, but implies the adjustment (of buckets sizes, choosing the way of hash combining for vector elements etc.) and most of its operations have amortized constant time. So, there involved some probabilistic actions. I can conclude, that for some D and N there is possibility to encounter performance degradation.
I suspect that there is a deterministic (in above sense) way to find all the neighbouring (adjacent) sets.
Here is an O(D*N) solution.
Two sets are defined to be neighbors if they differ in exactly 1 element. E.g. xap and pan.
Define N buckets, each labeled with a single element.
Place a copy of each set in each associated bucket. E.g. Bucket a holds xap, pan, and anq.
Start the chain by selecting any bucket, say a. Find 3 sets in the bucket that form a chain. E.g. xap, pan, anq. These are the first 3 sets in the chain.
Based on last two sets in the current chain, find the element in the last set that is not in the previous. For pan and anq, this is q. Go to this bucket.
In the present bucket, find the set which is a neighbor of the last element in the chain. E.g. in bucket q, the set neighboring anq is nqb. Add this set to the chain. Go to previous step until you circle back to the first set of the chain, e.g. xap.
One slight optimization is to remove sets from buckets once they are put into the chain.

Match every point in two different sized sets with minimum total line length

I have two sets of points plotted in a coordinate system. Each point in a set must be matched to at least one point at the other set, in a way that the sum of the length of the lines drawn by joining those points should be as low as possible. To make it clear, line drawing is just an abstraction, the actual output is just the pairs of points that must be matched.
I've seen this question about a similar problem, except that in my case there's no single-link restriction since the sets may have different sizes. Is there any kind of problem that describes this situation? More specifically, what algorithm could I use to solve this, assuming each set may have a maximum of 10 points?
Algorithm
You can model this as a network flow problem.
By having a source of 1 at each point in the first set, and a sink of 1 at each point in the second set, plus an extra node 'dest' for any left over capacity, any valid flow will always connect every point.
Make edges between the points with cost according to the distance between the points.
So far we have a network whose solution will be the lowest cost matching of set 1 to set 2 (i.e. each point will have a single link).
To allow multiple links you can simply make the following additions:
add 0 weight edges between each point in set2 and 'dest' (this allows points in set 2 to be multiply connected)
add 0 weight edges between 'dest' and each point in set2 (this allows points in set 1 to be multiply connected)
Example Python code using Networkx
import networkx as nx
import random
G=nx.DiGraph()
set1=['A','B','C','D','E','F','G','H','I']
set2=['a','b','c']
# Assume set1 > set2 (or swap sets)
assert len(set1)>=len(set2)
G.add_node('dest',demand=len(set1)-len(set2))
A=[]
for person in set1:
G.add_node(person,demand=-1)
G.add_edge('dest',person,weight=0)
for project in set2:
cost = random.randint(1,10) # Assign appropriate costs here
G.add_edge(person,project,weight=cost) # Edge taken if person does this project
for project in set2:
G.add_node(project,demand=1)
G.add_edge(project,'dest',weight=0)
flowdict = nx.min_cost_flow(G)
for person in set1:
for project,flow in flowdict[person].items():
if flow:
print person,'->',project
You can use a discrete optimization approach (Integer Programming).
We have two sets A, of size X, and B, of size Y. This means a maximum of X*Y links, each described by a boolean variable: L(i,j) = L(Y*i+j) is 1 if nodes A(i) and B(j) are linked, 0 if not. If X = Y = 10, we can write link L(7,3) as L73.
We can rewrite the problem like this:
Node A(i) has at least one link: X (say, ten) criteria with i from 0 to X-1, each of them comprised of Y components:
L(i,0)+L(i,1)+L(i,2)+...+L(i,Y-1) >= 1
Node B(j) has at least one link, and there are Y criteria made up of X components:
L(0,j)+L(1,j)+L(2,j)+...+L(X-1,j) >= 1
The minimal cost requirement becomes:
cost = SUM(C(0,0)*L(0,0)+C(0,1)*L(0,1)+...+C(9,9)*L(9,9)
With these conventions, we can easily build the matrices for an ILP problem, that can be passed to our favorite ILP solving package or library (C, Java, Python, even PHP).
====
A self-contained "greedy" algorithm which is not guaranteed to find a minimum, but is reasonably quick and should give reasonable results unless you feed it a pathological data set, is:
- connect all points in the smaller set, each to its nearest point in the other set.
- connect all unconnected points remaining in the larger set, each to its
nearest point in the first set, whether it's already connected or not.
As an optimization, you can then enumerate the points in the larger data set; if one of them (say A) is singly connected to a point in the first data set (say B) which is multiply connected, and is not its nearest neighbour C, you can switch the link from A-B to A-C. This takes care of one of the simplest problems that may arise from the "greediness" of the algorithm.

Algorithm: Removing as few elements as possible from a set in order to enforce no subsets

I got a problem which I do not know how to solve:
I have a set of sets A = {A_1, A_2, ..., A_n} and I have a set B.
The target now is to remove as few elements as possible from B (creating B'), such that, after removing the elements for all 1 <= i <= n, A_i is not a subset of B'.
For example, if we have A_1 = {1,2}, A_2 = {1,3,4}, A_3={2,5}, and B={1,2,3,4,5}, we could e.g. remove 1 and 2 from B (that would yield B'={3,4,5}, which is not a superset of one of the A_i).
Is there an algorithm for determining the (minimal number of) elements to be removed?
It sounds like you want to remove the minimal hitting set of A from B (this is closely related to the vertex cover problem).
A hitting set for some set-of-sets A is itself a set such that it contains at least one element from each set in A (it "hits" each set). The minimal hitting set is the smallest such hitting set. So, if you have an MHS for your set-of-sets A, you have an element from each set in A. Removing this from B means no set in A can be a subset of B.
All you need to do is calculate the MHS for (A1, A2, ... An), then remove that from B. Unfortunately, finding the MHS is an NP-complete problem. Knowing that though, you have a few options:
If your data set is small, do the obvious brute-force solution
Use a probabilistic algorithm to get a fast, approximate answer (see this PDF)
Run far, far away in the opposite direction
If you just need some approximation, start with the smallest set in A, and remove one element from B. (You could just grab one at random, or check to see which element is in the most sets in A, depending on how accurate, how fast you need)
Now the smallest set in A isn't a subset of B. Move on from there, but check first to see whether or not the sets you're examining are subsets at this point or not.
This reminds me of the vertex covering problem, and I remember some approximation algorithm for that that is similar to this one.
I think you should find the minimum length from these sets and then delete these elements which is in this set.

Superset Search

I'm looking for an algorithm to solve the following in a reasonable amount of time.
Given a set of sets, find all such sets that are subsets of a given set.
For example, if you have a set of search terms like ["stack overflow", "foo bar", ...], then given a document D, find all search terms whose words all appear in D.
I have found two solutions that are adequate:
Use a list of bit vectors as an index. To query for a given superset, create a bit vector for it, and then iterate over the list performing a bitwise OR for each vector in the list. If the result is equal to the search vector, the search set is a superset of the set represented by the current vector. This algorithm is O(n) where n is the number of sets in the index, and bitwise OR is very fast. Insertion is O(1). Caveat: to support all words in the English language, the bit vectors will need to be several million bits long, and there will need to exist a total order for the words, with no gaps.
Use a prefix tree (trie). Sort the sets before inserting them into the trie. When searching for a given set, sort it first. Iterate over the elements of the search set, activating nodes that match if they are either children of the root node or of a previously activated node. All paths, through activated nodes to a leaf, represent subsets of the search set. The complexity of this algorithm is O(a log a + ab) where a is the size of the search set and b is the number of indexed sets.
What's your solution?
The prefix trie sounds like something I'd try if the sets were sparse compared to the total vocabulary. Don't forget that if the suffix set of two different prefixes is the same, you can share the subgraph representing the suffix set (this can be achieved by hash-consing rather than arbitrary DFA minimization), giving a DAG rather than a tree. Try ordering your words least or most frequent first (I'll bet one or the other is better than some random or alphabetic order).
For a variation on your first strategy, where you represent each set by a very large integer (bit vector), use a sparse ordered set/map of integers (a trie on the sequence of bits which skips runs of consecutive 0s) - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.5452 (implemented in http://www.scala-lang.org/docu/files/api/scala/collection/immutable/IntMap.html).
If your reference set (of sets) is fixed, and you want to find for many of those sets which ones contain others, I'd compute the immediate containment relation (a directed acyclic graph with a path from a->b iff b is contained in a, and without the redundant arcs a->c where a->b and b->c). The branching factor is no more than the number of elements in a set. The vertices reachable from the given set are exactly those that are subsets of it.
First I would construct 2 data structures, S and E.
S is an array of sets (set S has the N subsets).
S[0] = set(element1, element2, ...)
S[1] = set(element1, element2, ...)
...
S[N] = set(element1, element2, ...)
E is a map (element hash for index) of lists. Each list contains S-indices, where the element appears.
// O( S_total_elements ) = O(n) operation
E[element1] = list(S1, S6, ...)
E[element2] = list(S3, S4, S8, ...)
...
Now, 2 new structures, set L and array C.
I store all the elements of D, that exist in E, in the L. (O(n) operation)
C is an array (S-indices) of counters.
// count subset's elements that are in E
foreach e in L:
foreach idx in E[e]:
C[idx] = C[idx] + 1
Finally,
for i in C:
if C[i] == S[i].Count()
// S[i] subset exists in D
Can you build an index for your documents? i.e. a mapping from each word to those documents containing that word. Once you've built that, lookup should be pretty quick and you can just do set intersection to find the documents matching all words.
Here's Wiki on full text search.
EDIT: Ok, I got that backwards.
You could convert your document to a set (if your language has a set datatype), do the same with your searches. Then it becomes a simple matter of testing whether one is a subset of the other.
Behind the scenes, this is effectively the same idea: it would probably involve building a hash table for the document, hashing the queries, and checking each word in the query in turn. This would be O(nm) where n is the number of searches and m the average number of words in a search.

Resources