I'm working on a problem which involves going through a lot of data. To reduce the work (because current calculations take about two weeks of compute time, and I'd like to reduce that dramatically) I came up with an algorithm which would be much faster if it was able to avoid a certain type of duplication. (The current algorithm avoids storing this information because it is too large, unreduced, to fit in memory.)
I have a collection of sets, and I don't want to insert a set A if there is already a set B which is a subset of A. At the moment the sets are represented by integers where individual binary digits represent a particular element being present or absent. In that interpretation the set/integer A should not be inserted if there is already a set/integer B such that (~A) & B is 0, where ~ is bitwise negation and & is bitwise AND.
For example, if my collection has the following sets
[ {a,b}, {b,c}, {b,d,e} ]
and I asked to add {b,c,e} it should not be added (since {b,c} is already there) and similarly with {a,b} (since {a,b} is there) but {a,e} should be added.
The numeric equivalent would be starting with
[ `0b11`, `0b110`, `0b11010` ]
where 0b10110 is not added since (~0b10110) & 0b110 == 0, 0b11 is not added since (~0b11) ^ 0b11 == 0, but 0b10001 can be added.
Ideally the structure would prune itself as new sets are added, so if {c} were added all existing sets containing c would be removed. But it's acceptable if it doesn't update in that way as long as I can normalize it to that form in some not-too-expensive way every so often.
This is a well-known problem known as "finding extremal sets"; unfortunately, there is nothing fundamentally faster known than the obvious approach of testing a newly inserted set against all existing sets, but good heuristic improvements exist. Here is a recent paper discussing this problem: https://arxiv.org/abs/1508.01753
An open-source implementation of a related algorithm:
https://code.google.com/archive/p/google-extremal-sets/
Related
I have N objects, and M sets of those objects. Sets are non-empty, different, and may intersect. Typically M and N are of the same order of magnitude, usually M > N.
Historically my sets were encoded as-is, each just contained a table (array) of its objects, but I'd like to create a more optimized encoding. Typically some objects present in most of the sets, and I want to utilize this.
My idea is to represent sets as stacks (i.e. single-directional linked lists), whereas their bottom parts can be shared across different sets. It can also be defined as a tree, whereas each node/leaf has a pointer to its parent, but not children.
Such a data structure will allow to use the most common subsets of objects as roots, which all the appropriate sets may "inherit".
The most efficient encoding is computed by the following algorithm. I'll write it as a recursive pseudo-code.
BuildAllChains()
{
BuildSubChains(allSets, NULL);
}
BuildSubChains(sets, pParent)
{
if (sets is empty)
return;
trgObj = the most frequent object from sets;
pNode = new Node;
pNode->Object = trgObj;
pNode->pParent = pParent;
newSets = empty;
for (each set in sets that contains the trgObj)
{
remove trgObj from set;
remove set from sets;
if (set is empty)
set->pHead = pNode;
else
newSets.Insert(set);
}
BuildSubChains(sets, pParent);
BuildSubChains(newSets, pNode);
}
Note: the pseudo-code is written in a recursive manner, but technically naive recursion should not be used, because at each point the splitting is not balanced, and in a degenerate case (which is likely, since the source data isn't random) the recursion depth would be O(N).
Practically I use a combination of loop + recursion, whereas recursion always invoked on a smaller part.
So, the idea is to select each time the most common object, create a "subset" which inherits its parent subset, and all the sets that include it, as well as all the predecessors selected so far - should be based on this subset.
Now, I'm trying to figure-out an effective way to select the most frequent object from the sets. Initially my idea was to compute the histogram of all the objects, and sort it once. Then, during the recursion, whenever we remove an object and select only sets that contain/don't contain it - deduce the sorted histogram of the remaining sets. But then I realized that this is not trivial, because we remove many sets, each containing many objects.
Of course we can select each time the most frequent object directly, i.e. O(N*M). But it also looks inferior, in a degenerate case, where an object exists in either almost all or almost none sets we may need to repeat this O(N) times. OTOH for those specific cases in-place adjustment of the sorted histogram may be preferred way to go.
So far I couldn't come up with a good enough solution. Any ideas would be appreciated. Thanks in advance.
Update:
#Ivan: first thanks a lot for the answer and the detailed analysis.
I do store the list of elements within the histogram rather than the count only. Actually I use pretty sophisticated data structures (not related to STL) with intrusive containers, corss-linked pointers and etc. I planned this from the beginning, because than it seemed to me that the histogram adjustment after removing elements would be trivial.
I think the main point of your suggestion, which I didn't figure-out myself, is that at each step the histograms should only contain elements that are still present in the family, i.e. they must not contain zeroes. I thought that in cases where the splitting is very uneven creating a new histogram for the smaller part is too expensive. But restricting it to only existing elements is a really good idea.
So we remove sets of the smaller family, adjust the "big" histogram and build the "small" one. Now, I need some clarifications about how to keep the big histogram sorted.
One idea, which I thought about first, was immediate fix of the histogram after every single element removal. I.e. for every set we remove, for every object in the set, remove it from the histogram, and if the sort is broken - swap the histogram element with its neighbor until the sort is restored.
This seems good if we remove small number of objects, we don't need to traverse the whole histogram, we do a "micro-bubble" sort.
However when removing large number of objects it seems better to just remove all the objects and then re-sort the array via quick-sort.
So, do you have a better idea regarding this?
Update2:
I think about the following: The histogram should be a data structure which is a binary search tree (auto-balanced of course), whereas each element of the tree contains the appropriate object ID and the list of the sets it belongs to (so far). The comparison criteria is the size of this list.
Each set should contain the list of objects it contains now, whereas the "object" has the direct pointer to the element histogram. In addition each set should contain the number of objects matched so far, set to 0 at the beginning.
Technically we need a cross-linked list node, i.e. a structure that exists in 2 linked lists simultaneously: in the list of a histogram element, and in the list of the set. This node also should contain pointers to both the histogram item and the set. I call it a "cross-link".
Picking the most frequent object is just finding the maximum in the tree.
Adjusting such a histogram is O(M log(N)), whereas M is the number of elements that are currently affected, which is smaller than N if only a little number is affected.
And I'll also use your idea to build the smaller histogram and adjust the bigger.
Sounds right?
I denote the total size of sets with T. The solution I present works in time O(T log T log N).
For the clarity I denote with set the initial sets and with family the set of these sets.
Indeed, let's store a histogram. In BuildSubChains function we maintain a histogram of all elements which are presented in the sets at the moment, sorted by frequency. It may be something like std::set of pairs (frequency, value), maybe with cross-references so you could find an element by value. Now taking the most frequent element is straightforward: it is the first element in the histogram. However, maintaining it is trickier.
You split your family of sets into two subfamilies, one containing the most frequent element, one not. Let there total sizes be T' and T''. Take the family with the smallest total size and remove all elements from its sets from the histogram, making the new histogram on the run. Now you have a histogram for both families, and it is built in time O(min(T', T'') log n), where log n comes from operations with std::set.
At the first glance it seems that it works in quadratic time. However, it is faster. Take a look at any single element. Every time we explicitly remove this element from the histogram the size of its family at least halves, so each element will directly participate in no more than log T removals. So there will be O(T log T) operations with histograms in total.
There might be a better solution if I knew the total size of sets. However, no solution can be faster than O(T), and this is only logarithmically slower.
There may be one more improvement: if you store in the histogram not only elements and frequencies, but also the sets that contain the element (simply another std::set for each element) you'll be able to efficiently select all sets that contain the most frequent element.
The source data for the subject is an m-by-n binary matrix (only 0s and 1s are allowed).
m Rows represent observations, n columns - features. Some observations are marked as targets which need to be separated from the rest.
While it looks like a typical NN, SVM, etc problem, I don't need generalization. What I need is an efficient algorithm to find as many as possible combinations of columns (features) that completely separate targets from other observations, classify, that is.
For example:
f1 f2 f3
o1 1 1 0
t1 1 0 1
o2 0 1 1
Here {f1, f3} is an acceptable combo which separates target t1 from the rest (o1, o2) (btw, {f2} is NOT as by task definition a feature MUST be present in a target). In other words,
t1(f1) & t1(f3) = 1 and o1(f1) & o1(f3) = 0, o2(f1) & o2(f3) = 0
where '&' represents logical conjunction (AND).
The m is about 100,000, n is 1,000. Currently the data is packed into 128bit words along m and the search is optimized with sse4 and whatnot. Yet it takes way too long to obtain those feature combos.
After 2 billion calls to the tree descent routine it has covered about 15% of root nodes. And found about 8,000 combos which is a decent result for my particular application.
I use some empirical criteria to cut off less probable descent paths, not without limited success, but is there something radically better? Im pretty sure there gotta be?.. Any help, in whatever form, reference or suggestion, would be appreciated.
I believe the problem you describe is NP-Hard so you shouldn't expect to find the optimum solution in a reasonable time. I do not understand your current algorithm, but here are some suggestions on the top of my head:
1) Construct a decision tree. Label targets as A and non-targets as B and let the decision tree learn the categorization. At each node select the feature such that a function of P(target | feature) and P(target' | feature') is maximum. (i.e. as many targets as possible fall to positive side and as many non-targets as possible fall to negative side)
2) Use a greedy algorithm. Start from the empty set and at each time step add the feauture that kills the most non-target rows.
3) Use a randomized algorithm. Start from a small subset of positive features of some target, use the set as the seed for the greedy algorithm. Repeat many times. Pick the best solution. Greedy algorithm will be fast so it will be ok.
4) Use a genetic algorithm. Generate random seeds for the greedy algorithm as in 3 to generate good solutions and cross-product them (bitwise-and probably) to generate new candidates seeds. Remember the best solution. Keep good solutions as the current population. Repeat for many generations.
You will need to find the answer "how many of the given rows have the given feature f" fast so probably you'll need specialized data structures, perhaps using a BitArray for each feature.
I have a problem which I am trying to solve with genetic algorithms. The problem is selecting some subset (say 4) of 100 integers (these integers are just ids that represent something else). Order does not matter, the solution to the problem is a SET of integers not an ordered list. I have a good fitness function but am having trouble with the crossover function.
I want to be able to mate the following two chromosomes:
[1 2 3 4] and
[3 4 5 6] into something useful. Clearly I cannot use the typical crossover function because I could end up with duplicates in my children which would represent invalid solutions. What is the best crossover method in this case.
Just ignore any element that occurs in both of the sets (i.e. in their intersection.), that is leave such elements unchanged in both sets.
The rest of the elements form two disjoint sets, to which you can apply pretty much any random transformation (e.g. swapping some pairs randomly) without getting duplicates.
This can be thought of as ordering and aligning both sets so that matching elements face each other and applying one of the standard crossover algorithms.
Sometimes it is beneficial to let your solution go "out of bounds" so that your search will converge more quickly. Rather than making a set of 4 unique integers a requirement for your chromosome, make the number of integers (and their uniqueness) part of the fitness function.
Since order doesn't matter, just collect all the numbers into an array, sort the array, throw out the duplicates (by disconnecting them from a linked list, or setting them to a negative number, or whatever). Shuffle the array and take the first 4 numbers.
I don't really know what you mean on "typical crossover", but I think you could use a crossover similar to what is often used for permutations:
take m ints from the first parent (m < n, where n is the number of ints in your sets)
scan the second and fill your subset from it with (n-m) ints that are free (not in the subset already).
This way you will have n ints from the first and n-m ints from the second parent, without duplications.
Sounds like a valid crossover for me :-).
I guess it might be beneficial not to do either steps on ordered sets (or using an iterator where the order of returned elements correlates somehow with the natural ordering of ints), otherwise either smaller or higher numbers will get a higher chance to be in the child making your search biased.
If it is the best method depends on the problem you want to solve...
In order to combine sets A and B, you could choose the resulting set S probabilistically so that the probability that x is in S is (number of sets out of A, B, which contain x) / 2. This will be guaranteed to contain the intersection and be contained in the union, and will have expected cardinality 4.
I want to find which elements of two arrays make the two arrays different.
For example, if I start off with
known_unacceptable_array = [bad, bad, good, good, good, bad, good]
known_acceptable_array = []
and an array is only unacceptable if there's three bads (but I don't know that at the time), but I'm able to evaluate whether an array is acceptable or unacceptable, I would like to find the smallest array that makes the array unacceptable
possibly_minimal_unacceptable = [bad, bad, bad]
maximal_acceptable = [bad, bad] # Third bad required to make the array unacceptable
What is this problem called, and what algorithms are there for this?
Edit: The elements can't be changed in order, and adding an element can only either change the list from being acceptable to unacceptable or have no effect - it can't change it from being unacceptable to acceptable.
Background: I've randomly generated thousands of instructions that make a ruby interpreter crash, and I want to isolate the specific instructions that cause it to crash, and at the time I thought that multiple bad instructions were required to make it crash. A very naive attempt to determine what the bad instructions is at this link
What is determining the elements that make the difference
between two arrays called?
Differencing is often called
subtraction.
I want to determine which elements of two arrays make the
two arrays different.
Again, that's subtraction(at least
some form of it):
Given A ={ x , y , z } B = { x , y a },
A - B = { z , -a }
or "only A has z and only B has a", or "z and a" make them
different.
For example, if I start off with
known_bad = [bad, bad, good, good, good, bad, good] >
known_good = []
Why start with a full array and an empty one? Isn't this an
extreme case, or are these "two arrays" not two of which you
are trying to determine the "difference."
possibly_minimal_bad = [bad, bad, bad]
maximal_good = [bad, bad] # Third bad required to make the list bad
Is this just a set of rules? Or is this the result of
finding the difference between the two arrays of the previous
(known_good,bad) set?
What is this problem called, and what algorithms are there
for this?
If it isn't called "difference" or "subtraction" then why
introduce it that way?
Is the problem: a. going from
the first two arrays (known_xx) to the second two (min,max);
or is it: b. classifying finite sequences of the words "good"
and "bad."
a) I can't see a relation between the first two
arrays and the second two. How did you get from the first two
to the second?
b) Classifying a sequence of words could be
"parsing a language", or decoding a message, recognizing a
pattern, etc.
Is it "Pattern Recognition"?
It appears that you are looking for a pattern in test input(or test point) data and it's relationship to product failure,
and want to represent the relationship in some codical
form for further analysis. Or searching for a correlation between certain test points and product failure. That makes this question rather
interesting. However, the presentation of the question
is quite confusing. Maybe those groups of
equations could be explained a little more, clarifying if they are related,and if so, then: In what way?
I'm not entirely sure if I understand the question. If my answer is unsatisfactory, please rephrase your question to be more clear. I'll base my answer on this.
I want to determine which elements of two arrays make the two arrays different.
This is a combination of the three set operations union, intersection and difference. Different combinations can achieve the same result.
Complement is the the subset of A which is not in B.
Intersection is the set of elements which is both in A and B, but not just A or B.
Union is the subset which is either in A or B (no duplicates).
It sounds like you want the union of both complements, which is:
A\B ∪ B\A
Or the complement between the intersection and the union:
A∩B \ A∪B
See http://en.wikipedia.org/wiki/Set_operations_(Boolean) for more information.
So, suppose you have a collection of items. Each item has an identifier which can be represented using a bitfield. As a simple example, suppose your collection is:
0110, 0111, 1001, 1011, 1110, 1111
So, you then want to implement a function, Remove(bool bitval, int position). For example, a call to Remove(0, 2) would remove all items where index 2(i.e. 3rd bit) was 0. In this case, that would be 1001, only. Remove(1,1) would remove 1110, 1111, 0111, and 0110. It is trivial to come up with an O(n) collection where this is possible (just use a linked list), with n being the number of items in the collection. In general the number of items to be removed is going to be O(n) (assuming a given bit has a ≥ c% chance of being 1 and a ≥ c% chance of being 0, where c is some constant > 0), so "better" algorithms which somehow are O(l), with l being the number of items being removed, are unexciting.
Is it possible to define a data structure where the average (or better yet, worst case) removal time is better than O(n)? A binary tree can do pretty well (just remove all left/right branches at the height m, where m is the index being tested), but I'm wondering if there is any way to do better (and quite honestly, I'm not sure how to removing all left or right branches at a particular height in an efficient manner). Alternatively, is there a proof that doing better is not possible?
Edit: I'm not sure exactly what I'm expecting in terms of efficiency (sorry Arno), but a basic explanation of it's possible application is thus: Suppose we are working with a binary decision tree. Such a tree could be used for a game tree or a puzzle solver or whatever. Further suppose the tree is small enough that we can fit all of the leaf nodes into memory. Each such node is basically just a bitfield listing all of the decisions. Now, if we want to prune arbitrary decisions from this tree, one method would be to just jump to the height where a particular decision is made and prune the left or right side of every node (left meaning one decision, right meaning the other). Normally in a decision tree you only want to prune subtree at a time (since the parent of that subtree is different from the parent of other subtrees and thus the decision which should be pruned in one subtree should not be pruned from others), but in some types of situations this may not be the case. Further, you normally only want to prune everything below a particular node, but in this case you'll be leaving some stuff below the node but also pruning below other nodes in the tree.
Anyhow, this is somewhat of a question based on curiousity; I'm not sure it's practical to use any results, but am interested in what people have to say.
Edit:
Thinking about it further, I think the tree method is actually O(n / logn), assuming it's reasonably dense. Proof:
Suppose you have a binary tree with n items. It's height is log(n). Removing half the bottom will require n/2 removals. Removing the half the row above will require n/4. The sum of operations for each row is n-1. So the average number of removals is n-1 / log(n).
Provided the length of your bitfields is limited, the following may work:
First, represent the bitfields that are in the set as an array of booleans, so in your case (4 bit bitfields), new bool[16];
Transform this array of booleans into a bitfield itself, so a 16-bit bitfield in this case, where each bit represents whether the bitfield corresponding to its index is included
Then operations become:
Remove(0, 0) = and with bitmask 1010101010101010
Remove(1, 0) = and with bitmask 0101010101010101
Remove(0, 2) = and with bitmask 1111000011110000
Note that more complicated 'add/remove' operations could then also be added as O(1) bit-logic.
The only down-side is that extra work is needed to interpret the resulting 16-bit bitfield back into a set of values, but with lookup arrays that might not turn out too bad either.
Addendum:
Additional down-sides:
Once the size of an integer is exceeded, every added bit to the original bit-fields will double the storage space. However, this is not much worse than a typical scenario using another collection where you have to store on average half the possible bitmask values (provided the typical scenario doesn't store far less remaining values).
Once the size of an integer is exceeded, every added bit also doubles the number of 'and' operations needed to implement the logic.
So basically, I'd say if your original bitfields are not much larger than a byte, you are likely better off with this encoding, beyond that you're probably better off with the original strategy.
Further addendum:
If you only ever execute Remove operations, which over time thins out the set state-space further and further, you may be able to stretch this approach a bit further (no pun intended) by making a more clever abstraction that somehow only keeps track of the int values that are non-zero. Detecting zero values may not be as expensive as it sounds either if the JIT knows what it's doing, because a CPU 'and' operation typically sets the 'zero' flag if the result is zero.
As with all performance optimizations, this one'd need some measurement to determine if it is worthwile.
If each decision bit and position are listed as objects, {bit value, k-th position}, you would end up with an array of length 2*k. If you link to each of these array positions from your item, represented as a linked list (which are of length k), using a pointer to the {bit, position} object as the node value, you can "invalidate" a bunch of items by simply deleting the {bit, position} object. This would require you, upon searching the list of items, to find "complete" items (it makes search REALLY slow?).
So something like:
[{0,0}, {1,0}, {0,1}, {1, 1}, {0,2}, {1, 2}, {0,3}, {1,3}]
and linked from "0100", represented as: {0->3->4->6}
You wouldn't know which items were invalid until you tried to find them (so it doesn't really limit your search space, which is what you're after).
Oh well, I tried.
Sure, it is possible (even if this is "cheating"). Just keep a stack of Remove objects:
struct Remove {
bool set;
int index;
}
The remove function just pushes an object on the stack. Viola, O(1).
If you wanted to get fancy, your stack couldn't exceed (number of bits) without containing duplicate or impossible scenarios.
The rest of the collection has to apply the logic whenever things are withdrawn or iterated over.
Two ways to do insert into the collection:
Apply the Remove rules upon insert, to clear out the stack, making in O(n). Gotta pay somewhere.
Each bitfield has to store it's index in the remove stack, to know what rules apply to it. Then, the stack size limit above wouldn't matter
If you use an array to store your binary tree, you can quickly index any element (the children of the node at index n are at index (n+1)*2 and (n+1)*2-1. All the nodes at a given level are stored sequentially. The first node at at level x is 2^x-1 and there are 2^x elements at that level.
Unfortunately, I don't think this really gets you much of anywhere from a complexity standpoint. Removing all the left nodes at a level is O(n/2) worst case, which is of course O(n). Of course the actual work depends on which bit you are checking, so the average may be somewhat better. This also requires O(2^n) memory which is much worse than the linked list and not practical at all.
I think what this problem is really asking is for a way to efficiently partition a set of sets into two sets. Using a bitset to describe the set gives you a fast check for membership, but doesn't seem to lend itself to making the problem any easier.