Efficient algorithm to find a maximum common subset of two sets? - algorithm

Each set contains bunch of checksums. For example:
Set A:
{
4445968d0e100ad08323df8c895cea15
a67f8052594d6ba3f75502c0b91b868f
07736dde2f8484a4a3af463e05f039e3
5b1e374ff2ba949ab49870ca24d3163a
}
Set B:
{
6639e1da308fd7b04b7635a17450df7c
4445968d0e100ad08323df8c895cea15
a67f8052594d6ba3f75502c0b91b868f
}
The maximum common subset of A and B is:
{
4445968d0e100ad08323df8c895cea15
a67f8052594d6ba3f75502c0b91b868f
}
A lot of these operations will be performed, so I'm looking for an efficient algorithm to do so.
Thanks for your help.

Put one of the sets in a hash table and iterate through the other, discarding elements that aren't in the hash. Alternatively, sort both and iterate through them simultaneously, as in merge sort.
EDIT: The latter method creates a sorted result. I should add that if the sets are of widely disparate sizes and they're presorted (say because you're doing a bunch of intersections), then you can realize a large performance improvement by using "unbounded" binary search to skip ahead in the large list.

Stick them in a hashtable and note the exact collisions.

Add Set A to a structure where you can find if a checksum exists.
Loop Set B, check if element exists in Set A, if it exists, add to Set C
Set C is your common subset.

Make ordered vector/list A from Set A
Make ordered vector/list B from Set B
Iterate over ordered A,B making new step on smaller element - if identical, add to restult and move both.
When underlying set structure is ordered - common case is a kind of Tree (BST,AVL etc.), - then you need only last step to perform.
To make last step clear, here is it's pseudocode:
a = A.begin(); b = B.begin();
while(a!=A.end() && b!=B.end()){
if(*a==*b){
results.add(a);
++a; ++b;
} else if(*a < *b) {
++a;
} else {
++b;
}
}

Related

Data Structure with fast access to nth of elements satisfying condition

I'm filling a stack/vector (a dynamically sized container with fast random access by index with insertion only at the end) with composite data (a struct, class, tuple…). For a specific attribute with a small set of possible values, I will want to access the nth of all elements in the stack where this attribute satisfies a condition. To achieve this, additional information can be stored along each composite or in a separate data structure.
Note that the vector is large and that the compared attribute has a small value range but is compared to a set of allowed values. Also the attributes aren't distributed evenly throughout composites in the vector.
Pseudocode of a O(n) naïve approach. How can I improve this:
enum Fruit { apple, orange, banana, potato };
struct c {
Fruit a;
Data d;
}
// Let's assume v has a length of many thousand and that the distribution of fruits is *not* completely random e.g. that maybe potato only rarely occurs or that bananas tend to come in packs
c getFruit(vector<c> v, set<Fruit> s, int n) {
int counter=0;
// iterate over all of v's indices
for(int i=0 ; i<v.length; i+=1) {
if(v[i].a in s) {
if(n==counter) {
return v[i];
}
counter+=1;
}
}
}
// note: The attribute is compared to a set (arbitrary combination of fruits)!
getFruit(largeVector, set{apple, orange, potato}, 15234)
Another approach would be to create a vector for each possible set of fruits which would be super fast O(1) but not so memory efficient.
(Although I do have to implement this now, I'm really just asking out of curiousity because my data is small enough to just go with the naïve approach.)
Any argument why there doesn't seem to a more efficient way is very much approved as well.
Edit: It should be noted that new elements may be appended between queries for indices using the algorithm in question so any caches have to grow with the vector and both growing the vector and this filtered access should be fast.
For each index of the vector, store the preceding number of each fruit.
Then you can do a binary search to find the first index where the sum of the desired fruit counts is sufficient.
If you don't want to use that much memory, then store the counts in a separate arrays, and only store them for every 16th index or so in the main array. Your binary search will then get you an index within 16 positions of the desired answer, and you can do a linear scan from there.

Algorithm / Data structure for largest set intersection in a collection of sets with a given set

I have a large collection of several million sets, C. The elements of my sets come from a universe of about 2000 possible elements. I need to know, for a given set, s, which set in C has the largest intersection with s? (Or the k sets in C with the k-largest intersections). I will be making many of these queries, sequentially, for different s.
I know that the obvious way to do this is to just to loop over every set in C and compute the intersection and take the max. Are there any smart data structures / programming tricks that can speed up my search? It would be great if I could do this faster than O(C).
EDIT: approximate answers would be alright too
I don't think there's a clever data structure that will help with asymptotic performance. But this is a perfect map reduce problem. A GPGPU would do nicely. For a universe of 2048 elements, a set as a bitmap is only 256 bytes. 4 million is only a gigabyte. Even a modestly spec'ed Nvidia has that. E.g. programming in CUDA, you'd copy C to graphics card RAM, map a chunk of the gigabyte to each GPU core for searching and then reduce across cores to find the final answer. This ought to take on the order of a very few milliseconds. Not fast enough? Just buy hotter hardware.
If you re-phrase your question along these lines, you'll probably get answers from experts in this kind of programming, which I'm not.
One simple trick is to sort the list of sets C in decreasing order by size, then proceed with brute force intersection tests as usual. As you go along, keep track of the set b with the biggest intersection so far. If you find a set whose intersection with the query set s has size |s| (or equivalently, has intersection equal to s -- use whichever of these tests is faster), you can immediately stop and return it as this is the best possible answer. Otherwise, if the next set from C has fewer than |b| elements, you can immediately stop and return b. This can easily be generalised to finding the top k matches.
I don't see any way to do this in less than O(C) per query, but I have some ideas on how to maximize efficiency. The idea is basically to build a lookup table for each element. If some elements are rare and some are common, you can have positive and negative lookup tables:
s[i] // your query, an array of size 2 thousand, true/false
sign[i] // whether the ith element is positive/negative lookup. +/- 1
sets[i] // a list of all the sets that the ith element belongs/(doesn't) to
query(s):
overlaps[i] // an array of size C, initialized to 0's
for i in len(s):
if s[i]:
for j in sets[i]:
overlaps[j] += sign[i]
return max_index(overlaps)
Especially if many of your elements are of widely differing probabilities (as you said), this approach should save you some time: very rare or very common elements can be dealt with almost instantly.
To further optimize: you can sort the structure so that the elements that are most common/most rare are dealt with first. After you have done the first e.g. 3/4, you can do a quick pass to see if the closest matching set is so far ahead of the next set that it is not necessary to continue, though again whether that is worthwhile depends on the details of your data's distribution.
Yet another refinement: make sets[i] one of two possible structures: if the element is very rare or common, sets[i] is just a list of the sets that the ith element is in/not in. However, suppose the ith element is in half the sets. Then sets[i] is just a list of indices half as long as the number of sets, looping through it and incrementing overlaps is wasteful. Have a third value for sign[i]: if sign[i] == 0, then the ith element is relatively close to 50% commonality (this may just mean between 5% and 95%, or anything else), and instead of a list of sets in which it appears, it will simply be an array of 1's and 0's with length equal to C. Then you would just add the array in its entirety to overlaps which would be faster.
Put all of your elements, from the million sets into a Hashtable. The key will be the element, the value will be a set of indexes that point to a containing set.
HashSet<Element>[] AllSets = ...
// preprocess
Hashtable AllElements = new Hashtable(2000);
for(var index = 0; index < AllSets.Count; index++) {
foreach(var elm in AllSets[index]) {
if(!AllElements.ContainsKey(elm)) {
AllElements.Add(elm, new HashSet<int>() { index });
} else {
((HashSet<int>)AllElements[elm]).Add(index);
}
}
}
public List<HashSet<Element>> TopIntersect(HashSet<Element> set, int top = 1) {
// <index, count>
Dictionar<int, int> counts = new Dictionary<int, int>();
foreach(var elm in set) {
var setIndices = AllElements[elm] As HashSet<int>;
if(setIndices != null) {
foreach(var index in setIndices) {
if(!counts.ContainsKey(index)) {
counts.Add(index, 1);
} else {
counts[index]++;
}
}
}
}
return counts.OrderByDescending(kv => kv.Value)
.Take(top)
.Select(kv => AllSets[kv.Key]).ToList();
}

Divide a group of people into two disjoint subgroups (of arbitrary size) and find some values

As we know from programming, sometimes a slight change in a problem can
significantly alter the form of its solution.
Firstly, I want to create a simple algorithm for solving
the following problem and classify it using bigtheta
notation:
Divide a group of people into two disjoint subgroups
(of arbitrary size) such that the
difference in the total ages of the members of
the two subgroups is as large as possible.
Now I need to change the problem so that the desired
difference is as small as possible and classify
my approach to the problem.
Well,first of all I need to create the initial algorithm.
For that, should I make some kind of sorting in order to separate the teams, and how am I suppose to continue?
EDIT: for the first problem,we have ruled out the possibility of a set being an empty set. So all we have to do is just a linear search to find the min age and then put it in a set B. SetA now has all the other ages except the age of setB, which is the min age. So here is the max difference of the total ages of the two sets, as high as possible
The way you described the first problem, it is trivial in the way that it requires you to find only the minimum element (in case the subgroups should contain at least 1 member), otherwise it is already solved.
The second problem can be solved recursively the pseudo code would be:
// compute sum of all elem of array and store them in sum
min = sum;
globalVec = baseVec;
fun generate(baseVec, generatedVec, position, total)
if (abs(sum - 2*total) < min){ // check if the distribution is better
min = abs(sum - 2*total);
globalVec = generatedVec;
}
if (position >= baseVec.length()) return;
else{
// either consider elem at position in first group:
generate(baseVec,generatedVec.pushback(baseVec[position]), position + 1, total+baseVec[position]);
// or consider elem at position is second group:
generate(baseVec,generatedVec, position + 1, total);
}
And now just start the function with generate(baseVec,"",0,0) where "" stand for an empty vector.
The algo can be drastically improved by applying it to a sorted array, hence adding a test condition to stop branching, but the idea stays the same.

How to "sort" elements of 2 possible values in place in linear time? [duplicate]

This question already has answers here:
Stable separation for two classes of elements in an array
(3 answers)
Closed 9 years ago.
Suppose I have a function f and array of elements.
The function returns A or B for any element; you could visualize the elements this way ABBAABABAA.
I need to sort the elements according to the function, so the result is: AAAAAABBBB
The number of A values doesn't have to equal the number of B values. The total number of elements can be arbitrary (not fixed). Note that you don't sort chars, you sort objects that have a single char representation.
Few more things:
the sort should take linear time - O(n),
it should be performed in place,
it should be a stable sort.
Any ideas?
Note: if the above is not possible, do you have ideas for algorithms sacrificing one of the above requirements?
If it has to be linear and in-place, you could do a semi-stable version. By semi-stable I mean that A or B could be stable, but not both. Similar to Dukeling's answer, but you move both iterators from the same side:
a = first A
b = first B
loop while next A exists
if b < a
swap a,b elements
b = next B
a = next A
else
a = next A
With the sample string ABBAABABAA, you get:
ABBAABABAA
AABBABABAA
AAABBBABAA
AAAABBBBAA
AAAAABBBBA
AAAAAABBBB
on each turn, if you make a swap you move both, if not you just move a. This will keep A stable, but B will lose its ordering. To keep B stable instead, start from the end and work your way left.
It may be possible to do it with full stability, but I don't see how.
A stable sort might not be possible with the other given constraints, so here's an unstable sort that's similar to the partition step of quick-sort.
Have 2 iterators, one starting on the left, one starting on the right.
While there's a B at the right iterator, decrement the iterator.
While there's an A at the left iterator, increment the iterator.
If the iterators haven't crossed each other, swap their elements and repeat from 2.
Lets say,
Object_Array[1...N]
Type_A objs are A1,A2,...Ai
Type_B objs are B1,B2,...Bj
i+j = N
FOR i=1 :N
if Object_Array[i] is of Type_A
obj_A_count=obj_A_count+1
else
obj_B_count=obj_B_count+1
LOOP
Fill the resultant array with obj_A and obj_B with their respective counts depending on obj_A > obj_B
The following should work in linear time for a doubly-linked list. Because up to N insertion/deletions are involved that may cause quadratic time for arrays though.
Find the location where the first B should be after "sorting". This can be done in linear time by counting As.
Start with 3 iterators: iterA starts from the beginning of the container, and iterB starts from the above location where As and Bs should meet, and iterMiddle starts one element prior to iterB.
With iterA skip over As, find the 1st B, and move the object from iterA to iterB->previous position. Now iterA points to the next element after where the moved element used to be, and the moved element is now just before iterB.
Continue with step 3 until you reach iterMiddle. After that all elements between first() and iterB-1 are As.
Now set iterA to iterB-1.
Skip over Bs with iterB. When A is found move it to just after iterA and increment iterA.
Continue step 6 until iterB reaches end().
This would work as a stable sort for any container. The algorithm includes O(N) insertion/deletion, which is linear time for containers with O(1) insertions/deletions, but, alas, O(N^2) for arrays. Applicability in you case depends on whether the container is an array rather than a list.
If your data structure is a linked list instead of an array, you should be able to meet all three of your constraints. You just skim through the list and accumulating and moving the "B"s will be trivial pointer changes. Pseudo code below:
sort(list) {
node = list.head, blast = null, bhead = null
while(node != null) {
nextnode = node.next
if(node.val == "a") {
if(blast != null){
//move the 'a' to the front of the 'B' list
bhead.prev.next = node, node.prev = bhead.prev
blast.next = node.next, node.next.prev = blast
node.next = bhead, bhead.prev = node
}
}
else if(node.val == "b") {
if(blast == null)
bhead = blast = node
else //accumulate the "b"s..
blast = node
}
3
node = nextnode
}
}
So, you can do this in an array, but the memcopies, that emulate the list swap, will make it quiet slow for large arrays.
Firstly, assuming the array of A's and B's is either generated or read-in, I wonder why not avoid this question entirely by simply applying f as the list is being accumulated into memory into two lists that would subsequently be merged.
Otherwise, we can posit an alternative solution in O(n) time and O(1) space that may be sufficient depending on Sir Bohumil's ultimate needs:
Traverse the list and sort each segment of 1,000,000 elements in-place using the permutation cycles of the segment (once this step is done, the list could technically be sorted in-place by recursively swapping the inner-blocks, e.g., ABB AAB -> AAABBB, but that may be too time-consuming without extra space). Traverse the list again and use the same constant space to store, in two interval trees, the pointers to each block of A's and B's. For example, segments of 4,
ABBAABABAA => AABB AABB AA + pointers to blocks of A's and B's
Sequential access to A's or B's would be immediately available, and random access would come from using the interval tree to locate a specific A or B. One option could be to have the intervals number the A's and B's; e.g., to find the 4th A, look for the interval containing 4.
For sorting, an array of 1,000,000 four-byte elements (3.8MB) would suffice to store the indexes, using one bit in each element for recording visited indexes during the swaps; and two temporary variables the size of the largest A or B. For a list of one billion elements, the maximum combined interval trees would number 4000 intervals. Using 128 bits per interval, we can easily store numbered intervals for the A's and B's, and we can use the unused bits as pointers to the block index (10 bits) and offset in the case of B (20 bits). 4000*16 bytes = 62.5KB. We can store an additional array with only the B blocks' offsets in 4KB. Total space under 5MB for a list of one billion elements. (Space is in fact dependent on n but because it is extremely small in relation to n, for all practical purposes, we may consider it O(1).)
Time for sorting the million-element segments would be - one pass to count and index (here we can also accumulate the intervals and B offsets) and one pass to sort. Constructing the interval tree is O(nlogn) but n here is only 4000 (0.00005 of the one-billion list count). Total time O(2n) = O(n)
This should be possible with a bit of dynamic programming.
It works a bit like counting sort, but with a key difference. Make arrays of size n for both a and b count_a[n] and count_b[n]. Fill these arrays with how many As or Bs there has been before index i.
After just one loop, we can use these arrays to look up the correct index for any element in O(1). Like this:
int final_index(char id, int pos){
if(id == 'A')
return count_a[pos];
else
return count_a[n-1] + count_b[pos];
}
Finally, to meet the total O(n) requirement, the swapping needs to be done in a smart order. One simple option is to have recursive swapping procedure that doesn't actually perform any swapping until both elements would be placed in correct final positions. EDIT: This is actually not true. Even naive swapping will have O(n) swaps. But doing this recursive strategy will give you absolute minimum required swaps.
Note that in general case this would be very bad sorting algorithm since it has memory requirement of O(n * element value range).

Permutations with extra restrictions

I have a set of items, for example: {1,1,1,2,2,3,3,3}, and a restricting set of sets, for example {{3},{1,2},{1,2,3},{1,2,3},{1,2,3},{1,2,3},{2,3},{2,3}. I am looking for permutations of items, but the first element must be 3, and the second must be 1 or 2, etc.
One such permutation that fits is:
{3,1,1,1,2,2,3}
Is there an algorithm to count all permutations for this problem in general? Is there a name for this type of problem?
For illustration, I know how to solve this problem for certain types of "restricting sets".
Set of items: {1,1,2,2,3}, Restrictions {{1,2},{1,2,3},{1,2,3},{1,2},{1,2}}. This is equal to 2!/(2-1)!/1! * 4!/2!/2!. Effectively permuting the 3 first, since it is the most restrictive and then permuting the remaining items where there is room.
Also... polynomial time. Is that possible?
UPDATE: This is discussed further at below links. The problem above is called "counting perfect matchings" and each permutation restriction above is represented by a {0,1} on a matrix of slots to occupants.
https://math.stackexchange.com/questions/519056/does-a-matrix-represent-a-bijection
https://math.stackexchange.com/questions/509563/counting-permutations-with-additional-restrictions
https://math.stackexchange.com/questions/800977/parking-cars-and-vans-into-car-van-and-car-van-parking-spots
All of the other solutions here are exponential time--even for cases that they don't need to be. This problem exhibits similar substructure, and so it should be solved with dynamic programming.
What you want to do is write a class that memoizes solutions to subproblems:
class Counter {
struct Problem {
unordered_multiset<int> s;
vector<unordered_set<int>> v;
};
int Count(Problem const& p) {
if (m.v.size() == 0)
return 1;
if (m.find(p) != m.end())
return m[p];
// otherwise, attack the problem choosing either choosing an index 'i' (notes below)
// or a number 'n'. This code only illustrates choosing an index 'i'.
Problem smaller_p = p;
smaller_p.v.erase(v.begin() + i);
int retval = 0;
for (auto it = p.s.begin(); it != p.s.end(); ++it) {
if (smaller_p.s.find(*it) == smaller_p.s.end())
continue;
smaller_p.s.erase(*it);
retval += Count(smaller_p);
smaller_p.s.insert(*it);
}
m[p] = retval;
return retval;
}
unordered_map<Problem, int> m;
};
The code illustrates choosing an index i, which should be chosen at a place where there are v[i].size() is small. The other option is to choose a number n, which should be one for which there are few locations v that it can be placed in. I'd say the minimum of the two deciding factors should win.
Also, you'll have to define a hash function for Problem -- that shouldn't be too hard using boost's hash stuff.
This solution can be improved by replacing the vector with a set<>, and defining a < operator for unordered_set. This will collapse many more identical subproblems into a single map element, and further reduce mitigate exponential blow-up.
This solution can be further improved by making Problem instances that are the same except that the numbers are rearranged hash to the same value and compare to be the same.
You might consider a recursive solution that uses a pool of digits (in the example you provide, it would be initialized to {1,1,1,2,2,3,3,3}), and decides, at the index given as a parameter, which digit to place at this index (using, of course, the restrictions that you supply).
If you like, I can supply pseudo-code.
You could build a tree.
Level 0: Create a root node.
Level 1: Append each item from the first "restricting set" as children of the root.
Level 2: Append each item from the second restricting set as children of each of the Level 1 nodes.
Level 3: Append each item from the third restricting set as children of each of the Level 2 nodes.
...
The permutation count is then the number of leaf nodes of the final tree.
Edit
It's unclear what is meant by the "set of items" {1,1,1,2,2,3,3,3}. If that is meant to constrain how many times each value can be used ("1" can be used 3 times, "2" twice, etc.) then we need one more step:
Before appending a node to the tree, remove the values used on the current path from the set of items. If the value you want to append is still available (e.g. you want to append a "1", and "1" has only been used twice so far) then append it to the tree.
To save space, you could build a directed graph instead of a tree.
Create a root node.
Create a node for each item in the
first set, and link from the root to
the new nodes.
Create a node for each item in the
second set, and link from each first
set item to each second set item.
...
The number of permutations is then the number of paths from the root node to the nodes of the final set.

Resources