Grouping sets algorithm - algorithm

Need to develop an algorithm to solve the following task
Given:
The N sets with a different number of elements
Expected Result:
The new M sets containing ≥X common elements of the N sets
Example:
N1=[1,2,3,4,5]
N2=[2,3,5]
N3=[1,3,5]
N4=[1,2]
if X=3:
M1=[1] (from N1,3,4)
M2=[2] (from N1,2,4)
M3=[3,5] (from N1,2,3)

Given N sets (noted Ni) of sorted integers, initialize N variablesHiwhich will hold the heads of each set.
While there still exist indexesHithat haven't reached the end of their respectiveNi, iterate over the valuesVi=Ni[Hi]and find the minimum valueVmin, count the number of occurrencesnand store the corresponding indexesj(which you can all do in one loop).
Increment theHj.
Ifn>X, this gives you a new setM = [Vmin] (from Nj).
Up to you to model the data representation accordingly so as to use(from Nj)as a map key.

Related

Groupping a list of tuples into sets with coordinate wise matching

Suppose that there is a list of n l-tuples. One is interested in grouping this list into sets where each set containing m tuples such that there is a maximal match coordinate wise. For example:
input: {(1,2,3), (1,2,4), (2,3,1), (2,3,1), (4,3,1), (2,1,4)}, m = 3
output: {(1,2,3), (1,2,4), (2,1,4)}, {(2,3,1), (2,3,1), (4,3,1)}
It is important to note the possibility of those cases(with certain values for n and m) that result in having a set with fewer elements than others or a set with more elements than m elements(tuples).
Questions: What is the name of this problem in the literature? Is there exists an algorithm that performing this task? What about leaving the number of tuples in each partition, m not fixed, and determining the optimal such m with the given restriction, if it makes sense.
Thank you.

How to assign many subsets to their largest supersets?

My data has large number of sets (few millions). Each of those set size is between few members to several tens of thousands integers. Many of those sets are subsets of larger sets (there are many of those super-sets). I'm trying to assign each subset to it's largest superset.
Please can anyone recommend algorithm for this type of task?
There are many algorithms for generating all possible sub-sets of a set, but this type of approach is time-prohibitive given my data size (e.g. this paper or SO question).
Example of my data-set:
A {1, 2, 3}
B {1, 3}
C {2, 4}
D {2, 4, 9}
E {3, 5}
F {1, 2, 3, 7}
Expected answer: B and A are subset of F (it's not important B is also subset of A); C is a subset of D; E remains unassigned.
Here's an idea that might work:
Build a table that maps number to a sorted list of sets, sorted first by size with largest first, and then, by size, arbitrarily but with some canonical order. (Say, alphabetically by set name.) So in your example, you'd have a table that maps 1 to [F, A, B], 2 to [F, A, D, C], 3 to [F, A, B, E] and so on. This can be implemented to take O(n log n) time where n is the total size of the input.
For each set in the input:
fetch the lists associated with each entry in that set. So for A, you'd get the lists associated with 1, 2, and 3. The total number of selects you'll issue in the runtime of the whole algorithm is O(n), so runtime so far is O(n log n + n) which is still O(n log n).
Now walk down each list simultaneously. If a set is the first entry in all three lists, then it's the largest set that contains the input set. Output that association and continue with the next input list. If not, then discard the smallest item among all the items in the input lists and try again. Implementing this last bit is tricky, but you can store the heads of all lists in a heap and get (IIRC) something like O(n log k) overall runtime where k is the maximum size of any individual set, so you can bound that at O(n log n) in the worst case.
So if I got everything straight, the runtime of the algorithm is overall O(n log n), which seems like probably as good as you're going to get for this problem.
Here is a python implementation of the algorithm:
from collections import defaultdict, deque
import heapq
def LargestSupersets(setlists):
'''Computes, for each item in the input, the largest superset in the same input.
setlists: A list of lists, each of which represents a set of items. Items must be hashable.
'''
# First, build a table that maps each element in any input setlist to a list of records
# of the form (-size of setlist, index of setlist), one for each setlist that contains
# the corresponding element
element_to_entries = defaultdict(list)
for idx, setlist in enumerate(setlists):
entry = (-len(setlist), idx) # cheesy way to make an entry that sorts properly -- largest first
for element in setlist:
element_to_entries[element].append(entry)
# Within each entry, sort so that larger items come first, with ties broken arbitrarily by
# the set's index
for entries in element_to_entries.values():
entries.sort()
# Now build up the output by going over each setlist and walking over the entries list for
# each element in the setlist. Since the entries list for each element is sorted largest to
# smallest, the first entry we find that is in every entry set we pulled will be the largest
# element of the input that contains each item in this setlist. We are guaranteed to eventually
# find such an element because, at the very least, the item we're iterating on itself is in
# each entries list.
output = []
for idx, setlist in enumerate(setlists):
num_elements = len(setlist)
buckets = [element_to_entries[element] for element in setlist]
# We implement the search for an item that appears in every list by maintaining a heap and
# a queue. We have the invariants that:
# 1. The queue contains the n smallest items across all the buckets, in order
# 2. The heap contains the smallest item from each bucket that has not already passed through
# the queue.
smallest_entries_heap = []
smallest_entries_deque = deque([], num_elements)
for bucket_idx, bucket in enumerate(buckets):
smallest_entries_heap.append((bucket[0], bucket_idx, 0))
heapq.heapify(smallest_entries_heap)
while (len(smallest_entries_deque) < num_elements or
smallest_entries_deque[0] != smallest_entries_deque[num_elements - 1]):
# First extract the next smallest entry in the queue ...
(smallest_entry, bucket_idx, element_within_bucket_idx) = heapq.heappop(smallest_entries_heap)
smallest_entries_deque.append(smallest_entry)
# ... then add the next-smallest item from the bucket that we just removed an element from
if element_within_bucket_idx + 1 < len(buckets[bucket_idx]):
new_element = buckets[bucket_idx][element_within_bucket_idx + 1]
heapq.heappush(smallest_entries_heap, (new_element, bucket_idx, element_within_bucket_idx + 1))
output.append((idx, smallest_entries_deque[0][1]))
return output
Note: don't trust my writeup too much here. I just thought of this algorithm right now, I haven't proved it correct or anything.
So you have millions of sets, with thousands of elements each. Just representing that dataset takes billions of integers. In your comparisons you'll quickly get to trillions of operations without even breaking a sweat.
Therefore I'll assume that you need a solution which will distribute across a lot of machines. Which means that I'll think in terms of https://en.wikipedia.org/wiki/MapReduce. A series of them.
Read the sets in, mapping them to k:v pairs of i: s where i is an element of the set s.
Receive a key of an integers, along with a list of sets. Map them off to pairs (s1, s2): i where s1 <= s2 are both sets that included to i. Do not omit to map each set to be paired with itself!
For each pair (s1, s2) count the size k of the intersection, and send off pairs s1: k, s2: k. (Only send the second if s1 and s2 are different.
For each set s receive the set of supersets. If it is maximal, send off s: s. Otherwise send off t: s for every t that is a strict superset of s.
For each set s, receive the set of subsets, with s in the list only if it is maximal. If s is maximal, send off t: s for every t that is a subset of s.
For each set we receive the set of maximal sets that it is a subset of. (There may be many.)
There are a lot of steps for this, but at its heart it requires repeated comparisons between pairs of sets with a common element for each common element. Potentially that is O(n * n * m) where n is the number of sets and m is the number of distinct elements that are in many sets.
Here is a simple suggestion for an algorithm that might give better results based on your numbers (n = 10^6 to 10^7 sets with m = 2 to 10^5 members, a lot of super/subsets). Of course it depends a lot on your data. Generally speaking complexity is much worse than for the other proposed algorithms. Maybe you could only process the sets with less than X, e.g. 1000 members that way and for the rest use the other proposed methods.
Sort the sets by their size.
Remove the first (smallest) set and start comparing it against the others from behind (largest set first).
Stop as soon as you found a superset and create a relation. Just remove if no superset was found.
Repeat 2. and 3. for all but the last set.
If you're using Excel, you could structure it as follows:
1) Create a cartesian plot as a two-way table that has all your data sets as titles on both the side and the top
2) In a seperate tab, create a row for each data set in the first column, along with a second column that will count the number of entries (ex: F has 4) and then just stack FIND(",") and MID formulas across the sheet to split out all the entries within each data set. Use the counter in the second column to do COUNTIF(">0"). Each variable you find can be your starting point in a subsequent FIND until it runs out of variables and just returns a blank.
3) Go back to your cartesian plot, and bring over the separate entries you just generated for your column titles (ex: F is 1,2,3,7). Use an AND statement to then check that each entry in your left hand column is in your top row data set using an OFFSET to your seperate area and utilizing your counter as the width for the OFFSET

Check if a collection of sets is pairwise disjoint

What is the most efficient way to determine whether a collection of sets is pairwise disjoint? -- i.e. verifying that the intersection between all pairs of sets is empty. How efficiently can this be done?
The sets from a collection are pairwise disjoint if, and only if, the size of their union equals the sum of their sizes (this statement applies to finite sets):
def pairwise_disjoint(sets) -> bool:
union = set().union(*sets)
return len(union) == sum(map(len, sets))
This could be a one-liner, but readability counts.
Expected linear time O(total number of elements):
def all_disjoint(sets):
union = set()
for s in sets:
for x in s:
if x in union:
return False
union.add(x)
return True
This is optimal under the assumption that your input is a collection of sets represented as some kind of unordered data structure (hash table?), because than you need to look at every element at least once.
You can do much better by using a different representation for your sets. For example, by maintaining a global hash table that stores for each element the number of sets it is stored in, you can do all the set operations optimally and also check for disjointness in O(1).
Using Python as psudo-code. The following tests for the intersection of each pair of sets only once.
def all_disjoint(sets):
S = list(sets)
while S:
s = S.pop() # remove an element
# loop over the remaining ones
for t in S:
# test for intersection
if not s.isdisjoint(t):
return False
return True
The number of intersection tests is the same as the number of edges in a fully connected graph with the same number of vertexes as there are sets. It also exits early if any pair is found not to be disjoint.

Group numbers into closest groups

For example, I have the numbers 46,47,54,58,60, and 66. I want to make group them in such a way as to make the largest possible group sizes. Numbers get grouped if their values are within plus or minus 10 (inclusive). So, depending on which number you start with, for this example there can be three possible outcomes (shown in the image).
What I want is the second possible outcome, which would occur if you started with 54, as the numbers within 44 to 64 would be grouped, leaving 66 by itself, and creating the largest group (5 items).
I realize I could easily brute force this example, but I really have a long list of numbers and it needs to do this across thousands of numbers.. Can anyone tell me about algorithms I should be reading about or give me suggestions?
You can simply sort the array first. Then for every i th number you can do a binary search to find the right most number that's within ith number + 20 range, let the position of such right most index is X. You have to find the largest (X-i+1) for all ith numbers and we are done :)
Runtime analysis: Runtime for this algorithm will be O(NlgN), where N is the number of items in the original array.
A better solution: Let's assume we have the array ar[] and ar[] has N items.
sort ar[] in non decreasing order
set max_result = 0, set cur_index = 0, i=0
increase i while i
set max_result to max(max_result,i-cur_index+1)
set cur_index=cur_index+1
if cur_index
Runtime Analysis: O(N), where N is the number of items in the array ar[] as cur_index will iterate through the array exactly once and i will iterate just once too.
Correctness: as the array is sorted in non decreasing order, if i < j and j < k and ar[i]+20 > ar[k] then ar[j]+20 > ar[k] too. So we don't need to check for these items those are already checked for previous item.
This is what I wanted to do. Sorry I didn't explain myself very well. Each iteration finds the largest possible group, using the numbers that are left after removing the previous largest group. Matlab code:
function out=groupNums(y)
d=10;
out=[];
if length(y)==1
out=y;
return
end
group=[];
for i=1:length(y)
group{i}=find(y<=y(i)+d & y>=y(i)-d);
end
[~,idx]=max(cellfun(#length,group));
out=[out,{y(group{idx})}];
y(group{idx})=[];
out=[out,groupNums(y)];

Generate a number is range (1,n) but not in a list (i,j)

How can I generate a random number that is in the range (1,n) but not in a certain list (i,j)?
Example: range is (1,500), list is [1,3,4,45,199,212,344].
Note: The list may not be sorted
Rejection Sampling
One method is rejection sampling:
Generate a number x in the range (1, 500)
Is x in your list of disallowed values? (Can use a hash-set for this check.)
If yes, return to step 1
If no, x is your random value, done
This will work fine if your set of allowed values is significantly larger than your set of disallowed values:if there are G possible good values and B possible bad values, then the expected number of times you'll have to sample x from the G + B values until you get a good value is (G + B) / G (the expectation of the associated geometric distribution). (You can sense check this. As G goes to infinity, the expectation goes to 1. As B goes to infinity, the expectation goes to infinity.)
Sampling a List
Another method is to make a list L of all of your allowed values, then sample L[rand(L.count)].
The technique I usually use when the list is length 1 is to generate a random
integer r in [1,n-1], and if r is greater or equal to that single illegal
value then increment r.
This can be generalised for a list of length k for small k but requires
sorting that list (you can't do your compare-and-increment in random order). If the list is moderately long, then after the sort you can start with a bsearch, and add the number of values skipped to r, and then recurse into the remainder of the list.
For a list of length k, containing no value greater or equal to n-k, you
can do a more direct substitution: generate random r in [1,n-k], and
then iterate through the list testing if r is equal to list[i]. If it is
then set r to n-k+i (this assumes list is zero-based) and quit.
That second approach fails if some of the list elements are in [n-k,n].
I could try to invest something clever at this point, but what I have so far
seems sufficient for uniform distributions with values of k much less than
n...
Create two lists -- one of illegal values below n-k, and the other the rest (this can be done in place).
Generate random r in [1,n-k]
Apply the direct substitution approach for the first list (if r is list[i] then set r to n-k+i and go to step 5).
If r was not altered in step 3 then we're finished.
Sort the list of larger values and use the compare-and-increment method.
Observations:
If all values are in the lower list, there will be no sort because there is nothing to sort.
If all values are in the upper list, there will be no sort because there is no occasion on which r is moved into the hazardous area.
As k approaches n, the maximum size of the upper (sorted) list grows.
For a given k, if more value appear in the upper list (the bigger the sort), the chance of getting a hit in the lower list shrinks, reducing the likelihood of needing to do the sort.
Refinement:
Obviously things get very sorty for large k, but in such cases the list has comparatively few holes into which r is allowed to settle. This could surely be exploited.
I might suggest something different if many random values with the same
list and limits were needed. I hope that the list of illegal values is not the
list of results of previous calls to this function, because if it is then you
wouldn't want any of this -- instead you would want a Fisher-Yates shuffle.
Rejection sampling would be the simplest if possible as described already. However, if you didn't want use that, you could convert the range and disallowed values to sets and find the difference. Then, you could choose a random value out of there.
Assuming you wanted the range to be in [1,n] but not in [i,j] and that you wanted them uniformly distributed.
In Python
total = range(1,n+1)
disallowed = range(i,j+1)
allowed = list( set(total) - set(disallowed) )
return allowed[random.randrange(len(allowed))]
(Note that this is not EXACTLY uniform since in all likeliness, max_rand%len(allowed) != 0 but this will in most practical applications be very close)
I assume that you know how to generate a random number in [1, n) and also your list is ordered like in the example above.
Let's say that you have a list with k elements. Make a map(O(logn)) structure, which will ensure speed if k goes higher. Put all elements from list in map, where element value will be the key and "good" value will be the value. Later on I'll explain about "good" value. So when we have the map then just find a random number in [1, n - k - p)(Later on I'll explain what is p) and if this number is in map then replace it with "good" value.
"GOOD" value -> Let's start from k-th element. It's good value is its own value + 1, because the very next element is "good" for us. Now let's look at (k-1)th element. We assume that its good value is again its own value + 1. If this value is equal to k-th element then the "good" value for (k-1)th element is k-th "good" value + 1. Also you will have to store the largest "good" value. If the largest value exceed n then p(from above) will be p = largest - n.
Of course I recommend you this only if k is big number otherwise #Timothy Shields' method is perfect.

Resources