Way to Calculate Distinct Partitions using Subsets of a Set containing only one Kind of element - set

We know 3 things
n(number of elements in the set)
k(no. of parts)
set s= {x,x,x,x,...,x(n times)} (here X can have any possible integral value)
we have to find the result as a number which will holds the value of number of distinct partitions possible of the set S.
Is there any kind way(formula / procedure) to find the result using given values?
EXAMPLES:
Input: n = 3, k = 2
Output: 4
Explanation: Let the set be {0,0,0} (assuming x=0), we can partition
it into 2 subsets in following ways
{{0,0}, {0}}, {{0}, {0,0}}, {{0,0,0},{}}
{{},{0,0,0}}.
further, see {{0,0}, {0}} is made up of 2 subsets namely {0,0}
and {0} And it has x(=0) used exactly n(=3) times
Input: n = 3, k = 1
Output: 1
Explanation: There is only one way {{1, 1, 1}} (assuming x=1)
Note:
I know I used word Set in the problem. but a set is defined as collection of distinct elements. So you can either consider it a Multiset, an array or You can assume a set can hold same elements for this particular problem.
I am just trying to use Same terminology as that in the problem.

Related

Select one element from each set but the selected element should not be repeated

I have a few sets, say 5 of them
{1,2,3}
{2,4}
{1,2}
{4,5}
{2,3,5}
Here, I need to choose at least 3 elements from any three sets(One element per set). Given that if an element is already selected, then it cannot be selected again.
Also check if any solution exists or not.
Eg
set {1,2,3} -> choose 1
set {2,4} -> choose 2
set {1,2} -> cannot choose since both 1 and 2 are chosen.
set {2,5} -> can only choose 5
Is there a way to achieve this? Simple explanation would be appreciated.
If you only need 3 elements, then the algorithm is quite simple. Just repeat the following procedure:
Select the set with the lowest heuristic. The heuristic is the length of the set, divided by the total occurrences of that set. If the set has zero elements, remove the set, and go to step 4. If there are two or more, you can choose any one of them.
Pick an element from that set. This is the element you'll choose.
Remove this element from every set.
If you have picked 3 elements or there are no more sets remaining, exit. Otherwise go to step 1.
This algorithm gives at least 3 elements whenever it's possible, even in the presence of duplicates. Here's the proof.
If the heuristic for a set is <= 1, picking an element from that set is basically free. It doesn't hurt the ability to use other sets at all.
If we are in a situation with 2 or more sets with heuristic >1, and we have to pick at least two elements, this is easy. Just pick one from the first, and the second one will have an element left, because it's length is >1 because it's heuristic is >1.
If we are in a situation with 3 or more sets with heuristic >1, we can pick from the first set. After this we are left with at least two sets, where at least one of them has more than one element. We can't be left with two size one sets, because that would imply that the 3 sets we started with contain a duplicate length 2 set, which has heuristic 1. Thus we can pick all 3 elements.
Here is python code for this algorithm. The generator returns as many elements as it can manage. If it's possible to return at least 3 elements, it will. However after that, it doesn't always return the optimal solution.
def choose(sets):
# Copy the input, to avoid modification of the input
s = [{*e} for e in sets]
while True:
# If there are no more sets remaining
if not s:return
# Remove based on length and number of duplicates
m = min(s,key=lambda x:(len(x)/s.count(x)))
s.remove(m)
# Ignore empty sets
if m:
# Remove a random element
e = m.pop()
# Yield it
yield e
# Remove the chosen element e from other sets
for i in range(len(s)):s[i].discard(e)
print([*choose([{1,2,3}, {2,4}, {1,2}, {4,5}, {2,3,5}])])
print([*choose([{1}, {2,3}, {2,4}, {1,2,4}])])
print([*choose([{1,2}, {2}, {2,3,4}])])
print([*choose([{1,2}, {2}, {2,1}])])
print([*choose([{1,2}, {1,3}, {1,3}])])
print([*choose([{1}, {1,2,3}, {1,2,3}])])
print([*choose([{1,3}, {2,3,4}, {2,3,4}, {2,3,4}, {2,3,4}])])
print([*choose([{1,5}, {2,3}, {1,3}, {1,2,3}])])
Try it online!
Something like this
given your sets
0: {1,2,3}
1: {2,4}
2: {1,2}
3: {4,5}
4: {2,3,5}
A array of sets
set A[1] = { 0, 2} // all sets containing 1
A[2] = { 0, 1, 2, 4} // all sets containing 2
A[3] = { 0, 4 } // all sets containing 3
A[4] = { 1, 3 } // all sets containing 4
A[5] = { 3, 4 } // all sets containing 5
set<int> result;
for(i = 0; i < 3; i++) {
find k such that A[k] not empty
if no k exist then "no solution"
result.add(k)
A[k] = empty
}
return result
I think my idea is a bit overkill but it would work on any kind of sets with any number of sets in any size.
the idea is to transform the sets to bipartite graph. on one side you have each set, and on the other side you have the number which they contains.
and if a set contains a number you have a edge between those vertices.
eventually you're trying to find the maximum matching in the graph (maximum cardinality matching).
gladly it can be done with Hopcroft-Karp algorithm in O(√VE) time or even less with Ford–Fulkerson algorithm.
here some links for more source on maximum matching and the algorithms->
https://en.wikipedia.org/wiki/Matching_(graph_theory)
https://en.wikipedia.org/wiki/Maximum_cardinality_matching
https://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm

How to assign many subsets to their largest supersets?

My data has large number of sets (few millions). Each of those set size is between few members to several tens of thousands integers. Many of those sets are subsets of larger sets (there are many of those super-sets). I'm trying to assign each subset to it's largest superset.
Please can anyone recommend algorithm for this type of task?
There are many algorithms for generating all possible sub-sets of a set, but this type of approach is time-prohibitive given my data size (e.g. this paper or SO question).
Example of my data-set:
A {1, 2, 3}
B {1, 3}
C {2, 4}
D {2, 4, 9}
E {3, 5}
F {1, 2, 3, 7}
Expected answer: B and A are subset of F (it's not important B is also subset of A); C is a subset of D; E remains unassigned.
Here's an idea that might work:
Build a table that maps number to a sorted list of sets, sorted first by size with largest first, and then, by size, arbitrarily but with some canonical order. (Say, alphabetically by set name.) So in your example, you'd have a table that maps 1 to [F, A, B], 2 to [F, A, D, C], 3 to [F, A, B, E] and so on. This can be implemented to take O(n log n) time where n is the total size of the input.
For each set in the input:
fetch the lists associated with each entry in that set. So for A, you'd get the lists associated with 1, 2, and 3. The total number of selects you'll issue in the runtime of the whole algorithm is O(n), so runtime so far is O(n log n + n) which is still O(n log n).
Now walk down each list simultaneously. If a set is the first entry in all three lists, then it's the largest set that contains the input set. Output that association and continue with the next input list. If not, then discard the smallest item among all the items in the input lists and try again. Implementing this last bit is tricky, but you can store the heads of all lists in a heap and get (IIRC) something like O(n log k) overall runtime where k is the maximum size of any individual set, so you can bound that at O(n log n) in the worst case.
So if I got everything straight, the runtime of the algorithm is overall O(n log n), which seems like probably as good as you're going to get for this problem.
Here is a python implementation of the algorithm:
from collections import defaultdict, deque
import heapq
def LargestSupersets(setlists):
'''Computes, for each item in the input, the largest superset in the same input.
setlists: A list of lists, each of which represents a set of items. Items must be hashable.
'''
# First, build a table that maps each element in any input setlist to a list of records
# of the form (-size of setlist, index of setlist), one for each setlist that contains
# the corresponding element
element_to_entries = defaultdict(list)
for idx, setlist in enumerate(setlists):
entry = (-len(setlist), idx) # cheesy way to make an entry that sorts properly -- largest first
for element in setlist:
element_to_entries[element].append(entry)
# Within each entry, sort so that larger items come first, with ties broken arbitrarily by
# the set's index
for entries in element_to_entries.values():
entries.sort()
# Now build up the output by going over each setlist and walking over the entries list for
# each element in the setlist. Since the entries list for each element is sorted largest to
# smallest, the first entry we find that is in every entry set we pulled will be the largest
# element of the input that contains each item in this setlist. We are guaranteed to eventually
# find such an element because, at the very least, the item we're iterating on itself is in
# each entries list.
output = []
for idx, setlist in enumerate(setlists):
num_elements = len(setlist)
buckets = [element_to_entries[element] for element in setlist]
# We implement the search for an item that appears in every list by maintaining a heap and
# a queue. We have the invariants that:
# 1. The queue contains the n smallest items across all the buckets, in order
# 2. The heap contains the smallest item from each bucket that has not already passed through
# the queue.
smallest_entries_heap = []
smallest_entries_deque = deque([], num_elements)
for bucket_idx, bucket in enumerate(buckets):
smallest_entries_heap.append((bucket[0], bucket_idx, 0))
heapq.heapify(smallest_entries_heap)
while (len(smallest_entries_deque) < num_elements or
smallest_entries_deque[0] != smallest_entries_deque[num_elements - 1]):
# First extract the next smallest entry in the queue ...
(smallest_entry, bucket_idx, element_within_bucket_idx) = heapq.heappop(smallest_entries_heap)
smallest_entries_deque.append(smallest_entry)
# ... then add the next-smallest item from the bucket that we just removed an element from
if element_within_bucket_idx + 1 < len(buckets[bucket_idx]):
new_element = buckets[bucket_idx][element_within_bucket_idx + 1]
heapq.heappush(smallest_entries_heap, (new_element, bucket_idx, element_within_bucket_idx + 1))
output.append((idx, smallest_entries_deque[0][1]))
return output
Note: don't trust my writeup too much here. I just thought of this algorithm right now, I haven't proved it correct or anything.
So you have millions of sets, with thousands of elements each. Just representing that dataset takes billions of integers. In your comparisons you'll quickly get to trillions of operations without even breaking a sweat.
Therefore I'll assume that you need a solution which will distribute across a lot of machines. Which means that I'll think in terms of https://en.wikipedia.org/wiki/MapReduce. A series of them.
Read the sets in, mapping them to k:v pairs of i: s where i is an element of the set s.
Receive a key of an integers, along with a list of sets. Map them off to pairs (s1, s2): i where s1 <= s2 are both sets that included to i. Do not omit to map each set to be paired with itself!
For each pair (s1, s2) count the size k of the intersection, and send off pairs s1: k, s2: k. (Only send the second if s1 and s2 are different.
For each set s receive the set of supersets. If it is maximal, send off s: s. Otherwise send off t: s for every t that is a strict superset of s.
For each set s, receive the set of subsets, with s in the list only if it is maximal. If s is maximal, send off t: s for every t that is a subset of s.
For each set we receive the set of maximal sets that it is a subset of. (There may be many.)
There are a lot of steps for this, but at its heart it requires repeated comparisons between pairs of sets with a common element for each common element. Potentially that is O(n * n * m) where n is the number of sets and m is the number of distinct elements that are in many sets.
Here is a simple suggestion for an algorithm that might give better results based on your numbers (n = 10^6 to 10^7 sets with m = 2 to 10^5 members, a lot of super/subsets). Of course it depends a lot on your data. Generally speaking complexity is much worse than for the other proposed algorithms. Maybe you could only process the sets with less than X, e.g. 1000 members that way and for the rest use the other proposed methods.
Sort the sets by their size.
Remove the first (smallest) set and start comparing it against the others from behind (largest set first).
Stop as soon as you found a superset and create a relation. Just remove if no superset was found.
Repeat 2. and 3. for all but the last set.
If you're using Excel, you could structure it as follows:
1) Create a cartesian plot as a two-way table that has all your data sets as titles on both the side and the top
2) In a seperate tab, create a row for each data set in the first column, along with a second column that will count the number of entries (ex: F has 4) and then just stack FIND(",") and MID formulas across the sheet to split out all the entries within each data set. Use the counter in the second column to do COUNTIF(">0"). Each variable you find can be your starting point in a subsequent FIND until it runs out of variables and just returns a blank.
3) Go back to your cartesian plot, and bring over the separate entries you just generated for your column titles (ex: F is 1,2,3,7). Use an AND statement to then check that each entry in your left hand column is in your top row data set using an OFFSET to your seperate area and utilizing your counter as the width for the OFFSET

Dynamic Programming Implementation of Unique subset generation

I came across this problem where we need to find the total number of ways in which you can form a subset of n(where n is the input by the user). The subset conditions are : the numbers in the subset should be distinct and the numbers in the subset should be in decreasing order.
Example: Given n =7 ,the output is 4 because the possible combinations are (6,1)(5,2)(4,3)(4,2,1). Note that though (4,1,1,1) also add upto 7,however it has repeating numbers. Hence it is not a valid subset.
I have solved this problem using the backtracking method which has exponential complexity. However, I am trying to figure out how can we solve this problem using Dynamic Prog ? The nearest I could come up with is the series : for n= 0,1,2,3,4,5,6,7,8,9,10 for which the output is 0,0,0,1,1,2,3,4,5,7,9 respectively. However, I am not able to come up with a general formula that I can use to calculate f(n) . While going through the internet I came across this kind of integer sequences in https://oeis.org/search?q=0%2C0%2C0%2C1%2C1%2C2%2C3%2C4%2C5%2C7%2C9%2C11%2C14&sort=&language=&go=Search . But I am unable to understand the formula they have provided here. Any help would be appreciated. Any other insights that improves exponential complexity is also appreciated.
What you call "unique subset generation" is better known as integer partitions with distinct parts. Mathworld has an entry on the Q function, which counts the number of partitions with distinct parts.
However, your function does not count trivial partitions (i.e. n -> (n)), so what you're looking for is actually Q(n) - 1, which generates the sequence that you linked in your question--the number of partitions into at least 2 distinct parts. This answer to a similar question on Mathematics contains an efficient algorithm in Java for computing the sequence up to n = 200, which easily can be adapted for larger values.
Here's a combinatorial explanation of the referenced algorithm:
Let's start with a table of all the subsets of {1, 2, 3}, grouped by their sums:
index (sum) partitions count
0 () 1
1 (1) 1
2 (2) 1
3 (1+2), (3) 2
4 (1+3) 1
5 (2+3) 1
6 (1+2+3) 1
Suppose we want to construct a new table of all of subsets of {1, 2, 3, 4}. Notice that every subset of {1, 2, 3} is also a subset of {1, 2, 3, 4}, so each subset above will appear in our new table. In fact, we can divide our new table into two equally sized categories: subsets which do not include 4, and those that do. What we can do is start from the table above, copy it over, and then "extend" it with 4. Here's the table for {1, 2, 3, 4}:
index (sum) partitions count
0 () 1
1 (1) 1
2 (2) 1
3 (1+2), (3) 2
4 (1+3), [4] 2
5 (2+3), [1+4] 2
6 (1+2+3),[2+4] 2
7 [1+2+4],[3+4] 2
8 [1+3+4] 1
9 [2+3+4] 1
10 [1+2+3+4] 1
All the subsets that include 4 are surrounded by square brackets, and they are formed by adding 4 to exactly one of the old subsets which are surrounded by parentheses. We can repeat this process and build the table for {1,2,..,5}, {1,2,..,6}, etc.
But we don't actually needed to store the actual subsets/partitions, we just need the counts for each index/sum. For example, if with have the table for {1, 2, 3} with only counts, we can build the count-table for {1, 2, 3, 4} by taking each (index, count) pair from the old table, and adding count to the the current count for index + 4. The idea is that, say, if there are two subsets of {1, 2, 3} that sum to 3, then adding 4 to each of those two subsets will make two new subsets for 7.
With this in mind, here's a Python implementation of this process:
def c(n):
counts = [1]
for k in range(1, n + 1):
new_counts = counts[:] + [0]*k
for index, count in enumerate(counts):
new_counts[index + k] += count
counts = new_counts
return counts
We start with the table for {}, which has just one subset--the empty set-- which sums to zero. Then we build the table for {1}, then for {1, 2}, ..., all the way up to {1,2,..,n}. Once we're done, we'll have counted every partition for n, since no partition of n can include integers larger than n.
Now, we can make 2 major optimizations to the code:
Limit the table to only include entries up to n, since that's all we're interested in. If index + k exceeds n, then we just ignore it. We can even preallocate space for the final table up front, rather than growing bigger tables on each iteration.
Instead of building a new table from scratch on each iteration, if we carefully iterate over the old table backwards, we can actually update it in-place without messing up any of the new values.
With these optimizations, you effectively have the same algorithm from the Mathematics answer that was referenced earlier, which was motivated by generating functions. It runs in O(n^2) time and takes only O(n) space. Here's what it looks like in Python:
def c(n):
counts = [1] + [0]*n
for k in range(1, n + 1):
for i in reversed(range(k, n + 1)):
counts[i] += counts[i - k]
return counts
def Q(n):
"-> the number of subsets of {1,..,n} that sum to n."
return c(n)[n]
def f(n):
"-> the number of subsets of {1,..,n} of size >= 2 that sum to n."
return Q(n) - 1

Given 2 arrays, returns elements that are not included in both arrays

I had an interview, and did one of the questions described below:
Given two arrays, please calculate the result: get the union and then remove the intersection from the union. e.g.
int a[] = {1, 3, 4, 5, 7};
int b[] = {5, 3, 8, 10}; // didn't mention if has the same value.
result = {1,4,7,8,10}
This is my idea:
Sort a, b.
Check each item of b using 'dichotomy search' in a. If not found, pass. Otherwise, remove this item from both a, b
result = elements left in a + elements left in b
I know it is a lousy algorithm, but nonetheless it's better than nothing. Is there a better approach than this one?
There are many approaches to this problem. one approach is:
1. construct hash-map using distinct array elements of array a with elements as keys and 1 is a value.
2. for every element,e in array b
if e in hash-map
set value of that key to 0
else
add e to result array.
3.add all keys from hash-map whose values 1 to result array.
another approach may be:
join both lists
sort the joined list
walk through the joined list and completely remove any elements that occurs multiple times
this have one drawback: it does not work if input lists already have doublets. But since we are talking about sets and set theory i would also expect the inputs to be sets in the mathematical sense.
Another (in my opinion the best) approach:
you do not need a search through your both lists. you can just sequentially iterate through them:
sort a and b
declare an empty result set
take iterators to both lists and repeat the following steps:
if the iterators values are unequal: add the smaller number to the result set and increment the belonging iterator
if the iterators values are equal: increment both iterators without adding something to the result set
if one iterator reaches end: add all remaining elements of the other set to the result

Disperse Duplicates in an Array

Source : Google Interview Question
Write a routine to ensure that identical elements in the input are maximally spread in the output?
Basically, we need to place the same elements,in such a way , that the TOTAL spreading is as maximal as possible.
Example:
Input: {1,1,2,3,2,3}
Possible Output: {1,2,3,1,2,3}
Total dispersion = Difference between position of 1's + 2's + 3's = 4-1 + 5-2 + 6-3 = 9 .
I am NOT AT ALL sure, if there's an optimal polynomial time algorithm available for this.Also,no other detail is provided for the question other than this .
What i thought is,calculate the frequency of each element in the input,then arrange them in the output,each distinct element at a time,until all the frequencies are exhausted.
I am not sure of my approach .
Any approaches/ideas people .
I believe this simple algorithm would work:
count the number of occurrences of each distinct element.
make a new list
add one instance of all elements that occur more than once to the list (order within each group does not matter)
add one instance of all unique elements to the list
add one instance of all elements that occur more than once to the list
add one instance of all elements that occur more than twice to the list
add one instance of all elements that occur more than trice to the list
...
Now, this will intuitively not give a good spread:
for {1, 1, 1, 1, 2, 3, 4} ==> {1, 2, 3, 4, 1, 1, 1}
for {1, 1, 1, 2, 2, 2, 3, 4} ==> {1, 2, 3, 4, 1, 2, 1, 2}
However, i think this is the best spread you can get given the scoring function provided.
Since the dispersion score counts the sum of the distances instead of the squared sum of the distances, you can have several duplicates close together, as long as you have a large gap somewhere else to compensate.
for a sum-of-squared-distances score, the problem becomes harder.
Perhaps the interview question hinged on the candidate recognizing this weakness in the scoring function?
In perl
#a=(9,9,9,2,2,2,1,1,1);
then make a hash table of the counts of different numbers in the list, like a frequency table
map { $x{$_}++ } #a;
then repeatedly walk through all the keys found, with the keys in a known order and add the appropriate number of individual numbers to an output list until all the keys are exhausted
#r=();
$g=1;
while( $g == 1 ) {
$g=0;
for my $n (sort keys %x)
{
if ($x{$n}>1) {
push #r, $n;
$x{$n}--;
$g=1
}
}
}
I'm sure that this could be adapted to any programming language that supports hash tables
python code for algorithm suggested by Vorsprung and HugoRune:
from collections import Counter, defaultdict
def max_spread(data):
cnt = Counter()
for i in data: cnt[i] += 1
res, num = [], list(cnt)
while len(cnt) > 0:
for i in num:
if num[i] > 0:
res.append(i)
cnt[i] -= 1
if cnt[i] == 0: del cnt[i]
return res
def calc_spread(data):
d = defaultdict()
for i, v in enumerate(data):
d.setdefault(v, []).append(i)
return sum([max(x) - min(x) for _, x in d.items()])
HugoRune's answer takes some advantage of the unusual scoring function but we can actually do even better: suppose there are d distinct non-unique values, then the only thing that is required for a solution to be optimal is that the first d values in the output must consist of these in any order, and likewise the last d values in the output must consist of these values in any (i.e. possibly a different) order. (This implies that all unique numbers appear between the first and last instance of every non-unique number.)
The relative order of the first copies of non-unique numbers doesn't matter, and likewise nor does the relative order of their last copies. Suppose the values 1 and 2 both appear multiple times in the input, and that we have built a candidate solution obeying the condition I gave in the first paragraph that has the first copy of 1 at position i and the first copy of 2 at position j > i. Now suppose we swap these two elements. Element 1 has been pushed j - i positions to the right, so its score contribution will drop by j - i. But element 2 has been pushed j - i positions to the left, so its score contribution will increase by j - i. These cancel out, leaving the total score unchanged.
Now, any permutation of elements can be achieved by swapping elements in the following way: swap the element in position 1 with the element that should be at position 1, then do the same for position 2, and so on. After the ith step, the first i elements of the permutation are correct. We know that every swap leaves the scoring function unchanged, and a permutation is just a sequence of swaps, so every permutation also leaves the scoring function unchanged! This is true at for the d elements at both ends of the output array.
When 3 or more copies of a number exist, only the position of the first and last copy contribute to the distance for that number. It doesn't matter where the middle ones go. I'll call the elements between the 2 blocks of d elements at either end the "central" elements. They consist of the unique elements, as well as some number of copies of all those non-unique elements that appear at least 3 times. As before, it's easy to see that any permutation of these "central" elements corresponds to a sequence of swaps, and that any such swap will leave the overall score unchanged (in fact it's even simpler than before, since swapping two central elements does not even change the score contribution of either of these elements).
This leads to a simple O(nlog n) algorithm (or O(n) if you use bucket sort for the first step) to generate a solution array Y from a length-n input array X:
Sort the input array X.
Use a single pass through X to count the number of distinct non-unique elements. Call this d.
Set i, j and k to 0.
While i < n:
If X[i+1] == X[i], we have a non-unique element:
Set Y[j] = Y[n-j-1] = X[i].
Increment i twice, and increment j once.
While X[i] == X[i-1]:
Set Y[d+k] = X[i].
Increment i and k.
Otherwise we have a unique element:
Set Y[d+k] = X[i].
Increment i and k.

Resources