Partitioning a superset and getting the list of original sets for each partition - algorithm

Introduction
While trying to do some cathegorization on nodes in a graph (which will be rendered differenty), I find myself confronted with the following problem:
The Problem
Given a superset of elements S = {0, 1, ... M} and a number n of non-disjoint subsets T_i thereof, with 0 <= i < n, what is the best algorithm to find out the partition of the set S called P?
P = S is the union of all disjoint partitions P_j of the original superset S, with 0 <= j < M, such that for all elements x in P_j, every x has the same list of "parents" among the "original" sets T_i.
Example
S = [1, 2, 3, 4, 5, 6, 8, 9]
T_1 = [1, 4]
T_2 = [2, 3]
T_3 = [1, 3, 4]
So all P_js would be:
P_1 = [1, 4] # all elements x have the same list of "parents": T_1, T_3
P_2 = [2] # all elements x have the same list of "parents": T_2
P_3 = [3] # all elements x have the same list of "parents": T_2, T_3
P_4 = [5, 6, 8, 9] # all elements x have the same list of "parents": S (so they're not in any of the P_j
Questions
What are good functions/classes in the python packages to compute all P_js and the list of their "parents", ideally restricted to numpy and scipy? Perhaps there's already a function which does just that
What is the best algorithm to find those partitions P_js and for each one, the list of "parents"? Let's note T_0 = S
I think the brute force approach would be to generate all 2-combinations of T sets and split them in at most 3 disjoint sets, which would be added back to the pool of T sets and then repeat the process until all resulting Ts are disjoint, and thus we've arrived at our answer - the set of P sets. A little problematic could be caching all the "parents" on the way there.
I suspect a dynamic programming approach could be used to optimize the algorithm.
Note: I would have loved to write the math parts in latex (via MathJax), but unfortunately this is not activated :-(

The following should be linear time (in the number of the elements in the Ts).
from collections import defaultdict
S = [1, 2, 3, 4, 5, 6, 8, 9]
T_1 = [1, 4]
T_2 = [2, 3]
T_3 = [1, 3, 4]
Ts = [S, T_1, T_2, T_3]
parents = defaultdict(int)
for i, T in enumerate(Ts):
for elem in T:
parents[elem] += 2 ** i
children = defaultdict(list)
for elem, p in parents.items():
children[p].append(elem)
print(list(children.values()))
Result:
[[5, 6, 8, 9], [1, 4], [2], [3]]

The way I'd do this is to construct an M × n boolean array In where In(i, j) &equals; Si &in; Tj. You can construct that in O(Σj|Tj|), provided you can map an element of S onto its integer index in O(1), by scanning all of the sets T and marking the corresponding bit in In.
You can then read the "signature" of each element i directly from In by concatenating row i into a binary number of n bits. The signature is precisely the equivalence relationship of the partition you are seeking.
By the way, I'm in total agreement with you about Math markup. Perhaps it's time to mount a new campaign.

Related

Ruby: Divide array into 2 arrays with the closest possible average

Background: I'm working on a "matchmaking system" for a small multiplayer video game side project. Every player has a rank from 0-10, every team has 4 players. I'm trying to find a good way to balance out the teams so that the average rank of both of them is as close as possible and the match is as fair as possible.
My current, flawed approach looks like this:
def create_teams(players)
teams = Hash.new{|hash, team| hash[team] = []}
players.sort_by(&:rank).each_slice(2) do |slice|
teams[:team1] << slice[0]
teams[:team2] << slice[1]
end
teams
end
This works decently well if the ranks are already pretty similar but it's not a proper solution to this problem.
For example, it fails in a situation like this:
require "ostruct"
class Array
def avg
sum.fdiv(size)
end
end
dummy_players = [9, 5, 5, 3, 3, 3, 2, 0].map{|rank| OpenStruct.new(rank: rank)}
teams = create_teams(dummy_players)
teams.each do |team, players|
ranks = players.map(&:rank)
puts "#{team} - ranks: #{ranks.inspect}, avg: #{ranks.avg}"
end
This results in pretty unfair teams:
team1 - ranks: [0, 3, 3, 5], avg: 2.75
team2 - ranks: [2, 3, 5, 9], avg: 4.75
Instead, I'd like the teams in this situation to be like this:
team1 - ranks: [0, 3, 3, 9], avg: 3.75
team2 - ranks: [2, 3, 5, 5], avg: 3.75
If there are n players, where n is an even number, there are
C(n) = n!/((n/2)!(n/2)!)
ways to partition the n players into two teams of n/2 players, where n! equals n-facorial. This is often expressed as the number of ways to choosing n/2 items from a collection of n items.
To obtain the partition that has a mimimum absolute difference in total ranks (and hence, in mean ranks), one would have to enumerate all C(n) partitions. If n = 8, as in this example, C(8) = 70 (see, for example, this online calculator). If, however, n = 16, then C(16) = 12,870 and C(32) = 601,080,390. This gives you an idea of how small n must be in order perform a complete enumeration.
If n is too large to enumerate all combinations you must resort to using a heuristic, or a subjective rule for partitioning the array of ranks. Here are two possibilities:
assign the highest rank element ("rank 1") to team A, assign elements with ranks 2 and 3 to team B, assign elements with ranks 4 and 5 to team A, and so on.
assign elements with ranks 1 and n to team A, elements with ranks 2 and n-1 to team B, and so on.
The trouble with heuristics is evaluating their effectiveness. For this problem, for every heuristic you devise there is an array of ranks for which the heuristic's performance is abysmal. If you know the universe of possible arrays of ranks and have a way of drawing unbiased samples you can evaluate the heuristic statistically. That generally is not possible, however.
Here is how you could examine all partitions. Suppose:
ranks = [3, 3, 0, 2, 5, 9, 3, 5]
Then we may perform the following calculations.
indices = ranks.size.times.to_a
#=> [0, 1, 2, 3, 4, 5, 6, 7]
team_a = indices.combination(ranks.size/2).min_by do |combo|
team_b = indices - combo
(combo.sum { |i| ranks[i] } - team_b.sum { |i| ranks[i] }).abs
end
#=> [0, 1, 2, 5]
team_b = indices - team_a
#=> [3, 4, 6, 7]
See Array#combination and Enumerable#min_by.
We see that team A players have ranks:
arr = ranks.values_at(*team_a)
#=> [3, 3, 0, 9]
and the sum of those ranks is:
arr.sum
#=> 15
Similarly, for team B:
arr = ranks.values_at(*team_b)
#=> [2, 5, 3, 5]
arr.sum
#=> 15
See Array#values_at.

Algorithm to generate Diagonal Latin Square matrix

I need for given N create N*N matrix which does not have repetitions in rows, cells, minor and major diagonals and values are 1, 2 , 3, ...., N.
For N = 4 one of matrices is the following:
1 2 3 4
3 4 1 2
4 3 2 1
2 1 4 3
Problem overview
The math structure you described is Diagonal Latin Square. Constructing them is the more mathematical problem than the algorithmic or programmatic.
To correctly understand what it is and how to create you should read following articles:
Latin squares definition
Magic squares definition
Diagonal Latin square construction <-- p.2 is answer to your question with proof and with other interesting properties
Short answer
One of the possible ways to construct Diagonal Latin Square:
Let N is the power of required matrix L.
If there are exist numbers A and B from range [0; N-1] which satisfy properties:
A relativly prime to N
B relatively prime to N
(A + B) relatively prime to N
(A - B) relatively prime to N
Then you can create required matrix with the following rule:
L[i][j] = (A * i + B * j) mod N
It would be nice to do this mathematically, but I'll propose the simplest algorithm that I can think of - brute force.
At a high level
we can represent a matrix as an array of arrays
for a given N, construct S a set of arrays, which contains every combination of [1..N]. There will be N! of these.
using an recursive & iterative selection process (e.g. a search tree), search through all orders of these arrays until one of the 'uniqueness' rules is broken
For example, in your N = 4 problem, I'd construct
S = [
[1,2,3,4], [1,2,4,3]
[1,3,2,4], [1,3,4,2]
[1,4,2,3], [1,4,3,2]
[2,1,3,4], [2,1,4,3]
[2,3,1,4], [2,3,4,1]
[2,4,1,3], [2,4,3,1]
[3,1,2,4], [3,1,4,2]
// etc
]
R = new int[4][4]
Then the algorithm is something like
If R is 'full', you're done
Evaluate does the next row from S fit into R,
if yes, insert it into R, reset the iterator on S, and go to 1.
if no, increment the iterator on S
If there are more rows to check in S, go to 2.
Else you've iterated across S and none of the rows fit, so remove the most recent row added to R and go to 1. In other words, explore another branch.
To improve the efficiency of this algorithm, implement a better data structure. Rather than a flat array of all combinations, use a prefix tree / Trie of some sort to both reduce the storage size of the 'options' and reduce the search area within each iteration.
Here's a method which is fast for N <= 9 : (python)
import random
def generate(n):
a = [[0] * n for _ in range(n)]
def rec(i, j):
if i == n - 1 and j == n:
return True
if j == n:
return rec(i + 1, 0)
candidate = set(range(1, n + 1))
for k in range(i):
candidate.discard(a[k][j])
for k in range(j):
candidate.discard(a[i][k])
if i == j:
for k in range(i):
candidate.discard(a[k][k])
if i + j == n - 1:
for k in range(i):
candidate.discard(a[k][n - 1 - k])
candidate_list = list(candidate)
random.shuffle(candidate_list)
for e in candidate_list:
a[i][j] = e
if rec(i, j + 1):
return True
a[i][j] = 0
return False
rec(0, 0)
return a
for row in generate(9):
print(row)
Output:
[8, 5, 4, 7, 1, 6, 2, 9, 3]
[2, 7, 5, 8, 4, 1, 3, 6, 9]
[9, 1, 2, 3, 6, 4, 8, 7, 5]
[3, 9, 7, 6, 2, 5, 1, 4, 8]
[5, 8, 3, 1, 9, 7, 6, 2, 4]
[4, 6, 9, 2, 8, 3, 5, 1, 7]
[6, 3, 1, 5, 7, 9, 4, 8, 2]
[1, 4, 8, 9, 3, 2, 7, 5, 6]
[7, 2, 6, 4, 5, 8, 9, 3, 1]

N-fold partition of an array with equal sum in each partition

Given an array of integers a, two numbers N and M, return N group of integers from a such that each group sums to M.
For example, say:
a = [1,2,3,4,5]
N = 2
M = 5
Then the algorithm could return [2, 3], [1, 4] or [5], [2, 3] or possibly others.
What algorithms could I use here?
Edit:
I wasn't aware that this problem is NP complete. So maybe it would help if I provided more details on my specific scenario:
So I'm trying to create a "match-up" application. Given the number of teams N and the number of players per team M, the application listens for client requests. Each client request will give a number of players that the client represents. So if I need 2 teams of 5 players, then if 5 clients send requests, each representing 1, 2, 3, 4, 5 players respectively, then my application should generate a match-up between clients [1, 4] and clients [2, 3]. It could also generate a match-up between [1, 4] and [5]; I don't really care.
One implication is that any client representing more than M or less than 0 players is invalid. Hope this could simplify the problem.
this appears to be a variation of the subset sum problem. as this problem is np-complete, there will be no efficient algorithm without further constraints.
note that it is already hard to find a single subset of the original set whose elements would sum up to M.
People give up too easily on NP-complete problems. Just because a problem is NP complete doesn't mean that there aren't more and less efficient algorithms in the general case. That is you can't guarantee that for all inputs there is an answer that can be computed faster than a brute force search, but for many problems you can certainly have methods that are faster than the full search for most inputs.
For this problem there are certainly 'perverse' sets of numbers that will result in worst case search times, because there may be say a large vector of integers, but only one solution and you have to end up trying a very large number of combinations.
But for non-perverse sets, there are probably many solutions, and an efficient way of 'tripping over' a good partitioning will run much faster than NP time.
How you solve this will depend a lot on what you expect to be the more common parameters. It also makes a difference if the integers are all positive, or if negatives are allowed.
In this case I'll assume that:
N is small relative to the length of the vector
All integers are positive.
Integers cannot be re-used.
Algorithm:
Sort the vector, v.
Eliminate elements bigger than M. They can't be part of any solution.
Add up all remaining numbers in v, divide by N. If the result is smaller than M, there is no solution.
Create a new array w, same size as v. For each w[i], sum all the numbers in v[i+1 - end]
So if v was 5 4 3 2 1, w would be 10, 6, 3, 1, 0.
While you have not found enough sets:
Chose the largest number, x, if it is equal to M, emit a solution set with just x, and remove it from the vector, remove the first element from w.
Still not enough sets? (likely), then again while you have not found enough sets:
A solution theory is ([a,b,c], R ) where [a,b,c] is a partial set of elements of v and a remainder R. R = M-sum[a,b,c]. Extending a theory is adding a number to the partial set, and subtracting that number from R. As you extend the theories, if R == 0, that is a possible solution.
Recursively create theories like so: loop over the elements v, as v[i] creating theories, ( [v[i]], R ), And now recursively extend extend each theory from just part of v. Binary search into v to find the first element equal to or smaller than R, v[j]. Start with v[j] and extend each theory with the elements of v from j until R > w[k].
The numbers from v[j] to v[k] are the only numbers that be used to extend a theory and still get R to 0. Numbers larger than v[j] will make R negative. Smaller larger than v[k], and there aren't any more numbers left in the array, even if you used them all to get R to 0
Here is my own Python solution that uses dynamic programming. The algorithm is given here.
def get_subset(lst, s):
'''Given a list of integer `lst` and an integer s, returns
a subset of lst that sums to s, as well as lst minus that subset
'''
q = {}
for i in range(len(lst)):
for j in range(1, s+1):
if lst[i] == j:
q[(i, j)] = (True, [j])
elif i >= 1 and q[(i-1, j)][0]:
q[(i, j)] = (True, q[(i-1, j)][1])
elif i >= 1 and j >= lst[i] and q[(i-1, j-lst[i])][0]:
q[(i, j)] = (True, q[(i-1, j-lst[i])][1] + [lst[i]])
else:
q[(i, j)] = (False, [])
if q[(i, s)][0]:
for k in q[(i, s)][1]:
lst.remove(k)
return q[(i, s)][1], lst
return None, lst
def get_n_subset(n, lst, s):
''' Returns n subsets of lst, each of which sums to s'''
solutions = []
for i in range(n):
sol, lst = get_subset(lst, s)
solutions.append(sol)
return solutions, lst
# print(get_n_subset(7, [1, 2, 3, 4, 5, 7, 8, 4, 1, 2, 3, 1, 1, 1, 2], 5))
# [stdout]: ([[2, 3], [1, 4], [5], [4, 1], [2, 3], [1, 1, 1, 2], None], [7, 8])

Algorithm to find a sequence of sub-sequences

Sequence [1,2,3] consider. This sequence has the following 6 different sequence: [1]and [2]and [3] and [1,2] and [2,3] and [1,2,3]
Note! Length the initial sequence may be up to 100 digits.
Please help me. How can I make the following sequences?
I love researching more about this kind of algorithms. Please tell me the name of this type of algorithms.
Here is a c code to print all sub sequences. Algorithm uses nested loops.
#include<stdio.h>
void seq_print(int A[],int n)
{
int k;
for(int i =0;i<=n-1;i++)
{
for(int j=0;j<=i;j++)
{
k=j;
while(k<=i)
{
printf("%d",A[k]);
k++;
}
printf("\n");
}
}
}
void main()
{
int A[]={1,2,3,4,5,6,7,8,9,0};
int n=10;
seq_print(A,n);
}
Your problem can be reduced to the Combination problem. There are already many solutions existed in stackoverflow. You can check this, it may be useful for you.
It is called a power set (in your case the empty set is excluded).
To build a power set, start with a set with an empty set in it; then
for each item in the input set extend the power set with all its subsets accumulated so far
with the current item included (in Python):
def powerset(lst):
S = [[]]
for item in lst:
S += [subset + [item] for subset in S]
return S
Example:
print(powerset([1, 2, 3]))
# -> [[], [1], [2], [1, 2], [3], [1, 3], [2, 3], [1, 2, 3]]
To avoid producing all subsets at once, a recursive definition could be used:
a power set of an empty set is a set with an empty set in it
a power set of a set with n items contains all subsets from a power set
of a set with n - 1 items plus all these subsets with the n-th item included.
def ipowerset(lst):
if not lst: # empty list
yield []
else:
item, *rest = lst
for subset in ipowerset(rest):
yield subset
yield [item] + subset
Example:
print(list(ipowerset([1, 2, 3])))
# -> [[], [1], [2], [1, 2], [3], [1, 3], [2, 3], [1, 2, 3]]
Yet another way to generate a power set is to generate r-length subsequences (combinations) for all r from zero to the size of the input set (itertools recipe):
from itertools import chain, combinations
def powerset_comb(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
Example:
print(list(powerset_comb([1, 2, 3])))
# -> [(), (1,), (2,), (3,), (1,2), (1,3), (2,3), (1,2,3)]
See also what's a good way to combinate through a set?.

Find the middle element in merged arrays in O(logn)

We have two sorted arrays of the same size n. Let's call the array a and b.
How to find the middle element in an sorted array merged by a and b?
Example:
n = 4
a = [1, 2, 3, 4]
b = [3, 4, 5, 6]
merged = [1, 2, 3, 3, 4, 4, 5, 6]
mid_element = merged[(0 + merged.length - 1) / 2] = merged[3] = 3
More complicated cases:
Case 1:
a = [1, 2, 3, 4]
b = [3, 4, 5, 6]
Case 2:
a = [1, 2, 3, 4, 8]
b = [3, 4, 5, 6, 7]
Case 3:
a = [1, 2, 3, 4, 8]
b = [0, 4, 5, 6, 7]
Case 4:
a = [1, 3, 5, 7]
b = [2, 4, 6, 8]
Time required: O(log n). Any ideas?
Look at the middle of both the arrays. Let's say one value is smaller and the other is bigger.
Discard the lower half of the array with the smaller value. Discard the upper half of the array with the higher value. Now we are left with half of what we started with.
Rinse and repeat until only one element is left in each array. Return the smaller of those two.
If the two middle values are the same, then pick arbitrarily.
Credits: Bill Li's blog
Quite interesting task. I'm not sure about O(logn), but solution O((logn)^2) is obvious for me.
If you know position of some element in first array then you can find how many elements are smaller in both arrays then this value (you know already how many smaller elements are in first array and you can find count of smaller elements in second array using binary search - so just sum up this two numbers). So if you know that number of smaller elements in both arrays is less than N, you should look in to the upper half in first array, otherwise you should move to the lower half. So you will get general binary search with internal binary search. Overall complexity will be O((logn)^2)
Note: if you will not find median in first array then start initial search in the second array. This will not have impact on complexity
So, having
n = 4 and a = [1, 2, 3, 4] and b = [3, 4, 5, 6]
You know the k-th position in result array in advance based on n, which is equal to n.
The result n-th element could be in first array or second.
Let's first assume that element is in first array then
do binary search taking middle element from [l,r], at the beginning l = 0, r = 3;
So taking middle element you know how many elements in the same array smaller, which is middle - 1.
Knowing that middle-1 element is less and knowing you need n-th element you may have [n - (middle-1)]th element from second array to be smaller, greater. If that's greater and previos element is smaller that it's what you need, if it's greater and previous is also greater we need to L = middle, if it's smaller r = middle.
Than do the same for the second array in case you did not find solution for first.
In total log(n) + log(n)

Resources