How can I write this combinatorics algorithm more efficiently? - algorithm

A group contains a set of entities and each entity has a value.
Each entity can be a part of more than one group.
Problem: Find largest N groups where each entity appears no more than once in the result. An entity can be excluded from a group if necessary.
Example:
Entities with values:
A = 2
B = 2
C = 2
D = 3
E = 3
Groups
1: (A,B,C) total value: 2+2+2 = 6
2: (B,D) total value: 2 + 3 = 5
3: (C,E) total value: 2 + 3 = 5
4: (D) total value: 3
5: (E) total value: 3
**Answers**:
Largest 1 group is obviously (A,B,C) with total value 6
Largest 2 groups are (B,D), (C,E) with total value 10
Largest 3 groups are either {(A,B,C),(D),(E)}, {(A,B),(C,E),(D)} or {(A,C), (B,D), (E)} with total value 12
The input data to the algorithm should be:
A set of entities with values
Groups containing one or more of the entities
The amount of groups in the result
If there are multiple answers then finding one of them is sufficient.
I included the example to try to make the problem clear, the amount of entities in practise should be less than about 50, and amount of groups should be less than the amount of entities. The amount of N groups to find will be between 1 and 10.
I am currently solving this problem by generating all possible combinations of N groups, excluding the results that contains duplicate entities and then picking the combination with largest total value. This is of course extremely inefficient but i cant get my head around how to obtain a general result in a more efficient way.
My question is if it's possible to solve this in a more efficient way, and if so, how? Any hints or answers are greatly appreciated.
edit
To be clear, in my solution i generate "fake" groups where duplicate entities are excluded from "real" groups. In the example entities (B, C, D, E) are duplicates (exist in more than one group. Then for group 1 (A,B,C) i add the fake groups (A,B),(A,C),(A) to the list of groups that I generate combinations for.

This problem can be formulated as a linear integer program. Although the integer programming is not super efficient in terms of complexity, it works very quick with this number of variables.
Here is how we turn this problem into an integer program.
Let v be a vector of size K representing the entity values.
Let G be a K x M binary matrix that defines the groups: G(i,j)=1 means that the entity i belongs to the group j and G(i,j)=0 otherwise.
Let x be a binary vector of size M, which represents the choice of groups: x[j]=1 indicates we pick the group j.
Let y be a binary vector of size K, which represents the inclusion of entities: y[i]=1 means that the entity i is included in the outcome.
Our goal is to choose x and y so as to maximize sum(v*y) under the following conditions:
G x >= y ... all included entities must belong to at least one of chosen groups
sum(x) = N ... we choose exactly N groups.
Below is an implementation in R. It uses lpSolve library, an interface to lpsolve.
library(lpSolve)
solver <- function(values, groups, N)
{
n_group <- ncol(groups)
n_entity <- length(values)
object <- c(rep(0, n_group), values)
lhs1 <- cbind(groups, -diag(n_entity))
rhs1 <- rep(0, n_entity)
dir1 <- rep(">=", n_entity)
lhs2 <- matrix(c(rep(1, n_group), rep(0, n_entity)), nrow=1)
rhs2 <- N
dir2 <- "="
lhs <- rbind(lhs1, lhs2)
rhs <- c(rhs1, rhs2)
direc <- c(dir1, dir2)
lp("max", object, lhs, direc, rhs, all.bin=TRUE)
}
values <- c(A=2, B=2, C=2, D=3, E=3)
groups <- matrix(c(1,1,1,0,0,
0,1,0,1,0,
0,0,1,0,1,
0,0,0,1,0,
0,0,0,0,1),
nrow=5, ncol=5)
rownames(groups) <- c("A", "B", "C", "D", "E")
ans <- solver(values, groups, 1)
print(ans)
names(values)[tail(ans$solution, length(values))==1]
# Success: the objective function is 6
# [1] "A" "B" "C"
ans <- solver(values, groups, 2)
print(ans)
names(values)[tail(ans$solution, length(values))==1]
# Success: the objective function is 10
# [1] "B" "C" "D" "E"
ans <- solver(values, groups, 3)
print(ans)
names(values)[tail(ans$solution, length(values))==1]
# Success: the objective function is 12
# [1] "A" "B" "C" "D" "E"
Below is to see how this can work with large problem. It finishes in one second.
# how does it scale?
n_entity <- 50
n_group <- 50
N <- 10
entity_names <- paste("X", 1:n_entity, sep="")
values <- sample(1:10, n_entity, replace=TRUE)
names(values) <- entity_names
groups <- matrix(sample(c(0,1), n_entity*n_group,
replace=TRUE, prob=c(0.99, 0.01)),
nrow=n_entity, ncol=n_group)
rownames(groups) <- entity_names
ans <- solver(values, groups, N)
print(ans)
names(values)[tail(ans$solution, length(values))==1]

if the entity values are always positive, I think you can get a solution without generating all combinations:
sort the groups by their largest element, 2nd largest element, nth largest element. in this case you would have 3 copies since the largest group has 3 elements.
for each copy, make one pass from the largest to the smallest adding the group to the solution only if it doesn't contain an element you've already added. this yields 3 results, take the largest. there shouldn't be a larger possible solution unless weights could be negative.
here's an implementation in C#
var entities = new Dictionary<char, int>() { { 'A', 2 }, { 'B', 2 }, { 'C', 2 }, { 'D', 3 }, { 'E', 3 } };
var groups = new List<string>() { "ABC", "BD", "CE", "D", "E" };
var solutions = new List<Tuple<List<string>, int>>();
for(int i = 0; i < groups.Max(x => x.Length); i++)
{
var solution = new List<string>();
foreach (var group in groups.OrderByDescending(x => x.Length > i ? entities[x[i]] : -1))
if (!group.ToCharArray().Any(c => solution.Any(g => g.Contains(c))))
solution.Add(group);
solutions.Add(new Tuple<List<string>, int>(solution, solution.Sum(g => g.ToCharArray().Sum(c => entities[c]))));
}
solutions.Dump();
solutions.OrderByDescending(x => x.Item2).First().Dump();
output:

Related

Arranging the number 1 in a 2d matrix

Given the number of rows and columns of a 2d matrix
Initially all elements of matrix are 0
Given the number of 1's that should be present in each row
Given the number of 1's that should be present in each column
Determine if it is possible to form such matrix.
Example:
Input: r=3 c=2 (no. of rows and columns)
2 1 0 (number of 1's that should be present in each row respectively)
1 2 (number of 1's that should be present in each column respectively)
Output: Possible
Explanation:
1 1
0 1
0 0
I tried solving this problem for like 12 hours by checking if summation of Ri = summation of Ci
But I wondered if wouldn't be possible for cases like
3 3
1 3 0
0 2 2
r and c can be upto 10^5
Any ideas how should I move further?
Edit: Constraints added and output should only be "possible" or "impossible". The possible matrix need not be displayed.
Can anyone help me now?
Hint: one possible solution utilizes Maximum Flow Problem by creating a special graph and running the standard maximum flow algorithm on it.
If you're not familiar with the above problem, you may start reading about it e.g. here https://en.wikipedia.org/wiki/Maximum_flow_problem
If you're interested in the full solution please comment and I'll update the answer. But it requires understading the above algorithm.
Solution as requested:
Create a graph of r+c+2 nodes.
Node 0 is the source, node r+c+1 is the sink. Nodes 1..r represent the rows, while r+1..r+c the columns.
Create following edges:
from source to nodes i=1..r of capacity r_i
from nodes i=r+1..r+c to sink of capacity c_i
between all the nodes i=1..r and j=r+1..r+c of capacity 1
Run maximum flow algorithm, the saturated edges between row nodes and column nodes define where you should put 1.
Or if it's not possible then the maximum flow value is less than number of expected ones in the matrix.
I will illustrate the algorithm with an example.
Assume we have m rows and n columns. Let rows[i] be the number of 1s in row i, for 0 <= i < m,
and cols[j] be the number of 1s in column j, for 0 <= j < n.
For example, for m = 3, and n = 4, we could have: rows = {4 2 3}, cols = {1 3 2 3}, and
the solution array would be:
1 3 2 3
+--------
4 | 1 1 1 1
2 | 0 1 0 1
3 | 0 1 1 1
Because we only want to know whether a solution exists, the values in rows and cols may be permuted in any order. The solution of each permutation is just a permutation of the rows and columns of the above solution.
So, given rows and cols, sort cols in decreasing order, and rows in increasing order. For our example, we have cols = {3 3 2 1} and rows = {2 3 4}, and the equivalent problem.
3 3 2 1
+--------
2 | 1 1 0 0
3 | 1 1 1 0
4 | 1 1 1 1
We transform cols into a form that is better suited for the algorithm. What cols tells us is that we have two series of 1s of length 3, one series of 1s of length 2, and one series of 1s of length 1, that are to be distributed among the rows of the array. We rewrite cols to capture just that, that is COLS = {2/3 1/2 1/1}, 2 series of length 3, 1 series of length 2, and 1 series of length 1.
Because we have 2 series of length 3, a solution exists only if we can put two 1s in the first row. This is possible because rows[0] = 2. We do not actually put any 1 in the first row, but record the fact that 1s have been placed there by decrementing the length of the series of length 3. So COLS becomes:
COLS = {2/2 1/2 1/1}
and we combine our two counts for series of length 2, yielding:
COLS = {3/2 1/1}
We now have the reduced problem:
3 | 1 1 1 0
4 | 1 1 1 1
Again we need to place 1s from our series of length 2 to have a solution. Fortunately, rows[1] = 3 and we can do this. We decrement the length of 3/2 and get:
COLS = {3/1 1/1} = {4/1}
We have the reduced problem:
4 | 1 1 1 1
Which is solved by 4 series of length 1, just what we have left. If at any step, the series in COLS cannot be used to satisfy a row count, then no solution is possible.
The general processing for each row may be stated as follows. For each row r, starting from the first element in COLS, decrement the lengths of as many elements count[k]/length[k] of COLS as needed, so that the sum of the count[k]'s equals rows[r]. Eliminate series of length 0 in COLS and combine series of same length.
Note that because elements of COLS are in decreasing order of lengths, the length of the last element decremented is always less than or equal to the next element in COLS (if there is a next element).
EXAMPLE 2 : Solution exists.
rows = {1 3 3}, cols = {2 2 2 1} => COLS = {3/2 1/1}
1 series of length 2 is decremented to satisfy rows[0] = 1, and the 2 other series of length 2 remains at length 2.
rows[0] = 1
COLS = {2/2 1/1 1/1} = {2/2 2/1}
The 2 series of length 2 are decremented, and 1 of the series of length 1.
The series whose length has become 0 is deleted, and the series of length 1 are combined.
rows[1] = 3
COLS = {2/1 1/0 1/1} = {2/1 1/1} = {3/1}
A solution exists for rows[2] can be satisfied.
rows[2] = 3
COLS = {3/0} = {}
EXAMPLE 3: Solution does not exists.
rows = {0 2 3}, cols = {3 2 0 0} => COLS = {1/3 1/2}
rows[0] = 0
COLS = {1/3 1/2}
rows[1] = 2
COLS = {1/2 1/1}
rows[2] = 3 => impossible to satisfy; no solution.
SPACE COMPLEXITY
It is easy to see that it is O(m + n).
TIME COMPLEXITY
We iterate over each row only once. For each row i, we need to iterate over at most
rows[i] <= n elements of COLS. Time complexity is O(m x n).
After finding this algorithm, I found the following theorem:
The Havel-Hakimi theorem (Havel 1955, Hakimi 1962) states that there exists a matrix Xn,m of 0’s and 1’s with row totals a0=(a1, a2,… , an) and column totals b0=(b1, b2,… , bm) such that bi ≥ bi+1 for every 0 < i < m if and only if another matrix Xn−1,m of 0’s and 1’s with row totals a1=(a2, a3,… , an) and column totals b1=(b1−1, b2−1,… ,ba1−1, ba1+1,… , bm) also exists.
from the post Finding if binary matrix exists given the row and column sums.
This is basically what my algorithm does, while trying to optimize the decrementing part, i.e., all the -1's in the above theorem. Now that I see the above theorem, I know my algorithm is correct. Nevertheless, I checked the correctness of my algorithm by comparing it with a brute-force algorithm for arrays of up to 50 cells.
Here is the C# implementation.
public class Pair
{
public int Count;
public int Length;
}
public class PairsList
{
public LinkedList<Pair> Pairs;
public int TotalCount;
}
class Program
{
static void Main(string[] args)
{
int[] rows = new int[] { 0, 0, 1, 1, 2, 2 };
int[] cols = new int[] { 2, 2, 0 };
bool success = Solve(cols, rows);
}
static bool Solve(int[] cols, int[] rows)
{
PairsList pairs = new PairsList() { Pairs = new LinkedList<Pair>(), TotalCount = 0 };
FillAllPairs(pairs, cols);
for (int r = 0; r < rows.Length; r++)
{
if (rows[r] > 0)
{
if (pairs.TotalCount < rows[r])
return false;
if (pairs.Pairs.First != null && pairs.Pairs.First.Value.Length > rows.Length - r)
return false;
DecrementPairs(pairs, rows[r]);
}
}
return pairs.Pairs.Count == 0 || pairs.Pairs.Count == 1 && pairs.Pairs.First.Value.Length == 0;
}
static void DecrementPairs(PairsList pairs, int count)
{
LinkedListNode<Pair> pair = pairs.Pairs.First;
while (count > 0 && pair != null)
{
LinkedListNode<Pair> next = pair.Next;
if (pair.Value.Count == count)
{
pair.Value.Length--;
if (pair.Value.Length == 0)
{
pairs.Pairs.Remove(pair);
pairs.TotalCount -= count;
}
else if (pair.Next != null && pair.Next.Value.Length == pair.Value.Length)
{
pair.Value.Count += pair.Next.Value.Count;
pairs.Pairs.Remove(pair.Next);
next = pair;
}
count = 0;
}
else if (pair.Value.Count < count)
{
count -= pair.Value.Count;
pair.Value.Length--;
if (pair.Value.Length == 0)
{
pairs.Pairs.Remove(pair);
pairs.TotalCount -= pair.Value.Count;
}
else if(pair.Next != null && pair.Next.Value.Length == pair.Value.Length)
{
pair.Value.Count += pair.Next.Value.Count;
pairs.Pairs.Remove(pair.Next);
next = pair;
}
}
else // pair.Value.Count > count
{
Pair p = new Pair() { Count = count, Length = pair.Value.Length - 1 };
pair.Value.Count -= count;
if (p.Length > 0)
{
if (pair.Next != null && pair.Next.Value.Length == p.Length)
pair.Next.Value.Count += p.Count;
else
pairs.Pairs.AddAfter(pair, p);
}
else
pairs.TotalCount -= count;
count = 0;
}
pair = next;
}
}
static int FillAllPairs(PairsList pairs, int[] cols)
{
List<Pair> newPairs = new List<Pair>();
int c = 0;
while (c < cols.Length && cols[c] > 0)
{
int k = c++;
if (cols[k] > 0)
pairs.TotalCount++;
while (c < cols.Length && cols[c] == cols[k])
{
if (cols[k] > 0) pairs.TotalCount++;
c++;
}
newPairs.Add(new Pair() { Count = c - k, Length = cols[k] });
}
LinkedListNode<Pair> pair = pairs.Pairs.First;
foreach (Pair p in newPairs)
{
while (pair != null && p.Length < pair.Value.Length)
pair = pair.Next;
if (pair == null)
{
pairs.Pairs.AddLast(p);
}
else if (p.Length == pair.Value.Length)
{
pair.Value.Count += p.Count;
pair = pair.Next;
}
else // p.Length > pair.Value.Length
{
pairs.Pairs.AddBefore(pair, p);
}
}
return c;
}
}
(Note: to avoid confusion between when I'm talking about the actual numbers in the problem vs. when I'm talking about the zeros in the ones in the matrix, I'm going to instead fill the matrix with spaces and X's. This obviously doesn't change the problem.)
Some observations:
If you're filling in a row, and there's (for example) one column needing 10 more X's and another column needing 5 more X's, then you're sometimes better off putting the X in the "10" column and saving the "5" column for later (because you might later run into 5 rows that each need 2 X's), but you're never better off putting the X in the "5" column and saving the "10" column for later (because even if you later run into 10 rows that all need an X, they won't mind if they don't all go in the same column). So we can use a somewhat "greedy" algorithm: always put an X in the column still needing the most X's. (Of course, we'll need to make sure that we don't greedily put an X in the same column multiple times for the same row!)
Since you don't need to actually output a possible matrix, the rows are all interchangeable and the columns are all interchangeable; all that matter is how many rows still need 1 X, how many still need 2 X's, etc., and likewise for columns.
With that in mind, here's one fairly simple approach:
(Optimization.) Add up the counts for all the rows, add up the counts for all the columns, and return "impossible" if the sums don't match.
Create an array of length r+1 and populate it with how many columns need 1 X, how many need 2 X's, etc. (You can ignore any columns needing 0 X's.)
(Optimization.) To help access the array efficiently, build a stack/linked-list/etc. of the indices of nonzero array elements, in decreasing order (e.g., starting at index r if it's nonzero, then index r−1 if it's nonzero, etc.), so that you can easily find the elements representing columns to put X's in.
(Optimization.) To help determine when there'll be a row can't be satisfied, also make note of the total number of columns needing any X's, and make note of the largest number of X's needed by any row. If the former is less than the latter, return "impossible".
(Optimization.) Sort the rows by the number of X's they need.
Iterate over the rows, starting with the one needing the fewest X's and ending with the one needing the most X's, and for each one:
Update the array accordingly. For example, if a row needs 12 X's, and the array looks like [..., 3, 8, 5], then you'll update the array to look like [..., 3+7 = 10, 8+5−7 = 6, 5−5 = 0]. If it's not possible to update the array because you run out of columns to put X's in, return "impossible". (Note: this part should never actually return "impossible", because we're keeping count of the number of columns left and the max number of columns we'll need, so we should have already returned "impossible" if this was going to happen. I mention this check only for clarity.)
Update the stack/linked-list of indices of nonzero array elements.
Update the total number of columns needing any X's. If it's now less than the greatest number of X's needed by any row, return "impossible".
(Optimization.) If the first nonzero array element has an index greater than the number of rows left, return "impossible".
If we complete our iteration without having returned "impossible", return "possible".
(Note: the reason I say to start with the row needing the fewest X's, and work your way to the row with the most X's, is that a row needing more X's may involve examining updating more elements of the array and of the stack, so the rows needing fewer X's are cheaper. This isn't just a matter of postponing the work: the rows needing fewer X's can help "consolidate" the array, so that there will be fewer distinct column-counts, making the later rows cheaper than they would otherwise be. In a very-bad-case scenario, such as the case of a square matrix where every single row needs a distinct positive number of X's and every single column needs a distinct positive number of X's, the fewest-to-most order means you can handle each row in O(1) time, for linear time overall, whereas the most-to-fewest order would mean that each row would take time proportional to the number of X's it needs, for quadratic time overall.)
Overall, this takes no worse than O(r+c+n) time (where n is the number of X's); I think that the optimizations I've listed are enough to ensure that it's closer to O(r+c) time, but it's hard to be 100% sure. I recommend trying it to see if it's fast enough for your purposes.
You can use brute force (iterating through all 2^(r * c) possibilities) to solve it, but that will take a long time. If r * c is under 64, you can accelerate it to a certain extent using bit-wise operations on 64-bit integers; however, even then, iterating through all 64-bit possibilities would take, at 1 try per ms, over 500M years.
A wiser choice is to add bits one by one, and only continue placing bits if no constraints are broken. This will eliminate the vast majority of possibilities, greatly speeding up the process. Look up backtracking for the general idea. It is not unlike solving sudokus through guesswork: once it becomes obvious that your guess was wrong, you erase it and try guessing a different digit.
As with sudokus, there are certain strategies that can be written into code and will result in speedups when they apply. For example, if the sum of 1s in rows is different from the sum of 1s in columns, then there are no solutions.
If over 50% of the bits will be on, you can instead work on the complementary problem (transform all ones to zeroes and vice-versa, while updating row and column counts). Both problems are equivalent, because any answer for one is also valid for the complementary.
This problem can be solved in O(n log n) using Gale-Ryser Theorem. (where n is the maximum of lengths of the two degree sequences).
First, make both sequences of equal length by adding 0's to the smaller sequence, and let this length be n.
Let the sequences be A and B. Sort A in non-decreasing order, and sort B in non-increasing order. Create another prefix sum array P for B such that ith element of P is equal to sum of first i elements of B.
Now, iterate over k's from 1 to n, and check for
The second sum can be calculated in O(log n) using binary search for index of last number in B smaller than k, and then using precalculated P.
Inspiring from the solution given by RobertBaron I have tried to build a new algorithm.
rows = [int(x)for x in input().split()]
cols = [int (ss) for ss in input().split()]
rows.sort()
cols.sort(reverse=True)
for i in range(len(rows)):
for j in range(len(cols)):
if(rows[i]!= 0 and cols[j]!=0):
rows[i] = rows[i] - 1;
cols[j] =cols[j]-1;
print("rows: ",rows)
print("cols: ",cols)
#if there is any non zero value, print NO else print yes
flag = True
for i in range(len(rows)):
if(rows[i]!=0):
flag = False
break
for j in range(len(cols)):
if(cols[j]!=0):
flag = False
if(flag):
print("YES")
else:
print("NO")
here, i have sorted the rows in ascending order and cols in descending order. later decrementing particular row and column if 1 need to be placed!
it is working for all the test cases posted here! rest GOD knows

Optimization of a function which look for combination - out-of-memory trouble + speed

Below is a function that creates all possible combination of splitting the elements of x into n groups (all groups have the same number of elements)
Function:
perm.groups <- function(x,n){
nx <- length(x)
ning <- nx/n
group1 <-
rbind(
matrix(rep(x[1],choose(nx-1,ning-1)),nrow=1),
combn(x[-1],ning-1)
)
ng <- ncol(group1)
if(n > 2){
out <- vector('list',ng)
for(i in seq_len(ng)){
other <- perm.groups(setdiff(x,group1[,i]),n=n-1)
out[[i]] <- lapply(seq_along(other),
function(j) cbind(group1[,i],other[[j]])
)
}
out <- unlist(out,recursive=FALSE)
} else {
other <- lapply(seq_len(ng),function(i)
matrix(setdiff(x,group1[,i]),ncol=1)
)
out <- lapply(seq_len(ng),
function(i) cbind(group1[,i],other[[i]])
)
}
out
}
Pseudo-code (explainations)
nb = number of groups
ning = number of elements in every group
if(nb == 2)
1. take first element, and add it to every possible
combination of ning-1 elements of x[-1]
2. make the difference for each group defined in step 1 and x
to get the related second group
3. combine the groups from step 2 with the related groups from step 1
if(nb > 2)
1. take first element, and add it to every possible
combination of ning-1 elements of x[-1]
2. to define the other groups belonging to the first groups obtained like this,
apply the algorithm on the other elements of x, but for nb-1 groups
3. combine all possible other groups from step 2
with the related first groups from step 1
This function (and pseudo-code) was first created by Joris Meys on this previous post:
Find all possible ways to split a list of elements into a a given number of group of the same size
Is there a way to create a function that returns a given number of randomly taken possible combinations ?
Such a function would take a third argument which is either percentage.possibilities or number.possiblities which fix the number of random different combinations the function returns.
Something like:
new.perm.groups(x=1:12,n=3,number.possiblities=50)
Working on #JackManey suggestion, you can sample one permutation group in an equiprobable fashion using
sample.perm.group <- function(ning, ngrp)
{
if( ngrp==1 ) return(seq_len(ning))
g1 <- 1+sample(ning*ngrp-1, size=ning-1)
g1 <- c(1, g1[order(g1)])
remaining <- seq_len(ning*ngrp)[-g1]
cbind(g1, matrix(remaining[sample.perm.group(ning, ngrp-1)], nrow=ning), deparse.level=0)
}
where ning is the number of elements per group and ngrp is the number of groups.
It returns indices, so if you have an arbitrary vector you can use it as a permutation:
> ind <- sample.perm.group(3,3)
> ind
[,1] [,2] [,3]
[1,] 1 2 5
[2,] 3 7 6
[3,] 4 8 9
> LETTERS[1:9][ind]
[1] "A" "C" "D" "B" "G" "H" "E" "F" "I"
To generate a sample of permutations of size N, you have two options: If you allow repetitions, i.e., a sample with replacement, all you have to do is run the preceding function N times. OTOH, if your sample is to be taken without replacent, then you can use a rejection mechanism:
sample.perm.groups <- function(ning, ngrp, N)
{
result <- list(sample.perm.group(ning, ngrp))
for( i in seq_len(N-1) )
{
repeat
{
y <- sample.perm.group(ning, ngrp)
if( all(vapply(result, function(x)any(x!=y), logical(1))) ) break
}
result[[i+1]] <- y
}
result
}
This is clearly an equiprobable sampling design, and it is unlikely to be inefficient, since the number of possible combinations is usually much larger than N.

scala version of swap algorithm for null models

The problem I am having is with trying to find an efficient way to find swappable elements in a matrix in order to implement a swap algorithm for null model creation.
The matrix consists of 0's and 1's and the idea is that elements can be switched between columns so that the row and column totals of the matrix remain the same.
For example, given the following matrix:
c1 c2 c3 c4
r1 0 1 0 0 = 1
r2 1 0 0 1 = 2
r3 0 0 0 0 = 0
r4 1 1 1 1 = 4
------------
2 2 1 2
columns c2 and c4 in r1 and r2 can each be swapped in such a way that totals are not altered i.e.:
c1 c2 c3 c4
r1 0 0 0 1 = 1
r2 1 1 0 0 = 2
r3 0 0 0 0 = 0
r4 1 1 1 1 = 4
------------
2 2 1 2
This all needs to be done randomly so as not to introduce any bias.
I have one solution that works. I randomly select a row and two columns. If they yield a 10 or 01 pattern then I randomly select another row and check the same columns to see if they yield the opposite pattern. If either of them fail I start over and select a new element.
This method works but I only "hit" the correct patterns about 10% of the time. In a large matrix or in one with few 1's in the rows I waste a lot of time "missing". I figured that there had to be a more intelligent way of choosing elements in the matrix but still doing it randomly.
The code for the working method is:
def isSwappable(matrix: Matrix): Tuple2[Tuple2[Int, Int], Tuple2[Int, Int]] = {
val indices = getRowAndColIndices(matrix)
(matrix(indices._1._1)(indices._2._1), matrix(indices._1._1)(indices._2._2)) match {
case (1, 0) => {
if (matrix(indices._1._2)(indices._2._1) == 0 & matrix(indices._1._2)(indices._2._2) == 1) {
indices
}
else {
isSwappable(matrix)
}
}
case (0, 1) => {
if (matrix(indices._1._2)(indices._2._1) == 1 & matrix(indices._1._2)(indices._2._2) == 0) {
indices
}
else {
isSwappable(matrix)
}
}
case _ => {
isSwappable(matrix)
}
}
}
def getRowAndColIndices(matrix: Matrix): Tuple2[Tuple2[Int, Int], Tuple2[Int, Int]] = {
(getNextIndex(rnd.nextInt(matrix.size), matrix.size), getNextIndex(rnd.nextInt(matrix(0).size), matrix(0).size))
}
def getNextIndex(i: Int, constraint: Int): Tuple2[Int, Int] = {
val newIndex = rnd.nextInt(constraint)
newIndex match {
case `i` => getNextIndex(i, constraint)
case _ => (i, newIndex)
}
}
I figured a more efficient way to handle this was to remove any rows that could not be used (all 1's or 0's) and then choose an element randomly. From there I could filter out any columns in the row that had the same value and the choose from the remaining columns.
Once the first row and column are chosen I then filter out the rows that can not provide the required pattern and then choose from the remaining rows.
This works for the most part but the problem that I can't figure out how to deal with is what happens when there are no columns or rows to choose from? I don't want to loop infinitely trying to find the pattern I need and I need a way of starting over if I do get an empty list of rows or columns to choose from.
The code that I have so far that sort of works (until I get an empty list) is:
def getInformativeRowIndices(matrix: Matrix) = (
matrix
.zipWithIndex
.filter(_._1.distinct.size > 1)
.map(_._2)
.toList
)
def getRowsWithOppositeValueInColumn(col: Int, value: Int, matrix: Matrix) = (
matrix
.zipWithIndex
.filter(_._1(col) != value)
.map(_._2)
.toList
)
def getColsWithOppositeValueInSameRow(row: Int, value: Int, matrix: Matrix) = (
matrix(row)
.zipWithIndex
.filter(_._1 != value)
.map(_._2)
.toList
)
def process(matrix: Matrix): Tuple2[Tuple2[Int, Int], Tuple2[Int, Int]] = {
val row1Indices = getInformativeRowIndices(matrix)
if (row1Indices.isEmpty) sys.error("No informative rows")
val row1 = row1Indices(rnd.nextInt(row1Indices.size))
val col1 = rnd.nextInt(matrix(0).size)
val colIndices = getColsWithOppositeValueInSameRow(row1, matrix(row1)(col1), matrix)
if (colIndices.isEmpty) process(matrix)
val col2 = colIndices(rnd.nextInt(colIndices.size))
val row2Indices = getRowsWithOppositeValueInColumn(col1, matrix(row1)(col1), matrix)
.intersect(getRowsWithOppositeValueInColumn(col2, matrix(row1)(col2), matrix))
println(row2Indices)
if (row2Indices.isEmpty) process(matrix)
val row2 = row2Indices(rnd.nextInt(row2Indices.size))
((row1, row2), (col1, col2))
}
I think the recursive methods are wrong and don't really work here. Also, I am really just trying to improve the speed of cell selection so any ideas or suggestions would be greatly appreciated.
EDIT:
I have had a chance to play with this little more and have come up with another solution but it does not seem to be much faster then just randomly choosing cells in the matrix. Also, I should add that the matrix needs to be swapped about 30000 times in succession in order for it to be considered randomised and I need to generate 5000 random matrices for each test of which I have at least another 5000 to do so performance is kind of important.
The current solution (besides random cell selection is:
Randomly select 2 rows from the matrix
subtract one row from the other and put it in an Array
if the new Array contains both a 1 and -1 then we can swap
The logic of the subtraction looks like this:
0 1 0 0
- 1 0 0 1
---------------
-1 1 0 -1
The method that does this looks like this:
def findSwaps(matrix: Matrix, iterations: Int): Boolean = {
var result = false
val mtxLength = matrix.length
val row1 = rnd.nextInt(mtxLength)
val row2 = getNextIndex(row1, mtxLength)
val difference = subRows(matrix(row1), matrix(row2))
if (difference.min == -1 & difference.max == 1) {
val zeroOne = difference.zipWithIndex.filter(_._1 == -1).map(_._2)
val oneZero = difference.zipWithIndex.filter(_._1 == 1).map(_._2)
val col1 = zeroOne(rnd.nextInt(zeroOne.length))
val col2 = oneZero(rnd.nextInt(oneZero.length))
swap(matrix, row1, row2, col1, col2)
result = true
}
result
}
The matrix row subtraction looks like this:
def subRows(a: Array[Int], b: Array[Int]): Array[Int] = (a, b).zipped.map(_ - _)
And the actual swap looks like this:
def swap(matrix: Matrix, row1: Int, row2: Int, col1: Int, col2: Int) = {
val temp = (matrix(row1)(col1), matrix(row1)(col2))
matrix(row1)(col1) = matrix(row2)(col1)
matrix(row1)(col2) = matrix(row2)(col2)
matrix(row2)(col1) = temp._1
matrix(row2)(col2) = temp._2
matrix
}
This works much better than before in that I get have between 80% and 90% success for an attempted swap (it was only about 10% with the random cell selection) however... it is still taking about 2.5 minutes to generate 1000 randomised matrices.
Any ideas on how to improve the speed?
I'm going to assume the matrices are big so that storage of the order of (matrix size squared) is not viable (for reasons of either speed or memory).
If you have a sparse matrix, you can enter the index of each 1 in each column in a set (here I show the compact way to do things, but you may wish to iterate with while loops for speed):
val mtx = Array(Array(0,1,0,0),Array(1,0,0,1),Array(0,0,0,0),Array(1,1,1,1))
val cols = mtx.transpose.map(x => x.zipWithIndex.filter(_._1==1).map(_._2).toSet)
Now for each column, a later column contains compatible pairs (at least one) if and only if only the following two sets are nonempty:
def xorish(a: Set[Int], b: Set[Int]) = (a--b, b--a)
So the answer will involve computing these sets and testing whether they're both nonempty.
Now the question is what you mean by "sample randomly". Randomly sampling single 1,0 pairs is not the same as randomly sampling possible swaps. To see this, consider the following:
1 0 1 0
1 0 1 0
1 0 1 0
0 1 1 0
0 1 1 0
0 1 0 1
The two columns on the left have nine possible swaps. The two on the right have only five possible swaps. But if you are looking for (1,0) patterns, you will sample only three times on the left vs. five on the right; if you are looking for either (1,0) or (0,1), you will sample six and six, which again distorts the probabilities. The only way to fix this is either to not be clever, and randomly sample a second time (which in the first case will work out with a usable swap 3/5 of the time, while only 1/5 in the second), or to basically compute every possible pair for swapping (or at least how many pairs there are) and select from that predefined set.
If we want to do the latter, we note that for each pair of nonidentical columns, we can compute the two sets to swap among, and we know the sizes and the product is the total number of possibilities. In order to avoid instantiating all the possibilities, we can create
val poss = {
for (i<-cols.indices; j <- (i+1) until cols.length) yield
(i, j, (cols(i)--cols(j)).toArray, (cols(j)--cols(i)).toArray)
}.filter{ case (_,_,a,b) => a.length>0 && b.length>0 }
and then count how many there are:
val cuml = poss.map{ case (_,_,a,b) => a.size*b.size }.scanLeft(0)(_ + _).toArray
Now to pick a number at random, we pick a number between 0 and cuml.last and pick out which bucket this is and which item within the bucket:
def pickItem(cuml: Array[Int], poss: Seq[(Int,Int,Array[Int],Array[Int])]) = {
val n = util.Random.nextInt(cuml.last)
val k = {
val i = java.util.Arrays.binarySearch(cuml,n)
if (i<0) -i-2 else i
}
val j = n - cuml(k)
val bucket = poss(k)
(
bucket._1, bucket._2,
bucket._3(j % bucket._3.size), bucket._4(j / bucket._3.size)
)
}
This ends up returning (c1,c2,r1,r2) selected randomly.
Now that you have the coordinates, you can create the new matrix however you wish. (Most efficient is probably to do an in-place swap of the entries, and then swap back when you want to try again.)
Note that this is only sensible for a large number of independent swaps from the same starting matrix. If you instead want to do this iteratively and maintain independence, you are probably best off doing this randomly after all unless the matrices are extremely sparse, at which point it's worth simply storing the matrices in some standard sparse matrix format (i.e. by index of nonzero entries) and doing your manipulation on those (probably with mutable sets and an update strategy, since the consequences of a single swap are confined to about n of the entries in an n*n matrix).

Algorithm for combining different age groups together based on their values

Let's say we have an array of age groups and an array of the number of people in each age group
For example:
Ages = ("1-13", "14-20", "21-30", "31-40", "41-50", "51+")
People = (1, 10, 21, 3, 2, 1)
I want to have an algorithm that combines these age groups with the following logic if there are fewer than 5 people in each group. The algorithm that I have so far does the following:
Start from the last element (e.g., "51+") can you combine it with the next group? (here "41-50") if yes add the numbers 1+2 and combine their labels. So we get the following
Ages = ("1-13", "14-20", "21-30", "31-40", "41+")
People = (1, 10, 21, 3, 3)
Take the last one again (here is "41+"). Can you combine it with the next group (31-40)? the answer is yes so we get:
Ages = ("1-13", "14-20", "21-30", "31+")
People = (1, 10, 21, 6)
since the group 31+ now has 6 members we cannot collapse it into the next group.
we cannot collapse "21-30" into the next one "14-20" either
"14-20" also has 10 people (>5) so we don't do anything on this either
for the first one ("1-13") since we have only one person and it is the last group we combine it with the next group "14-20" and get the following
Ages = ("1-20", "21-30", "31+")
People = (11, 21, 6)
I have an implementation of this algorithm that uses many flags to keep track of whether or not any data is changed and it makes a number of passes on the two arrays to finish this task.
My question is if you know any efficient way of doing the same thing? any data structure that can help? any algorithm that can help me do the same thing without doing too much bookkeeping would be great.
Update:
A radical example would be (5,1,5)
in the first pass it becomes (5,6) [collapsing the one on the right into the one in the middle]
then we have (5,6). We cannot touch 6 since it is larger than our threshold:5. so we go to the next one (which is element on the very left 5) since it is less than or equal to 5 and since it is the last one on the left we group it with the one on its right. so we finally get (11)
Here is an OCaml solution of a left-to-right merge algorithm:
let close_group acc cur_count cur_names =
(List.rev cur_names, cur_count) :: acc
let merge_small_groups mini l =
let acc, cur_count, cur_names =
List.fold_left (
fun (acc, cur_count, cur_names) (name, count) ->
if cur_count <= mini || count <= mini then
(acc, cur_count + count, name :: cur_names)
else
(close_group acc cur_count cur_names, count, [name])
) ([], 0, []) l
in
List.rev (close_group acc cur_count cur_names)
let input = [
"1-13", 1;
"14-20", 10;
"21-30", 21;
"31-40", 3;
"41-50", 2;
"51+", 1
]
let output = merge_small_groups 5 input
(* output = [(["1-13"; "14-20"], 11); (["21-30"; "31-40"; "41-50"; "51+"], 27)] *)
As you can see, the result of merging from left to right may not be what you want.
Depending on the goal, it may make more sense to merge the pair of consecutive elements whose sum is smallest and iterate until all counts are above the minimum of 5.
Here is my scala approach.
We start with two lists:
val people = List (1, 10, 21, 3, 2, 1)
val ages = List ("1-13", "14-20", "21-30", "31-40", "41-50", "51+")
and combine them to a kind of mapping:
val agegroup = ages.zip (people)
define a method to merge two Strings, describing an (open ended) interval. The first parameter is, if any, the one with the + in "51+".
/**
combine age-strings
a+ b-c => b+
a-b c-d => c-b
*/
def merge (xs: String, ys: String) = {
val xab = xs.split ("[+-]")
val yab = ys.split ("-")
if (xs.contains ("+")) yab(0) + "+" else
yab (0) + "-" + xab (1)
}
Here is the real work:
/**
reverse the list, combine groups < threshold.
*/
def remap (map: List [(String, Int)], threshold : Int) = {
def remap (mappings: List [(String, Int)]) : List [(String, Int)] = mappings match {
case Nil => Nil
case x :: Nil => x :: Nil
case x :: y :: xs => if (x._2 > threshold) x :: remap (y :: xs) else
remap ((merge (x._1, y._1), x._2 + y._2) :: xs) }
val nearly = (remap (map.reverse)).reverse
// check for first element
if (! nearly.isEmpty && nearly.length > 1 && nearly (0)._2 < threshold) {
val a = nearly (0)
val b = nearly (1)
val rest = nearly.tail.tail
(merge (b._1, a._1), a._2 + b._2) :: rest
} else nearly
}
and invocation
println (remap (agegroup, 5))
with result:
scala> println (remap (agegroup, 5))
List((1-20,11), (21-30,21), (31+,6))
The result is a list of pairs, age-group and membercount.
I guess the main part is easy to understand: There are 3 basic cases: an empty list, which can't be grouped, a list of one group, which is the solution itself, and more than one element.
If the first element (I reverse the list in the beginning, to start with the end) is bigger than 5 (6, whatever), yield it, and procede with the rest - if not, combine it with the second, and take this combined element and call it with the rest in a recursive way.
If 2 elements get combined, the merge-method for the strings is called.
The map is remapped, after reverting it, and the result reverted again. Now the first element has to be inspected and eventually combined.
We're done.
I think a good data structure would be a linked list of pairs, where each pair contains the age span and the count. Using that, you can easily walk the list, and join two pairs in O(1).

How can you compare to what extent two lists are in the same order?

I have two arrays containing the same elements, but in different orders, and I want to know the extent to which their orders differ.
The method I tried, didn't work. it was as follows:
For each list I built a matrix which recorded for each pair of elements whether they were above or below each other in the list. I then calculated a pearson correlation coefficient of these two matrices. This worked extremely badly. Here's a trivial example:
list 1:
1
2
3
4
list 2:
1
3
2
4
The method I described above produced matrices like this (where 1 means the row number is higher than the column, and 0 vice-versa):
list 1:
1 2 3 4
1 1 1 1
2 1 1
3 1
4
list 2:
1 2 3 4
1 1 1 1
2 0 1
3 1
4
Since the only difference is the order of elements 2 and 3, these should be deemed to be very similar. The Pearson Correlation Coefficient for those two matrices is 0, suggesting they are not correlated at all. I guess the problem is that what I'm looking for is not really a correlation coefficient, but some other kind of similarity measure. Edit distance, perhaps?
Can anyone suggest anything better?
Mean square of differences of indices of each element.
List 1: A B C D E
List 2: A D C B E
Indices of each element of List 1 in List 2 (zero based)
A B C D E
0 3 2 1 4
Indices of each element of List 1 in List 1 (zero based)
A B C D E
0 1 2 3 4
Differences:
A B C D E
0 -2 0 2 0
Square of differences:
A B C D E
4 4
Average differentness = 8 / 5.
Just an idea, but is there any mileage in adapting a standard sort algorithm to count the number of swap operations needed to transform list1 into list2?
I think that defining the compare function may be difficult though (perhaps even just as difficult as the original problem!), and this may be inefficient.
edit: thinking about this a bit more, the compare function would essentially be defined by the target list itself. So for example if list 2 is:
1 4 6 5 3
...then the compare function should result in 1 < 4 < 6 < 5 < 3 (and return equality where entries are equal).
Then the swap function just needs to be extended to count the swap operations.
A bit late for the party here, but just for the record, I think Ben almost had it... if you'd looked further into correlation coefficients, I think you'd have found that Spearman's rank correlation coefficient might have been the way to go.
Interestingly, jamesh seems to have derived a similar measure, but not normalized.
See this recent SO answer.
You might consider how many changes it takes to transform one string into another (which I guess it was you were getting at when you mentioned edit distance).
See: http://en.wikipedia.org/wiki/Levenshtein_distance
Although I don't think l-distance takes into account rotation. If you allow rotation as an operation then:
1, 2, 3, 4
and
2, 3, 4, 1
Are pretty similar.
There is a branch-and-bound algorithm that should work for any set of operators you like. It may not be real fast. The pseudocode goes something like this:
bool bounded_recursive_compare_routine(int* a, int* b, int level, int bound){
if (level > bound) return false;
// if at end of a and b, return true
// apply rule 0, like no-change
if (*a == *b){
bounded_recursive_compare_routine(a+1, b+1, level+0, bound);
// if it returns true, return true;
}
// if can apply rule 1, like rotation, to b, try that and recur
bounded_recursive_compare_routine(a+1, b+1, level+cost_of_rotation, bound);
// if it returns true, return true;
...
return false;
}
int get_minimum_cost(int* a, int* b){
int bound;
for (bound=0; ; bound++){
if (bounded_recursive_compare_routine(a, b, 0, bound)) break;
}
return bound;
}
The time it takes is roughly exponential in the answer, because it is dominated by the last bound that works.
Added: This can be extended to find the nearest-matching string stored in a trie. I did that years ago in a spelling-correction algorithm.
I'm not sure exactly what formula it uses under the hood, but difflib.SequenceMatcher.ratio() does exactly this:
ratio(self) method of difflib.SequenceMatcher instance:
Return a measure of the sequences' similarity (float in [0,1]).
Code example:
from difflib import SequenceMatcher
sm = SequenceMatcher(None, '1234', '1324')
print sm.ratio()
>>> 0.75
Another approach that is based on a little bit of mathematics is to count the number of inversions to convert one of the arrays into the other one. An inversion is the exchange of two neighboring array elements. In ruby it is done like this:
# extend class array by new method
class Array
def dist(other)
raise 'can calculate distance only to array with same length' if length != other.length
# initialize count of inversions to 0
count = 0
# loop over all pairs of indices i, j with i<j
length.times do |i|
(i+1).upto(length) do |j|
# increase count if i-th and j-th element have different order
count += 1 if (self[i] <=> self[j]) != (other[i] <=> other[j])
end
end
return count
end
end
l1 = [1, 2, 3, 4]
l2 = [1, 3, 2, 4]
# try an example (prints 1)
puts l1.dist(l2)
The distance between two arrays of length n can be between 0 (they are the same) and n*(n+1)/2 (reversing the first array one gets the second). If you prefer to have distances always between 0 and 1 to be able to compare distances of pairs of arrays of different length, just divide by n*(n+1)/2.
A disadvantage of this algorithms is it running time of n^2. It also assumes that the arrays don't have double entries, but it could be adapted.
A remark about the code line "count += 1 if ...": the count is increased only if either the i-th element of the first list is smaller than its j-th element and the i-th element of the second list is bigger than its j-th element or vice versa (meaning that the i-th element of the first list is bigger than its j-th element and the i-th element of the second list is smaller than its j-th element). In short: (l1[i] < l1[j] and l2[i] > l2[j]) or (l1[i] > l1[j] and l2[i] < l2[j])
If one has two orders one should look at two important ranking correlation coefficients:
Spearman's rank correlation coefficient: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
This is almost the same as Jamesh answer but scaled in the range -1 to 1.
It is defined as:
1 - ( 6 * sum_of_squared_distances ) / ( n_samples * (n_samples**2 - 1 )
Kendalls tau: https://nl.wikipedia.org/wiki/Kendalls_tau
When using python one could use:
from scipy import stats
order1 = [ 1, 2, 3, 4]
order2 = [ 1, 3, 2, 4]
print stats.spearmanr(order1, order2)[0]
>> 0.8000
print stats.kendalltau(order1, order2)[0]
>> 0.6667
if anyone is using R language, I've implemented a function that computes the "spearman rank correlation coefficient" using the method described above by #bubake here:
get_spearman_coef <- function(objectA, objectB) {
#getting the spearman rho rank test
spearman_data <- data.frame(listA = objectA, listB = objectB)
spearman_data$rankA <- 1:nrow(spearman_data)
rankB <- c()
for (index_valueA in 1:nrow(spearman_data)) {
for (index_valueB in 1:nrow(spearman_data)) {
if (spearman_data$listA[index_valueA] == spearman_data$listB[index_valueB]) {
rankB <- append(rankB, index_valueB)
}
}
}
spearman_data$rankB <- rankB
spearman_data$distance <-(spearman_data$rankA - spearman_data$rankB)**2
spearman <- 1 - ( (6 * sum(spearman_data$distance)) / (nrow(spearman_data) * ( nrow(spearman_data)**2 -1) ) )
print(paste("spearman's rank correlation coefficient"))
return( spearman)
}
results :
get_spearman_coef(c("a","b","c","d","e"), c("a","b","c","d","e"))
spearman's rank correlation coefficient: 1
get_spearman_coef(c("a","b","c","d","e"), c("b","a","d","c","e"))
spearman's rank correlation coefficient: 0.9

Resources