Merging database tables using rank and path compression heuristics - algorithm

I am having trouble with the following problem in my data structures course. The errors provided by the course are rather ambiguous and I am not able to discern where the bug lies.
NOTE: The error message only says "Wrong answer." and the test cases are not provided.
Input Format: The first line of the input contains two integers 𝑛 and 𝑚 — the number of tables in the
database and the number of merge queries to perform, respectively.
The second line of the input contains 𝑛 integers 𝑟[𝑖] — the number of rows in the 𝑖-th table.
Then the following 𝑚 lines describe the merge queries. Each of them contains two integers 𝑑𝑒𝑠𝑡𝑖𝑛𝑎𝑡𝑖𝑜n[i] and
𝑠𝑜𝑢𝑟𝑐𝑒[i] — the numbers of the tables to merge.
Output Format: For each query print a line containing a single integer — the maximum of the sizes of all
tables (in terms of the number of rows) after the corresponding operation.
Sample Input:
5 5
1 1 1 1 1
3 5
2 4
1 4
5 4
5 3
Sample Output:
2
2
3
5
5
This is my current code and it works for most cases but there seems to be some edge cases that I have not accounted for.
class DataBases:
def __init__(self, row_counts):
self.max_row_count = max(row_counts)
self.row_counts = row_counts
n_tables = len(row_counts)
self.parent = list(range(n_tables))
self.rank = [1] * n_tables
def get_parent(self, table):
update_root = []
root = table
while root != self.parent[root]:
update_root.append(self.parent[root])
root = self.parent[root]
for i in update_root:
self.parent[i] = root
return root
def merge_tables(self, dst, src):
src_parent = self.get_parent(src)
dst_parent = self.get_parent(dst)
if src_parent == dst_parent: return
if self.rank[src_parent] > self.rank[dst_parent]:
self.parent[dst_parent] = src_parent
self.update_row_counts(src_parent, dst_parent)
else:
self.parent[src_parent] = dst_parent
self.update_row_counts(dst_parent, src_parent)
if self.rank[src_parent] == self.rank[dst_parent]:
self.rank[dst_parent] += 1
def update_row_counts(self, root, child):
self.row_counts[root] += self.row_counts[child]
self.row_counts[child] = 0
self.max_row_count = max(self.max_row_count, self.row_counts[root])
def main():
n_tables, n_queries = map(int, input().split())
counts = list(map(int, input().split()))
assert(n_tables == len(counts))
db = DataBases(counts)
for i in range(n_queries):
dst, src = map(int, input().split())
db.merge_tables(dst - 1, src - 1)
print(db.max_row_count)
if __name__ == "__main__":
main()

The issue was in the get_parent (path compression) implementation.
Correct Solution:
def get_parent(self, table):
if table != self.parent[table]:
self.parent[table] = self.get_parent(self.parent[table])
return self.parent[table]

Related

Reduce binary string to an empty string by removing subsequences with alternative characters

This was a question asked in the coding round for NASDAQ internship.
Program description:
The program takes a binary string as input. We have to successively remove sub-sequences having all characters alternating, till the string is empty. The task was to find the minimum number of steps required to do so.
Example1:
let the string be : 0111001
Removed-0101, Remaining-110
Removed-10 , Remaining-1
Removed-1
No of steps = 3
Example2:
let the string be : 111000111
Removed-101, Remaining-110011
Removed-101, Remaining-101
Removed-101
No of steps = 3
Example3:
let the string be : 11011
Removed-101, Remaining-11
Removed-1 , Remaining-1
Removed-1
No of steps = 3
Example4:
let the string be : 10101
Removed-10101
No of steps = 1
The solution I tried, considered the first character of the binary string as first character for my sub-sequence. Then created a new string, where the next character would be appended if it wasn't part of the alternating sequence. The new string becomes our binary string. In this way, a loop continues till the new string is empty. (somewhat an O(n^2) algorithm). As expected, it gave me a timeout error. Adding a somewhat similar code in C++ to the one I had tried, which was in Java.
#include<bits/stdc++.h>
using namespace std;
int main() {
string str, newStr;
int len;
char c;
int count = 0;
getline(cin, str);
len = str.length();
//continue till string is empty
while(len > 0) {
len = 0;
c = str[0];
for(int i=1; str[i] != '\0';i++) {
//if alternative characters are found, set as c and avoid that character
if(c != str[i])
c = str[i];
//if next character is not alternate, add the character to newStr
else {
newStr.push_back(str[i]);
len++;
}
}
str = newStr;
newStr = "";
count++;
}
cout<<count<<endl;
return 0;
}
I also tried methods like finding the length of the largest sub sequence of same consecutive characters which obviously didn't satisfy every case, like that of example3.
Hope somebody could help me with the most optimized solution for this question. Preferably a code in C, C++ or python. Even the algorithm would do.
I found a more optimal O(NlogN) solution by maintaining a Min-Heap and Look-up hashMap.
We start with the initial array as alternating counts of 0, 1.
That is, for string= 0111001; lets assume our input-array S=[1,3,2,1]
Basic idea:
Heapify the count-array
Extract minimum count node => add to num_steps
Now extract both its neighbours (maintained in the Node-class) from the Heap using the lookup-map
Merge both these neighbours and insert into the Heap
Repeat steps 2-4 until no entries remain in the Heap
Code implementation in Python
class Node:
def __init__(self, node_type: int, count: int):
self.prev = None
self.next = None
self.node_type = node_type
self.node_count = count
#staticmethod
def compare(node1, node2) -> bool:
return node1.node_count < node2.node_count
def get_num_steps(S: list): ## Example: S = [2, 1, 2, 3]
heap = []
node_heap_position_map = {} ## Map[Node] -> Heap-index
prev = None
type = 0
for s in S:
node: Node = Node(type, s)
node.prev = prev
if prev is not None:
prev.next = node
prev = node
type = 1 - type
# Add element to the map and also maintain the updated positions of the elements for easy lookup
addElementToHeap(heap, node_heap_position_map, node)
num_steps = 0
last_val = 0
while len(heap) > 0:
# Extract top-element and also update the positions in the lookup-map
top_heap_val: Node = extractMinFromHeap(heap, node_heap_position_map)
num_steps += top_heap_val.node_count - last_val
last_val = top_heap_val.node_count
# If its the corner element, no merging is required
if top_heap_val.prev is None or top_heap_val.next is None:
continue
# Merge the nodes adjacent to the extracted-min-node:
prev_node = top_heap_val.prev
next_node = top_heap_val.next
removeNodeFromHeap(prev_node, node_heap_position_map)
removeNodeFromHeap(next_node, node_heap_position_map)
del node_heap_position_map[prev_node]
del node_heap_position_map[next_node]
# Created the merged-node for neighbours and add to the Heap; and update the lookup-map
merged_node = Node(prev_node.node_type, prev_node.node_count + next_node.node_count)
merged_node.prev = prev_node.prev
merged_node.next = next_node.next
addElementToHeap(heap, node_heap_position_map, merged_node)
return num_steps
PS: I havent implemented the Min-heap operations above, but the function-method-names are quite eponymous.
We can solve this in O(n) time and O(1) space.
This isn't about order at all. The actual task, when you think about it, is how to divide the string into the least number of subsequences that consist of alternating characters (where a single is allowed). Just maintain two queues or stacks; one for 1s, the other for 0s, where characters pop their immediate alternate predecessors. Keep a record of how long the queue is at any one time during the iteration (not including the replacement moves).
Examples:
(1)
0111001
queues
1 1 -
0 - 0
0 - 00
1 1 0
1 11 -
1 111 - <- max 3
0 11 0
For O(1) space, The queues can just be two numbers representimg the current counts.
(2)
111000111
queues (count of 1s and count of 0s)
1 1 0
1 2 0
1 3 0 <- max 3
0 2 1
0 1 2
0 0 3 <- max 3
1 1 2
1 2 1
1 3 0 <- max 3
(3)
11011
queues
1 1 0
1 2 0
0 1 1
1 2 0
1 3 0 <- max 3
(4)
10101
queues
1 1 0 <- max 1
0 0 1 <- max 1
1 1 0 <- max 1
0 0 1 <- max 1
1 1 0 <- max 1
I won't write the full code. But I have an idea of an approach that will probably be fast enough (certainly faster than building all of the intermediate strings).
Read the input and change it to a representation that consists of the lengths of sequences of the same character. So 11011 is represented with a structure that specifies it something like [{length: 2, value: 1}, {length: 1, value: 0}, {length: 2, value: 1}]. With some cleverness you can drop the values entirely and represent it as [2, 1, 2] - I'll leave that as an exercise for the reader.
With that representation you know that you can remove one value from each of the identified sequences of the same character in each "step". You can do this a number of times equal to the smallest length of any of those sequences.
So you identify the minimum sequence length, add that to a total number of operations that you're tracking, then subtract that from every sequence's length.
After doing that, you need to deal with sequences of 0 length. - Remove them, then if there are any adjacent sequences of the same value, merge those (add together the lengths, remove one). This merging step is the one that requires some care if you're going for the representation that forgets the values.
Keep repeating this until there's nothing left. It should run somewhat faster than dealing with string manipulations.
There's probably an even better approach that doesn't iterate through the steps at all after making this representation, just examining the lengths of sequences starting at the start in one pass through to the end. I haven't worked out what that approach is exactly, but I'm reasonably confident that it would exist. After trying what I've outlined above, working that out is a good idea. I have a feeling it's something like - start total at 0, keep track of minimum and maximum total reaches. Scan each value from the start of string, adding 1 to the total for each 1 encountered, subtracting 1 for each 0 encountered. The answer is the greater of the absolute values of the minimum and maximum reached by total. - I haven't verified that, it's just a hunch. Comments have lead to further speculation that doing this but adding together the maximum and absolute of minimum may be more realistic.
Time complexity - O(n)
void solve(string s) {
int n = s.size();
int zero = 0, One = 0, res = 0;
for (int i = 0; i < n; i++)
{
if (s[i] == '1')
{
if (zero > 0)
zero--;
else
res++;
One++;
}
else
{
if (One > 0)
One--;
else
res++;
zero++;
}
}
cout << res << endl;
}

algo class question: compare n no. of sequence showing their comparison which leads to the particular sequence

Whenever you compare 3 no. it end up in 6 results and similarly 4 no it goes for 24 no. making permutation of no. of inputs.
The task is to compare n no. of sequence showing their comparison which leads to the particular sequence
For example your input is a,b,c
If a<b
If b<c
Abc
Else
If a<c
Acb
Else a>c
cab
Else b>c
Cba
Else
If a<c
Bac
Else
Bca
Else
Cba
The task is to print all the comparisons which took place to lead that sequence for n no.s and
confirm that there is no duplication.
Here is Python code that outputs valid Python code to assign to answer the sorted values.
The sorting algorithm here is mergesort. Which is not going to give the smallest possible decision tree, but it will be pretty good.
Here is Python code that outputs valid Python code to assign to answer the sorted values.
The sorting algorithm here is mergesort. Which is not going to give the smallest possible decision tree, but it will be pretty good.
#! /usr/bin/env python
import sys
class Tree:
def __init__ (self, node_type, value1=None, value2=None, value3=None):
self.node_type = node_type
self.value1 = value1
self.value2 = value2
self.value3 = value3
def output (self, indent='', is_continue=False):
rows = []
if self.node_type == 'answer':
rows.append("{}answer = [{}]".format(indent, ', '.join(self.value1)))
elif self.node_type == 'comparison':
if is_continue:
rows.append('{}elif {} < {}:'.format(indent, self.value1[0], self.value1[1]))
else:
rows.append('{}if {} < {}:'.format(indent, self.value1[0], self.value1[1]))
rows = rows + self.value2.output(indent + ' ')
if self.value3.node_type == 'answer':
rows.append('{}else:'.format(indent))
rows = rows + self.value3.output(indent + ' ')
else:
rows = rows + self.value3.output(indent, True)
return rows
# This call captures a state in the merging.
def _merge_tree (chains, first=None, second=None, output=None):
if first is None and second is None and output is None:
if len(chains) < 2:
return Tree('answer', chains[0])
else:
return _merge_tree(chains[2:], chains[0], chains[1], [])
elif first is None:
return _merge_tree(chains + [output])
elif len(first) == 0:
return _merge_tree(chains, second, None, output)
elif second is None:
return _merge_tree(chains + [output + first])
elif len(second) < len(first):
return _merge_tree(chains, second, first, output)
else:
subtree1 = _merge_tree(chains, first[1:], second, output + [first[0]])
subtree2 = _merge_tree(chains, first, second[1:], output + [second[0]])
return Tree('comparison', [first[0], second[0]], subtree1, subtree2)
def merge_tree (variables):
# Turn the list into a list of 1 element merges.
return _merge_tree([[x] for x in variables])
# This captures the moment when you're about to compare the next
# variable with the already sorted variable at position 'position'.
def insertion_tree (variables, prev_sorted=None, current_variable=None, position=None):
if prev_sorted is None:
prev_sorted = []
if current_variable is None:
if len(variables) == 0:
return Tree('answer', prev_sorted)
else:
return insertion_tree(variables[1:], prev_sorted, variables[0], len(prev_sorted))
elif position < 1:
return insertion_tree(variables, [current_variable] + prev_sorted)
else:
position = position - 1
subtree1 = insertion_tree(variables, prev_sorted, current_variable, position)
subtree2 = insertion_tree(variables, prev_sorted[0:position] + [current_variable] + prev_sorted[position:])
return Tree('comparison', [current_variable, prev_sorted[position]], subtree1, subtree2)
args = ['a', 'b', 'c']
if 1 < len(sys.argv):
args = sys.argv[1:]
for line in merge_tree(args).output():
print(line)
For giggles and grins, you can get insertion sort by switching the final call to merge_tree to insertion_tree.
In principle you could repeat the exercise for any sort algorithm, but it gets really tricky, really fast. (For quicksort you have to do continuation passing. For heapsort and bubble sort you have to insert fancy logic to only consider parts of the decision tree that you could actually arrive at. It is a fun exercise if you want to engage in it.)

Why my Binary Search implementation in Scala is so slow?

Recently, I implemented this Binary Search, which is supposed to run under 6 seconds for Scala, yet it runs for 12-13 seconds on the machine that checks the assignments.
Note before you read the code: the input consists of two lines: first - list of numbers to search in, and second - list of "search terms" to search in the list of numbers. Expected output just lists the indexes of each term in the list of numbers. Each input can be maximum of length 10^5 and each number maximum of size 10^9.
For example:
Input:
5 1 5 8 12 13 //note, that the first number 5 indicates the length of the
following sequence
5 8 1 23 1 11 //note, that the first number 5 indicates the length of the
following sequence
Output:
2 0 -1 0 -1 // index of each term in the input array
My solution:
object BinarySearch extends App {
val n_items = readLine().split(" ").map(BigInt(_))
val n = n_items(0)
val items = n_items.drop(1)
val k :: terms = readLine().split(" ").map(BigInt(_)).toList
println(search(terms, items).mkString(" "))
def search(terms: List[BigInt], items:Array[BigInt]): Array[BigInt] = {
#tailrec
def go(terms: List[BigInt], results: Array[BigInt]): Array[BigInt] = terms match {
case List() => results
case head :: tail => go(tail, results :+ find(head))
}
def find(term: BigInt): BigInt = {
#tailrec
def go(left: BigInt, right: BigInt): BigInt = {
if (left > right) { -1 }
else {
val middle = left + (right - left) / 2
val middle_val = items(middle.toInt)
middle_val match {
case m if m == term => middle
case m if m <= term => go(middle + 1, right)
case m if m > term => go(left, middle - 1)
}
}
}
go(0, n - 1)
}
go(terms, Array())
}
}
What makes this code so slow? Thank you
I am worried about the complexity of
results :+ find(head)
Appending an item to a list of length L is O(L) (see here), so if you have n results to compute, the complexity will be O(n*n).
Try using a mutable ArrayBuffer instead of an Array to accumulate the results, or simply mapping the input terms through the find function.
In other words replace
go(terms, Array())
with
terms.map( x => find(x) ).toArray
By the way, the limits on the problem are small enough that using BigInt is overkill and probably making the code significantly slower. Normal ints should be large enough for this problem.

Python: break up dataframe (one row per entry in column, instead of multiple entries in column)

I have a solution to a problem, that to my despair is somewhat slow, and I am seeking advice on how to speed up my solution (by adding vectorization or other clever methods). I have a dataframe that looks like this:
toy = pd.DataFrame([[1,'cv','c,d,e'],[2,'search','a,b,c,d,e'],[3,'cv','d']],
columns=['id','ch','kw'])
Output is:
The task is to break up kw column into one (replicated) row per comma-separated entry in each string. Thus, what I wish to achieve is:
My initial solution is the following:
data = pd.DataFrame()
for x in toy.itertuples():
id = x.id; ch = x.ch; keys = x.kw.split(",")
data = data.append([[id, ch, x] for x in keys], ignore_index=True)
data.columns = ['id','ch','kw']
Problem is: it is slow for larger dataframes. My hope is that someone has encountered a similar problem before, and knows how to optimize my solution. I'm using python 3.4.x and pandas 0.19+ if that is of importance.
Thank you!
You can use str.split for lists, then get len for length.
Last create new DataFrame by constructor with numpy.repeat and numpy.concatenate:
cols = toy.columns
splitted = toy['kw'].str.split(',')
l = splitted.str.len()
toy = pd.DataFrame({'id':np.repeat(toy['id'], l),
'ch':np.repeat(toy['ch'], l),
'kw':np.concatenate(splitted)})
toy = toy.reindex_axis(cols, axis=1)
print (toy)
id ch kw
0 1 cv c
0 1 cv d
0 1 cv e
1 2 search a
1 2 search b
1 2 search c
1 2 search d
1 2 search e
2 3 cv d

scala version of swap algorithm for null models

The problem I am having is with trying to find an efficient way to find swappable elements in a matrix in order to implement a swap algorithm for null model creation.
The matrix consists of 0's and 1's and the idea is that elements can be switched between columns so that the row and column totals of the matrix remain the same.
For example, given the following matrix:
c1 c2 c3 c4
r1 0 1 0 0 = 1
r2 1 0 0 1 = 2
r3 0 0 0 0 = 0
r4 1 1 1 1 = 4
------------
2 2 1 2
columns c2 and c4 in r1 and r2 can each be swapped in such a way that totals are not altered i.e.:
c1 c2 c3 c4
r1 0 0 0 1 = 1
r2 1 1 0 0 = 2
r3 0 0 0 0 = 0
r4 1 1 1 1 = 4
------------
2 2 1 2
This all needs to be done randomly so as not to introduce any bias.
I have one solution that works. I randomly select a row and two columns. If they yield a 10 or 01 pattern then I randomly select another row and check the same columns to see if they yield the opposite pattern. If either of them fail I start over and select a new element.
This method works but I only "hit" the correct patterns about 10% of the time. In a large matrix or in one with few 1's in the rows I waste a lot of time "missing". I figured that there had to be a more intelligent way of choosing elements in the matrix but still doing it randomly.
The code for the working method is:
def isSwappable(matrix: Matrix): Tuple2[Tuple2[Int, Int], Tuple2[Int, Int]] = {
val indices = getRowAndColIndices(matrix)
(matrix(indices._1._1)(indices._2._1), matrix(indices._1._1)(indices._2._2)) match {
case (1, 0) => {
if (matrix(indices._1._2)(indices._2._1) == 0 & matrix(indices._1._2)(indices._2._2) == 1) {
indices
}
else {
isSwappable(matrix)
}
}
case (0, 1) => {
if (matrix(indices._1._2)(indices._2._1) == 1 & matrix(indices._1._2)(indices._2._2) == 0) {
indices
}
else {
isSwappable(matrix)
}
}
case _ => {
isSwappable(matrix)
}
}
}
def getRowAndColIndices(matrix: Matrix): Tuple2[Tuple2[Int, Int], Tuple2[Int, Int]] = {
(getNextIndex(rnd.nextInt(matrix.size), matrix.size), getNextIndex(rnd.nextInt(matrix(0).size), matrix(0).size))
}
def getNextIndex(i: Int, constraint: Int): Tuple2[Int, Int] = {
val newIndex = rnd.nextInt(constraint)
newIndex match {
case `i` => getNextIndex(i, constraint)
case _ => (i, newIndex)
}
}
I figured a more efficient way to handle this was to remove any rows that could not be used (all 1's or 0's) and then choose an element randomly. From there I could filter out any columns in the row that had the same value and the choose from the remaining columns.
Once the first row and column are chosen I then filter out the rows that can not provide the required pattern and then choose from the remaining rows.
This works for the most part but the problem that I can't figure out how to deal with is what happens when there are no columns or rows to choose from? I don't want to loop infinitely trying to find the pattern I need and I need a way of starting over if I do get an empty list of rows or columns to choose from.
The code that I have so far that sort of works (until I get an empty list) is:
def getInformativeRowIndices(matrix: Matrix) = (
matrix
.zipWithIndex
.filter(_._1.distinct.size > 1)
.map(_._2)
.toList
)
def getRowsWithOppositeValueInColumn(col: Int, value: Int, matrix: Matrix) = (
matrix
.zipWithIndex
.filter(_._1(col) != value)
.map(_._2)
.toList
)
def getColsWithOppositeValueInSameRow(row: Int, value: Int, matrix: Matrix) = (
matrix(row)
.zipWithIndex
.filter(_._1 != value)
.map(_._2)
.toList
)
def process(matrix: Matrix): Tuple2[Tuple2[Int, Int], Tuple2[Int, Int]] = {
val row1Indices = getInformativeRowIndices(matrix)
if (row1Indices.isEmpty) sys.error("No informative rows")
val row1 = row1Indices(rnd.nextInt(row1Indices.size))
val col1 = rnd.nextInt(matrix(0).size)
val colIndices = getColsWithOppositeValueInSameRow(row1, matrix(row1)(col1), matrix)
if (colIndices.isEmpty) process(matrix)
val col2 = colIndices(rnd.nextInt(colIndices.size))
val row2Indices = getRowsWithOppositeValueInColumn(col1, matrix(row1)(col1), matrix)
.intersect(getRowsWithOppositeValueInColumn(col2, matrix(row1)(col2), matrix))
println(row2Indices)
if (row2Indices.isEmpty) process(matrix)
val row2 = row2Indices(rnd.nextInt(row2Indices.size))
((row1, row2), (col1, col2))
}
I think the recursive methods are wrong and don't really work here. Also, I am really just trying to improve the speed of cell selection so any ideas or suggestions would be greatly appreciated.
EDIT:
I have had a chance to play with this little more and have come up with another solution but it does not seem to be much faster then just randomly choosing cells in the matrix. Also, I should add that the matrix needs to be swapped about 30000 times in succession in order for it to be considered randomised and I need to generate 5000 random matrices for each test of which I have at least another 5000 to do so performance is kind of important.
The current solution (besides random cell selection is:
Randomly select 2 rows from the matrix
subtract one row from the other and put it in an Array
if the new Array contains both a 1 and -1 then we can swap
The logic of the subtraction looks like this:
0 1 0 0
- 1 0 0 1
---------------
-1 1 0 -1
The method that does this looks like this:
def findSwaps(matrix: Matrix, iterations: Int): Boolean = {
var result = false
val mtxLength = matrix.length
val row1 = rnd.nextInt(mtxLength)
val row2 = getNextIndex(row1, mtxLength)
val difference = subRows(matrix(row1), matrix(row2))
if (difference.min == -1 & difference.max == 1) {
val zeroOne = difference.zipWithIndex.filter(_._1 == -1).map(_._2)
val oneZero = difference.zipWithIndex.filter(_._1 == 1).map(_._2)
val col1 = zeroOne(rnd.nextInt(zeroOne.length))
val col2 = oneZero(rnd.nextInt(oneZero.length))
swap(matrix, row1, row2, col1, col2)
result = true
}
result
}
The matrix row subtraction looks like this:
def subRows(a: Array[Int], b: Array[Int]): Array[Int] = (a, b).zipped.map(_ - _)
And the actual swap looks like this:
def swap(matrix: Matrix, row1: Int, row2: Int, col1: Int, col2: Int) = {
val temp = (matrix(row1)(col1), matrix(row1)(col2))
matrix(row1)(col1) = matrix(row2)(col1)
matrix(row1)(col2) = matrix(row2)(col2)
matrix(row2)(col1) = temp._1
matrix(row2)(col2) = temp._2
matrix
}
This works much better than before in that I get have between 80% and 90% success for an attempted swap (it was only about 10% with the random cell selection) however... it is still taking about 2.5 minutes to generate 1000 randomised matrices.
Any ideas on how to improve the speed?
I'm going to assume the matrices are big so that storage of the order of (matrix size squared) is not viable (for reasons of either speed or memory).
If you have a sparse matrix, you can enter the index of each 1 in each column in a set (here I show the compact way to do things, but you may wish to iterate with while loops for speed):
val mtx = Array(Array(0,1,0,0),Array(1,0,0,1),Array(0,0,0,0),Array(1,1,1,1))
val cols = mtx.transpose.map(x => x.zipWithIndex.filter(_._1==1).map(_._2).toSet)
Now for each column, a later column contains compatible pairs (at least one) if and only if only the following two sets are nonempty:
def xorish(a: Set[Int], b: Set[Int]) = (a--b, b--a)
So the answer will involve computing these sets and testing whether they're both nonempty.
Now the question is what you mean by "sample randomly". Randomly sampling single 1,0 pairs is not the same as randomly sampling possible swaps. To see this, consider the following:
1 0 1 0
1 0 1 0
1 0 1 0
0 1 1 0
0 1 1 0
0 1 0 1
The two columns on the left have nine possible swaps. The two on the right have only five possible swaps. But if you are looking for (1,0) patterns, you will sample only three times on the left vs. five on the right; if you are looking for either (1,0) or (0,1), you will sample six and six, which again distorts the probabilities. The only way to fix this is either to not be clever, and randomly sample a second time (which in the first case will work out with a usable swap 3/5 of the time, while only 1/5 in the second), or to basically compute every possible pair for swapping (or at least how many pairs there are) and select from that predefined set.
If we want to do the latter, we note that for each pair of nonidentical columns, we can compute the two sets to swap among, and we know the sizes and the product is the total number of possibilities. In order to avoid instantiating all the possibilities, we can create
val poss = {
for (i<-cols.indices; j <- (i+1) until cols.length) yield
(i, j, (cols(i)--cols(j)).toArray, (cols(j)--cols(i)).toArray)
}.filter{ case (_,_,a,b) => a.length>0 && b.length>0 }
and then count how many there are:
val cuml = poss.map{ case (_,_,a,b) => a.size*b.size }.scanLeft(0)(_ + _).toArray
Now to pick a number at random, we pick a number between 0 and cuml.last and pick out which bucket this is and which item within the bucket:
def pickItem(cuml: Array[Int], poss: Seq[(Int,Int,Array[Int],Array[Int])]) = {
val n = util.Random.nextInt(cuml.last)
val k = {
val i = java.util.Arrays.binarySearch(cuml,n)
if (i<0) -i-2 else i
}
val j = n - cuml(k)
val bucket = poss(k)
(
bucket._1, bucket._2,
bucket._3(j % bucket._3.size), bucket._4(j / bucket._3.size)
)
}
This ends up returning (c1,c2,r1,r2) selected randomly.
Now that you have the coordinates, you can create the new matrix however you wish. (Most efficient is probably to do an in-place swap of the entries, and then swap back when you want to try again.)
Note that this is only sensible for a large number of independent swaps from the same starting matrix. If you instead want to do this iteratively and maintain independence, you are probably best off doing this randomly after all unless the matrices are extremely sparse, at which point it's worth simply storing the matrices in some standard sparse matrix format (i.e. by index of nonzero entries) and doing your manipulation on those (probably with mutable sets and an update strategy, since the consequences of a single swap are confined to about n of the entries in an n*n matrix).

Resources