Coalesce intersecting sets into disjoint sets

Coalesce intersecting sets into disjoint sets - algorithm

I am looking for an algorithm to coalesce a list of sets, that may intersect, into a list of sets with no intersection.
For instance:
my_sets = set(1, 2, 3), set(5, 6), set(4, 5, 6), set(4, 7), set(3, 8), set(9)
Should yield:
my_coalesced_sets = set(1, 2, 3, 8), set(4, 5, 6, 7), set(9)
Ideally an algorithm O(n)...
At the request of Ruben, here is one of the many algorithms I tried that does not yield correct results:
fun main(){
val l = mutableListOf(setOf(1, 2, 3), setOf(5, 6), setOf(4, 5, 6), setOf(4, 7), setOf(3, 8), setOf(9))
while (true){
val removes = mutableListOf<Set<Int>>()
var current = l.removeFirst()
l.filter { current.intersect(it).isNotEmpty() }.forEach {
current = current union it
removes += it
}
l += current
if (removes.isEmpty()){
break
}
l.removeAll(removes)
}
print(l)
}

This is the purpose of the disjoint-set data structure. Simply add an edge ("Union") for each consecutive pair in each input set, and the result will be the coalesced sets. The running time is essentially linear (very slightly super-linear but that's only a theoretical difference).

Related

Data Structure to convert a stream of integers into a list of range

A function is given with a method to get the next integer from a stream of integers. The numbers are fetched sequentially from the stream. How will we go about producing a summary of integers encountered till now?
Given a list of numbers, the summary will consist of the ranges of numbers. Example: The list till now = [1,5,4,2,7] then summary = [[1-2],[4-5],7]
Put the number in ranges if they are continuous.
My Thoughts:
Approach 1:
Maintain the sorted numbers. So when we fetch a new number from a stream, we can use binary search to find the location of the number in the list and insert the element so that the resulting list is sorted. But since this is a list, I think inserting the element will be an O(N) operation.
Approach 2:
Use Balanced binary search trees like Red, Black, or AVL. Each insertion will be O(log N)
and in order will yield the sorted array from which one can compute the range in O(N)
Approach 2 looks like a better approach if I am not making any mistakes. I am unsure if there is a better way to solve this issue.

I'd not keep the original numbers, but aggregate them to ranges on the fly. This has the potential to reduce the number of elements by quite some factor (depending on the ordering and distribution of the incoming values). The task itself seems to imply that you expect contiguous ranges of integers to appear quite frequently in the input.
Then a newly incoming number can fall into one of a few cases:
It is already contained in some range: then simply ignore the number (this is only relevant if duplicate inputs can happen).
It is adjacent to none of the ranges so far: create a new single-element range.
It is adjacent to exactly one range: extend that range by 1, downward or upward.
It is adjacent to two ranges (i.e. fills the gap): merge the two ranges.
For the data structure holding the ranges, you want a good performance for the following operations:
Find the place (position) for a given number.
Insert a new element (range) at a given place.
Merge two (neighbor) elements. This can be broken down into:
Remove an element at a given place.
Modify an element at a given place.
Depending on the expected number und sparsity of ranges, a sorted list of ranges might do. Otherwise, some kind of search tree might turn out helpful.
Anyway, start with the most readable approach, measure performance for typical cases, and decide whether some optimization is necessary.

I suggest maintaining a hashmap that maps each integer seen so far to the interval it belongs to.
Make sure that two numbers that are part of the same interval will point to the same interval object, not to copies; so that if you update an interval to extend it, all numbers can see it.
All operations are O(1), except the operation "merge two intervals" that happens if the stream produces integer x when we have two intervals [a, x - 1] and [x + 1, b]. The merge operation is proportional to the length of the shortest of these two intervals.
As a result, for a stream of n integers, the algorithm's complexity is O(n) in the best-case (where at most a few big merges happen) and O(n log n) in the worst-case (when we keep merging lots of intervals).
In python:
def add_element(intervals, x):
if x in intervals: # do not do anything
pass
elif x + 1 in intervals and x - 1 in intervals: # merge two intervals
i = intervals[x - 1]
j = intervals[x + 1]
if i[1]-i[0] > j[1]-j[0]: # j is shorter: update i, and make everything in j point to i
i[1] = j[1]
for y in range(j[0] - 1, j[1]+1):
intervals[y] = i
else: # i is shorter: update j, and make everything in i point to j
j[0] = i[0]
for y in range(i[0], i[1] + 2):
intervals[y] = j
elif x + 1 in intervals: # extend one interval to the left
i = intervals[x + 1]
i[0] = x
intervals[x] = i
elif x - 1 in intervals: # extend one interval to the right
i = intervals[x - 1]
i[1] = x
intervals[x] = i
else: # add a singleton
intervals[x] = [x,x]
return intervals
from random import shuffle
def main():
stream = list(range(10)) * 2
shuffle(stream)
print(stream)
intervals = {}
for x in stream:
intervals = add_element(intervals, x)
print(x)
print(set(map(tuple, intervals.values()))) # this line terribly inefficient because I'm lazy
if __name__=='__main__':
main()
Output:
[1, 5, 8, 3, 9, 6, 7, 9, 3, 0, 6, 5, 8, 1, 4, 7, 2, 2, 0, 4]
1
{(1, 1)}
5
{(1, 1), (5, 5)}
8
{(8, 8), (1, 1), (5, 5)}
3
{(8, 8), (1, 1), (5, 5), (3, 3)}
9
{(8, 9), (1, 1), (5, 5), (3, 3)}
6
{(3, 3), (1, 1), (8, 9), (5, 6)}
7
{(5, 9), (1, 1), (3, 3)}
9
{(5, 9), (1, 1), (3, 3)}
3
{(5, 9), (1, 1), (3, 3)}
0
{(0, 1), (5, 9), (3, 3)}
6
{(0, 1), (5, 9), (3, 3)}
5
{(0, 1), (5, 9), (3, 3)}
8
{(0, 1), (5, 9), (3, 3)}
1
{(0, 1), (5, 9), (3, 3)}
4
{(0, 1), (3, 9)}
7
{(0, 1), (3, 9)}
2
{(0, 9)}
2
{(0, 9)}
0
{(0, 9)}
4
{(0, 9)}

You could use a Disjoint Set Forest implementation for this. If well-implemented, it gives a near linear time complexity for inserting 𝑛 elements into it. The amortized running time of each insert operation is Θ(α(𝑛)) where α(𝑛) is the inverse Ackermann function. For all practical purposes we can not distinguish this from O(1).
The extraction of the ranges can have a time complexity of O(𝑘), where 𝑘 is the number of ranges, provided that the disjoint set maintains the set of root nodes. If the ranges need to be sorted, then this extraction will have a time complexity of O(𝑘log𝑘), as it will then just perform the sort-operation on it.
Here is an implementation in Python:
class Node:
def __init__(self, value):
self.low = value
self.parent = self
self.size = 1
def find(self): # Union-Find: Path splitting
node = self
while node.parent is not node:
node, node.parent = node.parent, node.parent.parent
return node
class Ranges:
def __init__(self):
self.nums = dict()
self.roots = set()
def union(self, a, b): # Union-Find: Size-based merge
a = a.find()
b = b.find()
if a is not b:
if a.size > b.size:
a, b = b, a
self.roots.remove(a) # Keep track of roots
a.parent = b
b.low = min(a.low, b.low)
b.size = a.size + b.size
def add(self, n):
if n not in self.nums:
self.nums[n] = node = Node(n)
self.roots.add(node)
if (n+1) in self.nums:
self.union(node, self.nums[n+1])
if (n-1) in self.nums:
self.union(node, self.nums[n-1])
def get(self):
return sorted((node.low, node.low + node.size - 1) for node in self.roots)
# example run
ranges = Ranges()
for n in 4, 7, 1, 6, 2, 9, 5:
ranges.add(n)
print(ranges.get()) # [(1, 2), (4, 7), (9, 9)]

Cartesian product but remove duplicates up to cyclic permutations

Given two integers n and r, I want to generate all possible combinations with the following rules:
There are n distinct numbers to choose from, 1, 2, ..., n;
Each combination should have r elements;
A combination may contain more than one of an element, for instance (1,2,2) is valid;
Order matters, i.e. (1,2,3) and (1,3,2) are considered distinct;
However, two combinations are considered equivalent if one is a cyclic permutation of the other; for instance, (1,2,3) and (2,3,1) are considered duplicates.
Examples:
n=3, r=2
11 distinct combinations
(1,1,1), (1,1,2), (1,1,3), (1,2,2), (1,2,3), (1,3,2), (1,3,3), (2,2,2), (2,2,3), (2,3,3) and (3,3,3)
n=2, r=4
6 distinct combinations
(1,1,1,1), (1,1,1,2), (1,1,2,2), (1,2,1,2), (1,2,2,2), (2,2,2,2)
What is the algorithm for it? And how to implement it in c++?
Thank you in advance for advice.

Here is a naive solution in python:
Generate all combinations from the Cartesian product of {1, 2, ...,n} with itself r times;
Only keep one representative combination for each equivalency class; drop all other combinations that are equivalent to this representative combination.
This means we must have some way to compare combinations, and for instance, only keep the smallest combination of every equivalency class.
from itertools import product
def is_representative(comb):
return all(comb[i:] + comb[:i] >= comb
for i in range(1, len(comb)))
def cartesian_product_up_to_cyclic_permutations(n, r):
return filter(is_representative,
product(range(n), repeat=r))
print(list(cartesian_product_up_to_cyclic_permutations(3, 3)))
# [(0, 0, 0), (0, 0, 1), (0, 0, 2), (0, 1, 1), (0, 1, 2), (0, 2, 1), (0, 2, 2), (1, 1, 1), (1, 1, 2), (1, 2, 2), (2, 2, 2)]
print(list(cartesian_product_up_to_cyclic_permutations(2, 4)))
# [(0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 1, 1), (0, 1, 0, 1), (0, 1, 1, 1), (1, 1, 1, 1)]
You mentioned that you wanted to implement the algorithm in C++. The product function in the python code behaves just like a big for-loop that generates all the combinations in the Cartesian product. See this related question to implement Cartesian product in C++: Is it possible to execute n number of nested "loops(any)" where n is given?.

Merge two sorted lists of intervals

Given A and B, which are two interval lists. A has no overlap inside A and B has no overlap inside B. In A, the intervals are sorted by their starting points. In B, the intervals are sorted by their starting points. How do you merge the two interval lists and output the result with no overlap?
One method is to concatenate the two lists, sort by the starting point, and apply merge intervals as discussed at https://www.geeksforgeeks.org/merging-intervals/. Is there a more efficient method?
Here is an example:
A: [1,5], [10,14], [16,18]
B: [2,6], [8,10], [11,20]
The output:
[1,6], [8, 20]

So you have two sorted lists with events - entering interval and leaving interval.
Merge these lists keeping current state as integer 0, 1, 2 (active interval count)
Get the next coordinate from both lists
If it is entering event
Increment state
If state becomes 1, start new output interval
If it is closing event
Decrement state
If state becomes 0, close current output interval
Note that this algo is similar to intersection finding there

Here is a different approach, in the spirit of the answer to the question of overlaps.
<!--code lang=scala-->
def findUnite (l1: List[Interval], l2: List[Interval]): List[Interval] = (l1, l2) match {
case (Nil, Nil) => Nil
case (as, Nil) => as
case (Nil, bs) => bs
case (a :: as, b :: bs) => {
if (a.lower > b.upper) b :: findUnite (l1, bs)
else if (a.upper < b.lower) a :: findUnite (as, l2)
else if (a.upper > b.upper) findUnite (a.union (b).get :: as, bs)
else findUnite (as, a.union (b).get :: bs)
}
}
If both lists are empty - return the empty list.
If only one is empty, return the other.
If the upper bound of one list is below the lower bound of the other, there is no unification possible, so return the other and proceed with the rest.
If they overlap, don't return, but call the method recursively, the unification on the side of the more far reaching interval and without the consumed less far reaching interval.
The union method looks similar to the one which does the overlap:
<!--code scala-->
case class Interval (lower: Int, upper: Int) {
// from former question, to compare
def overlap (other: Interval) : Option [Interval] = {
if (lower > other.upper || upper < other.lower) None else
Some (Interval (Math.max (lower, other.lower), Math.min (upper, other.upper)))
}
def union (other: Interval) : Option [Interval] = {
if (lower > other.upper || upper < other.lower) None else
Some (Interval (Math.min (lower, other.lower), Math.max (upper, other.upper)))
}
}
The test for non overlap is the same. But min and max have changed places.
So for (2, 4) (3, 5) the overlap is (3, 4), the union is (2, 5).
lower upper
_____________
2 4
3 5
_____________
min 2 4
max 3 5
Table of min/max lower/upper.
<!--code lang='scala'-->
val e = List (Interval (0, 4), Interval (7, 12))
val f = List (Interval (1, 3), Interval (6, 8), Interval (9, 11))
findUnite (e, f)
// res3: List[Interval] = List(Interval(0,4), Interval(6,12))
Now for the tricky or unclear case from above:
val e = List (Interval (0, 4), Interval (7, 12))
val f = List (Interval (1, 3), Interval (5, 8), Interval (9, 11))
findUnite (e, f)
// res6: List[Interval] = List(Interval(0,4), Interval(5,12))
0-4 and 5-8 don't overlap, so they form two different results which don't get merged.

A simple solution could be, to deflate all elements, put them into a set, sort it, then iterate to transform adjectant elements to Intervals.
A similar approach could be chosen for your other question, just eliminating all distinct values to get the overlaps.
But - there is a problem with that approach.
Lets define a class Interval:
case class Interval (lower: Int, upper: Int) {
def deflate () : List [Int] = {(lower to upper).toList}
}
and use it:
val e = List (Interval (0, 4), Interval (7, 12))
val f = List (Interval (1, 3), Interval (6, 8), Interval (9, 11))
deflating:
e.map (_.deflate)
// res26: List[List[Int]] = List(List(0, 1, 2, 3, 4), List(7, 8, 9, 10, 11, 12))
f.map (_.deflate)
// res27: List[List[Int]] = List(List(1, 2, 3), List(6, 7, 8), List(9, 10, 11))
The ::: combines two Lists, here two Lists of Lists, which is why we have to flatten the result, to make one big List:
(res26 ::: res27).flatten
// res28: List[Int] = List(0, 1, 2, 3, 4, 7, 8, 9, 10, 11, 12, 1, 2, 3, 6, 7, 8, 9, 10, 11)
With distinct, we remove duplicates:
(res26 ::: res27).flatten.distinct
// res29: List[Int] = List(0, 1, 2, 3, 4, 7, 8, 9, 10, 11, 12, 6)
And then we sort it:
(res26 ::: res27).flatten.distinct.sorted
// res30: List[Int] = List(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12)
All in one command chain:
val united = ((e.map (_.deflate) ::: f.map (_.deflate)).flatten.distinct).sorted
// united: List[Int] = List(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12)
// ^ (Gap)
Now we have to find the gaps like the one between 4 and 6 and return two distinct Lists.
We go recursively through the input list l, and if the element is from the sofar collected elements 1 bigger than the last, we collect that element into this sofar-list. Else we return the sofar collected list as partial result, followed by splitting of the rest with a List of just the current element as new sofar-collection. In the beginning, sofar is empty, so we can start right with adding the first element into that list and splitting the tail with that.
def split (l: List [Int], sofar: List[Int]): List[List[Int]] = l match {
case Nil => List (sofar)
case h :: t => if (sofar.isEmpty) split (t, List (h)) else
if (h == sofar.head + 1) split (t, h :: sofar)
else sofar :: split (t, List (h))
}
// Nil is the empty list, we hand in for initialization
split (united, Nil)
// List(List(4, 3, 2, 1, 0), List(12, 11, 10, 9, 8, 7, 6))
Converting the Lists into intervals would be a trivial task - take the first and last element, and voila!
But there is a problem with that approach. Maybe you recognized, that I redefined your A: and B: (from the former question). In B, I redefined the second element from 5-8 to 6-8. Because else, it would merge with the 0-4 from A because 4 and 5 are direct neighbors, so why not combine them to a big interval?
But maybe it is supposed to work this way? For the above data:
split (united, Nil)
// List(List(6, 5, 4, 3, 2, 1), List(20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8))

Dynamic programming - board with multiplier

I got quite standard DP problem - board nxn with integers, all positive. I want to start somewhere in the first row, end somewhere in the last row and accumulate as much sum as possible. From field (i,j) I can go to fields (i+1, j-1), (i+1, j), (i+1, j+1).
That's quite standard DP problem. But we add one thing - there can be an asterisk on the field, instead of the number. If we meet the asterisk, then we got 0 points from it, but we increase multiplier by 1. All numbers we collect later during our traversal are multiplied by multiplier.
I can't find out how to solve this problem with that multiplier thing. I assume that's still a DP problem - but how to get the equations right for it?
Thanks for any help.

You can still use DP, but you have to keep track of two values: The "base" value, i.e. without any multipliers applied to it, and the "effective" value, i.e. with multipliers. You work your way backwards through the grid, starting in the previous-to-last row, get the three "adjacent" cells in the row after that (the possible "next" cells on the path), and just pick the one with the highest value.
If the current cell is a *, you get the cell where base + effective is maximal, otherwise you just get the one where the effective score is highest.
Here's an example implementation in Python. Note that instead of * I'm just using 0 for multipliers, and I'm looping the grid in order instead of in reverse, just because it's more convenient.
import random
size = 5
grid = [[random.randint(0, 5) for _ in range(size)] for _ in range(size)]
print(*grid, sep="\n")
# first value is base score, second is effective score (with multiplier)
solution = [[(x, x) for x in row] for row in grid]
for i in range(1, size):
for k in range(size):
# the 2 or 3 values in the previous line
prev_values = solution[i-1][max(0, k-1):k+2]
val = grid[i][k]
if val == 0:
# multiply
base, mult = max(prev_values, key=lambda t: t[0] + t[1])
solution[i][k] = (base, base + mult)
else:
# add
base, mult = max(prev_values, key=lambda t: t[1])
solution[i][k] = (val + base, val + mult)
print(*solution, sep="\n")
print(max(solution[-1], key=lambda t: t[1]))
Example: The random 5x5 grid, with 0 corresponding to *:
[4, 4, 1, 2, 1]
[2, 0, 3, 2, 0]
[5, 1, 3, 4, 5]
[0, 0, 2, 4, 1]
[1, 0, 5, 2, 0]
The final solution grid with base values and effective values:
[( 4, 4), ( 4, 4), ( 1, 1), ( 2, 2), ( 1, 1)]
[( 6, 6), ( 4, 8), ( 7, 7), ( 4, 4), ( 2, 4)]
[( 9, 13), ( 5, 9), ( 7, 11), (11, 11), ( 9, 9)]
[( 9, 22), ( 9, 22), ( 9, 13), (11, 15), (12, 12)]
[(10, 23), ( 9, 31), (14, 27), (13, 17), (11, 26)]
Thus, the best solution for this grid is 31 from (9, 31). Working backwards through the grid solution grid, this corresponds to the path 0-0-5-0-4, i.e. 3*5 + 4*4 = 31, as there are 2 * before the 5, and 3 * before the 4.

Counting number of days, given a collection of day ranges?

Say I have the following ranges, in some list:
{ (1, 4), (6, 8), (2, 5), (1, 3) }
(1, 4) represents days 1, 2, 3, 4. (6, 8) represents days 6, 7, 8, and so on.
The goal is to find the total number of days that are listed in the collection of ranges -- for instance, in the above example, the answer would be 8, because days 1, 2, 3, 4, 6, 7, 8, and 5 are contained within the ranges.
This problem can be solved trivially by iterating through the days in each range and putting them in a HashSet, then returning the size of the HashSet. But is there any way to do it in O(n) time with respect to the number of range pairs? How about in O(n) time and with constant space? Thanks.

Sort the ranges in ascending order by their lower limits. You can probably do this in linear time since you're dealing with integers.
The rest is easy. Loop through the ranges once keeping track of numDays (initialized to zero) and largestDay (initialized to -INF). On reaching each interval (a, b):
if b > largestDay then
numDays <- numDays + b-max(a - 1, largestDay)
largestDay <- max(largestDay, b)
else nothing.
So, after sorting we have (1,4), (1,3), (2,5), (6,8)
(1,4): numDays <- 0 + (4 - max(1 - 1, -INF)) = 4, largestDay <- max(-INF, 4) = 4
(1,3): b < largestDay, so no change.
(2,5): numDays <- 4 + (5 - max(2 - 1, 4)) = 5, largestDay <- 5
(6,8): numDays <- 5 + (8 - max(6-1, 5)) = 8, largestDay <- 8

The complexity of the following algorithm is O(n log n) where n is the number of ranges.
Sort the ranges (a, b) lexicographically by increasing a then by decreasing b.
Before: { (1, 4), (6, 8), (2, 5), (1, 3) }
After: { (1, 4), (1, 3), (2, 5), (6, 8) }
Collapse the sorted sequence of ranges into a potentially-shorter sequence of ranges, repeatedly merging consecutive (a, b) and (c, d) into (a, max(b, d)) if b >= c.
Before: { (1, 4), (1, 3), (2, 5), (6, 8) }
{ (1, 4), (2, 5), (6, 8) }
After: { (1, 5), (6, 8) }
Map the sequence of ranges to their sizes.
Before: { (1, 5), (6, 8) }
After: { 5, 3 }
Sum the sizes to arrive at the total number of days.
8

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Coalesce intersecting sets into disjoint sets - algorithm

This is the purpose of the disjoint-set data structure. Simply add an edge ("Union") for each consecutive pair in each input set, and the result will be the coalesced sets. The running time is essentially linear (very slightly super-linear but that's only a theoretical difference).

Related

Data Structure to convert a stream of integers into a list of range

Cartesian product but remove duplicates up to cyclic permutations

Merge two sorted lists of intervals

Dynamic programming - board with multiplier

Counting number of days, given a collection of day ranges?

Categories

Resources