Algorithm for merging sets that share at least 2 elements - algorithm

Given a list of sets:
S_1 : [ 1, 2, 3, 4 ]
S_2 : [ 3, 4, 5, 6, 7 ]
S_3 : [ 8, 9, 10, 11 ]
S_4 : [ 1, 8, 12, 13 ]
S_5 : [ 6, 7, 14, 15, 16, 17 ]
What the most efficient way to merge all sets that share at least 2 elements? I suppose this is similar to a connected components problem. So the result would be:
[ 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17] (S_1 UNION S_2 UNION S_5)
[ 8, 9, 10, 11 ]
[ 1, 8, 12, 13 ] (S_4 shares 1 with S_1, and 8 with S_3, but not merged because they only share one element in each)
The naive implementation is O(N^2), where N is the number of sets, which is unworkable for us. This would need to be efficient for millions of sets.

Let there be a list of many Sets named (S)
Perform a pass through all elements of S, to determine the range (LOW .. HIGH).
Create an array of pointer to Set, of dimensions (LOW, HIGH), named (M).
do
Init all elements of M to NULL.
Iterate though S, processing them one Set at a time, named (Si).
Permutate all ordered pairs in Si. (P1, P2) where P1 <= P2.
For each pair examine M(P1, P2)
if M(P1, P2) is NULL
Continue with the next pair.
otherwise
Merge Si, into the Set pointed to by, M(P1, P2).
Remove Si from S, as it has been merged.
Move on to processing Set S(i + 1)
If Si was not merged,
Permutate again through Si
For each pair, make M(P1, P2) point to Si.
while At least one set was merged during the pass.
My head is saying this is about Order (2N ln N).
Take that with a grain of salt.

If you can order the elements in the set, you can look into using Mergesort on the sets. The only modification needed is to check for duplicates during the merge phase. If one is found, just discard the duplicate. Since mergesort is O(n*log(n)), this will offer imrpoved speed when compared to the naive O(n^2) algorithm.
However, to really be effective, you should maintain a sorted set and keep it sorted, so that you can skip the sort phase and go straight to the merge phase.

I don't see how this can be done in less than O(n^2).
Every set needs to be compared to every other one to see if they contain 2 or more shared elements. That's n*(n-1)/2 comparisons, therefore O(n^2), even if the check for shared elements takes constant time.
In sorting, the naive implementation is O(n^2) but you can take advantage of the transitive nature of ordered comparison (so, for example, you know nothing in the lower partition of quicksort needs to be compared to anything in the upper partition, as it's already been compared to the pivot). This is what result in sorting being O(n * log n).
This doesn't apply here. So unless there's something special about the sets that allows us to skip comparisons based on the results of previous comparisons, it's going to be O(n^2) in general.
Paul.

One side note: It depends on how often this occurs. If most pairs of sets do share at least two elements, it might be most efficient to build the new set at the same time as you are stepping through the comparison, and throw it away if they don't match the condition. If most pairs do not share at least two elements, then deferring the building of the new set until confirmation of the condition might be more efficient.

If your elements are numerical in nature, or can be naturally ordered (ie. you can assign a value such as 1, 2, 42 etc...), I would suggest using a radix sort on the merged sets, and make a second pass to pick up on the unique elements.
This algorithm should be of O(n), and you can optimize the radix sort quite a bit using bitwise shift operators and bit masks. I have done something similar for a project I was working on, and it works like a charm.

Related

Sorting a list that have n fixed segments already sorted in ascending order

The question goes as follows:
Lists which consist of a small fixed number, n, of segments connected
end-to-end, each segment is already in ascending order.
I thought about using mergesort with base case being if equal to n then go back and merge them since we already know that they are sorted, but if I have 3 segments it won't work since I'm dividing by two and you can't divide 3 segments equally into two parts.
The other approach which is similar to merge sort. so I use n stacks for each segment, which we can identify if L[i] > L[i+1] since segments are in ascending order. But I need n comparisons to figure out which element comes first, and I don't know an efficient way of comparing n elements dynamically without using another data structure to compare the elements at the top of the stack.
Also, you are supposed to use the problem feature, segments already ordered, to get better results than conventional algorithms. i.e. complexity less than O(nlogn).
A pseudocode would be nice if you have an idea.
Edit:
An example would be [(14,20,22),(7,8,9),(1,2,3)] here we have 3 segments of 3 elements, even though the segments are sorted, the whole list isn't.
p.s. () is there to point out the segments only
I think maybe you've misunderstood mergesort. While usually you would split in half and sort each half before merging, it's really the merging part which makes the algorithm. You just need to merge on runs instead.
With your example of [(14,20,22),(7,8,9),(1,2,3)]
After first merge you have [(7, 8, 9, 14, 20, 22),(1, 2, 3)]
After second merge you have [(1, 2, 3, 7, 8, 9, 14, 20, 22)]
l = [14, 20, 22, 7, 8, 9, 1, 2, 3]
rl = [] # run list
sl = [l[0]] # temporary sublist
#split list into list of sorted sublists
for item in l[1:]:
if item > sl[-1]:
sl.append(item)
else:
rl.append(sl)
sl = [item]
rl.append(sl)
print(rl)
#function for merging two sorted lists
def merge(l1, l2):
l = [] #list we add into
while True:
if not l1:
# first list is empty, add second list onto new list
return l + l2
if not l2:
# second list is empty, add first list onto new list
return l + l1
if l1[0] < l2[0]:
# rather than deleting, you could increment an index
# which is likely to be faster, or reverse the list
# and pop off the end, or use a data structure which
# allows you to pop off the front
l.append(l1[0])
del l1[0]
else:
l.append(l2[0])
del l2[0]
# keep mergins sublists until only one remains
while len(rl) > 1:
rl.append(merge(rl.pop(), rl.pop()))
print(rl)
It's worth noting that unless this is simply an excercise, you are probably better off using whatever inbuilt sorting function your language of choice uses.

Get kth group of unsorted result list with arbitrary number of results per group

Okay so I have a huge array of unsorted elements of an unknown data type (all elements are of the same type, obviously, I just can't make assumptions as they could be numbers, strings, or any type of object that overloads the < and > operators. The only assumption I can make about those objects is that no two of them are the same, and comparing them (A < B) should give me which one should show up first if it was sorted. The "smallest" should be first.
I receive this unsorted array (type std::vector, but honestly it's more of an algorithm question so no language in particular is expected), a number of objects per "group" (groupSize), and the group number that the sender wants (groupNumber).
I'm supposed to return an array containing groupSize elements, or less if the group requested is the last one. (Examples: 17 results with groupSize of 5 would only return two of them if you ask for the fourth group. Also, the fourth group is group number 3 because it's a zero-indexed array)
Example:
Received Array: {1, 5, 8, 2, 19, -1, 6, 6.5, -14, 20}
Received pageSize: 3
Received pageNumber: 2
If the array was sorted, it would be: {-14, -1, 1, 2, 5, 6, 6.5, 8, 19, 20}
If it was split in groups of size 3: {{-14, -1, 1}, {2, 5, 6}, {6.5, 8, 19}, {20}}
I have to return the third group (pageNumber 2 in a 0-indexed array): {6.5, 8, 19}
The biggest problem is the fact that it needs to be lightning fast. I can't sort the array because it has to be faster than O(n log n).
I've tried several methods, but can never get under O(n log n).
I'm aware that I should be looking for a solution that doesn't fill up all the other groups, and skips a pretty big part of the steps shown in the example above, to create only the requested group before returning it, but I can't figure out a way to do that.
You can find the value of the smallest element s in the group in linear time using the standard C++ std::nth_element function (because you know it's index in the sorted array). You can find the largest element S in the group in the same way. After that, you need a linear pass to find all elements x such that s <= x <= S and return them. The total time complexity is O(n).
Note: this answer is not C++ specific. You just need an implementation of the k-th order statistics in linear time.

Compare rotated lists, containing duplicates [duplicate]

This question already has answers here:
How to check whether two lists are circularly identical in Python
(18 answers)
Closed 7 years ago.
I'm looking for an efficient way to compare lists of numbers to see if they match at any rotation (comparing 2 circular lists).
When the lists don't have duplicates, picking smallest/largest value and rotating both lists before comparisons works.
But when there may be many duplicate large values, this isn't so simple.
For example, lists [9, 2, 0, 0, 9] and [0, 0, 9, 9, 2] are matches,where [9, 0, 2, 0, 9] won't (since the order is different).
Heres an example of an in-efficient function which works.
def min_list_rotation(ls):
return min((ls[i:] + ls[:i] for i in range(len(ls))))
# example use
ls_a = [9, 2, 0, 0, 9]
ls_b = [0, 0, 9, 9, 2]
print(min_list_rotation(ls_a) == min_list_rotation(ls_b))
This can be improved on for efficiency...
check sorted lists match before running exhaustive tests.
only test rotations that start with the minimum value(skipping matching values after that)effectively finding the minimum value with the furthest & smallest number after it (continually - in the case there are multiple matching next-biggest values).
compare rotations without creating the new lists each time..
However its still not a very efficient method since it relies on checking many possibilities.
Is there a more efficient way to perform this comparison?
Related question:
Compare rotated lists in python
If you are looking for duplicates in a large number of lists, you could rotate each list to its lexicographically minimal string representation, then sort the list of lists or use a hash table to find duplicates. This canonicalisation step means that you don't need to compare every list with every other list. There are clever O(n) algorithms for finding the minimal rotation described at https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation.
You almost have it.
You can do some kind of "normalization" or "canonicalisation" of a list independently of the others, then you only need to compare item by item (or if you want, put them in a map, in a set to eliminate duplicates, ..."
1 take the minimum item, which is not preceded by itself (in a circular way)
In you example 92009, you should take the first 0 (not the second one)
2 If you have always the same item (say 00000), you just keep that: 00000
3 If you have the same item several times, take the next item, which is minimal, and keep going until you find one unique path with minimums.
Example: 90148301562 => you have 0148.. and 0156.. => you take 0148
4 If you can not separate the different paths (= if you have equality at infinite), you have a repeating pattern: then, no matters: you take any of them.
Example: 014376501437650143765 : you have the same pattern 0143765...
It is like AAA, where A = 0143765
5 When you have your list in this form, it is easy to compare two of them.
How to do that efficiently:
Iterate on your list to get the minimums Mx (not preceded by itself). If you find several, keep all of them.
Then, iterate from each minimum Mx, take the next item, and keep the minimums. If you do an entire cycle, you have a repeating pattern.
Except the case of repeating pattern, this must be the minimal way.
Hope it helps.
I would do this in expected O(N) time using a polynomial hash function to compute the hash of list A, and every cyclic shift of list B. Where a shift of list B has the same hash as list A, I'd compare the actual elements to see if they are equal.
The reason this is fast is that with polynomial hash functions (which are extremely common!), you can calculate the hash of each cyclic shift from the previous one in constant time, so you can calculate hashes for all of the cyclic shifts in O(N) time.
It works like this:
Let's say B has N elements, then the the hash of B using prime P is:
Hb=0;
for (i=0; i<N ; i++)
{
Hb = Hb*P + B[i];
}
This is an optimized way to evaluate a polynomial in P, and is equivalent to:
Hb=0;
for (i=0; i<N ; i++)
{
Hb += B[i] * P^(N-1-i); //^ is exponentiation, not XOR
}
Notice how every B[i] is multiplied by P^(N-1-i). If we shift B to the left by 1, then every every B[i] will be multiplied by an extra P, except the first one. Since multiplication distributes over addition, we can multiply all the components at once just by multiplying the whole hash, and then fix up the factor for the first element.
The hash of the left shift of B is just
Hb1 = Hb*P + B[0]*(1-(P^N))
The second left shift:
Hb2 = Hb1*P + B[1]*(1-(P^N))
and so on...

Maximum sum of n intervals in a sequence

I'm doing some programming "kata" which are skill building exercises for programming (and martial arts). I want to learn how to solve for algorithms like these in shorter amounts of time, so I need to develop my knowledge of the patterns. Eventually I want to solve in increasingly efficient time complexities (O(n), O(n^2), etc), but for now I'm fine with figuring out the solution with any efficiency to start.
The problem:
Given arr[10] = [4, 5, 0, 2, 5, 6, 4, 0, 3, 5]
Given various segment lengths, for example one 3-length segment, and two 2-length segments, find the optimal position of (or maximum sum contained by) the segments without overlapping the segments.
For example, solution to this array and these segments is 2, because:
{4 5} 0 2 {5 6 4} 0 {3 5}
What I have tried before posting on stackoverflow.com:
I've read through:
Algorithm to find maximum coverage of non-overlapping sequences. (I.e., the Weighted Interval Scheduling Prob.)
algorithm to find longest non-overlapping sequences
and I've watched MIT opencourseware and read about general steps for solving complex problems with dynamic programming, and completed a dynamic programming tutorial for finding Fibonacci numbers with memoization. I thought I could apply memoization to this problem, but I haven't found a way yet.
The theme of dynamic programming is to break the problem down into sub-problems which can be iterated to find the optimal solution.
What I have come up with (in an OO way) is
foreach (segment) {
- find the greatest sum interval with length of this segment
This produces incorrect results, because not always will the segments fit with this approach. For example:
Given arr[7] = [0, 3, 5, 5, 5, 1, 0] and two 3-length segments,
The first segment will take 5, 5, 5, leaving no room for the second segment. Ideally I should memoize this scenario and try the algorithm again, this time avoiding 5, 5, 5, as a first pick. Is this the right path?
How can I approach this in a "dynamic programming" way?
If you place the first segment, you get two smaller sub-arrays: placing one or both of the two remaining segments into one of these sub-arrays is a sub-problem of just the same form as the original one.
So this suggests a recursion: you place the first segment, then try out the various combinations of assigning remaining segments to sub-arrays, and maximize over those combinations. Then you memoize: the sub-problems all take an array and a list of segment sizes, just like the original problem.
I'm not sure this is the best algorithm but it is the one suggested by a "direct" dynamic programming approach.
EDIT: In more detail:
The arguments to the valuation function should have two parts: one is a pair of numbers which represent the sub-array being analysed (initially [0,6] in this example) and the second is a multi-set of numbers representing the lengths of the segments to be allocated ({3,3} in this example). Then in pseudo-code you do something like this:
valuation( array_ends, the_segments):
if sum of the_segments > array_ends[1] - array_ends[0]:
return -infinity
segment_length = length of chosen segment from the_segments
remaining_segments = the_segments with chosen segment removed
best_option = 0
for segment_placement = array_ends[0] to array_ends[1] - segment_length:
value1 = value of placing the chosen segment at segment_placement
new_array1 = [array_ends[0],segment_placement]
new_array2 = [segment_placement + segment_length,array_ends[1]]
for each partition of remaining segments into seg1 and seg2:
sub_value1 = valuation( new_array1, seg1)
sub_value2 = valuation( new_array2, seg2)
if value1 + sub_value1 + sub_value2 > best_option:
best_option = value1 + sub_value1 + sub_value2
return best_option
This code (modulo off by one errors and typos) calculates the valuation but it calls the valuation function more than once with the same arguments. So the idea of the memoization is to cache those results and avoid re-traversing equivalent parts of the tree. So we can do this just by wrapping the valuation function:
memoized_valuation(args):
if args in memo_dictionary:
return memo_dictionary[args]
else:
result = valuation(args)
memo_dictionary[args] = result
return result
Of course, you need to change the recursive call now to call memoized_valuation.

Importance of order of the operation in backtracking algorithms

Order of operation in each recursive step of a backtracking algorithms are how much important in terms of the efficiency of that particular algorithm?
For Ex.
In the Knight’s Tour problem.
The knight is placed on the first block of an empty board and, moving
according to the rules of chess, must visit each square exactly once.
In each step there are 8 possible (in general) ways to move.
int xMove[8] = { 2, 1, -1, -2, -2, -1, 1, 2 };
int yMove[8] = { 1, 2, 2, 1, -1, -2, -2, -1 };
If I change this order like...
int xmove[8] = { -2, -2, 2, 2, -1, -1, 1, 1};
int ymove[8] = { -1, 1,-1, 1, -2, 2, -2, 2};
Now,
for a n*n board
upto n=6
both the operation order does not affect any visible change in the execution time,
But if it is n >= 7
First operation (movement) order's execution time is much less than the later one.
In such cases, it is not feasible to generate all the O(m!) operation order and test the algorithm. So how do I determine the performance of such algorithms on a specific movement order, or rather how could it be possible to reach one (or a set) of operation orders such that the algorithm that is more efficient in terms of execution time.
This is an interesting problem from a Math/CS perspective. There definitely exists a permutation (or set of permutations) that would be most efficient for a given n . I don't know if there is a permutation that is most efficient among all n. I would guess not. There could be a permutation that is better 'on average' (however you define that) across all n.
If I was tasked to find an efficient permutation I might try doing the following: I would generate a fixed number x of randomly generated move orders. Measure their efficiency. For every one of the randomly generated movesets, randomly create a fixed number of permutations that are near the original. Compute their efficiencies. Now you have many more permutations than you started with. Take top x performing ones and repeat. This will provide some locally maxed algorithms, but I don't know if it leads up to the globally maxed algorithm(s).

Resources