Maximum sum of n intervals in a sequence - algorithm

I'm doing some programming "kata" which are skill building exercises for programming (and martial arts). I want to learn how to solve for algorithms like these in shorter amounts of time, so I need to develop my knowledge of the patterns. Eventually I want to solve in increasingly efficient time complexities (O(n), O(n^2), etc), but for now I'm fine with figuring out the solution with any efficiency to start.
The problem:
Given arr[10] = [4, 5, 0, 2, 5, 6, 4, 0, 3, 5]
Given various segment lengths, for example one 3-length segment, and two 2-length segments, find the optimal position of (or maximum sum contained by) the segments without overlapping the segments.
For example, solution to this array and these segments is 2, because:
{4 5} 0 2 {5 6 4} 0 {3 5}
What I have tried before posting on stackoverflow.com:
I've read through:
Algorithm to find maximum coverage of non-overlapping sequences. (I.e., the Weighted Interval Scheduling Prob.)
algorithm to find longest non-overlapping sequences
and I've watched MIT opencourseware and read about general steps for solving complex problems with dynamic programming, and completed a dynamic programming tutorial for finding Fibonacci numbers with memoization. I thought I could apply memoization to this problem, but I haven't found a way yet.
The theme of dynamic programming is to break the problem down into sub-problems which can be iterated to find the optimal solution.
What I have come up with (in an OO way) is
foreach (segment) {
- find the greatest sum interval with length of this segment
This produces incorrect results, because not always will the segments fit with this approach. For example:
Given arr[7] = [0, 3, 5, 5, 5, 1, 0] and two 3-length segments,
The first segment will take 5, 5, 5, leaving no room for the second segment. Ideally I should memoize this scenario and try the algorithm again, this time avoiding 5, 5, 5, as a first pick. Is this the right path?
How can I approach this in a "dynamic programming" way?

If you place the first segment, you get two smaller sub-arrays: placing one or both of the two remaining segments into one of these sub-arrays is a sub-problem of just the same form as the original one.
So this suggests a recursion: you place the first segment, then try out the various combinations of assigning remaining segments to sub-arrays, and maximize over those combinations. Then you memoize: the sub-problems all take an array and a list of segment sizes, just like the original problem.
I'm not sure this is the best algorithm but it is the one suggested by a "direct" dynamic programming approach.
EDIT: In more detail:
The arguments to the valuation function should have two parts: one is a pair of numbers which represent the sub-array being analysed (initially [0,6] in this example) and the second is a multi-set of numbers representing the lengths of the segments to be allocated ({3,3} in this example). Then in pseudo-code you do something like this:
valuation( array_ends, the_segments):
if sum of the_segments > array_ends[1] - array_ends[0]:
return -infinity
segment_length = length of chosen segment from the_segments
remaining_segments = the_segments with chosen segment removed
best_option = 0
for segment_placement = array_ends[0] to array_ends[1] - segment_length:
value1 = value of placing the chosen segment at segment_placement
new_array1 = [array_ends[0],segment_placement]
new_array2 = [segment_placement + segment_length,array_ends[1]]
for each partition of remaining segments into seg1 and seg2:
sub_value1 = valuation( new_array1, seg1)
sub_value2 = valuation( new_array2, seg2)
if value1 + sub_value1 + sub_value2 > best_option:
best_option = value1 + sub_value1 + sub_value2
return best_option
This code (modulo off by one errors and typos) calculates the valuation but it calls the valuation function more than once with the same arguments. So the idea of the memoization is to cache those results and avoid re-traversing equivalent parts of the tree. So we can do this just by wrapping the valuation function:
memoized_valuation(args):
if args in memo_dictionary:
return memo_dictionary[args]
else:
result = valuation(args)
memo_dictionary[args] = result
return result
Of course, you need to change the recursive call now to call memoized_valuation.

Related

Compare rotated lists, containing duplicates [duplicate]

This question already has answers here:
How to check whether two lists are circularly identical in Python
(18 answers)
Closed 7 years ago.
I'm looking for an efficient way to compare lists of numbers to see if they match at any rotation (comparing 2 circular lists).
When the lists don't have duplicates, picking smallest/largest value and rotating both lists before comparisons works.
But when there may be many duplicate large values, this isn't so simple.
For example, lists [9, 2, 0, 0, 9] and [0, 0, 9, 9, 2] are matches,where [9, 0, 2, 0, 9] won't (since the order is different).
Heres an example of an in-efficient function which works.
def min_list_rotation(ls):
return min((ls[i:] + ls[:i] for i in range(len(ls))))
# example use
ls_a = [9, 2, 0, 0, 9]
ls_b = [0, 0, 9, 9, 2]
print(min_list_rotation(ls_a) == min_list_rotation(ls_b))
This can be improved on for efficiency...
check sorted lists match before running exhaustive tests.
only test rotations that start with the minimum value(skipping matching values after that)effectively finding the minimum value with the furthest & smallest number after it (continually - in the case there are multiple matching next-biggest values).
compare rotations without creating the new lists each time..
However its still not a very efficient method since it relies on checking many possibilities.
Is there a more efficient way to perform this comparison?
Related question:
Compare rotated lists in python
If you are looking for duplicates in a large number of lists, you could rotate each list to its lexicographically minimal string representation, then sort the list of lists or use a hash table to find duplicates. This canonicalisation step means that you don't need to compare every list with every other list. There are clever O(n) algorithms for finding the minimal rotation described at https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation.
You almost have it.
You can do some kind of "normalization" or "canonicalisation" of a list independently of the others, then you only need to compare item by item (or if you want, put them in a map, in a set to eliminate duplicates, ..."
1 take the minimum item, which is not preceded by itself (in a circular way)
In you example 92009, you should take the first 0 (not the second one)
2 If you have always the same item (say 00000), you just keep that: 00000
3 If you have the same item several times, take the next item, which is minimal, and keep going until you find one unique path with minimums.
Example: 90148301562 => you have 0148.. and 0156.. => you take 0148
4 If you can not separate the different paths (= if you have equality at infinite), you have a repeating pattern: then, no matters: you take any of them.
Example: 014376501437650143765 : you have the same pattern 0143765...
It is like AAA, where A = 0143765
5 When you have your list in this form, it is easy to compare two of them.
How to do that efficiently:
Iterate on your list to get the minimums Mx (not preceded by itself). If you find several, keep all of them.
Then, iterate from each minimum Mx, take the next item, and keep the minimums. If you do an entire cycle, you have a repeating pattern.
Except the case of repeating pattern, this must be the minimal way.
Hope it helps.
I would do this in expected O(N) time using a polynomial hash function to compute the hash of list A, and every cyclic shift of list B. Where a shift of list B has the same hash as list A, I'd compare the actual elements to see if they are equal.
The reason this is fast is that with polynomial hash functions (which are extremely common!), you can calculate the hash of each cyclic shift from the previous one in constant time, so you can calculate hashes for all of the cyclic shifts in O(N) time.
It works like this:
Let's say B has N elements, then the the hash of B using prime P is:
Hb=0;
for (i=0; i<N ; i++)
{
Hb = Hb*P + B[i];
}
This is an optimized way to evaluate a polynomial in P, and is equivalent to:
Hb=0;
for (i=0; i<N ; i++)
{
Hb += B[i] * P^(N-1-i); //^ is exponentiation, not XOR
}
Notice how every B[i] is multiplied by P^(N-1-i). If we shift B to the left by 1, then every every B[i] will be multiplied by an extra P, except the first one. Since multiplication distributes over addition, we can multiply all the components at once just by multiplying the whole hash, and then fix up the factor for the first element.
The hash of the left shift of B is just
Hb1 = Hb*P + B[0]*(1-(P^N))
The second left shift:
Hb2 = Hb1*P + B[1]*(1-(P^N))
and so on...

Importance of order of the operation in backtracking algorithms

Order of operation in each recursive step of a backtracking algorithms are how much important in terms of the efficiency of that particular algorithm?
For Ex.
In the Knight’s Tour problem.
The knight is placed on the first block of an empty board and, moving
according to the rules of chess, must visit each square exactly once.
In each step there are 8 possible (in general) ways to move.
int xMove[8] = { 2, 1, -1, -2, -2, -1, 1, 2 };
int yMove[8] = { 1, 2, 2, 1, -1, -2, -2, -1 };
If I change this order like...
int xmove[8] = { -2, -2, 2, 2, -1, -1, 1, 1};
int ymove[8] = { -1, 1,-1, 1, -2, 2, -2, 2};
Now,
for a n*n board
upto n=6
both the operation order does not affect any visible change in the execution time,
But if it is n >= 7
First operation (movement) order's execution time is much less than the later one.
In such cases, it is not feasible to generate all the O(m!) operation order and test the algorithm. So how do I determine the performance of such algorithms on a specific movement order, or rather how could it be possible to reach one (or a set) of operation orders such that the algorithm that is more efficient in terms of execution time.
This is an interesting problem from a Math/CS perspective. There definitely exists a permutation (or set of permutations) that would be most efficient for a given n . I don't know if there is a permutation that is most efficient among all n. I would guess not. There could be a permutation that is better 'on average' (however you define that) across all n.
If I was tasked to find an efficient permutation I might try doing the following: I would generate a fixed number x of randomly generated move orders. Measure their efficiency. For every one of the randomly generated movesets, randomly create a fixed number of permutations that are near the original. Compute their efficiencies. Now you have many more permutations than you started with. Take top x performing ones and repeat. This will provide some locally maxed algorithms, but I don't know if it leads up to the globally maxed algorithm(s).

Dividing an array into equally weighted subarrays

Algorithm question here.
I have an unordered array containing product weights, e.g. [3, 2, 5, 5, 8] which need to be divided up into smaller arrays.
Rules:
REQUIRED: Should return 1 or more arrays.
REQUIRED: No array should sum to more than 12.
REQUIRED: Return minimum possible number of arrays, ex. total weight of example above is 23, which can fit into two arrays.
IDEALLY: Arrays should be weighted as evenly as possible.
In the example above, the ideal return would be [ [3, 8], [2, 5, 5] ]
My current thoughts:
Number of arrays to return will be (sum(input_array) / 12).ceil
A greedy algorithm could work well enough?
This is a combination of the bin packing problem and multiprocessor scheduling problem. Both are NP-hard.
Your three requirements constitute the bin packing problem: find the minimal number of bins of a fixed size (12) that fit all the numbers.
Once you solve that, you have the multiprocessor scheduling problem: given a fixed number of bins, what is the most even way to distribute the numbers among them.
There are number of well-known approximate algorithms for both problems.
How about something with an entirely different take on it. Something really simple. Like this, which is based on common horse sense:
module Splitter
def self.split(values, max_length = 12)
# if the sum of all values is lower than the max_length there
# is no point in continuing
return values unless self.sum(values) > max_length
optimized = []
current = []
# start off by ordering the values. perhaps it's a good idea
# to start off with the smallest values first; this will result
# in gathering as much values as possible in the first array. This
# in order to conform to the rule "Should return minimum possible
# number of arrays"
ordered_values = values.sort
ordered_values.each do |v|
if self.sum(current) + v > max_length
# finish up the current iteration if we've got an optimized pair
optimized.push(current)
# reset for the next iteration
current = []
end
current.push(v)
end
# push the last iteration
optimized.push(current)
return optimized
end
# calculates the sum of a collection of numbers
def self.sum(numbers)
if numbers.empty?
return 0
else
return numbers.inject{|sum,x| sum + x }
end
end
end
Which can be used like so:
product_weights = [3, 2, 5, 5, 8]
p Splitter.split(product_weights)
The output will be:
[[2, 3, 5], [5], [8]]
Now, as said before, this is a really simple sample. And I've excluded the validations for empty or non-numeric values in the array for brevity. But it does seem to conform to your primary requirements:
Splitting the (expected: all numeric) values into arrays
With a ceiling per array, defaulting in the sample to 12
Return minimum amount of arrays by collection the smallest numbers first EDIT: after the edit from the comments, this indeed doesn't work
I do have some doubts regarding the comment on "returning minimum possible number of arrays, and balance the weights throughout those as evenly as possible". I'm sure someone else will come up with an implementation of a better and math-proven algorithm that conforms to that requirement, but perhaps this is at least a suitable example for the discussion?

sorting algorithm where pairwise-comparison can return more information than -1, 0, +1

Most sort algorithms rely on a pairwise-comparison the determines whether A < B, A = B or A > B.
I'm looking for algorithms (and for bonus points, code in Python) that take advantage of a pairwise-comparison function that can distinguish a lot less from a little less or a lot more from a little more. So perhaps instead of returning {-1, 0, 1} the comparison function returns {-2, -1, 0, 1, 2} or {-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5} or even a real number on the interval (-1, 1).
For some applications (such as near sorting or approximate sorting) this would enable a reasonable sort to be determined with less comparisons.
The extra information can indeed be used to minimize the total number of comparisons. Calls to the super_comparison function can be used to make deductions equivalent to a great number of calls to a regular comparsion function. For example, a much-less-than b and c little-less-than b implies a < c < b.
The deductions cans be organized into bins or partitions which can each be sorted separately. Effectively, this is equivalent to QuickSort with n-way partition. Here's an implementation in Python:
from collections import defaultdict
from random import choice
def quicksort(seq, compare):
'Stable in-place sort using a 3-or-more-way comparison function'
# Make an n-way partition on a random pivot value
segments = defaultdict(list)
pivot = choice(seq)
for x in seq:
ranking = 0 if x is pivot else compare(x, pivot)
segments[ranking].append(x)
seq.clear()
# Recursively sort each segment and store it in the sequence
for ranking, segment in sorted(segments.items()):
if ranking and len(segment) > 1:
quicksort(segment, compare)
seq += segment
if __name__ == '__main__':
from random import randrange
from math import log10
def super_compare(a, b):
'Compare with extra logarithmic near/far information'
c = -1 if a < b else 1 if a > b else 0
return c * (int(log10(max(abs(a - b), 1.0))) + 1)
n = 10000
data = [randrange(4*n) for i in range(n)]
goal = sorted(data)
quicksort(data, super_compare)
print(data == goal)
By instrumenting this code with the trace module, it is possible to measure the performance gain. In the above code, a regular three-way compare uses 133,000 comparisons while a super comparison function reduces the number of calls to 85,000.
The code also makes it easy to experiment with a variety comparison functions. This will show that naïve n-way comparison functions do very little to help the sort. For example, if the comparison function returns +/-2 for differences greater than four and +/-1 for differences four or less, there is only a modest 5% reduction in the number of comparisons. The root cause is that the course grained partitions used in the beginning only have a handful of "near matches" and everything else falls in "far matches".
An improvement to the super comparison is to covers logarithmic ranges (i.e. +/-1 if within ten, +/-2 if within a hundred, +/- if within a thousand.
An ideal comparison function would be adaptive. For any given sequence size, the comparison function should strive to subdivide the sequence into partitions of roughly equal size. Information theory tells us that this will maximize the number of bits of information per comparison.
The adaptive approach makes good intuitive sense as well. People should first be partitioned into love vs like before making more refined distinctions such as love-a-lot vs love-a-little. Further partitioning passes should each make finer and finer distinctions.
You can use a modified quick sort. Let me explain on an example when you comparison function returns [-2, -1, 0, 1, 2]. Say, you have an array A to sort.
Create 5 empty arrays - Aminus2, Aminus1, A0, Aplus1, Aplus2.
Pick an arbitrary element of A, X.
For each element of the array, compare it with X.
Depending on the result, place the element in one of the Aminus2, Aminus1, A0, Aplus1, Aplus2 arrays.
Apply the same sort recursively to Aminus2, Aminus1, Aplus1, Aplus2 (note: you don't need to sort A0, as all he elements there are equal X).
Concatenate the arrays to get the final result: A = Aminus2 + Aminus1 + A0 + Aplus1 + Aplus2.
It seems like using raindog's modified quicksort would let you stream out results sooner and perhaps page into them faster.
Maybe those features are already available from a carefully-controlled qsort operation? I haven't thought much about it.
This also sounds kind of like radix sort except instead of looking at each digit (or other kind of bucket rule), you're making up buckets from the rich comparisons. I have a hard time thinking of a case where rich comparisons are available but digits (or something like them) aren't.
I can't think of any situation in which this would be really useful. Even if I could, I suspect the added CPU cycles needed to sort fuzzy values would be more than those "extra comparisons" you allude to. But I'll still offer a suggestion.
Consider this possibility (all strings use the 27 characters a-z and _):
11111111112
12345678901234567890
1/ now_is_the_time
2/ now_is_never
3/ now_we_have_to_go
4/ aaa
5/ ___
Obviously strings 1 and 2 are more similar that 1 and 3 and much more similar than 1 and 4.
One approach is to scale the difference value for each identical character position and use the first different character to set the last position.
Putting aside signs for the moment, comparing string 1 with 2, the differ in position 8 by 'n' - 't'. That's a difference of 6. In order to turn that into a single digit 1-9, we use the formula:
digit = ceiling(9 * abs(diff) / 27)
since the maximum difference is 26. The minimum difference of 1 becomes the digit 1. The maximum difference of 26 becomes the digit 9. Our difference of 6 becomes 3.
And because the difference is in position 8, out comparison function will return 3x10-8 (actually it will return the negative of that since string 1 comes after string 2.
Using a similar process for strings 1 and 4, the comparison function returns -5x10-1. The highest possible return (strings 4 and 5) has a difference in position 1 of '-' - 'a' (26) which generates the digit 9 and hence gives us 9x10-1.
Take these suggestions and use them as you see fit. I'd be interested in knowing how your fuzzy comparison code ends up working out.
Considering you are looking to order a number of items based on human comparison you might want to approach this problem like a sports tournament. You might allow each human vote to increase the score of the winner by 3 and decrease the looser by 3, +2 and -2, +1 and -1 or just 0 0 for a draw.
Then you just do a regular sort based on the scores.
Another alternative would be a single or double elimination tournament structure.
You can use two comparisons, to achieve this. Multiply the more important comparison by 2, and add them together.
Here is a example of what I mean in Perl.
It compares two array references by the first element, then by the second element.
use strict;
use warnings;
use 5.010;
my #array = (
[a => 2],
[b => 1],
[a => 1],
[c => 0]
);
say "$_->[0] => $_->[1]" for sort {
($a->[0] cmp $b->[0]) * 2 +
($a->[1] <=> $b->[1]);
} #array;
a => 1
a => 2
b => 1
c => 0
You could extend this to any number of comparisons very easily.
Perhaps there's a good reason to do this but I don't think it beats the alternatives for any given situation and certainly isn't good for general cases. The reason? Unless you know something about the domain of the input data and about the distribution of values you can't really improve over, say, quicksort. And if you do know those things, there are often ways that would be much more effective.
Anti-example: suppose your comparison returns a value of "huge difference" for numbers differing by more than 1000, and that the input is {0, 10000, 20000, 30000, ...}
Anti-example: same as above but with input {0, 10000, 10001, 10002, 20000, 20001, ...}
But, you say, I know my inputs don't look like that! Well, in that case tell us what your inputs really look like, in detail. Then someone might be able to really help.
For instance, once I needed to sort historical data. The data was kept sorted. When new data were added it was appended, then the list was run again. I did not have the information of where the new data was appended. I designed a hybrid sort for this situation that handily beat qsort and others by picking a sort that was quick on already sorted data and tweaking it to be fast (essentially switching to qsort) when it encountered unsorted data.
The only way you're going to improve over the general purpose sorts is to know your data. And if you want answers you're going to have to communicate that here very well.

Algorithm for merging sets that share at least 2 elements

Given a list of sets:
S_1 : [ 1, 2, 3, 4 ]
S_2 : [ 3, 4, 5, 6, 7 ]
S_3 : [ 8, 9, 10, 11 ]
S_4 : [ 1, 8, 12, 13 ]
S_5 : [ 6, 7, 14, 15, 16, 17 ]
What the most efficient way to merge all sets that share at least 2 elements? I suppose this is similar to a connected components problem. So the result would be:
[ 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17] (S_1 UNION S_2 UNION S_5)
[ 8, 9, 10, 11 ]
[ 1, 8, 12, 13 ] (S_4 shares 1 with S_1, and 8 with S_3, but not merged because they only share one element in each)
The naive implementation is O(N^2), where N is the number of sets, which is unworkable for us. This would need to be efficient for millions of sets.
Let there be a list of many Sets named (S)
Perform a pass through all elements of S, to determine the range (LOW .. HIGH).
Create an array of pointer to Set, of dimensions (LOW, HIGH), named (M).
do
Init all elements of M to NULL.
Iterate though S, processing them one Set at a time, named (Si).
Permutate all ordered pairs in Si. (P1, P2) where P1 <= P2.
For each pair examine M(P1, P2)
if M(P1, P2) is NULL
Continue with the next pair.
otherwise
Merge Si, into the Set pointed to by, M(P1, P2).
Remove Si from S, as it has been merged.
Move on to processing Set S(i + 1)
If Si was not merged,
Permutate again through Si
For each pair, make M(P1, P2) point to Si.
while At least one set was merged during the pass.
My head is saying this is about Order (2N ln N).
Take that with a grain of salt.
If you can order the elements in the set, you can look into using Mergesort on the sets. The only modification needed is to check for duplicates during the merge phase. If one is found, just discard the duplicate. Since mergesort is O(n*log(n)), this will offer imrpoved speed when compared to the naive O(n^2) algorithm.
However, to really be effective, you should maintain a sorted set and keep it sorted, so that you can skip the sort phase and go straight to the merge phase.
I don't see how this can be done in less than O(n^2).
Every set needs to be compared to every other one to see if they contain 2 or more shared elements. That's n*(n-1)/2 comparisons, therefore O(n^2), even if the check for shared elements takes constant time.
In sorting, the naive implementation is O(n^2) but you can take advantage of the transitive nature of ordered comparison (so, for example, you know nothing in the lower partition of quicksort needs to be compared to anything in the upper partition, as it's already been compared to the pivot). This is what result in sorting being O(n * log n).
This doesn't apply here. So unless there's something special about the sets that allows us to skip comparisons based on the results of previous comparisons, it's going to be O(n^2) in general.
Paul.
One side note: It depends on how often this occurs. If most pairs of sets do share at least two elements, it might be most efficient to build the new set at the same time as you are stepping through the comparison, and throw it away if they don't match the condition. If most pairs do not share at least two elements, then deferring the building of the new set until confirmation of the condition might be more efficient.
If your elements are numerical in nature, or can be naturally ordered (ie. you can assign a value such as 1, 2, 42 etc...), I would suggest using a radix sort on the merged sets, and make a second pass to pick up on the unique elements.
This algorithm should be of O(n), and you can optimize the radix sort quite a bit using bitwise shift operators and bit masks. I have done something similar for a project I was working on, and it works like a charm.

Resources