Find the largest sorted subset [duplicate] - algorithm

This question already has an answer here:
finding largest increasing subset of an array (non-contiguous)
(1 answer)
Closed 4 years ago.
I need to create an algorithm that extracts one of the largest possible subsets from a list, in which all elements are ordered. This subset can be non-consecutive, but must preserve the order from the original list. For example:
Input: Possible Output:
[1,2,8,3,6,4,7,9,5] -> [1,2,3,6,7,9]
One might rephrase the question as "which elements do I at least have to remove so that the remaining list is sorted".
I'm not looking for an implementation, but just for ideas for a simple algorithm.
My best approach so far would be to build a tree with nodes for each number, and their children being all larger numbers following in the list. Then the longest path down the tree should equal the sorted subset. However, that seems overly complicated.
Context: this is to check student's answers on a test where they have to order items by size. I want to find out for how many they got right, relative to each other.

A recursive implementation in Scala with standard list functions (size, map, filter, dropWhile, reduce):
def longestClimb (nx: List[Int]) : List[Int] = nx match {
case Nil => Nil
case _ => {
val lix = nx.map (n => n :: longestClimb (nx.dropWhile (_ != n).filter (_ > n)))
lix.reduce ((a, b) => if (a.size > b.size) a else b)
}
}
Invocation:
scala> val nx = List (1, 2, 8, 3, 6, 4, 7, 9, 5)
scala> longestClimb (nx)
res7: List[Int] = List(1, 2, 3, 4, 7, 9)
Prosa: For an empty list, the result is the empty list and the end of the recursive process.
For the whole list, every point is tried as starting point. Let's look for example at the 6 value. For the 6 is evaluated longestClimb of nx.dropWhile (_ != 6) (which is 6, 4, 7, 9, 5) filtered for (_ > 6) which reduces the former sample to (7, 9) which results in the List (6, 7, 9).
That's hardly the longest List over all, but a candidate for a longest sublist, but since only one largest List is searched, the bias of lix.reduce ((a, b) => if (a.size > b.size) a else b) yields another list of equal length, while lix.reduce ((a, b) => if (a.size >= b.size) a else b) would have resulted in the equal length list (1, 2, 3, 6, 7, 9).
To measure, how far we get with this approach, I use a timing and an iterating function:
def timed (name: String) (f: => Any) = {
val a = System.currentTimeMillis
val res = f
val z = System.currentTimeMillis
val delta = z-a
println (name + ": " + (delta / 1000.0))
res
}
val r = util.Random
def testRandomIncreasing (max: Int) : Unit = {
(2 to max).map { i =>
val cnt = Math.pow (2, i).toInt
val l = (1 to cnt).toList
val lr = r.shuffle (l)
val s = f"2^${i}=${cnt}\t${lr}%s"
val res = timed (s) (longestClimb (lr))
println (res)
}
}
and the results are pretty interesting:
testRandomIncreasing (7)
2^2=4 List(2, 4, 3, 1): 0.0
List(2, 3)
2^3=8 List(2, 7, 5, 6, 8, 1, 4, 3): 0.0
List(2, 5, 6, 8)
2^4=16 List(15, 6, 10, 7, 1, 16, 9, 4, 13, 14, 5, 2, 8, 11, 3, 12): 0.0
List(1, 4, 5, 8, 11, 12)
2^5=32 List(1, 5, 30, 26, 27, 7, 20, 6, 29, 23, 31, 21, 22, ...10): 0.002
List(1, 5, 6, 13, 14, 15, 16, 18)
2^6=64 List(2, 57, 7, 45, 51, 49, 4, 16, 23, 21, 5, 3, 62, ... 55): 0.899
List(2, 4, 5, 9, 12, 14, 18, 19, 26, 31, 36, 41, 42, 54, 55)
2^7=128 List(16, 106, 65, 94, 84, 13, 57, 52, 117, 48, 38, ... 110): 42.195
List(13, 14, 33, 35, 37, 40, 53, 55, 58, 74, 75, 78, 97, 114, 116, 121, 123, 128)
Interesting, in that the last step from 64 to 128 values, the time increased by a factor of about 40. A former test with another random seed, lead to a factor of about 2000 and it took about 8 minutes for 2^7 values in the REPL. The test for 2^8 elements had to be interrupted, because I didn't want to wait 11 days for the result in the worst case.

Related

Get the first possible combination of numbers in a list that adds up to a certain sum

Given a list of numbers, find the first combination of numbers that adds up to a certain sum.
Example:
Given list: [1, 2, 3, 4, 5]
Given sum: 5
Response: [1, 4]
the response can also be [2, 3], it doesn't matter. What matters is that we get a combination of numbers from the given list that adds up to the given sum as fast as possible.
I tried doing this with itertools.combinations in python, but it takes way too long:
from typing import List
import itertools
def test(target_sum, numbers):
for i in range(len(numbers), 0, -1):
for seq in itertools.combinations(numbers, i):
if(sum(seq) == target_sum):
return seq
if __name__ == "__main__":
target_sum: int = 616
numbers: List[int] = [16, 96, 16, 32, 16, 4, 4, 32, 32, 10, 16, 8, 32, 8, 4, 16, 8, 8, 8, 16, 8, 8, 8, 16, 8, 16, 16, 4, 8, 8, 16, 12, 16, 16, 8, 16, 8, 8, 8, 8, 4, 32, 16, 8, 32, 16, 8, 8, 8, 8, 16, 32, 8, 32, 8, 8, 16, 24, 32, 8]
print(test(target_sum, numbers))
def subsum(tsum, numbers):
a = [0]*(tsum+1)
a[0] = -1
for x in numbers:
for i in range(tsum, x-1,-1): #reverse to avoid reusing the same item
if a[i-x] and a[i] == 0: #we can form sum i with item x and existing sum i-x
a[i] = x #remember the last item for given sum
res = []
idx = tsum
while idx:
res.append(a[idx])
idx -= a[idx]
return res
print(subsum(21, [2,3,5,7,11]))
>>>[11, 7, 3]
When the last cell is nonzero, combination does exist, and we can retrieve items.
Complexity is O(target_sum*len(numbers))

Find Top N Most Frequent Sequence of Numbers in List of a Billion Sequences

Let's say I have the following list of lists:
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
Essentially, for any list that contains the target number 5 (i.e., target=5) anywhere within the list, what are the top N=2 most frequently observed subsequences with length M=4?
So, the conditions are:
if target doesn't exist in the list then we ignore that list completely
if the list length is less than M then we ignore the list completely
if the list is exactly length M but target is not in the Mth position then we ignore it (but we count it if target is in the Mth position)
if the list length, L, is longer than M and target is in the i=M position(ori=M+1position, ori=M+2position, ...,i=Lposition) then we count the subsequence of lengthMwheretarget` is in the final position in the subsequence
So, using our list-of-lists example, we'd count the following subsequences:
subseqs = [[2, 3, 4, 5], # taken from sequence 1
[2, 3, 4, 5], # taken from sequence 3
[12, 12, 6, 5], # taken from sequence 4
[8, 8, 3, 5], # taken from sequence 7
[1, 4, 12, 5], # taken from sequence 7
[12, 12, 6, 5], # taken from sequence 9
]
Of course, what we want are the top N=2 subsequences by frequency. So, [2, 3, 4, 5] and [12, 12, 6, 5] are the top two most frequent sequences by count. If N=3 then all of the subsequences (subseqs) would be returned since there is a tie for third.
Important
This is super simplified but, in reality, my actual list-of-sequences
consists of a few billion lists of positive integers (between 1 and 10,000)
each list can be as short as 1 element or as long as 500 elements
N and M can be as small as 1 or as big as 100
My questions are:
Is there an efficient data structure that would allow for fast queries assuming that N and M will always be less than 100?
Are there known algorithms for performing this kind of analysis for various combinations of N and M? I've looked at suffix trees but I'd have to roll my own custom version to even get close to what I need.
For the same dataset, I need to repeatedly query the dataset for various values or different combinations of target, N, and M (where target <= 10,000, N <= 100 and `M <= 100). How can I do this efficiently?
Extending on my comment. Here is a sketch how you could approach this using an out-of-the-box suffix array:
1) reverse and concatenate your lists with a stop symbol (I used 0 here).
[7, 6, 5, 4, 3, 2, 1, 0, 11, 10, 5, 6, 0, 5, 4, 3, 2, 8, 9, 0, 5, 6, 12, 12, 0, 2, 4, 3, 8, 5, 0, 5, 1, 0, 6, 5, 12, 4, 1, 9, 5, 3, 8, 8, 2, 0, 2, 1, 4, 3, 7, 1, 7, 0, 1, 5, 6, 12, 12, 4, 9]
2) Build a suffix array
[53, 45, 24, 30, 12, 19, 33, 7, 32, 6, 47, 54, 51, 38, 44, 5, 46, 25, 16, 4, 15, 49, 27, 41, 37, 3, 14, 48, 26, 59, 29, 31, 40, 2, 13, 10, 20, 55, 35, 11, 1, 34, 21, 56, 52, 50, 0, 43, 28, 42, 17, 18, 39, 60, 9, 8, 23, 36, 58, 22, 57]
3) Build the LCP array. The LCP array will tell you how many numbers a suffix has in common with its neighbour in the suffix array. However, you need to stop counting when you encounter a stop symbol
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 0, 2, 1, 1, 2, 0, 1, 3, 2, 2, 1, 0, 1, 1, 1, 4, 1, 2, 4, 1, 0, 1, 2, 1, 3, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 2, 1, 2, 0]
4) When a query comes in (target = 5, M= 4) you search for the first occurence of your target in the suffix array and scan the corresponding LCP-array until the starting number of suffixes changes. Below is the part of the LCP array that corresponds to all suffixes starting with 5.
[..., 1, 1, 1, 4, 1, 2, 4, 1, 0, ...]
This tells you that there are two sequences of length 4 that occur two times. Brushing over some details using the indexes you can find the sequences and revert them back to get your final results.
Complexity
Building up the suffix array is O(n) where n is the total number of elements in all lists and O(n) space
Building the LCP array is also O(n) in both time and space
Searching a target number in the suffix is O(log n) in average
The cost of scanning through the relevant subsequences is linear in the number of times the target occurs. Which should be 1/10000 on average according to your given parameters.
The first two steps happen offline. Querying is technically O(n) (due to step 4) but with a small constant (0.0001).

Is there a way to specify a multi-step in Ruby?

I'm working with some lazy iteration, and would like to be able to specify a multiple step for this iteration. This means that I want the step to alternate between a and b. So, if I had this as a range (not lazy just for simplification)
(1..20).step(2, 4)
I would want my resulting range to be
1 # + 2 =
3 # + 4 =
7 # + 2 =
9 # + 4 =
13 # + 2 =
15 # + 4 =
19 # + 2 = 21 (out of range, STOP ITERATION)
However, I cannot figure out a way to do this. Is this at all possible in Ruby?
You could use a combination of cycle and Enumerator :
class Range
def multi_step(*steps)
a = min
Enumerator.new do |yielder|
steps.cycle do |step|
yielder << a
a += step
break if a > max
end
end
end
end
p (1..20).multi_step(2, 4).to_a
#=> [1, 3, 7, 9, 13, 15, 19]
Note that the first element is 1, because the first element of (1..20).step(2) is also 1.
It takes exclude_end? into account :
p (1...19).multi_step(2, 4).to_a
#=> [1, 3, 7, 9, 13, 15]
And can be lazy :
p (0..2).multi_step(1,-1).first(20)
#=> [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
p (0..Float::INFINITY).multi_step(*(1..100).to_a).lazy.map{|x| x*2}.first(20)
#=> [0, 2, 6, 12, 20, 30, 42, 56, 72, 90, 110, 132, 156, 182, 210, 240, 272, 306, 342, 380]
Here's a variant of FizzBuzz, which generates all the multiples of 3 or 5 but not 15 :
p (3..50).multi_step(2,1,3,1,2,6).to_a
#=> [3, 5, 6, 9, 10, 12, 18, 20, 21, 24, 25, 27, 33, 35, 36, 39, 40, 42, 48, 50]
Ruby doesn't have a built-in method for stepping with multiple values. However, if you don't actually need a lazy method, you can use Enumerable#cycle with an accumulator. For example:
range = 1..20
accum = range.min
[2, 4].cycle(range.max) { |step| accum += step; puts accum }
Alternatively, you could construct your own lazy enumerator with Enumerator::Lazy. That seems like overkill for the given example, but may be useful if you have an extremely large Range object.

Make a square multiplication table in Ruby

I got this question in an interview and got almost all the way to the answer but got stuck on the last part. If I want to get the multiplication table for 5, for instance, I want to get the output to be formatted like so:
1, 2, 3, 4, 5
2, 4, 6, 8, 10
3, 6, 9, 12, 15
4, 8, 12, 16, 20
5, 10, 15, 20, 25
My answer to this is:
def make_table(n)
s = ""
1.upto(n).each do |i|
1.upto(n).each do |j|
s += (i*j).to_s
end
s += "\n"
end
p s
end
But the output for make_table(5) is:
"12345\n246810\n3691215\n48121620\n510152025\n"
I've tried variations with array but I'm getting similar output.
What am I missing or how should I think about the last part of the problem?
You can use map and join to get a String in one line :
n = 5
puts (1..n).map { |x| (1..n).map { |y| x * y }.join(', ') }.join("\n")
It iterates over rows (x=1, x=2, ...). For each row, it iterates over cells (y=1, y=2, ...) and calculates x*y. It joins every cells in a row with ,, and joins every rows in the table with a newline :
1, 2, 3, 4, 5
2, 4, 6, 8, 10
3, 6, 9, 12, 15
4, 8, 12, 16, 20
5, 10, 15, 20, 25
If you want to keep the commas aligned, you can use rjust :
puts (1..n).map { |x| (1..n).map { |y| (x * y).to_s.rjust(3) }.join(',') }.join("\n")
It outputs :
1, 2, 3, 4, 5
2, 4, 6, 8, 10
3, 6, 9, 12, 15
4, 8, 12, 16, 20
5, 10, 15, 20, 25
You could even go fancy and calculate the width of n**2 before aligning commas :
n = 11
width = Math.log10(n**2).ceil + 1
puts (1..n).map { |x| (1..n).map { |y| (x * y).to_s.rjust(width) }.join(',') }.join("\n")
It outputs :
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22
3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33
4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44
5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55
6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66
7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77
8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88
9, 18, 27, 36, 45, 54, 63, 72, 81, 90, 99
10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110
11, 22, 33, 44, 55, 66, 77, 88, 99, 110, 121
Without spaces between the figures, the result is indeed unreadable. Have a look at the % operator, which formats strings and numbers. Instead of
s += (i*j).to_s
you could write
s += '%3d' % (i*j)
If you really want to get the output formatted in the way you explained in your posting (which I don't find that much readable), you could do a
s += "#{i*j}, "
This leaves you with two extra characters at the end of the line, which you have to remove. An alternative would be to use an array. Instead of the inner loop, you would have then something like
s += 1.upto(n).to_a.map {|j| i*j}.join(', ') + "\n"
You don't need to construct a string if you're only interested in printing the table and not returning the table(as a string).
(1..n).each do |a|
(1..n-1).each { |b| print "#{a * b}, " }
puts a * n
end
This is how I'd do it.
require 'matrix'
n = 5
puts Matrix.build(n) { |i,j| (i+1)*(j+1) }.to_a.map { |row| row.join(', ') }
1, 2, 3, 4, 5
2, 4, 6, 8, 10
3, 6, 9, 12, 15
4, 8, 12, 16, 20
5, 10, 15, 20, 25
See Matrix::build.
You can make it much shorter but here's my version.
range = Array(1..12)
range.each do |element|
range.map { |item| print "#{element * item} " } && puts
end

Spread objects evenly over multiple collections

The scenario is that there are n objects, of different sizes, unevenly spread over m buckets. The size of a bucket is the sum of all of the object sizes that it contains. It now happens that the sizes of the buckets are varying wildly.
What would be a good algorithm if I want to spread those objects evenly over those buckets so that the total size of each bucket would be about the same? It would be nice if the algorithm leaned towards less move size over a perfectly even spread.
I have this naïve, ineffective, and buggy solution in Ruby.
buckets = [ [10, 4, 3, 3, 2, 1], [5, 5, 3, 2, 1], [3, 1, 1], [2] ]
avg_size = buckets.flatten.reduce(:+) / buckets.count + 1
large_buckets = buckets.take_while {|arr| arr.reduce(:+) >= avg_size}.to_a
large_buckets.each do |large|
smallest = buckets.last
until ((small_sum = smallest.reduce(:+)) >= avg_size)
break if small_sum + large.last >= avg_size
smallest << large.pop
end
buckets.insert(0, buckets.pop)
end
=> [[3, 1, 1, 1, 2, 3], [2, 1, 2, 3, 3], [10, 4], [5, 5]]
I believe this is a variant of the bin packing problem, and as such it is NP-hard. Your answer is essentially a variant of the first fit decreasing heuristic, which is a pretty good heuristic. That said, I believe that the following will give better results.
Sort each individual bucket in descending size order, using a balanced binary tree.
Calculate average size.
Sort the buckets with size less than average (the "too-small buckets") in descending size order, using a balanced binary tree.
Sort the buckets with size greater than average (the "too-large buckets") in order of the size of their greatest elements, using a balanced binary tree (so the bucket with {9, 1} would come first and the bucket with {8, 5} would come second).
Pass1: Remove the largest element from the bucket with the largest element; if this reduces its size below the average, then replace the removed element and remove the bucket from the balanced binary tree of "too-large buckets"; else place the element in the smallest bucket, and re-index the two modified buckets to reflect the new smallest bucket and the new "too-large bucket" with the largest element. Continue iterating until you've removed all of the "too-large buckets."
Pass2: Iterate through the "too-small buckets" from smallest to largest, and select the best-fitting elements from the largest "too-large bucket" without causing it to become a "too-small bucket;" iterate through the remaining "too-large buckets" from largest to smallest, removing the best-fitting elements from them without causing them to become "too-small buckets." Do the same for the remaining "too-small buckets." The results of this variant won't be as good as they are for the more complex variant because it won't shift buckets from the "too-large" to the "too-small" category or vice versa (hence the search space will be smaller), but this also means that it has much simpler halting conditions (simply iterate through all of the "too-small" buckets and then halt), whereas the complex variant might cause an infinite loop if you're not careful.
The idea is that by moving the largest elements in Pass1 you make it easier to more precisely match up the buckets' sizes in Pass2. You use balanced binary trees so that you can quickly re-index the buckets or the trees of buckets after removing or adding an element, but you could use linked lists instead (the balanced binary trees would have better worst-case performance but the linked lists might have better average-case performance). By performing a best-fit instead of a first-fit in Pass2 you're less likely to perform useless moves (e.g. moving a size-10 object from a bucket that's 5 greater than average into a bucket that's 5 less than average - first fit would blindly perform the movie, best-fit would either query the next "too-large bucket" for a better-sized object or else would remove the "too-small bucket" from the bucket tree).
I ended up with something like this.
Sort the buckets in descending size order.
Sort each individual bucket in descending size order.
Calculate average size.
Iterate over each bucket with a size larger than average size.
Move objects in size order from those buckets to the smallest bucket until either the large bucket is smaller than average size or the target bucket reaches average size.
Ruby code example
require 'pp'
def average_size(buckets)
(buckets.flatten.reduce(:+).to_f / buckets.count + 0.5).to_i
end
def spread_evenly(buckets)
average = average_size(buckets)
large_buckets = buckets.take_while {|arr| arr.reduce(:+) >= average}.to_a
large_buckets.each do |large_bucket|
smallest_bucket = buckets.last
smallest_size = smallest_bucket.reduce(:+)
large_size = large_bucket.reduce(:+)
until (smallest_size >= average)
break if large_size <= average
if smallest_size + large_bucket.last > average and large_size > average
buckets.unshift buckets.pop
smallest_bucket = buckets.last
smallest_size = smallest_bucket.reduce(:+)
end
smallest_size += smallest_object = large_bucket.pop
large_size -= smallest_object
smallest_bucket << smallest_object
end
buckets.unshift buckets.pop if smallest_size >= average
end
buckets
end
test_buckets = [
[ [10, 4, 3, 3, 2, 1], [5, 5, 3, 2, 1], [3, 1, 1], [2] ],
[ [4, 3, 3, 2, 2, 2, 2, 1, 1], [10, 5, 3, 2, 1], [3, 3, 3], [6] ],
[ [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1] ],
[ [10, 9, 8, 7], [6, 5, 4], [3, 2], [1] ],
]
test_buckets.each do |buckets|
puts "Before spread with average of #{average_size(buckets)}:"
pp buckets
result = spread_evenly(buckets)
puts "Result and sum of each bucket:"
pp result
sizes = result.map {|bucket| bucket.reduce :+}
pp sizes
puts
end
Output:
Before spread with average of 12:
[[10, 4, 3, 3, 2, 1], [5, 5, 3, 2, 1], [3, 1, 1], [2]]
Result and sum of each bucket:
[[3, 1, 1, 4, 1, 2], [2, 1, 2, 3, 3], [10], [5, 5, 3]]
[12, 11, 10, 13]
Before spread with average of 14:
[[4, 3, 3, 2, 2, 2, 2, 1, 1], [10, 5, 3, 2, 1], [3, 3, 3], [6]]
Result and sum of each bucket:
[[3, 3, 3, 2, 3], [6, 1, 1, 2, 2, 1], [4, 3, 3, 2, 2], [10, 5]]
[14, 13, 14, 15]
Before spread with average of 4:
[[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1]]
Result and sum of each bucket:
[[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]]
[4, 4, 4, 4, 4]
Before spread with average of 14:
[[10, 9, 8, 7], [6, 5, 4], [3, 2], [1]]
Result and sum of each bucket:
[[1, 7, 9], [10], [6, 5, 4], [3, 2, 8]]
[17, 10, 15, 13]
This isn't bin packing as others have suggested. There the size of bins is fixed and you are trying to minimize the number. Here you are trying to minimize the variance among a fixed number of bins.
It turns out this is equivalent to Multiprocessor Scheduling, and - according to the reference - the algorithm below (known as "Longest Job First" or "Longest Processing Time First") is certain to produce a largest sum no more than 4/3 - 1/(3m) times optimal, where m is the number of buckets. In the test cases shonw, we'd have 4/3-1/12 = 5/4 or no more than 25% above optimal.
We just start with all bins empty, and put each item in decreasing order of size into the currently least full bin. We can track the least full bin efficiently with a min heap. With a heap having O(log n) insert and deletemin, the algorithm has O(n log m) time (n and m defined as #Jonas Elfström says). Ruby is very expressive here: only 9 sloc for the algorithm itself.
Here is code. I am not a Ruby expert, so please feel free to suggest better ways. I am using #Jonas Elfström's test cases.
require 'algorithms'
require 'pp'
test_buckets = [
[ [10, 4, 3, 3, 2, 1], [5, 5, 3, 2, 1], [3, 1, 1], [2] ],
[ [4, 3, 3, 2, 2, 2, 2, 1, 1], [10, 5, 3, 2, 1], [3, 3, 3], [6] ],
[ [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1] ],
[ [10, 9, 8, 7], [6, 5, 4], [3, 2], [1] ],
]
def relevel(buckets)
q = Containers::PriorityQueue.new { |x, y| x < y }
# Initially all buckets to be returned are empty and so have zero sums.
rtn = Array.new(buckets.length) { [] }
buckets.each_index {|i| q.push(i, 0) }
sums = Array.new(buckets.length, 0)
# Add to emptiest bucket in descending order.
# Bang! ops would generate less garbage.
buckets.flatten.sort.reverse.each do |val|
i = q.pop # Get index of emptiest bucket
rtn[i] << val # Append current value to it
q.push(i, sums[i] += val) # Update sums and min heap
end
rtn
end
test_buckets.each {|b| pp relevel(b).map {|a| a.inject(:+) }}
Results:
[12, 11, 11, 12]
[14, 14, 14, 14]
[4, 4, 4, 4, 4]
[13, 13, 15, 14]
You could use my answer to fitting n variable height images into 3 (similar length) column layout.
Mentally map:
Object size to picture height, and
bucket count to bincount
Then the rest of that solution should apply...
The following uses the first_fit algorithm mentioned by Robin Green earlier but then improves on this by greedy swapping.
The swapping routine finds the column that is furthest away from the average column height then systematically looks for a swap between one of its pictures and the first picture in another column that minimizes the maximum deviation from the average.
I used a random sample of 30 pictures with heights in the range five to 50 'units'. The convergenge was swift in my case and improved significantly on the first_fit algorithm.
The code (Python 3.2:
def first_fit(items, bincount=3):
items = sorted(items, reverse=1) # New - improves first fit.
bins = [[] for c in range(bincount)]
binsizes = [0] * bincount
for item in items:
minbinindex = binsizes.index(min(binsizes))
bins[minbinindex].append(item)
binsizes[minbinindex] += item
average = sum(binsizes) / float(bincount)
maxdeviation = max(abs(average - bs) for bs in binsizes)
return bins, binsizes, average, maxdeviation
def swap1(columns, colsize, average, margin=0):
'See if you can do a swap to smooth the heights'
colcount = len(columns)
maxdeviation, i_a = max((abs(average - cs), i)
for i,cs in enumerate(colsize))
col_a = columns[i_a]
for pic_a in set(col_a): # use set as if same height then only do once
for i_b, col_b in enumerate(columns):
if i_a != i_b: # Not same column
for pic_b in set(col_b):
if (abs(pic_a - pic_b) > margin): # Not same heights
# new heights if swapped
new_a = colsize[i_a] - pic_a + pic_b
new_b = colsize[i_b] - pic_b + pic_a
if all(abs(average - new) < maxdeviation
for new in (new_a, new_b)):
# Better to swap (in-place)
colsize[i_a] = new_a
colsize[i_b] = new_b
columns[i_a].remove(pic_a)
columns[i_a].append(pic_b)
columns[i_b].remove(pic_b)
columns[i_b].append(pic_a)
maxdeviation = max(abs(average - cs)
for cs in colsize)
return True, maxdeviation
return False, maxdeviation
def printit(columns, colsize, average, maxdeviation):
print('columns')
pp(columns)
print('colsize:', colsize)
print('average, maxdeviation:', average, maxdeviation)
print('deviations:', [abs(average - cs) for cs in colsize])
print()
if __name__ == '__main__':
## Some data
#import random
#heights = [random.randint(5, 50) for i in range(30)]
## Here's some from the above, but 'fixed'.
from pprint import pprint as pp
heights = [45, 7, 46, 34, 12, 12, 34, 19, 17, 41,
28, 9, 37, 32, 30, 44, 17, 16, 44, 7,
23, 30, 36, 5, 40, 20, 28, 42, 8, 38]
columns, colsize, average, maxdeviation = first_fit(heights)
printit(columns, colsize, average, maxdeviation)
while 1:
swapped, maxdeviation = swap1(columns, colsize, average, maxdeviation)
printit(columns, colsize, average, maxdeviation)
if not swapped:
break
#input('Paused: ')
The output:
columns
[[45, 12, 17, 28, 32, 17, 44, 5, 40, 8, 38],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 34, 9, 37, 44, 30, 20, 28]]
colsize: [286, 267, 248]
average, maxdeviation: 267.0 19.0
deviations: [19.0, 0.0, 19.0]
columns
[[45, 12, 17, 28, 17, 44, 5, 40, 8, 38, 9],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 34, 37, 44, 30, 20, 28, 32]]
colsize: [263, 267, 271]
average, maxdeviation: 267.0 4.0
deviations: [4.0, 0.0, 4.0]
columns
[[45, 12, 17, 17, 44, 5, 40, 8, 38, 9, 34],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 37, 44, 30, 20, 28, 32, 28]]
colsize: [269, 267, 265]
average, maxdeviation: 267.0 2.0
deviations: [2.0, 0.0, 2.0]
columns
[[45, 12, 17, 17, 44, 5, 8, 38, 9, 34, 37],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 44, 30, 20, 28, 32, 28, 40]]
colsize: [266, 267, 268]
average, maxdeviation: 267.0 1.0
deviations: [1.0, 0.0, 1.0]
columns
[[45, 12, 17, 17, 44, 5, 8, 38, 9, 34, 37],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 44, 30, 20, 28, 32, 28, 40]]
colsize: [266, 267, 268]
average, maxdeviation: 267.0 1.0
deviations: [1.0, 0.0, 1.0]
Nice problem.
Heres the info on reverse-sorting mentioned in my separate comment below.
>>> h = sorted(heights, reverse=1)
>>> h
[46, 45, 44, 44, 42, 41, 40, 38, 37, 36, 34, 34, 32, 30, 30, 28, 28, 23, 20, 19, 17, 17, 16, 12, 12, 9, 8, 7, 7, 5]
>>> columns, colsize, average, maxdeviation = first_fit(h)
>>> printit(columns, colsize, average, maxdeviation)
columns
[[46, 41, 40, 34, 30, 28, 19, 12, 12, 5],
[45, 42, 38, 36, 30, 28, 17, 16, 8, 7],
[44, 44, 37, 34, 32, 23, 20, 17, 9, 7]]
colsize: [267, 267, 267]
average, maxdeviation: 267.0 0.0
deviations: [0.0, 0.0, 0.0]
If you have the reverse-sorting, this extra code appended to the bottom of the above code (in the 'if name == ...), will do extra trials on random data:
for trial in range(2,11):
print('\n## Trial %i' % trial)
heights = [random.randint(5, 50) for i in range(random.randint(5, 50))]
print('Pictures:',len(heights))
columns, colsize, average, maxdeviation = first_fit(heights)
print('average %7.3f' % average, '\nmaxdeviation:')
print('%5.2f%% = %6.3f' % ((maxdeviation * 100. / average), maxdeviation))
swapcount = 0
while maxdeviation:
swapped, maxdeviation = swap1(columns, colsize, average, maxdeviation)
if not swapped:
break
print('%5.2f%% = %6.3f' % ((maxdeviation * 100. / average), maxdeviation))
swapcount += 1
print('swaps:', swapcount)
The extra output shows the effect of the swaps:
## Trial 2
Pictures: 11
average 72.000
maxdeviation:
9.72% = 7.000
swaps: 0
## Trial 3
Pictures: 14
average 118.667
maxdeviation:
6.46% = 7.667
4.78% = 5.667
3.09% = 3.667
0.56% = 0.667
swaps: 3
## Trial 4
Pictures: 46
average 470.333
maxdeviation:
0.57% = 2.667
0.35% = 1.667
0.14% = 0.667
swaps: 2
## Trial 5
Pictures: 40
average 388.667
maxdeviation:
0.43% = 1.667
0.17% = 0.667
swaps: 1
## Trial 6
Pictures: 5
average 44.000
maxdeviation:
4.55% = 2.000
swaps: 0
## Trial 7
Pictures: 30
average 295.000
maxdeviation:
0.34% = 1.000
swaps: 0
## Trial 8
Pictures: 43
average 413.000
maxdeviation:
0.97% = 4.000
0.73% = 3.000
0.48% = 2.000
swaps: 2
## Trial 9
Pictures: 33
average 342.000
maxdeviation:
0.29% = 1.000
swaps: 0
## Trial 10
Pictures: 26
average 233.333
maxdeviation:
2.29% = 5.333
1.86% = 4.333
1.43% = 3.333
1.00% = 2.333
0.57% = 1.333
swaps: 4
Adapt the Knapsack Problem solving algorithms' by, for example, specify the "weight" of every buckets to be roughly equals to the mean of the n objects' sizes (try a gaussian distri around the mean value).
http://en.wikipedia.org/wiki/Knapsack_problem#Solving
Sort buckets in size order.
Move an object from the largest bucket into the smallest bucket, re-sorting the array (which is almost-sorted, so we can use "limited insertion sort" in both directions; you can also speed things up by noting where you placed the last two buckets to be sorted. If you have 6-6-6-6-6-6-5... and get one object from the first bucket, you will move it to the sixth position. Then on the next iteration you can start comparing from the fifth. The same goes, right-to-left, for the smallest buckets).
When the difference of the two buckets is one, you can stop.
This moves the minimum number of buckets, but is of order n^2 log n for comparisons (the simplest version is n^3 log n). If object moving is expensive while bucket size checking is not, for reasonable n it might still do:
12 7 5 2
11 7 5 3
10 7 5 4
9 7 5 5
8 7 6 5
7 7 6 6
12 7 3 1
11 7 3 2
10 7 3 3
9 7 4 3
8 7 4 4
7 7 5 4
7 6 5 5
6 6 6 5
Another possibility would be to calculate the expected average size for every bucket, and "move along" a bag (or a further bucket) with the excess from the larger buckets to the smaller ones.
Otherwise, strange things may happen:
12 7 3 1, the average is a bit less than 6, so we take 5 as the average.
5 7 3 1 bag = 7 from 1st bucket
5 5 3 1 bag = 9
5 5 5 1 bag = 7
5 5 5 8 which is a bit unbalanced.
By taking 6 (i.e. rounding) it goes better, but again sometimes it won't work:
12 5 3 1
6 5 3 1 bag = 6 from 1st bucket
6 6 3 1 bag = 5
6 6 6 1 bag = 2
6 6 6 3 which again is unbalanced.
You can run two passes, the first with the rounded mean left-to-right, the other with the truncated mean right-to-left:
12 5 3 1 we want to get no more than 6 in each bucket
6 11 3 1
6 6 8 1
6 6 6 3
6 6 6 3 and now we want to get at least 5 in each bucket
6 6 4 5 (we have taken 2 from bucket #3 into bucket #5)
6 5 5 5 (when the difference is 1 we stop).
This will require "n log n" size checks, and no more than 2n object moves.
Another possibility which is interesting is to reason thus: you have m objects into n buckets. So you need to do an integer mapping of m onto n, and this is Bresenham's linearization algorithm. Run a (n,m) Bresenham on the sorted array, and at step i (i.e. against bucket i-th) the algorithm will tell you whether to use round(m/n) or floor(m/n) size. Then move objects from or to the "moving bag" according to bucket i-th size.
This requires n log n comparisons.
You can further reduce the number of object moves by initially removing all buckets that are either round(m/n) or floor(m/n) in size to two pools of buckets sized R or F. When, running the algorithm, you need the i-th bucket to hold R objects, if the pool of R objects is not empty, swap the i-th bucket with one of the R-sized ones. This way, only buckets that are hopelessly under- or over-sized get balanced; (most of) the others are simply ignored, except for their references being shuffled.
If object access time is huge in proportion to computation time (e.g. some kind of automatic loader magazine), this will yield a magazine that is as balanced as possible, with the absolute minimum of overall object moves.
You could use an Integer Programming Package if it's fast enough.
It may be tricky getting your constraints right. Something like the following may do the trick:
let variable Oij denote Object i being in Bucket j. Let Wi represent the weight or size of Oi
Constraints:
sum(Oij for all j) == 1 #each object is in only one bucket
Oij = 1 or 0. #object is either in bucket j or not in bucket j
sum(Oij * Wi for all i) <= X + R #restrict weight on buckets.
Objective:
minimize X
Note R is the relaxation constant that you can play with depending on how much movement is required and how much performance is needed.
Now the maximum bucket size is X + R
The next step is to figure out the minimum amount movement possible whilst keeping the bucket size less than X + R
Define a Stay variable Si that controls if Oi stays in bucket j
If Si is 0 it indicates that Oi stays where it was.
Constraints:
Si = 1 or 0.
Oij = 1 or 0.
Oij <= Si where j != original bucket of Object i
Oij != Si where j == original bucket of Object i
Sum(Oij for all j) == 1
Sum(Oij for all i) <= X + R
Objective:
minimize Sum(Si for all i)
Here Sum(Si for all i) represents the number of objects that have moved.

Resources