Pandas distance matrix performance with vector data - performance

Even if I found a few threads dealing with distance matrix efficiency, they all use either an int or float matrix. In my case I have to deal with vectors (orderedDict of frequency), and I only end up with a very slow method that is not viable with a large DataFrame (300,000 x 300,000).
How to make the process more optimized?
I would be very thankful for any help, this problem has been killing me :)
Considering DataFrame df such as:
>>> df
vectors
id
1 {dict1}
2 {dict2}
3 {dict3}
4 {dict4}
where {dict#}
orderedDict{event1: 1,
event2: 5,
event3: 0,
...}
A function to return the distance between two vectors:
def vectorDistance(a, b, df_vector):
# Calculate distance between a & b
# based on the vector from df_vector.
return distance
[in]: vectorDistance({dict1}, {dict2})
[out]: distance
A desired Output:
1 2 3 4
id
1 0 1<->2 1<->3 1<->4
2 1<->2 0 ... ...
3 1<->3 ... 0 ...
4 1<->4 ... ... 0
(where 1<->2 is a float distance between vector 1 & 2)
Method used:
import pandas as pd
matrix = pd.concat([df, df.T], axis=1)
for index in matrix.index:
for col in matrix.columns:
matrix.ix[col, index] = vectorDistance(col, index, df)
>>> matrix
5072142538 5072134420 4716823618 ...
udid
5072142538 0.00000 0.01501 0.06002 ...
5072134420 0.01501 0.00000 0.09037 ...
4716823618 0.06002 0.09037 0.00000 ...
... ... ... ...
EDIT:
Minimal example
Note: The event can differ form one {dict} to another, but it's ok when passed in the function. My issue is more how to pass the right a & b to fill the cell in a fast way.
I am working with cosine distance as it's rather good with vectors such as mine.
from collections import Counter
import pandas as pd
from math import sqrt
raw_data = {'counters_': {4716823618: Counter({51811: 1, 51820: 1, 51833: 56, 51835: 8, 51843: 48, 51848: 2, 51852: 8, 51853: 5, 51854: 4, 51856: 24, 51903: 11, 51904: 12, 51905: 3, 51906: 19, 51908: 230, 51922: 24, 51927: 19, 51931: 2, 106282: 9, 112830: 1, 119453: 1, 165062: 80, 168904: 3, 180354: 19, 180437: 33, 185824: 117, 186171: 14, 187101: 1, 190827: 7, 201629: 1, 209318: 37}), 5072134420: Counter({51811: 1, 51812: 1, 51820: 1, 51833: 56, 51835: 9, 51843: 49, 51848: 2, 51852: 11, 51853: 4, 51854: 4, 51856: 28, 51885: 1, 51903: 17, 51904: 17, 51905: 9, 51906: 14, 51908: 225, 51927: 29, 51931: 2, 106282: 19, 112830: 2, 168904: 9, 180354: 14, 185824: 219, 186171: 7, 187101: 1, 190827: 6, 201629: 2, 209318: 41}), 5072142538: Counter({51811: 4, 51812: 4, 51820: 4, 51833: 56, 51835: 8, 51843: 48, 51848: 2, 51852: 6, 51853: 3, 51854: 3, 51856: 18, 51885: 1, 51903: 17, 51904: 16, 51905: 3, 51906: 24, 51908: 258, 51927: 20, 51931: 8, 106282: 16, 112830: 2, 168904: 3, 180354: 24, 185824: 180, 186171: 10, 187101: 1, 190827: 7, 201629: 2, 209318: 52})}}
def vectorDistance(index, col):
a = dict(df[df.index == index]["counters_"].values[0])
b = dict(df[df.index == col]["counters_"].values[0])
return abs(np.round(1-(similarity(a,b)),5))
def scalar(collection):
total = 0
for coin, count in collection.items():
total += count * count
return sqrt(total)
def similarity(A,B):
total = 0
for kind in A:
if kind in B:
total += A[kind] * B[kind]
return float(total) / (scalar(A) * scalar(B))
df = pd.DataFrame(raw_data)
matrix = pd.concat([df, df.T], axis=1)
matrix.drop("counters_",0,inplace=True)
matrix.drop("counters_",1,inplace=True)
for index in matrix.index:
for col in matrix.columns:
matrix.ix[col,index] = vectorDistance(col,index)
matrix

This is certainly more efficient and easier to read than using for loops.
df = pd.DataFrame([v for v in raw_data['counters_'].values()],
index=raw_data['counters_'].keys()).T
>>> df.head()
4716823618 5072134420 5072142538
51811 1 1 4
51812 NaN 1 4
51820 1 1 4
51833 56 56 56
51835 8 9 8
# raw_data no longer needed. Delete to reduce memory footprint.
del raw_data
# Create scalars.
scalars = ((df ** 2).sum()) ** .5
>>> scalars
4716823618 289.679133
5072134420 330.548030
5072142538 331.957829
dtype: float64
def v_dist(col_1, col_2):
return 1 - ((df.iloc[:, col_1] * df.iloc[:, col_2]).sum() /
(scalars.iloc[col_1] * scalars.iloc[col_2]))
>>> v_dist(0, 1)
0.09036665882900885
>>> v_dist(0, 2)
0.060016436804916085
>>> v_dist(1, 2)
0.015009898476505357
m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
>>> m
4716823618 5072134420 5072142538
4716823618 NaN NaN NaN
5072134420 NaN NaN NaN
5072142538 NaN NaN NaN
for row in range(m.shape[0]):
for col in range(row, m.shape[1]): # Note: m.shape[0] equals m.shape[1]
if row == col:
# No need to calculate value for diagonal.
m.iat[row, col] = 0
else:
# Do two calculation in one due to symmetry.
m.iat[row, col] = m.iat[col, row] = v_dist(row, col)
>>> m
4716823618 5072134420 5072142538
4716823618 0.000000 0.090367 0.060016
5072134420 0.090367 0.000000 0.015010
5072142538 0.060016 0.015010 0.000000
Wrapping all of this into a function:
def calc_matrix(raw_data):
df = pd.DataFrame([v for v in raw_data['counters_'].values()],
index=raw_data['counters_'].keys()).T
scalars = ((df ** 2).sum()) ** .5
m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
for row in range(m.shape[0]):
for col in range(row, m.shape[1]):
if row == col:
m.iat[row, col] = 0
else:
m.iat[row, col] = m.iat[col, row] = (1 -
(df.iloc[:, row] * df.iloc[:, col]).sum() /
(scalars.iloc[row] * scalars.iloc[col]))
return m

You don't want to store dicts inside your dataframe. Read in your dataframe using the from_dict method:
df = pd.DataFrame.from_dict(raw_data['counters_'],orient='index')
Then you can apply the numpy/scipy vectorised methods for computing cosine similarity as in What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Related

algorithm to map weighted objects to their amount

need algorithm to map weighted objects to their amount and need to make amount minimal for each object with keeping ratio between weights
example 1:
input: object1: 40, object2: 60, object3: 80
output: object1: 2, object2: 3, object3: 4
this can be solved by dividing object weight with gcd of weights of all objects
example 2:
input: object1: 3, object2: 15
output: object1: 1, object2: 5
example 3:
input: object1: 13, object2: 97, object3: 20
output: object1: 1, object2: 7, object3: 2
example 4:
input: object1: 1, object2: 17, object3: 97
output: object1: 0, object2: 1, object3: 5
gcd is not applicable for example 3 and 4, what's the algorithm can be used, is there any idea?
limitations: range of weights 0-99, maximum sum of all amounts is 32
As I mentioned in comments, dividing by the GCD is the best you can do if you need integers that exactly preserve the ratio.
For floats that are very close to the ratio, divide everything by the min.
Ruby example:
def f(weights)
min_wt = weights.min
ans = []
weights.each do |wt|
ans.append(wt.to_f/min_wt)
end
return ans
end
> f([40, 60, 80])
=> [1.0, 1.5, 2.0]
> f([13, 97, 20])
=> [1.0, 7.461538461538462, 1.5384615384615385]
Alternate approach to get integers: Check every scaling factor in your range (final sum 1-32). I'm taking 1 as the floor for each integer since dividing by 0 is undefined.
Ruby code (not beautifully written):
def f(unsorted_weights)
weights = unsorted_weights.sort!
orig_sum_of_wts = weights.sum
best_error = Float::INFINITY
best_sum_of_wts = 0
1.upto(32) do |new_sum_of_wts|
error = 0.0
new_wts = []
0.upto(weights.length - 1) do |i|
new_wts[i] = [1, weights[i] * new_sum_of_wts / orig_sum_of_wts].max
end
0.upto(weights.length - 2) do |i|
new_wt_i = weights[i] * new_sum_of_wts / orig_sum_of_wts
(i+1).upto(weights.length - 1) do |j|
new_wt_j = weights[j] * new_sum_of_wts / orig_sum_of_wts
error += (new_wts[j].to_f / [new_wts[i], 1.0].max - weights[j].to_f / [weights[i], 1.0].max).abs
end
if error < best_error
best_sum_of_wts = new_sum_of_wts
best_error = error
end
end
end
ans = []
0.upto(weights.length - 1) do |i|
ans.append([1, weights[i] * best_sum_of_wts / orig_sum_of_wts].max)
end
puts "#{ans.to_s}"
end
Results:
> f([40, 60, 80])
[2, 3, 4]
> f([40, 60])
[2, 3]
> f([13, 97, 20])
[2, 3, 15]
> f([1, 17, 97])
[1, 4, 26]
For 13, 20, 97, I get 2,3,15 vs your 1,2,7.
Ratios: 20/13 = 1.538, 3/2 = 1.500, 2/1 = 2.000
97/13 = 7.462, 15/2 = 7.500, 7/1 = 7.000
97/20 = 4.850, 15/3 = 5.000, 7/2 = 3.500
Cumulative error for 2,3,15: 0.038 + 0.038 + 0.150 = 0.226
Cumulative error for 1,2,7: 0.038 + 0.462 + 1.350 = 2.274

How to get diagonal values from specific point?

Suppose I have 10x10 matrix with the following data:
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 _ 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100
My position is in [4][4]. How can I list the diagonal values from this position?
For example, the expected outcome would be:
[56, 67, 78, 89, 100, 1, 12, 23, 34]
[54, 63, 72, 81, 9, 18, 27, 36]
My current solution
def next?(index, row, size)
(((row + index) % size) + 1 ) % size
end
(1...chess_size).each do |l|
next_el, curr_el = next?(l, row, chess_size), (row + l) % chess_size
# this gets me the first diagnonal. Note that it prints out wrong value
tmp[0] << chess[curr_el][curr_el]
# this gets me the values from current row below to up
tmp[1] << chess[(row + l) % chess_size][row]
tmp[2] << chess[-l][l]
tmp[3] << chess[row][(row + l) % chess_size]
end
Our matrix will always have the same number of rows and columns.
Generally to get the diagonal values from i and j, you can iterate over i and j at the same time up to one of them would be zero. Hence, the main diagonal is (i-1, j-1), (i-2, j-2), ... up to i, j >= 0 and (i + 1, j + 1), (i +2, j + 2), ... up to i, j <= n. For the antidiagonal is (i - 1, j + 1), (i - 2, j + 2), ... up to i >= 0 and j <= n, and (i + 1, j-1), (i + 2, j - 2), ... up to i <= n and j >= 0.
This is a solution to the Hackerrank Queen's attack problem.
Code
def count_moves(n, obs, qrow, qcol)
qdiff = qrow-qcol
qsum = qrow+qcol
l = u = -1
r = d = n
ul = qdiff >= 0 ? qrow-qcol-1 : -1
dr = qdiff >= 0 ? n : qrow+n-qcol
ur = qsum < n ? -1 : qrow-n+qcol
dl = qsum < n ? qrow+qcol+1 : n
obs.uniq.each do |i,j|
case i <=> qrow
when -1 # up
case j <=> qcol
when -1 # up-left
ul = [ul,i].max
when 0 # up same col
u = [u,i].max
when 1 # up-right
ur = [ur,i].max
end
when 0 # same row
j < qcol ? (l = [l,j].max) : r = [r,j].min
else # down
case j <=> qcol
when -1 # down-left
dl = [dl,i].min
when 0 # down same col
d = [d,i].min
when 1 # down-right
dr = [dr,i].min
end
end
end
r + dl + d + dr - l - ul -u - ur - 8
end
Example
Suppose the chess board has 9 rows and columns, with the queen's location shown by the character q and each obstruction shown with the letter o. All other locations are represented by the letter x. We see that the queen has 16 possible moves (7 up and down, 6 left and right, 1 on the up-left to down-right diagonal and 2 on the up-right to down-left diagonal.
arr = [
%w| x x x x x x x x x |, # 0
%w| o x x x x x x x x |, # 1
%w| x o x x x x x x x |, # 2
%w| x x o x x x x x o |, # 3
%w| x x x o x x x x x |, # 4
%w| x x x x x x o x x |, # 5
%w| o o x x x q x x x |, # 6
%w| x x x x x x o x x |, # 7
%w| x x x x x o x x x | # 8
# 0 1 2 3 4 5 6 7 8
]
qrow = qcol = nil
obs = []
n = arr.size
arr.each_with_index do |a,i|
a.each_with_index do |c,j|
case c
when 'o'
obs << [i,j]
when 'q'
qrow=i
qcol=j
end
end
end
qrow
#=> 6
qcol
#=> 5
obs
#=> [[1, 0], [2, 1], [3, 2], [3, 8], [4, 3], [5, 6], [6, 0], [6, 1], [7, 6], [8, 5]]
count_moves(n, obs, qrow, qcol)
#=> 16
Explanation
l is the largest column index of an obstruction in the queen's row that is less than the queen's column index;
r is the smalles column index of an obstruction in the queens that is greater than the queen's column index;
u is the largest largest row index of an obstruction in the queen's column that is less than the queen's row index;
d is the smallest row index of an obstruction in the queen's column that is greater than the queen's row index;
ul is the greatest row index of an obstruction on the queen's top-left to bottom-right diagonal that is less than the queen's row index;
dr is the smallest row index of an obstruction on the queen's top-left to bottom-right diagonal that is greater than the queen's row index;
ur is the greatest row index of an obstruction on the queen's top-right to bottom-left diagonal that is less than the queen's row index; and
dl is the smallest row index of an obstruction on the queen's top-right to bottom-left diagonal that is greater than the queen's row index.
For the example above, before obstructions are taken into account, these variables are set to the following values.
l = 0
r = 9
ul = 0
u = -1
ur = 2
dl = 9
d = 9
dr = 9
Note that if the queen has row and column indices qrow and qcol,
i - j = qrow - qcol for all locations [i, j] on the queen's top-left to bottom- right diagonal; and
i + j = grow + gcol for all locations [i, j] on the queen's top-right to bottom- left diagonal
We then loop through all (unique) obstructions, determining, for each, whether it is in the queen's row, queen's column, or one of the queen's diagonals and then replaces the value of the applicable variable with it's row or column index if it is "closer" to the queen than the previously-closest location.
If, for example, the obstruction is in the queen's row and its column index j is less than the queen's column index, the following calculation is made:
l = [l, j].max
Similarly, if the obstruction is on the queen's top-left to bottom-right diagonal and its row index i is less than the queen's row index, the calculation would be:
ul = [ul, i].max
After all obstructions from the above example have been considered the variables have the following values.
l #=> 1
r #=> 9
ul #=> 4
u #=> -1
ur #=> 5
dl #=> 9
d #=> 8
dr #=> 7
Lastly, we compute the total number of squares to which the queen may move.
qcol - l - 1 + # left
r - qcol - 1 + # right
u - qrow - 1 + # up
grow - d - 1 + # down
ul - qrow - 1 + # up-left
ur - qrow - 1 + # up-right
qrow - dl - 1 + # down-left
qrow - dr - 1 # down-right
which simplifies to
r + dl + d + dr - l - ul -u - ur - 8
#=> 9 + 9 + 8 + 7 - 1 - 4 + 1 - 5 - 8 => 16
I've applied the logic that #OmG provided. Not sure how efficient it would be.
def stackOverflow(matrixSize, *args)
pos, obstacles = *args
chess = (1..(matrixSize * matrixSize)).each_slice(matrixSize).to_a
obstacles.each do |l| chess[l[0]][l[1]] = '_' end
row, col = pos[:row] - 1, pos[:col] - 1
chess[row][col] = '♙'
directions = [[],[],[],[],[],[],[],[]]
(1...matrixSize).each do |l|
directions[0] << chess[row + l][col + l] if (row + l) < matrixSize && (col + l) < chess_size
directions[1] << chess[row - l][col - l] if (row - l) >= 0 && (col - l) >= 0
directions[2] << chess[row + l][col - l] if (row + l) < matrixSize && (col - l) >= 0
directions[3] << chess[row - l][col + l] if (row - l) >= 0 && (col + l) < matrixSize
directions[4] << chess[row + l][col] if row + l < matrixSize
directions[5] << chess[row - l][col] if row - l >= 0
directions[6] << chess[row][col + l] if col + l < matrixSize
directions[7] << chess[row][col - l] if col - l >= 0
end
end
stackOverflow(5, 3, {row: 4, col: 3}, [[4,4],[3,1],[1,2]] )
#CarySwoveland It seems #Jamy is working on another problem from hackerrank queens-attack.
The problem is quite hard because the idea is to never create a matrix in the first place. That is, the test cases become very large, and thus the space complexity will be an issue.
I've changed my implementation, yet still, fail because of timeout issue (this is because test cases become very large). I'm not sure how to make it performant.
Before I show the code. Let me explain what I'm trying to do using illustration:
This is our chess:
---------------------------
| 1 2 3 4 5 |
| 6 7 8 9 10 |
| 11 12 13 14 15 |
| 16 17 18 19 20 |
| 21 22 23 24 25 |
---------------------------
And this is where our queen is located: queen[2][3]
---------------------------
| 1 2 3 4 5 |
| 6 7 8 9 10 |
| 11 12 13 ♙ 15 |
| 16 17 18 19 20 |
| 21 22 23 24 25 |
---------------------------
The queen can attack all 8 directions. I.e:
horizontal(x2):
1. from queen position to left : [13, 12, 11]
2. from queen position to right : [15]
vertical(x2):
1. from queen position to top : [9, 4]
2. from queen position to bottom : [19, 24]
diagonal(x2):
1. from queen position to bottom-right : [20]
2. from queen position to top-left : [8, 2]
diagonal(x2):
1. from queen position to bottom-left : [18, 22]
2. from queen position to top-right : [10]
Because there are no obstacles within those 8 paths, the queen can attack a total of 14 attacks.
Say we have some obstacles:
---------------------------
| 1 2 3 4 5 |
| 6 7 x 9 10 |
| 11 x 13 ♙ 15 |
| 16 17 18 19 x |
| 21 x 23 x 25 |
---------------------------
Now the queen can attack a total of 7 attacks: [13, 18, 19, 15, 10, 9, 4]
Code
MAXI = 10 ** 5
def queens_attack(size, number_of_obstacles, queen_pos, obstacles)
# exit the function if...
# size is negative or more than MAXI. Note MAXI has constraint shown in hackerrank
return if size < 0 || size > MAXI
# the obstacles is negative or more than the MAXI
return if number_of_obstacles < 0 || number_of_obstacles > MAXI
# the queen's position is outside of our chess dimension
return if queen_pos[:row] < 1 || queen_pos[:row] > size
return if queen_pos[:col] < 1 || queen_pos[:col] > size
# the queen's pos is the same as one of the obstacles
return if [[queen_pos[:row], queen_pos[:col]]] - obstacles == []
row, col = queen_pos[:row], queen_pos[:col]
# variable to increment how many places the queen can attack
attacks = 0
# the queen can attack on all directions:
# horizontals, verticals and both diagonals. So let us create pointers
# for each direction. Once the obstacle exists in the path, make the
# pointer[i] set to true
pointers = Array.new(8, false)
(1..size).lazy.each do |i|
# this is the diagonal from queen's pos to bottom-right
if row + i <= size && col + i <= size && !pointers[0]
# set it to true if there is no obstacle in the current [row + i, col + i]
pointers[0] = true unless [[row + i, col + i]] - obstacles != []
# now we know the queen can attack this pos
attacks += 1 unless pointers[0]
end
# this is diagonal from queen's pos to top-left
if row - i > 0 && col - i > 0 && !pointers[1]
# set it to true if there is no obstacle in the current [row - i, col - i]
pointers[1] = true unless [[row - i, col - i]] - obstacles != []
# now we know the queen can attack this pos
attacks += 1 unless pointers[1]
end
# this is diagonal from queen's pos to bottom-left
if row + i <= size && col - i > 0 && !pointers[2]
pointers[2] = true unless [[row + i, col - i]] - obstacles != []
attacks += 1 unless pointers[2]
end
# this is diagonal from queen's pos to top-right
if row - i > 0 && col + i <= size && !pointers[3]
pointers[3] = true unless [[row - i, col + i]] - obstacles != []
attacks += 1 unless pointers[3]
end
# this is verticle from queen's pos to bottom
if row + i <=size && !pointers[4]
pointers[4] = true unless [[row + i, col]] - obstacles != []
attacks += 1 unless pointers[4]
end
# this is verticle from queen's pos to top
if row - i > 0 && !pointers[5]
pointers[5] = true unless [[row - i, col]] - obstacles != []
attacks += 1 unless pointers[5]
end
# this is horizontal from queen's pos to right
if col + i <= size && !pointers[6]
pointers[6] = true unless [[row, col + i]] - obstacles != []
attacks += 1 unless pointers[6]
end
# this is horizontal from queen's pos to left
if col - i > 0 && !pointers[7]
pointers[7] = true unless [[row, col - i]] - obstacles != []
attacks += 1 unless pointers[7]
end
end
p attacks
end
Problem
Now the problem is, I don't know why my code is doing a timeout error from hackerrank. I do know it because of the test case, where the dimension of chess can be 10,000 X 10,000. But dont know what constraint I'm missing.
I've just learned from a comment posted by the OP that I've solved the wrong problem, despite the fact that the OP's question seems quite clear, especially the example, and is consistent with my interpretation. I will leave this solution to the following problem: "Given an array arr such that Matrix(*arr) is an NxM matrix, and a matrix location i,j, return an array [d,a], where elements d and a are elements on the diagonal and antidiagonal that pass through [d,a] but do not include [d,a] and are each rotated so that row index of the first element is i+1 if i < arr.size-1 and is 0 otherwise.
Code
def diagonals(arr, row_idx, col_idx)
ncols = arr.first.size
sum_idx = row_idx+col_idx
diff_idx = row_idx-col_idx
a = Array.new(arr.size * arr.first.size) { |i| i.divmod(ncols) } -[[row_idx, col_idx]]
[a.select { |r,c| r-c == diff_idx }, a.select { |r,c| r+c == sum_idx }].
map do |b| b.sort_by { |r,_| [r > row_idx ? 0:1 , r] }.
map { |r,c| arr[r][c] }
end
end
All elements of the array arr must be equal in size but there is no requirement that arr.size = arr.first.size.
Example
arr = [
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
[21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
[31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
[41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
[51, 52, 53, 54, 55, 56, 57, 58, 59, 60],
[61, 62, 63, 64, 65, 66, 67, 68, 69, 70],
[71, 72, 73, 74, 75, 76, 77, 78, 79, 80],
[81, 82, 83, 84, 85, 86, 87, 88, 89, 90],
[91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
]
diagonals(arr, 4, 4)
#=> [[56, 67, 78, 89, 100, 1, 12, 23, 34],
# [54, 63, 72, 81, 9, 18, 27, 36]]
Explanation
Suppose
arr = (1..16).each_slice(4).to_a
#=> [[ 1, 2, 3, 4],
# [ 5, 6, 7, 8],
# [ 9, 10, 11, 12],
# [13, 14, 15, 16]]
row_idx = 2
col_idx = 1
The steps are as follows.
a = Array.new(arr.size) { |i| Array.new(arr.first.size) { |j| [i,j] } }
#=> [[[0, 0], [0, 1], [0, 2], [0, 3]],
# [[1, 0], [1, 1], [1, 2], [1, 3]],
# [[2, 0], [2, 1], [2, 2], [2, 3]],
# [[3, 0], [3, 1], [3, 2], [3, 3]]]
ncols = arr.first.size
#=> 4
sum_idx = row_idx+col_idx
#=> 3
diff_idx = row_idx-col_idx
#=> 1
a = Array.new(arr.size * arr.first.size) { |i| i.divmod(ncols) } - [[row_idx, col_idx]]
#=> [[0, 0], [0, 1], [0, 2], [0, 3], [1, 0], [1, 1], [1, 2], [1, 3],
# [2, 0], [2, 2], [2, 3], [3, 0], [3, 1], [3, 2], [3, 3]]
Select and sort the locations [r, c] on the top-left to bottom-right diagonal that passes through [row_idx, col_idx].
b = a.select { |r,c| r-c == diff_idx }
#=> [[1, 0], [3, 2]]
c = b.sort_by { |r,_| [r > row_idx ? 0:1 , r] }
#=> [[3, 2], [1, 0]]
Select and sort the locations [r, c] on the top-right bottom-left diagonal that passes through [row_idx, col_idx].
d = a.select { |r,c| r+c == sum_idx }
#=> [[0, 3], [1, 2], [3, 0]]
e = d.sort_by { |r,c| [r > row_idx ? 0:1 , r] }
#=> [[3, 0], [0, 3], [1, 2]]
[c, e].map { |f| f.map { |r,c| arr[r][c] }
#=> [c, e].map { |f| f.map { |r,c| arr[r][c] } }
#=> [[15, 5], [13, 4, 7]]
I've just learned from a comment posted by the OP that I've solved the wrong problem, despite the fact that the OP's question seems quite clear, especially the example, and is consistent with my interpretation. I will leave this solution to the following problem: "Given an array arr such that Matrix(*arr) is an NxM matrix, and a matrix location i,j, return an array [d,a], where elements d and a are elements on the diagonal and antidiagonal that pass through [d,a] but do not include [d,a] and are each rotated so that row index of the first element is i+1 if i < arr.size-1 and is 0 otherwise.
The following approach uses methods from the Matrix class.
Code
require 'matrix'
def diagonals(arr, row_idx, col_idx)
[diag(arr, row_idx, col_idx),
diag(arr.map(&:reverse).transpose, arr.first.size-1-col_idx, row_idx)]
end
def diag(arr, row_idx, col_idx)
nrows, ncols = arr.size, arr.first.size
lr = [ncols-col_idx, nrows-row_idx].min - 1
ul = [col_idx, row_idx].min
m = Matrix[*arr]
[*m.minor(row_idx+1, lr, col_idx+1, lr).each(:diagonal).to_a,
*m.minor(row_idx-ul, ul, col_idx-ul, ul).each(:diagonal).to_a]
end
Example
arr = [
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
[21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
[31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
[41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
[51, 52, 53, 54, 55, 56, 57, 58, 59, 60],
[61, 62, 63, 64, 65, 66, 67, 68, 69, 70],
[71, 72, 73, 74, 75, 76, 77, 78, 79, 80],
[81, 82, 83, 84, 85, 86, 87, 88, 89, 90],
[91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
]
diagonals arr, 4, 4
#=> [[56, 67, 78, 89, 100, 1, 12, 23, 34], [54, 63, 72, 81, 9, 18, 27, 36]]
diagonals arr, 4, 5
#=> [[57, 68, 79, 90, 2, 13, 24, 35], [55, 64, 73, 82, 91, 10, 19, 28, 37]]
diagonals arr, 0, 9
#=> [[], [19, 28, 37, 46, 55, 64, 73, 82, 91]]
Explanation
Suppose the array and target location were as follows.
arr = (1..30).each_slice(6).to_a
#=> [[ 1, 2, 3, 4, 5, 6],
# [ 7, 8, 9, 10, 11, 12],
# [13, 14, 15, 16, 17, 18],
# [19, 20, 21, 22, 23, 24],
# [25, 26, 27, 28, 29, 30]]
row_idx = 2
col_idx = 3
Note arr[2][3] #=> 16. We obtain the diagonal with negative slope by computing the diagonals of two matrix minors:
[[23, 24],
[29, 30]]
and
[[2, 3],
[8, 9]]
giving us
[*[23, 30], *[2, 9]]
#=> [23, 30, 2, 9]
To obtain the other diagonal we rotate the array anti-clockwise 90 degrees, adjust row_idx and col_idx and repeat the above procedure.
arr.map(&:reverse).transpose
#=> [[6, 12, 18, 24, 30],
# [5, 11, 17, 23, 29],
# [4, 10, 16, 22, 28],
# [3, 9, 15, 21, 27],
# [2, 8, 14, 20, 26],
# [1, 7, 13, 19, 25]]
ncols = arr.first.size
#=> 6
row_idx, col_idx = ncols-1-col_idx, row_idx
#=> [2, 2]
We now extract the diagonals from the matrix minors
[[21, 27],
[20, 26]]
and
[[6, 12],
[5, 11]]
to obtain the second diagonal:
[21, 26, 6, 11]
def possible_moves(val):
# val is a value between 0 and n*n-1
for i in range(n*n):
if i == val:
board[i // n][i % n] = 'Q'
continue
#mark row and column with a dot
if i % n == val % n or i // n == val // n:
board[i//n][i%n] = '.'
# mark diagonals with a dot
if i % (n + 1) == val % (n + 1) and abs(i % n - val % n) == abs(i // n - val // n):
board[i//n][i%n] = '.'
if i % (n - 1) == val % (n - 1) and abs(i % n - val % n) == abs(i // n - val // n):
board[i//n][i%n] = '.'
n = 10 #board size = n x n
board = [['0' for x in range(n)] for y in range(n)] #initialize board with '0' in every row and col
possible_moves(40)
At the end you will have a 'Q' where the queen s positioned, '0' where the Q cannot movea and '.' where she can moves

Parallel Computing - Shuffle

I am looking to shuffle an array in parallel. I have found that doing an algorithm similar to bitonic sort but with a random (50/50) re-order results in an equal distribution but only if the array is a power of 2. I've considered the Yates Fisher Shuffle but I can't see how I could parallel-ize it in order to avoid O(N) computations.
Any advice?
Thanks!
There's a good clear recent paper on this here and the references, especially Shun et al 2015 are worth a read.
But basically you can do this using the same sort of approach that's used in sort -R: shuffle by giving each row a random key value and sorting on that key. And there are lots of ways to do good parallel distributed sort.
Here's a basic version in python + MPI using an odd-even sort; it goes through P communication steps if P is the number of processors. You can do better than that, but this is pretty simple to understand; it's discussed in this question.
from __future__ import print_function
import sys
import random
from mpi4py import MPI
comm = MPI.COMM_WORLD
def exchange(localdata, sendrank, recvrank):
"""
Perform a merge-exchange with a neighbour;
sendrank sends local data to recvrank,
which merge-sorts it, and then sends lower
data back to the lower-ranked process and
keeps upper data
"""
rank = comm.Get_rank()
assert rank == sendrank or rank == recvrank
assert sendrank < recvrank
if rank == sendrank:
comm.send(localdata, dest=recvrank)
newdata = comm.recv(source=recvrank)
else:
bothdata = list(localdata)
otherdata = comm.recv(source=sendrank)
bothdata = bothdata + otherdata
bothdata.sort()
comm.send(bothdata[:len(otherdata)], dest=sendrank)
newdata = bothdata[len(otherdata):]
return newdata
def print_by_rank(data, rank, nprocs):
""" crudely attempt to print data coherently """
for proc in range(nprocs):
if proc == rank:
print(str(rank)+": "+str(data))
comm.barrier()
return
def odd_even_sort(data):
rank = comm.Get_rank()
nprocs = comm.Get_size()
data.sort()
for step in range(1, nprocs+1):
if ((rank + step) % 2) == 0:
if rank < nprocs - 1:
data = exchange(data, rank, rank+1)
elif rank > 0:
data = exchange(data, rank-1, rank)
return data
def main():
# everyone get their data
rank = comm.Get_rank()
nprocs = comm.Get_size()
n_per_proc = 5
data = list(range(n_per_proc*rank, n_per_proc*(rank+1)))
if rank == 0:
print("Original:")
print_by_rank(data, rank, nprocs)
# tag your data with random values
data = [(random.random(), item) for item in data]
# now sort it by these random tags
data = odd_even_sort(data)
if rank == 0:
print("Shuffled:")
print_by_rank([x for _, x in data], rank, nprocs)
return 0
if __name__ == "__main__":
sys.exit(main())
Running gives:
$ mpirun -np 5 python mergesort_shuffle.py
Original:
0: [0, 1, 2, 3, 4]
1: [5, 6, 7, 8, 9]
2: [10, 11, 12, 13, 14]
3: [15, 16, 17, 18, 19]
4: [20, 21, 22, 23, 24]
Shuffled:
0: [19, 17, 4, 20, 9]
1: [23, 12, 3, 2, 8]
2: [14, 6, 13, 15, 1]
3: [11, 0, 22, 16, 18]
4: [5, 10, 21, 7, 24]

Ruby, Linearize an array with a sorted number series by adding items in order to keep the differentials lower or equal to X

I need a function that transforms an array of integers in descending order, not allowing any integer in position i to be X times greater than the following in position i+1, by adding 1 or more elements in between, and keeping the original numbers intact.
The resulting sorted array will meet the criteria:
array[i] <= array[i+1]*1.5
For every i.
Examples:
x = 1.5
Transformation over a
a = [5, 3]
func(a, x) = [5,4,3]
a[0] > a[1]*1.5, so func adds 4 = (a[0].to_f/1.5).ceil and sorts a
a is now [5,4,3]
Transformation over b
b = [50, 4]
func(b, x) = [50, 34, 23, 16, 11, 8, 6, 4]
b[0] > b[1]*1.5, so func adds 34 = (b[0].to_f/1.5).ceil and sorts b
b is now [50,34,4]
b[1] > b[2]*1.5, so func adds 23 = (b[1].to_f/1.5).ceil and sorts b
b is now [50,34,23,4]
b[2] > b[3]*1.5, so func adds 16 = (b[2].to_f/1.5).ceil and sorts b
b is now [50,34,23,16,4]
b[3] > b[4]*1.5, so func adds 11 = (b[3].to_f/1.5).ceil and sorts b
b is now [50,34,23,16,11,4]
b[4] > b[5]*1.5, so func adds 8 = (b[4].to_f/1.5).ceil and sorts b
b is now [50,34,23,16,11,8,4]
b[5] > b[6]*1.5, so func adds 6 = (b[5].to_f/1.5).ceil and sorts b
b is now [50,34,23,16,11,8,4]
func returns [50, 34, 23, 16, 11, 8, 6, 4]
Transformation over c
c = [50, 20, 10, 4, 3, 2]
func(c, x) = [50, 34, 23, 20, 14, 10, 7, 5, 4, 3, 2]
c[0] > c[1]*1.5, so func adds 34 = (c[0].to_f/1.5).ceil and sorts c
c is now [50,34,20,10,4,3,2]
c[1] > c[2]*1.5, so func adds 23 = (c[1].to_f/1.5).ceil and sorts c
c is now [50,34,23,20,10,4,3,2]
c[3] > c[4]*1.5, so func adds 14 = (c[3].to_f/1.5).ceil and sorts c
c is now [50,34,23,20,14,10,4,3,2]
c[5] > c[6]*1.5, so func adds 7 = (c[5].to_f/1.5).ceil and sorts c
c is now [50,34,23,20,14,10,7,4,3,2]
c[6] > c[7]*1.5, so func adds 5 = (c[6].to_f/1.5).ceil and sorts c
c is now [50,34,23,20,14,10,7,5,4,3,2]
func returns [50, 34, 23, 20, 14, 10, 7, 5, 4, 3, 2]
How can this be done in a functional and clean way?
A pure functional way:
def func(a, x, i = 0)
if i == a.size - 1
a
else
if a[i] <= a[i + 1] * x
func a, x, i + 1
else
func a.take(i + 1) + [(a[i].to_f / x).ceil] + a.drop(i + 1), x, i + 1
end
end
end
I'm getting the exact same output as your first and third examples, but not for the second -- your sample output seems to be incorrect.
Test:
p func [5, 3], 1.5
p func [50, 4], 1.5
p func [50, 20, 10, 4, 3, 2], 1.5
Output:
[5, 4, 3]
[50, 34, 23, 16, 11, 8, 6, 4]
[50, 34, 23, 20, 14, 10, 7, 5, 4, 3, 2]
This may be useful. It divides intervals geometrically so that each subdivision has (as near as is possible) the same multiplier as the others, instead of using 1.5 for all but the last and then whatever's left over.
include Math
def geometric_interpolation(arr, ratio)
log_ratio = log(ratio)
result = []
arr.each_cons(2) do |pair|
logs = pair.map { |x| log(x) }
log_interval = logs[0] - logs[1]
num = (log_interval / log_ratio).round(12).ceil
result += [ pair[0] ] + (1...num).map { |n| exp(logs[0] - log_interval * n / num).round }
end
result + [ arr[-1] ]
end
a = [5, 3]
b = [50, 4]
c = [50, 20, 10, 4, 3, 2]
p geometric_interpolation(a, 1.5)
p geometric_interpolation(b, 1.5)
p geometric_interpolation(c, 1.5)
output
[5, 4, 3]
[50, 35, 24, 17, 12, 8, 6, 4]
[50, 37, 27, 20, 14, 10, 7, 5, 4, 3, 2]

Spread objects evenly over multiple collections

The scenario is that there are n objects, of different sizes, unevenly spread over m buckets. The size of a bucket is the sum of all of the object sizes that it contains. It now happens that the sizes of the buckets are varying wildly.
What would be a good algorithm if I want to spread those objects evenly over those buckets so that the total size of each bucket would be about the same? It would be nice if the algorithm leaned towards less move size over a perfectly even spread.
I have this naïve, ineffective, and buggy solution in Ruby.
buckets = [ [10, 4, 3, 3, 2, 1], [5, 5, 3, 2, 1], [3, 1, 1], [2] ]
avg_size = buckets.flatten.reduce(:+) / buckets.count + 1
large_buckets = buckets.take_while {|arr| arr.reduce(:+) >= avg_size}.to_a
large_buckets.each do |large|
smallest = buckets.last
until ((small_sum = smallest.reduce(:+)) >= avg_size)
break if small_sum + large.last >= avg_size
smallest << large.pop
end
buckets.insert(0, buckets.pop)
end
=> [[3, 1, 1, 1, 2, 3], [2, 1, 2, 3, 3], [10, 4], [5, 5]]
I believe this is a variant of the bin packing problem, and as such it is NP-hard. Your answer is essentially a variant of the first fit decreasing heuristic, which is a pretty good heuristic. That said, I believe that the following will give better results.
Sort each individual bucket in descending size order, using a balanced binary tree.
Calculate average size.
Sort the buckets with size less than average (the "too-small buckets") in descending size order, using a balanced binary tree.
Sort the buckets with size greater than average (the "too-large buckets") in order of the size of their greatest elements, using a balanced binary tree (so the bucket with {9, 1} would come first and the bucket with {8, 5} would come second).
Pass1: Remove the largest element from the bucket with the largest element; if this reduces its size below the average, then replace the removed element and remove the bucket from the balanced binary tree of "too-large buckets"; else place the element in the smallest bucket, and re-index the two modified buckets to reflect the new smallest bucket and the new "too-large bucket" with the largest element. Continue iterating until you've removed all of the "too-large buckets."
Pass2: Iterate through the "too-small buckets" from smallest to largest, and select the best-fitting elements from the largest "too-large bucket" without causing it to become a "too-small bucket;" iterate through the remaining "too-large buckets" from largest to smallest, removing the best-fitting elements from them without causing them to become "too-small buckets." Do the same for the remaining "too-small buckets." The results of this variant won't be as good as they are for the more complex variant because it won't shift buckets from the "too-large" to the "too-small" category or vice versa (hence the search space will be smaller), but this also means that it has much simpler halting conditions (simply iterate through all of the "too-small" buckets and then halt), whereas the complex variant might cause an infinite loop if you're not careful.
The idea is that by moving the largest elements in Pass1 you make it easier to more precisely match up the buckets' sizes in Pass2. You use balanced binary trees so that you can quickly re-index the buckets or the trees of buckets after removing or adding an element, but you could use linked lists instead (the balanced binary trees would have better worst-case performance but the linked lists might have better average-case performance). By performing a best-fit instead of a first-fit in Pass2 you're less likely to perform useless moves (e.g. moving a size-10 object from a bucket that's 5 greater than average into a bucket that's 5 less than average - first fit would blindly perform the movie, best-fit would either query the next "too-large bucket" for a better-sized object or else would remove the "too-small bucket" from the bucket tree).
I ended up with something like this.
Sort the buckets in descending size order.
Sort each individual bucket in descending size order.
Calculate average size.
Iterate over each bucket with a size larger than average size.
Move objects in size order from those buckets to the smallest bucket until either the large bucket is smaller than average size or the target bucket reaches average size.
Ruby code example
require 'pp'
def average_size(buckets)
(buckets.flatten.reduce(:+).to_f / buckets.count + 0.5).to_i
end
def spread_evenly(buckets)
average = average_size(buckets)
large_buckets = buckets.take_while {|arr| arr.reduce(:+) >= average}.to_a
large_buckets.each do |large_bucket|
smallest_bucket = buckets.last
smallest_size = smallest_bucket.reduce(:+)
large_size = large_bucket.reduce(:+)
until (smallest_size >= average)
break if large_size <= average
if smallest_size + large_bucket.last > average and large_size > average
buckets.unshift buckets.pop
smallest_bucket = buckets.last
smallest_size = smallest_bucket.reduce(:+)
end
smallest_size += smallest_object = large_bucket.pop
large_size -= smallest_object
smallest_bucket << smallest_object
end
buckets.unshift buckets.pop if smallest_size >= average
end
buckets
end
test_buckets = [
[ [10, 4, 3, 3, 2, 1], [5, 5, 3, 2, 1], [3, 1, 1], [2] ],
[ [4, 3, 3, 2, 2, 2, 2, 1, 1], [10, 5, 3, 2, 1], [3, 3, 3], [6] ],
[ [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1] ],
[ [10, 9, 8, 7], [6, 5, 4], [3, 2], [1] ],
]
test_buckets.each do |buckets|
puts "Before spread with average of #{average_size(buckets)}:"
pp buckets
result = spread_evenly(buckets)
puts "Result and sum of each bucket:"
pp result
sizes = result.map {|bucket| bucket.reduce :+}
pp sizes
puts
end
Output:
Before spread with average of 12:
[[10, 4, 3, 3, 2, 1], [5, 5, 3, 2, 1], [3, 1, 1], [2]]
Result and sum of each bucket:
[[3, 1, 1, 4, 1, 2], [2, 1, 2, 3, 3], [10], [5, 5, 3]]
[12, 11, 10, 13]
Before spread with average of 14:
[[4, 3, 3, 2, 2, 2, 2, 1, 1], [10, 5, 3, 2, 1], [3, 3, 3], [6]]
Result and sum of each bucket:
[[3, 3, 3, 2, 3], [6, 1, 1, 2, 2, 1], [4, 3, 3, 2, 2], [10, 5]]
[14, 13, 14, 15]
Before spread with average of 4:
[[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1]]
Result and sum of each bucket:
[[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]]
[4, 4, 4, 4, 4]
Before spread with average of 14:
[[10, 9, 8, 7], [6, 5, 4], [3, 2], [1]]
Result and sum of each bucket:
[[1, 7, 9], [10], [6, 5, 4], [3, 2, 8]]
[17, 10, 15, 13]
This isn't bin packing as others have suggested. There the size of bins is fixed and you are trying to minimize the number. Here you are trying to minimize the variance among a fixed number of bins.
It turns out this is equivalent to Multiprocessor Scheduling, and - according to the reference - the algorithm below (known as "Longest Job First" or "Longest Processing Time First") is certain to produce a largest sum no more than 4/3 - 1/(3m) times optimal, where m is the number of buckets. In the test cases shonw, we'd have 4/3-1/12 = 5/4 or no more than 25% above optimal.
We just start with all bins empty, and put each item in decreasing order of size into the currently least full bin. We can track the least full bin efficiently with a min heap. With a heap having O(log n) insert and deletemin, the algorithm has O(n log m) time (n and m defined as #Jonas Elfström says). Ruby is very expressive here: only 9 sloc for the algorithm itself.
Here is code. I am not a Ruby expert, so please feel free to suggest better ways. I am using #Jonas Elfström's test cases.
require 'algorithms'
require 'pp'
test_buckets = [
[ [10, 4, 3, 3, 2, 1], [5, 5, 3, 2, 1], [3, 1, 1], [2] ],
[ [4, 3, 3, 2, 2, 2, 2, 1, 1], [10, 5, 3, 2, 1], [3, 3, 3], [6] ],
[ [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1] ],
[ [10, 9, 8, 7], [6, 5, 4], [3, 2], [1] ],
]
def relevel(buckets)
q = Containers::PriorityQueue.new { |x, y| x < y }
# Initially all buckets to be returned are empty and so have zero sums.
rtn = Array.new(buckets.length) { [] }
buckets.each_index {|i| q.push(i, 0) }
sums = Array.new(buckets.length, 0)
# Add to emptiest bucket in descending order.
# Bang! ops would generate less garbage.
buckets.flatten.sort.reverse.each do |val|
i = q.pop # Get index of emptiest bucket
rtn[i] << val # Append current value to it
q.push(i, sums[i] += val) # Update sums and min heap
end
rtn
end
test_buckets.each {|b| pp relevel(b).map {|a| a.inject(:+) }}
Results:
[12, 11, 11, 12]
[14, 14, 14, 14]
[4, 4, 4, 4, 4]
[13, 13, 15, 14]
You could use my answer to fitting n variable height images into 3 (similar length) column layout.
Mentally map:
Object size to picture height, and
bucket count to bincount
Then the rest of that solution should apply...
The following uses the first_fit algorithm mentioned by Robin Green earlier but then improves on this by greedy swapping.
The swapping routine finds the column that is furthest away from the average column height then systematically looks for a swap between one of its pictures and the first picture in another column that minimizes the maximum deviation from the average.
I used a random sample of 30 pictures with heights in the range five to 50 'units'. The convergenge was swift in my case and improved significantly on the first_fit algorithm.
The code (Python 3.2:
def first_fit(items, bincount=3):
items = sorted(items, reverse=1) # New - improves first fit.
bins = [[] for c in range(bincount)]
binsizes = [0] * bincount
for item in items:
minbinindex = binsizes.index(min(binsizes))
bins[minbinindex].append(item)
binsizes[minbinindex] += item
average = sum(binsizes) / float(bincount)
maxdeviation = max(abs(average - bs) for bs in binsizes)
return bins, binsizes, average, maxdeviation
def swap1(columns, colsize, average, margin=0):
'See if you can do a swap to smooth the heights'
colcount = len(columns)
maxdeviation, i_a = max((abs(average - cs), i)
for i,cs in enumerate(colsize))
col_a = columns[i_a]
for pic_a in set(col_a): # use set as if same height then only do once
for i_b, col_b in enumerate(columns):
if i_a != i_b: # Not same column
for pic_b in set(col_b):
if (abs(pic_a - pic_b) > margin): # Not same heights
# new heights if swapped
new_a = colsize[i_a] - pic_a + pic_b
new_b = colsize[i_b] - pic_b + pic_a
if all(abs(average - new) < maxdeviation
for new in (new_a, new_b)):
# Better to swap (in-place)
colsize[i_a] = new_a
colsize[i_b] = new_b
columns[i_a].remove(pic_a)
columns[i_a].append(pic_b)
columns[i_b].remove(pic_b)
columns[i_b].append(pic_a)
maxdeviation = max(abs(average - cs)
for cs in colsize)
return True, maxdeviation
return False, maxdeviation
def printit(columns, colsize, average, maxdeviation):
print('columns')
pp(columns)
print('colsize:', colsize)
print('average, maxdeviation:', average, maxdeviation)
print('deviations:', [abs(average - cs) for cs in colsize])
print()
if __name__ == '__main__':
## Some data
#import random
#heights = [random.randint(5, 50) for i in range(30)]
## Here's some from the above, but 'fixed'.
from pprint import pprint as pp
heights = [45, 7, 46, 34, 12, 12, 34, 19, 17, 41,
28, 9, 37, 32, 30, 44, 17, 16, 44, 7,
23, 30, 36, 5, 40, 20, 28, 42, 8, 38]
columns, colsize, average, maxdeviation = first_fit(heights)
printit(columns, colsize, average, maxdeviation)
while 1:
swapped, maxdeviation = swap1(columns, colsize, average, maxdeviation)
printit(columns, colsize, average, maxdeviation)
if not swapped:
break
#input('Paused: ')
The output:
columns
[[45, 12, 17, 28, 32, 17, 44, 5, 40, 8, 38],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 34, 9, 37, 44, 30, 20, 28]]
colsize: [286, 267, 248]
average, maxdeviation: 267.0 19.0
deviations: [19.0, 0.0, 19.0]
columns
[[45, 12, 17, 28, 17, 44, 5, 40, 8, 38, 9],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 34, 37, 44, 30, 20, 28, 32]]
colsize: [263, 267, 271]
average, maxdeviation: 267.0 4.0
deviations: [4.0, 0.0, 4.0]
columns
[[45, 12, 17, 17, 44, 5, 40, 8, 38, 9, 34],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 37, 44, 30, 20, 28, 32, 28]]
colsize: [269, 267, 265]
average, maxdeviation: 267.0 2.0
deviations: [2.0, 0.0, 2.0]
columns
[[45, 12, 17, 17, 44, 5, 8, 38, 9, 34, 37],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 44, 30, 20, 28, 32, 28, 40]]
colsize: [266, 267, 268]
average, maxdeviation: 267.0 1.0
deviations: [1.0, 0.0, 1.0]
columns
[[45, 12, 17, 17, 44, 5, 8, 38, 9, 34, 37],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 44, 30, 20, 28, 32, 28, 40]]
colsize: [266, 267, 268]
average, maxdeviation: 267.0 1.0
deviations: [1.0, 0.0, 1.0]
Nice problem.
Heres the info on reverse-sorting mentioned in my separate comment below.
>>> h = sorted(heights, reverse=1)
>>> h
[46, 45, 44, 44, 42, 41, 40, 38, 37, 36, 34, 34, 32, 30, 30, 28, 28, 23, 20, 19, 17, 17, 16, 12, 12, 9, 8, 7, 7, 5]
>>> columns, colsize, average, maxdeviation = first_fit(h)
>>> printit(columns, colsize, average, maxdeviation)
columns
[[46, 41, 40, 34, 30, 28, 19, 12, 12, 5],
[45, 42, 38, 36, 30, 28, 17, 16, 8, 7],
[44, 44, 37, 34, 32, 23, 20, 17, 9, 7]]
colsize: [267, 267, 267]
average, maxdeviation: 267.0 0.0
deviations: [0.0, 0.0, 0.0]
If you have the reverse-sorting, this extra code appended to the bottom of the above code (in the 'if name == ...), will do extra trials on random data:
for trial in range(2,11):
print('\n## Trial %i' % trial)
heights = [random.randint(5, 50) for i in range(random.randint(5, 50))]
print('Pictures:',len(heights))
columns, colsize, average, maxdeviation = first_fit(heights)
print('average %7.3f' % average, '\nmaxdeviation:')
print('%5.2f%% = %6.3f' % ((maxdeviation * 100. / average), maxdeviation))
swapcount = 0
while maxdeviation:
swapped, maxdeviation = swap1(columns, colsize, average, maxdeviation)
if not swapped:
break
print('%5.2f%% = %6.3f' % ((maxdeviation * 100. / average), maxdeviation))
swapcount += 1
print('swaps:', swapcount)
The extra output shows the effect of the swaps:
## Trial 2
Pictures: 11
average 72.000
maxdeviation:
9.72% = 7.000
swaps: 0
## Trial 3
Pictures: 14
average 118.667
maxdeviation:
6.46% = 7.667
4.78% = 5.667
3.09% = 3.667
0.56% = 0.667
swaps: 3
## Trial 4
Pictures: 46
average 470.333
maxdeviation:
0.57% = 2.667
0.35% = 1.667
0.14% = 0.667
swaps: 2
## Trial 5
Pictures: 40
average 388.667
maxdeviation:
0.43% = 1.667
0.17% = 0.667
swaps: 1
## Trial 6
Pictures: 5
average 44.000
maxdeviation:
4.55% = 2.000
swaps: 0
## Trial 7
Pictures: 30
average 295.000
maxdeviation:
0.34% = 1.000
swaps: 0
## Trial 8
Pictures: 43
average 413.000
maxdeviation:
0.97% = 4.000
0.73% = 3.000
0.48% = 2.000
swaps: 2
## Trial 9
Pictures: 33
average 342.000
maxdeviation:
0.29% = 1.000
swaps: 0
## Trial 10
Pictures: 26
average 233.333
maxdeviation:
2.29% = 5.333
1.86% = 4.333
1.43% = 3.333
1.00% = 2.333
0.57% = 1.333
swaps: 4
Adapt the Knapsack Problem solving algorithms' by, for example, specify the "weight" of every buckets to be roughly equals to the mean of the n objects' sizes (try a gaussian distri around the mean value).
http://en.wikipedia.org/wiki/Knapsack_problem#Solving
Sort buckets in size order.
Move an object from the largest bucket into the smallest bucket, re-sorting the array (which is almost-sorted, so we can use "limited insertion sort" in both directions; you can also speed things up by noting where you placed the last two buckets to be sorted. If you have 6-6-6-6-6-6-5... and get one object from the first bucket, you will move it to the sixth position. Then on the next iteration you can start comparing from the fifth. The same goes, right-to-left, for the smallest buckets).
When the difference of the two buckets is one, you can stop.
This moves the minimum number of buckets, but is of order n^2 log n for comparisons (the simplest version is n^3 log n). If object moving is expensive while bucket size checking is not, for reasonable n it might still do:
12 7 5 2
11 7 5 3
10 7 5 4
9 7 5 5
8 7 6 5
7 7 6 6
12 7 3 1
11 7 3 2
10 7 3 3
9 7 4 3
8 7 4 4
7 7 5 4
7 6 5 5
6 6 6 5
Another possibility would be to calculate the expected average size for every bucket, and "move along" a bag (or a further bucket) with the excess from the larger buckets to the smaller ones.
Otherwise, strange things may happen:
12 7 3 1, the average is a bit less than 6, so we take 5 as the average.
5 7 3 1 bag = 7 from 1st bucket
5 5 3 1 bag = 9
5 5 5 1 bag = 7
5 5 5 8 which is a bit unbalanced.
By taking 6 (i.e. rounding) it goes better, but again sometimes it won't work:
12 5 3 1
6 5 3 1 bag = 6 from 1st bucket
6 6 3 1 bag = 5
6 6 6 1 bag = 2
6 6 6 3 which again is unbalanced.
You can run two passes, the first with the rounded mean left-to-right, the other with the truncated mean right-to-left:
12 5 3 1 we want to get no more than 6 in each bucket
6 11 3 1
6 6 8 1
6 6 6 3
6 6 6 3 and now we want to get at least 5 in each bucket
6 6 4 5 (we have taken 2 from bucket #3 into bucket #5)
6 5 5 5 (when the difference is 1 we stop).
This will require "n log n" size checks, and no more than 2n object moves.
Another possibility which is interesting is to reason thus: you have m objects into n buckets. So you need to do an integer mapping of m onto n, and this is Bresenham's linearization algorithm. Run a (n,m) Bresenham on the sorted array, and at step i (i.e. against bucket i-th) the algorithm will tell you whether to use round(m/n) or floor(m/n) size. Then move objects from or to the "moving bag" according to bucket i-th size.
This requires n log n comparisons.
You can further reduce the number of object moves by initially removing all buckets that are either round(m/n) or floor(m/n) in size to two pools of buckets sized R or F. When, running the algorithm, you need the i-th bucket to hold R objects, if the pool of R objects is not empty, swap the i-th bucket with one of the R-sized ones. This way, only buckets that are hopelessly under- or over-sized get balanced; (most of) the others are simply ignored, except for their references being shuffled.
If object access time is huge in proportion to computation time (e.g. some kind of automatic loader magazine), this will yield a magazine that is as balanced as possible, with the absolute minimum of overall object moves.
You could use an Integer Programming Package if it's fast enough.
It may be tricky getting your constraints right. Something like the following may do the trick:
let variable Oij denote Object i being in Bucket j. Let Wi represent the weight or size of Oi
Constraints:
sum(Oij for all j) == 1 #each object is in only one bucket
Oij = 1 or 0. #object is either in bucket j or not in bucket j
sum(Oij * Wi for all i) <= X + R #restrict weight on buckets.
Objective:
minimize X
Note R is the relaxation constant that you can play with depending on how much movement is required and how much performance is needed.
Now the maximum bucket size is X + R
The next step is to figure out the minimum amount movement possible whilst keeping the bucket size less than X + R
Define a Stay variable Si that controls if Oi stays in bucket j
If Si is 0 it indicates that Oi stays where it was.
Constraints:
Si = 1 or 0.
Oij = 1 or 0.
Oij <= Si where j != original bucket of Object i
Oij != Si where j == original bucket of Object i
Sum(Oij for all j) == 1
Sum(Oij for all i) <= X + R
Objective:
minimize Sum(Si for all i)
Here Sum(Si for all i) represents the number of objects that have moved.

Resources