Related
Please, I would like to find the maximum sum with only one value per row. I already made the resolution by brute force and it is O (N^5). Now I would like to find a way with dynamic programming or another way to reduce the complexity.
For example:
Matrix:
100 5 4 3 1
90 80 70 60 50
70 69 65 20 10
60 20 10 5 1
50 45 15 6 1
Solution for 5 sets:
100 + 90 + 70 + 60 + 50 = 370
100 + 90 + 69 + 60 + 50 = 369
100 + 90 + 70 + 60 + 45 = 365
100 + 90 + 65 + 60 + 50 = 365
100 + 90 + 69 + 60 + 45 = 364
Sum: 1833
example for the sum with brute force:
for(int i=0; i<matrix[0].size(); i++) {
for(int j=0; j<matrix[1].size(); j++) {
for(int k=0; k<matrix[2].size(); k++) {
for(int l=0; l<matrix[3].size(); l++) {
for(int x=0; x<matrix[4].size(); x++) {
sum.push_back(matrix[0][i] + matrix[1][j] + matrix[2][k] + matrix[3][l] + matrix[4][x]);
}
}
}
}
}
sort(sum.begin(), sum.end(), mySort);
Thanks!
You can solve it in O(k*log k) time with Dijkstra's algorithm. A node in a graph is represented by a list with 5 indexes of the numbers in the corresponding rows of the matrix.
For example in the matrix
100 5 4 3 1
90 80 70 60 50
70 69 65 20 10
60 20 10 5 1
50 45 15 6 1
the node [0, 0, 2, 0, 1] represents the numbers [100, 90, 65, 60, 45]
The initial node is [0, 0, 0, 0, 0]. Every node has up to 5 outgoing edges increasing 1 of the 5 indexes by 1, and the distance between nodes is the absolute difference in the sums of the indexed numbers.
So for that matrix the edges from the node [0, 0, 2, 0, 1] lead:
to [1, 0, 2, 0, 1] with distance 100 - 5 = 95
to [0, 1, 2, 0, 1] with distance 90 - 80 = 10
to [0, 0, 3, 0, 1] with distance 65 - 20 = 45
to [0, 0, 2, 1, 1] with distance 60 - 20 = 40
to [0, 0, 2, 0, 2] with distance 45 - 15 = 30
With this setup you can use Dijkstra's algorithm to find k - 1 closest nodes to the initial node.
Update I previously used a greedy algorithm, which doesn't work for this problem. Here is a more general solution.
Suppose we've already found the combinations with the top m highest sums. The next highest combination (number m+1) must be 1 step away from one of these, where a step is defined as shifting focus one column to the right in one of the rows of the matrix. (Any combination that is more than one step away from all of the top m combinations cannot be the m+1 highest, because you can convert it to a higher one that is not in the top m by undoing one of those steps, i.e., moving back toward one of the existing combinations.)
For m = 1, we know that the "m highest combinations" just means the combination made by taking the first element of each row of the matrix (assuming each row is sorted from highest to lowest). So then we can work out from there:
Create a set of candidate combinations to consider for the next highest position. This will initially hold only the highest possible combination (first column of the matrix).
Identify the candidate with the highest sum and move that to the results.
Find all the combinations that are 1 step away from the one that was just added to the results. Add all of these to the set of candidate combinations. Only n of these will be added each round, where n is the number of rows in the matrix. Some may be duplicates of previously identified candidates, which should be ignored.
Go back to step 2. Repeat until there are 5 results.
Here is some Python code that does this:
m = [
[100, 5, 4, 3, 1],
[90, 80, 70, 60, 50],
[70, 69, 65, 20, 10],
[60, 20, 10, 5, 1],
[50, 45, 15, 6, 1]
]
n_cols = len(m[0]) # matrix width
# helper function to calculate the sum for any combination,
# where a "combination" is a list of column indexes for each row
score = lambda combo: sum(m[r][c] for r, c in enumerate(combo))
# define candidate set, initially with single highest combination
# (this set could also store the score for each combination
# to avoid calculating it repeatedly)
candidates = {tuple(0 for row in m)}
results = set()
# get 5 highest-scoring combinations
for i in range(5):
result = max(candidates, key=score)
results.add(result)
candidates.remove(result) # don't test it again
# find combinations one step away from latest result
# and add them to the candidates set
for j, c in enumerate(result):
if c+1 >= n_cols:
continue # don't step past edge of matrix
combo = result[:j] + (c+1,) + result[j+1:]
if combo not in results:
candidates.add(combo) # drops dups
# convert from column indexes to actual values
final = [
[m[r][c] for r, c in enumerate(combo)]
for combo in results
]
final.sort(key=sum, reverse=True)
print(final)
# [
# [100, 90, 70, 60, 50]
# [100, 90, 69, 60, 50],
# [100, 90, 70, 60, 45],
# [100, 90, 65, 60, 50],
# [100, 90, 69, 60, 45],
# ]
If you want just maximum sum, then sum maximum value at each row.
That is,
M = [[100, 5, 4, 3, 1],
[90, 80, 70, 60, 50],
[70, 69, 65, 20, 10],
[60, 20, 10, 5, 1],
[50, 45, 15, 6, 1]]
sum(max(row) for row in M)
Edit
It is not necessary to use dynamic programming, etc.
There is simple rule: select next number considering difference between the number and current number.
Here is a code using numpy.
import numpy as np
M = np.array(M)
M = -np.sort(-M, axis = 1)
k = 3
answer = []
ind = np.zeros(M.shape[0], dtype = int)
for _ in range(k):
answer.append(sum(M[list(range(M.shape[0])), ind]))
min_ind = np.argmin(M[list(range(len(ind))), ind] - M[list(range(len(ind))), ind+1])
ind[min_ind] += 1
Result is [370, 369, 365].
For example, here is a matrix:
[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 0, 1, 0],
[1, 1, 1, 0],
[1, 1, 1, 1],
I want to find some rows, whose sum is equal to [4, 3, 2, 1].
The expected answer is rows: {0,1,3,4}.
Because:
[1, 0, 0, 0] + [1, 1, 0, 0] + [1, 1, 1, 0] + [1, 1, 1, 1] = [4, 3, 2, 1]
Is there some famous or related algrithoms to resolve the problem?
Thank #sascha and #N. Wouda for the comments.
To clarify it, here I provide some more details.
In my problem, the matrix will have about 50 rows and 25 columns. But echo row will just have less than 4 elements (other is zero). And every solution has 8 rows.
If I try all combinations, c(8, 50) is about 0.55 billion times of attempt. Too complex. So I want to find a more effective algrithom.
If you want to make the jump to using a solver, I'd recommend it. This is a pretty straightforward Integer Program. Below solutions use python, python's pyomo math programming package to formulate the problem, and COIN OR's cbc solver for Integer Programs and Mixed Integer Programs, which needs to be installed separately (freeware) available: https://www.coin-or.org/downloading/
Here is the an example with your data followed by an example with 100,000 rows. The example above solves instantly, the 100,000 row example takes about 2 seconds on my machine.
# row selection Integer Program
import pyomo.environ as pyo
data1 = [ [1, 0, 0, 0],
[1, 1, 0, 0],
[1, 0, 1, 0],
[1, 1, 1, 0],
[1, 1, 1, 1],]
data_dict = {(i, j): data1[i][j] for i in range(len(data1)) for j in range(len(data1[0]))}
model = pyo.ConcreteModel()
# sets
model.I = pyo.Set(initialize=range(len(data1))) # a simple row index
model.J = pyo.Set(initialize=range(len(data1[0]))) # a simple column index
# parameters
model.matrix = pyo.Param(model.I , model.J, initialize=data_dict) # hold the sparse matrix of values
magic_sum = [4, 3, 2, 1 ]
# variables
model.row_select = pyo.Var(model.I, domain=pyo.Boolean) # row selection variable
# constraints
# ensure the columnar sum is at least the magic sum for all j
def min_sum(model, j):
return sum(model.row_select[i] * model.matrix[(i, j)] for i in model.I) >= magic_sum[j]
model.c1 = pyo.Constraint(model.J, rule=min_sum)
# objective function
# minimze the overage
def objective(model):
delta = 0
for j in model.J:
delta += sum(model.row_select[i] * model.matrix[i, j] for i in model.I) - magic_sum[j]
return delta
model.OBJ = pyo.Objective(rule=objective)
model.pprint() # verify everything
solver = pyo.SolverFactory('cbc') # need to have cbc solver installed
result = solver.solve(model)
result.write() # solver details
model.row_select.display() # output
Output:
# ----------------------------------------------------------
# Solver Information
# ----------------------------------------------------------
Solver:
- Status: ok
User time: -1.0
System time: 0.0
Wallclock time: 0.0
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Black box:
Number of iterations: 0
Error rc: 0
Time: 0.01792597770690918
# ----------------------------------------------------------
# Solution Information
# ----------------------------------------------------------
Solution:
- number of solutions: 0
number of solutions displayed: 0
row_select : Size=5, Index=I
Key : Lower : Value : Upper : Fixed : Stale : Domain
0 : 0 : 1.0 : 1 : False : False : Boolean
1 : 0 : 1.0 : 1 : False : False : Boolean
2 : 0 : 0.0 : 1 : False : False : Boolean
3 : 0 : 1.0 : 1 : False : False : Boolean
4 : 0 : 1.0 : 1 : False : False : Boolean
A more stressful rendition with 100,000 rows:
# row selection Integer Program stress test
import pyomo.environ as pyo
import numpy as np
# make a large matrix 100,000 x 8
data1 = np.random.randint(0, 1000, size=(100_000, 8))
# inject "the right answer into 3 rows"
data1[42602] = [8, 0, 0, 0, 0, 0, 0, 0 ]
data1[3] = [0, 0, 0, 0, 4, 3, 2, 1 ]
data1[10986] = [0, 7, 6, 5, 0, 0, 0, 0 ]
data_dict = {(i, j): data1[i][j] for i in range(len(data1)) for j in range(len(data1[0]))}
model = pyo.ConcreteModel()
# sets
model.I = pyo.Set(initialize=range(len(data1))) # a simple row index
model.J = pyo.Set(initialize=range(len(data1[0]))) # a simple column index
# parameters
model.matrix = pyo.Param(model.I , model.J, initialize=data_dict) # hold the sparse matrix of values
magic_sum = [8, 7, 6, 5, 4, 3, 2, 1 ]
# variables
model.row_select = pyo.Var(model.I, domain=pyo.Boolean) # row selection variable
# constraints
# ensure the columnar sum is at least the magic sum for all j
def min_sum(model, j):
return sum(model.row_select[i] * model.matrix[(i, j)] for i in model.I) >= magic_sum[j]
model.c1 = pyo.Constraint(model.J, rule=min_sum)
# objective function
# minimze the overage
def objective(model):
delta = 0
for j in model.J:
delta += sum(model.row_select[i] * model.matrix[i, j] for i in model.I) - magic_sum[j]
return delta
model.OBJ = pyo.Objective(rule=objective)
solver = pyo.SolverFactory('cbc')
result = solver.solve(model)
result.write()
print('\n\n======== row selections =======')
for i in model.I:
if model.row_select[i].value > 0:
print (f'row {i} selected')
Output:
# ----------------------------------------------------------
# Solver Information
# ----------------------------------------------------------
Solver:
- Status: ok
User time: -1.0
System time: 2.18
Wallclock time: 2.61
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Black box:
Number of iterations: 0
Error rc: 0
Time: 2.800779104232788
# ----------------------------------------------------------
# Solution Information
# ----------------------------------------------------------
Solution:
- number of solutions: 0
number of solutions displayed: 0
======== row selections =======
row 3 selected
row 10986 selected
row 42602 selected
This one picks and not picks an element (recursivly). As soon as the tree is impossible to solve (no elements left or any target value negative) it will return false. In case the sum of the target is 0 a solution is found and returned in form of the picked elements.
Feel free to add time and memory complexity in the comments. Worst case should be 2^(n+1)
Please let me know how it performs on your 8/50 data.
const elements = [
[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 0, 1, 0],
[1, 1, 1, 0],
[1, 1, 1, 1]
];
const target = [4, 3, 2, 1];
let iterations = 0;
console.log(iter(elements, target, [], 0));
console.log(`Iterations: ${iterations}`);
function iter(elements, target, picked, index) {
iterations++;
const sum = target.reduce(function(element, sum) {
return sum + element;
});
if (sum === 0) return picked;
if (elements.length === 0) return false;
const result = iter(
removeElement(elements, 0),
target,
picked,
index + 1
);
if (result !== false) return result;
const newTarget = matrixSubtract(target, elements[0]);
const hasNegatives = newTarget.some(function(element) {
return element < 0;
});
if (hasNegatives) return false;
return iter(
removeElement(elements, 0),
newTarget,
picked.concat(index),
index + 1
);
}
function removeElement(target, i) {
return target.slice(0, i).concat(target.slice(i + 1));
}
function matrixSubtract(minuend, subtrahend) {
let i = 0;
return minuend.map(function(element) {
return minuend[i] - subtrahend[i++]
});
}
Even if I found a few threads dealing with distance matrix efficiency, they all use either an int or float matrix. In my case I have to deal with vectors (orderedDict of frequency), and I only end up with a very slow method that is not viable with a large DataFrame (300,000 x 300,000).
How to make the process more optimized?
I would be very thankful for any help, this problem has been killing me :)
Considering DataFrame df such as:
>>> df
vectors
id
1 {dict1}
2 {dict2}
3 {dict3}
4 {dict4}
where {dict#}
orderedDict{event1: 1,
event2: 5,
event3: 0,
...}
A function to return the distance between two vectors:
def vectorDistance(a, b, df_vector):
# Calculate distance between a & b
# based on the vector from df_vector.
return distance
[in]: vectorDistance({dict1}, {dict2})
[out]: distance
A desired Output:
1 2 3 4
id
1 0 1<->2 1<->3 1<->4
2 1<->2 0 ... ...
3 1<->3 ... 0 ...
4 1<->4 ... ... 0
(where 1<->2 is a float distance between vector 1 & 2)
Method used:
import pandas as pd
matrix = pd.concat([df, df.T], axis=1)
for index in matrix.index:
for col in matrix.columns:
matrix.ix[col, index] = vectorDistance(col, index, df)
>>> matrix
5072142538 5072134420 4716823618 ...
udid
5072142538 0.00000 0.01501 0.06002 ...
5072134420 0.01501 0.00000 0.09037 ...
4716823618 0.06002 0.09037 0.00000 ...
... ... ... ...
EDIT:
Minimal example
Note: The event can differ form one {dict} to another, but it's ok when passed in the function. My issue is more how to pass the right a & b to fill the cell in a fast way.
I am working with cosine distance as it's rather good with vectors such as mine.
from collections import Counter
import pandas as pd
from math import sqrt
raw_data = {'counters_': {4716823618: Counter({51811: 1, 51820: 1, 51833: 56, 51835: 8, 51843: 48, 51848: 2, 51852: 8, 51853: 5, 51854: 4, 51856: 24, 51903: 11, 51904: 12, 51905: 3, 51906: 19, 51908: 230, 51922: 24, 51927: 19, 51931: 2, 106282: 9, 112830: 1, 119453: 1, 165062: 80, 168904: 3, 180354: 19, 180437: 33, 185824: 117, 186171: 14, 187101: 1, 190827: 7, 201629: 1, 209318: 37}), 5072134420: Counter({51811: 1, 51812: 1, 51820: 1, 51833: 56, 51835: 9, 51843: 49, 51848: 2, 51852: 11, 51853: 4, 51854: 4, 51856: 28, 51885: 1, 51903: 17, 51904: 17, 51905: 9, 51906: 14, 51908: 225, 51927: 29, 51931: 2, 106282: 19, 112830: 2, 168904: 9, 180354: 14, 185824: 219, 186171: 7, 187101: 1, 190827: 6, 201629: 2, 209318: 41}), 5072142538: Counter({51811: 4, 51812: 4, 51820: 4, 51833: 56, 51835: 8, 51843: 48, 51848: 2, 51852: 6, 51853: 3, 51854: 3, 51856: 18, 51885: 1, 51903: 17, 51904: 16, 51905: 3, 51906: 24, 51908: 258, 51927: 20, 51931: 8, 106282: 16, 112830: 2, 168904: 3, 180354: 24, 185824: 180, 186171: 10, 187101: 1, 190827: 7, 201629: 2, 209318: 52})}}
def vectorDistance(index, col):
a = dict(df[df.index == index]["counters_"].values[0])
b = dict(df[df.index == col]["counters_"].values[0])
return abs(np.round(1-(similarity(a,b)),5))
def scalar(collection):
total = 0
for coin, count in collection.items():
total += count * count
return sqrt(total)
def similarity(A,B):
total = 0
for kind in A:
if kind in B:
total += A[kind] * B[kind]
return float(total) / (scalar(A) * scalar(B))
df = pd.DataFrame(raw_data)
matrix = pd.concat([df, df.T], axis=1)
matrix.drop("counters_",0,inplace=True)
matrix.drop("counters_",1,inplace=True)
for index in matrix.index:
for col in matrix.columns:
matrix.ix[col,index] = vectorDistance(col,index)
matrix
This is certainly more efficient and easier to read than using for loops.
df = pd.DataFrame([v for v in raw_data['counters_'].values()],
index=raw_data['counters_'].keys()).T
>>> df.head()
4716823618 5072134420 5072142538
51811 1 1 4
51812 NaN 1 4
51820 1 1 4
51833 56 56 56
51835 8 9 8
# raw_data no longer needed. Delete to reduce memory footprint.
del raw_data
# Create scalars.
scalars = ((df ** 2).sum()) ** .5
>>> scalars
4716823618 289.679133
5072134420 330.548030
5072142538 331.957829
dtype: float64
def v_dist(col_1, col_2):
return 1 - ((df.iloc[:, col_1] * df.iloc[:, col_2]).sum() /
(scalars.iloc[col_1] * scalars.iloc[col_2]))
>>> v_dist(0, 1)
0.09036665882900885
>>> v_dist(0, 2)
0.060016436804916085
>>> v_dist(1, 2)
0.015009898476505357
m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
>>> m
4716823618 5072134420 5072142538
4716823618 NaN NaN NaN
5072134420 NaN NaN NaN
5072142538 NaN NaN NaN
for row in range(m.shape[0]):
for col in range(row, m.shape[1]): # Note: m.shape[0] equals m.shape[1]
if row == col:
# No need to calculate value for diagonal.
m.iat[row, col] = 0
else:
# Do two calculation in one due to symmetry.
m.iat[row, col] = m.iat[col, row] = v_dist(row, col)
>>> m
4716823618 5072134420 5072142538
4716823618 0.000000 0.090367 0.060016
5072134420 0.090367 0.000000 0.015010
5072142538 0.060016 0.015010 0.000000
Wrapping all of this into a function:
def calc_matrix(raw_data):
df = pd.DataFrame([v for v in raw_data['counters_'].values()],
index=raw_data['counters_'].keys()).T
scalars = ((df ** 2).sum()) ** .5
m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
for row in range(m.shape[0]):
for col in range(row, m.shape[1]):
if row == col:
m.iat[row, col] = 0
else:
m.iat[row, col] = m.iat[col, row] = (1 -
(df.iloc[:, row] * df.iloc[:, col]).sum() /
(scalars.iloc[row] * scalars.iloc[col]))
return m
You don't want to store dicts inside your dataframe. Read in your dataframe using the from_dict method:
df = pd.DataFrame.from_dict(raw_data['counters_'],orient='index')
Then you can apply the numpy/scipy vectorised methods for computing cosine similarity as in What's the fastest way in Python to calculate cosine similarity given sparse matrix data?
Given an array of ints I want to quantize each value so that the sum of quantized values is 100. Each quantized value should also be an integer. This works when the whole array is quantized, but when a subset of quantized values is added up it doesn't remain quantized with respect to the rest of the values.
For example, the values 44, 40, 7, 2, 0, 0 are quantized to 47, 43, 8, 2, 0, 0 (the sum of which is 100). If you take the last 4 quantized values the sum is 53 which is consistent with the first value (i.e. 47 + 53 = 100).
But with the values 78, 7, 7, 1, 0, 0, the sum of the last 4 quantized values (8, 8, 1, 0, 0) is 17. The first quantized value is 84 which when added to 17 does not equal 100. Clearly the reason for this is due to the rounding. Is there a way to adjust the rounding so that subsets are still consistent?
Here is the Ruby code:
class Quantize
def initialize(array)
#array = array.map { |a| a.to_i }
end
def values
#array.map { |a| quantize(a) }
end
def sub_total(i, j)
#array[i..j].map { |a| quantize(a) }.reduce(:+)
end
private
def quantize(val)
(val * 100.0 / total).round(0)
end
def total
#array.reduce(:+)
end
end
And the (failing) tests:
require 'quantize'
describe Quantize do
context 'first example' do
let(:subject) { described_class.new([44, 40, 7, 2, 0, 0]) }
context '#values' do
it 'quantizes array to add up to 100' do
expect(subject.values).to eq([47, 43, 8, 2, 0, 0])
end
end
context '#sub_total' do
it 'adds a subset of array' do
expect(subject.sub_total(1, 5)).to eq(53)
end
end
end
context 'second example' do
let(:subject) { described_class.new([78, 7, 7, 1, 0, 0]) }
context '#values' do
it 'quantizes array to add up to 100' do
expect(subject.values).to eq([84, 8, 8, 1, 0, 0])
end
end
context '#sub_total' do
it 'adds a subset of array' do
expect(subject.sub_total(1, 5)).to eq(16)
end
end
end
end
As noted in the comments on the question, the quantization routine does not perform correctly: the second example [78, 7, 7, 1, 0, 0] is quantized as [84, 8, 8, 1, 0, 0] — which adds to 101 and not to 100.
Here is an approach that will yield correct results:
def quantize(array, value)
quantized = array.map(&:to_i)
total = array.reduce(:+)
remainder = value - total
index = 0
if remainder > 0
while remainder > 0
quantized[index] += 1
remainder -= 1
index = (index + 1) % quantized.length
end
else
while remainder < 0
quantized[index] -= 1
remainder += 1
index = (index + 1) % quantized.length
end
end
quantized
end
This solves your problem, as stated in the question. The troublesome result becomes [80, 8, 8, 2, 1, 1], which adds to 100 and maintains the subset relationship that you described. The solution can, of course, be made more performant — but it has the advantage of working and being dead simple to understand.
I want to round any given number to an eighth or a third in Ruby, whichever is closest.
I'm hoping for output like 1/8 or 2/3.
I've tried the following:
scalar_in_eighths = (scalar * 8.0).round / 8.0
scalar_in_thirds = (scalar * 3.0).round / 3.0
thirds_difference = (scalar - scalar_in_thirds).abs
eighths_difference = (scalar - scalar_in_eighths).abs
compute_in_thirds = thirds_difference < eighths_difference
if compute_in_thirds
less_than_eighth = false
rounded_scalar = scalar_in_thirds
else
less_than_eighth = false
rounded_scalar = scalar_in_eighths
end
quotient, modulus = rounded_scalar.to_s.split '.'
quotient = quotient.to_f
modulus = ".#{modulus}".to_f
This works well for eights, but for numbers like 1.32 it breaks down.
Doing modulus.numerator and modulus.denominator for the fractional components will yield numbers like 6004799503160661 and 18014398509481984.
Is there a better way to solve this?
Here's one way you could write it.
Code
def closest_fraction(f,*denominators)
n, frac = denominators.map { |n| [n, round_to_fraction(f,n)] }
.min_by { |_,g| (f-g).abs }
[(n*frac).round, n, frac]
end
def round_to_fraction(f,n)
(f*n).round/n.to_f
end
Examples
closest_fraction(2.33, 3, 8)
#=> [7, 3, 2.3333333333333335]
closest_fraction(2.12, 3, 8)
#=> [17, 8, 2.125]
closest_fraction(2.46, 2, 3, 5)
#=> [5, 2, 2.5]
closest_fraction(2.76, 2, 3, 5, 7, 11, 13, 17)
#=> [47, 17, 2.764705882352941]