Finding an optimal selection in a 2D matrix with given constrains - algorithm

Problem statement
Given a m x n matrix where m <= n you have to select entries so that their sum is maximal.
However you can only select one entry per row and at most one per column.
The performance is also a huge factor which means its ok to find selections that are not optimal in oder to reduce complexity (as long as its better than selecting random entries)
Example
Valid selections:
Invalid selections:
(one entry per row and at most one per column)
My Approaches
Select best of k random permutations
A = createRandomMatrix(m,n)
selections = list()
for try in range(k):
cols = createRandomIndexPermutation(m) # with no dublicates
for row in range(m):
sum += A[row, cols[row]]
selections.append(sum)
result = max(selections)
This appoach performs poorly when n is significantly larger than m
Best possible (not yet taken) column per row
A = createRandomMatrix(m,n)
takenCols = set()
result = 0
for row in range(m):
col = getMaxColPossible(row, takenCols, A)
result += A[row, col]
takenCols.add(col)
This approach always values the rows (or columns) higher that were discovered first which could lead to worse than average results

This sounds exactly like the rectangular linear assignment problem (RLAP). This problem can be efficiently (in terms of asymptotic complexity; somewhat around cubic time) solved (to a global-optimum) and a lot of software is available.
The basic approaches are LAP + dummy-vars, LAP-modifications or more general algorithms like network-flows (min-cost max-flow).
You can start with (pdf):
Bijsterbosch, J., and A. Volgenant. "Solving the Rectangular assignment problem and applications." Annals of Operations Research 181.1 (2010): 443-462.
Small python-example using python's common scientific-stack:
Edit: as mentioned in the comments, negating the cost-matrix (which i did, motivated by the LP-description) is not what's done in the Munkres/Hungarian-method literature. The strategy is to build a profit-matrix from the cost-matrix, which is now reflected in the example. This approach will lead to a non-negative cost-matrix (sometimes assumes; if it's important, depends on the implementation). More information is available in this question.
Code
import numpy as np
import scipy.optimize as sopt # RLAP solver
import matplotlib.pyplot as plt # visualizatiion
import seaborn as sns # """
np.random.seed(1)
# Example data from
# https://matplotlib.org/gallery/images_contours_and_fields/image_annotated_heatmap.html
# removed a row; will be shuffled to make it more interesting!
harvest = np.array([[0.8, 2.4, 2.5, 3.9, 0.0, 4.0, 0.0],
[2.4, 0.0, 4.0, 1.0, 2.7, 0.0, 0.0],
[1.1, 2.4, 0.8, 4.3, 1.9, 4.4, 0.0],
[0.6, 0.0, 0.3, 0.0, 3.1, 0.0, 0.0],
[0.7, 1.7, 0.6, 2.6, 2.2, 6.2, 0.0],
[1.3, 1.2, 0.0, 0.0, 0.0, 3.2, 5.1]],)
harvest = harvest[:, np.random.permutation(harvest.shape[1])]
# scipy: linear_sum_assignment -> able to take rectangular-problem!
# assumption: minimize -> cost-matrix to profit-matrix:
# remove original cost from maximum-costs
# Kuhn, Harold W.:
# "Variants of the Hungarian method for assignment problems."
max_cost = np.amax(harvest)
harvest_profit = max_cost - harvest
row_ind, col_ind = sopt.linear_sum_assignment(harvest_profit)
sol_map = np.zeros(harvest.shape, dtype=bool)
sol_map[row_ind, col_ind] = True
# Visualize
f, ax = plt.subplots(2, figsize=(9, 6))
sns.heatmap(harvest, annot=True, linewidths=.5, ax=ax[0], cbar=False,
linecolor='black', cmap="YlGnBu")
sns.heatmap(harvest, annot=True, mask=~sol_map, linewidths=.5, ax=ax[1],
linecolor='black', cbar=False, cmap="YlGnBu")
plt.tight_layout()
plt.show()
Output

Related

Use Gekko and Python to fit a numerical ODE solution to data

Use Gekko to fit a numerical ODE solution to data.
Hi everyone!
I was wondering, if it is possible to fit coefficients of an ODE using GEKKO.
I unsuccessfully tried to replicate the example given here.
This is what I have come up with (but is flawed – and I should perhaps mention that my math skills are unfortunately rather poor):
import numpy as np
from gekko import GEKKO
tspan = [0, 0.1, 0.2, 0.4, 0.8, 1]
Ca_data = [2.0081, 1.5512, 1.1903, 0.7160, 0.2562, 0.1495]
m = GEKKO(remote=False)
t = m.Param(value=tspan)
m.time = t
Ca_m = m.Param(value=Ca_data)
Ca = m.Var()
k = m.FV(value=1.3)
k.STATUS = 1
m.Equation( Ca.dt() == -k * Ca)
m.Obj( ((Ca-Ca_m)**2)/Ca_m )
m.options.IMODE = 2
m.solve(disp=True)
print(k.value[0]) #2.58893455 is the solution
Can someone help me out here?
Thank you very much,
Martin
(This is my first post here – please be gentle, if I have done something not appropriate.)
Your solution was close but you needed:
More NODES (default=2) to improve the accuracy. Gekko only adds that points that you define. See additional information on collocation.
Define Ca as m.CV() to use built-in error model instead of m.Var() and m.Obj with NODES>=3. Otherwise, the internal nodes of each collocation interval are also matched to the measurements and this gives a slightly wrong answer.
Set EV_TYPE=2 to use a squared error. An absolute value objective EV_TYPE=1 (default) gives a correct but slightly different answer.
import numpy as np
from gekko import GEKKO
m = GEKKO(remote=False)
m.time = [0, 0.1, 0.2, 0.4, 0.8, 1]
Ca_data = [2.0081, 1.5512, 1.1903, 0.7160, 0.2562, 0.1495]
Ca = m.CV(value=Ca_data); Ca.FSTATUS = 1 # fit to measurement
k = m.FV(value=1.3); k.STATUS = 1 # adjustable parameter
m.Equation(Ca.dt()== -k * Ca) # differential equation
m.options.IMODE = 5 # dynamic estimation
m.options.NODES = 5 # collocation nodes
m.options.EV_TYPE = 2 # squared error
m.solve(disp=True) # display solver output
print(k.value[0]) # 2.58893455 is the curve_fit solution
The solution is k=2.5889717102. A plot shows the match to the measured values.
import matplotlib.pyplot as plt # plot solution
plt.plot(m.time,Ca_data,'ro')
plt.plot(m.time,Ca.value,'bx')
plt.show()
There are additional tutorials and course material on parameter estimation with differential and algebraic equation models.

Optimal parameters not found: Number of calls to function has reached maxfev = 100

I'm new to python, I try to give some adjustment to the data, but when I get the graph, only the original data appears and with the message "Optimal parameters not found: Number of calls to function has reached maxfev = 1000." Could you help me find my mistake?
%matplotlib inline
import matplotlib.pylab as m
from scipy.optimize import curve_fit
import numpy as num
import scipy.optimize as optimize
xData=num.array([0,0,100,200,250,300,400], dtype="float")
yData=num.array([0,0,0,0,75,100,100], dtype="float")
m.plot(xData, yData, 'ro', label='Datos originales')
def fun(x, a, b):
return a + b * num.log(x)
popt,pcov=optimize.curve_fit(fun, xData, yData,p0=[1,1], maxfev=1000)
print=popt
x=num.linspace(1,400,7)
m.plot(x,fun(x, *popt), label='Función ajustada')
m.xlabel('concentración')
m.ylabel('% mortalidad')
m.legend()
m.grid()
The model in your code is "a + b * num.log(x)". Because your data contains an x value of 0.0, the evaluation of log(0.0) gives errors and will not allow the fitting software to function. Sometimes these x values of 0.0 can be replaced with very small numbers, as log(small number) will not fail - but in this case the equation and data do not appear to match and so using that technique alone would not be sufficient here.
My thought is that a different equation would be a better model for this data. I performed an equation search using your data, and found that several different sigmoidal type equations gave suspiciously good fits to this data set - which is not surprising because of the small number of data points.
The sigmoidal equations I tried were all extremely sensitive to the initial parameter estimates. Here is a graphical Python fitter using scipy's Differential Evolution genetic algorithm module to determine the initial parameter estimates for curve_fit's non-linear solver. That scipy module uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, requiring bounds within which to search. Here those bounds are taken from the data maximum and minimun values.
I personally would not use this fit precisely because the small number of data points is giving such suspiciously good fits, and strongly recommend taking additional data points if at all possible. I could however not find any equations with less than three parameters that would fit the data.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings
xData=numpy.array([0,0,100,200,250,300,400], dtype="float")
yData=numpy.array([0,0,0,0,75,100,100], dtype="float")
def func(x, a, b, c): # Sigmoid B equation from zunzun.com
return a / (1.0 + numpy.exp(-1.0 * (x - b) / c))
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = func(xData, *parameterTuple)
return numpy.sum((yData - val) ** 2.0)
def generate_Initial_Parameters():
# min and max used for bounds
maxX = max(xData)
minX = min(xData)
parameterBounds = []
parameterBounds.append([minX, maxX]) # search bounds for a
parameterBounds.append([minX, maxX]) # search bounds for b
parameterBounds.append([0.0, 2.0]) # search bounds for c
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()
# now call curve_fit without passing bounds from the genetic algorithm,
# just in case the best fit parameters are aoutside those bounds
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Fitted parameters:', fittedParameters)
print()
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData), 100)
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

Compute (Aᵀ×A)*(Bᵀ×B) matrix using gsl, Blas and Lapack

Let I have 2 symmetric matrices:
A = {{1,2}, {2,3}}
B = {{2,3},{3,4}}
Can I compute the matrix (AT×A)*(BT×B) using gsl, Blas and Lapack?
I'm using
gsl_blas_dsyrk(CblasUpper, CblasTrans, 1.0, A, 0.0, ATA);
gsl_blas_dsyrk(CblasUpper, CblasTrans, 1.0, B, 0.0, BTB);
gsl_blas_dsymm(CblasLeft, CblasUpper, 1.0, ATA, BTB, 0.0, ATABTB); // It doesn't work
It returns:
(Aᵀ·A) = ATA = {{5, 8}, {0, 13}} -- ok, gsl_blas_dsyrk returns symmetric matrix as upper triangular matrix.
(Bᵀ·B) = BTB = {{13, 8}, {0, 25}} -- ok.
(Aᵀ·A)·(Bᵀ·B) = ATABTB = {{65, 290}, {104, 469}} -- it's wrong.
Symmetrize BTB and the problem will be solved.
As you noticed, the upper triangular parts of symmetric matrices are computed by dsyrk(). Then dsymm() is applied. According to the definition of dsymm(), the following operation is performed since the flag CblasLeft is used:
C := alpha*A*B + beta*C
where alpha and beta are scalars, A is a symmetric matrix and B and
C are m by n matrices.
Indeed, the B matrix is a general matrix, not necessarly a symmetric one. As a result, ATA is multiplied by the upper triangular part of BTB, since the lower triangular part of BTB is not computed.
Symmetrize BTB and the problem will be solved. To do so, for loops is a straightforward solution , see Convert symmetric matrix between packed and full storage?

Efficient Parallel Sparse Matrix dot product in Scipy Python

I have a really big (1.5M x 16M) sparse csr scipy matrix A. What i need to compute is the similarity of each pair of rows. I have defined the similarity as this:
Assume a and b are two rows of matrix A
a = (0, 1, 0, 4)
b = (1, 0, 2, 3)
Similarity (a, b) = 0*1 + 1*0 + 0*2 + 4*3 = 12
To compute all pairwise row similarities I use this (or Cosine similarity):
AT = np.transpose(A)
pairs = A.dot(AT)
Now pairs[i, j] is the similarity of row i and row j for all such i and j.
This is quite similar to pairwise Cosine similarity of rows. So If there is an efficient parallel algorithm that computes pairwise Cosine similarity it would work for me as well.
The problem: This dot product is very slow because it uses just one cpu (I have access to 64 of those cpus on my server).
I can also export A and AT to a file and run any other external program that does the multiplication in parallel and get the results back to the Python program.
Is there any more efficient way of doing this dot product? or computing the pairwise similarity in Parallel?
I finally used the 'Cosine' distance metric of scikit-learn and its pairwise_distances functions which support sparse matrices and is highly parallelised.
sklearn.metrics.pairwise.pairwise_distances(X, Y=None, metric='euclidean', n_jobs=1, **kwds)
I could also divide A into n horizontal parts and use the parallel python package to run multiple multiplications and horizontally stack the results later.
I wrote own implementation using sklearn. It is not parallel but it quite fast for large matrices.
from scipy.sparse import spdiags
from sklearn.preprocessing import normalize
def get_similarity_by_x_dot_x_greedy_for_memory(sp_matrix):
sp_matrix = sp_matrix.tocsr()
matrix = sp_matrix.dot(sp_matrix.T)
# zero diagonal
diag = spdiags(-matrix.diagonal(), [0], *matrix.shape, format='csr')
matrix = matrix + diag
return matrix
def get_similarity_by_cosine(sp_matrix):
sp_matrix = normalize(sp_matrix.tocsr())
return get_similarity_by_x_dot_x_greedy_for_memory(sp_matrix)

create ROC from 10 different thresholds

I have output from svmlight which has x=predictions (0.1,-0.6,1.2, -0.7...), y=actual class {+1,-1}. I want to create an ROC curve for 10 specific different thresholds (let t be a vector that contains 10 different threshold values). I checked ROCR package but I didn't see any option for supplying threshold vector. I need to calculate TPR and FPR for each threshold value and plot. Is there any other way to do that ? I am new to R programming.
ROCR creates an ROC curve by plotting the TPR and FPR for many different thresholds. This can be done with just one set of predictions and labels because if an observation is classified as positive for one threshold, it will also be classified as positive at a lower threshold. I found this paper to be helpful in explaining ROC curves in more detail.
You can create the plot as follows in ROCR where x is the vector of predictions, and y is the vector of class labels:
pred <- prediction(x,y)
perf <- performance(pred,"tpr","fpr")
plot(perf)
If you want to access the TPR and FPR associated with all the thresholds, you can examine the performance object 'perf':
str(perf)
The following answer shows how to obtain the threshold values in more detail:
https://stackoverflow.com/a/16347508/786220
You can do that with the pROC package. First create the ROC curve (for all thresholds):
myROC <- roc(y, x) # with the x and y you defined in your question
And then you query this curve for the 10 (or any number of) thresholds that you stored in t:
coords(myROC, x = t, input="threshold", ret = c("threshold", "se", "1-sp"))
Sensitivity is your TPR while 1-Specificity is your FPR.
Disclaimer: I am the author of pROC.
You can use this func:
def roc_curve_new(y_true, y_pred, thresholds):
fpr_list = []
tpr_list = []
thresholds_list = []
for threshold in thresholds:
thresholds_list.append(threshold)
new_y_pred = np.where(y_pred < threshold, y_pred, 1)
y_pred_b = np.where(new_y_pred >= threshold,new_y_pred, 0)
tn, fp, fn, tp = confusion_matrix(list(y_true), list(y_pred_b)).ravel()
#true positive rate
tpr = tp/(tp+fn)
#false positive rate
fpr = fp/(fp+tn)
fpr_list.append(fpr)
tpr_list.append(tpr)
return fpr_list, tpr_list, thresholds_list
thresholds = np.arange(0.1, 1.1, 0.1)
y = np.array([1, 1, 0, 1, 1, 0, 0])
scores = np.array([0.5, 0.4, 0.35, 0.75, 0.55, 0.4, 0.2])
fpr, tpr, _ = roc_curve_new(y, scores, thresholds)
plt.plot(fpr, tpr, '.-', color='b')
plt.plot([0, 1], [0, 1], color="navy", lw=1, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
It will give you img:

Resources