Euclidian distance based fitness function - genetic-algorithm

Does it make sense to use euclidian distance as fitness function in order to maximise based on multiple parameters? If not what sort of fitness function should I be using for such a task?

Your biggest problem using euclidean distance is that your multiple objectives might not be scaled the same way. I.e. if objective A ranges from 1 to 1000, and objective B ranges from 0 to 1, you're going to favor objective A. If you're wedded to using a single aggregate objective rather than an MOEA that does Pareto ranking like NSGAII, pay attention to objective scaling, and also consider a satisficing formulation.
Satisficing is where you saturate an objective at a certain good-enough value. In Python, it might look like this (assuming minimization):
a_prime = max(a, 40)
b_prime = max(b, 0.1)
aggregate_objective = a_prime / 1000 + b_prime

Related

Algorithm to find desired direction with minimum amount of iterations

There's three components to this problem:
A three dimensional vector A.
A "smooth" function F.
A desired vector B (also three dimensional).
We want to find a vector A that when put through F will produce the vector B.
F(A) = B
F can be anything that somehow transforms or distorts A in some manner. The point is that we want to iteratively call F(A) until B is produced.
The question is:
How can we do this, but with the least amount of calls to F before finding a vector that equals B (within a reasonable threshold)?
I am assuming that what you call "smooth" is tantamount to being differentiable.
Since the concept of smoothness only makes sense in the rational / real numbers, I will also assume that you are solving a floating point-based problem.
In this case, I would formulate the problem as a nonlinear programming problem. i.e. minimizing the squared norm of the difference between f(A) and B, given by
(F(A)_1 -B_1)² + (F(A)_2 - B_2)² + (F(A)_3 - B_3)²
It should be clear that this expression is zero if and only if f(A) = B and positive otherwise. Therefore you would want to minimize it.
As an example, you could use the solvers built into the scipy optimization suite (available for python):
from scipy.optimize import minimize
# Example function
f = lambda x : [x[0] + 1, x[2], 2*x[1]]
# Optimization objective
fsq = lambda x : sum(v*v for v in f(x))
# Initial guess
x0 = [0,0,0]
res = minimize(fsq, x0, tol=1e-6)
# res.x is the solution, in this case
# array([-1.00000000e+00, 2.49999999e+00, -5.84117172e-09])
A binary search (as pointed out above) only works if the function is 1-d, which is not the case here. You can try out different optimization methods by adding the method="name" to the call to minimize, see the API. It is not always clear which method works best for your problem without knowing more about the nature of your function. As a rule of thumb, the more information you give to the solver, the better. If you can compute the derivative of F explicitly, passing it to the solver will help reduce the number of required evaluations. If F has a Hessian (i.e., if it is twice differentiable), providing the Hessian will help as well.
As an alternative, you can use the least_squares function on F directly via res = least_squares(f, x0). This could be faster since the solver can take care of the fact that you are solving a least squares problem rather than a generic optimization problem.
From a more general standpoint, the problem of restoring the function arguments producing a given value is called an Inverse Problem. These problems have been extensively studied.
Provided that F(A)=B, F,B are known and A remains unknown, you can start with a simple gradient search:
F(A)~= F(C) + F'(C)*(A-C)~=B
where F'(C) is the gradient of F() evaluated in point C. I'm assuming you can calculate this gradient analytically for now.
So, you can choose a point C that you estimate it is not very far from the solution and iterate by:
C= Co;
While(true)
{
Ai = inverse(F'(C))*(B-F(C)) + C;
convergence = Abs(Ai-C);
C=Ai;
if(convergence<someThreshold)
break;
}
if the gradient of F() cannot be calculated analytically, you can estimate it. Let Ei, i=1:3 be the ortonormal vectors, then
Fi'(C) = (F(C+Ei*d) - F(C-Ei*d))/(2*d);
F'(C) = [F1'(C) | F2'(C) | F3'(C)];
and d can be chosen as fixed or as a function of the convergence value.
These algorithms suffer from the problem of local maxima, null gradient areas, etc., so in order for it to work, the start point (Co) must be not very far from the solution where the function F() behaves monotonically
it seems like you can try a metaheuristic approach for this.
Genetic algorithm (GA) might be the best suite for this.
you can initiate a number of A vector to init a population, and use GA to make evolution on your population, so you will have better generation in which they have better vectors that F(Ax) closer to B.
Your fitness function can be a simple function that compare F(Ai) to B
You can choose how to mutate your population by each generation.
A simple example about GA can be found here link

optimize integral f(x)exp(-x) from x=0,infinity

I need a robust integration algorithm for f(x)exp(-x) between x=0 and infinity, with f(x) a positive, differentiable function.
I do not know the array x a priori (it's an intermediate output of my routine). The x array is typically ~log-equispaced, but highly irregular.
Currently, I'm using the Simpson algorithm, buy my problem is that often the domain is highly undersampled by the x array, which produces unrealistic values for the integral.
On each run of my code I need to do this integration thousands of times (each with a different set of x values), so I need to find an efficient and robust way to integrate this function.
More details:
The x array can have between 2 and N points (N known). The first value is always x[0] = 0.0. The last point is always a value greater than a tunable threshold x_max (such that exp(x_max) approx 0). I only know the values of f at the points x[i] (though the function is a smooth function).
My first idea was to do a Laguerre-Gauss quadrature integration. However, this algorithm seems to be highly unreliable when one does not use the optimal quadrature points.
My current idea is to add a set of auxiliary points, interpolating f, such that the Simpson algorithm becomes more stable. If I do this, is there an optimal selection of auxiliary points?
I'd appreciate any advice,
Thanks.
Set t=1-exp(-x), then dt = exp(-x) dx and the integral value is equal to
integral[ f(-log(1-t)) , t=0..1 ]
which you can evaluate with the standard Simpson formula and hopefully get good results.
Note that piecewise linear interpolation will always result in an order 2 error for the integral, as the result amounts to a trapezoid formula even if the method was Simpson. For better errors in the Simpson method you will need higher interpolation degrees, ideally cubic splines. Cubic Bezier polynomials with estimated derivatives to compute the control points could be a fast compromise.

How is the decision boundary piloted after the parameters theta are updated

I have been learning about Machine learning algorithms this semester but I cant seem to understand how the parameters theta are used once Gradient decent is ran and they are updated, specifically in Logistic regression, In short my question is how is the decision boundary piloted after the parameters theta are updated.
After you use gradient descent to estimate your parameters theta, you can use those calculated parameters to make predictions.
For any input x, you can now calculate an predicted outcome y.
Ultimately the goal of machine learning is to make predictions.
So you take a whole bunch of observations x and y. Where x is your input and y is your output. In case of logistic regression, y is one of two values. For example, take a bunch of emails (x) that are labeled spam or no spam (y is 1 for spam and 0 for no spam). Or take a bunch of medical images that are labeled healthy or non healthy. ...
Feed all that data in your machine learning algorithm. Your algorithm (gradient descent for example), will calculate the theta coefficients.
Now you can use these theta coefficient to make predictions for new values of x. For example a new email that the system has never seen, using the theta coefficient, you can predict whether it is spam or not.
As far a plotting the decision boundary. This is probably feasible when you have two dimensions for x. You can have one dimension on each axis. And the resulting dots in your graph would be your y values. You could color them differently or show a different shape whether the result is one way or the other (i.e. your y is 0 or 1).
In practicality, these plots are useful during a lecture to get a general gist of what you're trying to do or accomplish. In reality, every input X would probably be a vector of many values (way more than 2). And thus it becomes impossible to plot a decision boundary.
Typically, logistic regression is parametrized in a following way:
cl(x|theta) = 1 / (1 + exp(-SUM_{i=1}^d theta_i x_i + theta_0 )) ) > 0.5
which is equivalent to
cl(x|theta) = sign(SUM_{i=1}^d theta_i x_i + theta_0 )
so once you get your theta, you use it to make a prediction by computing a simple weighted sum of your data representation and you check the sign of such number.

Machine Learning Algorithm for Completing Sparse Matrix Data

I've seen some machine learning questions on here so I figured I would post a related question:
Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.
The finishing times from users are almost inverse normally distributed for every competition.
One can write this problem as a matrix:
Comp1 Comp2 Comp3
User1 20min ?? 10min
User2 25min 20min 12min
User3 30min 25min ??
User4 30min ?? ??
I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).
There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.
I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1
Could you maybe give me a hint which methods I could use?
Your astute observation that this is a matrix completion problem gets
you most of the way to the solution. I'll codify your intuition that
the combination of ability of a user and difficulty of the course
yields the time of a race, then present various algorithms.
Model
Let the vector u denote the speed of the users so that u_i is user i's
speed. Let the vector v denote the difficulty of the courses so
that v_j is course j's difficulty. Also when available, let t_ij be user i's time on
course j, and define y_ij = 1/t_ij, user i's speed on course j.
Since you say the times are inverse Gaussian distributed, a sensible
model for the observations is
y_ij = u_i * v_j + e_ij,
where e_ij is a zero-mean Gaussian random variable.
To fit this model, we search for vectors u and v that minimize the
prediction error among the observed speeds:
f(u,v) = sum_ij (u_i * v_j - y_ij)^2
Algorithm 1: missing value Singular Value Decomposition
This is the classical Hebbian
algorithm. It
minimizes the above cost function by gradient descent. The gradient of
f wrt to u and v are
df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i
Plug these gradients into a Conjugate Gradient solver or BFGS
optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or
optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.
Algorithm 2: matrix factorization with a trace norm penalty
Recently, simple convex relaxations to this problem have been
proposed. The resulting algorithms are just as simple to code up and seem to
work very well. Check out, for example Collaborative Filtering in a Non-Uniform World:
Learning with the Weighted Trace Norm. These methods minimize
f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*,
where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.
There are several ways to do this, perhaps the best architecture to try first is the following:
(As usual, as a preprocessing step normalize your data into a uniform function with 0 mean and 1 std deviation as best you can. You can do this by fitting a function to the distribution of all race results, applying its inverse, and then subtracting the mean and dividing by the std deviation.)
Select a hyperparameter N (you can tune this as usual with a cross validation set).
For each participant and each race create an N-dimensional feature vector, initially random. So if there are R races and P participants then there are R+P feature vectors with a total of N(R+P) parameters.
The prediction for a given participant and a given race is a function of the two corresponding feature vectors (as a first try use the scalar product of these two vectors).
Alternate between incrementally improving the participant feature vectors and the race feature vectors.
To improve a feature vector use gradient descent (or some more complex optimization method) on the known data elements (the participant/race pairs for which you have a result).
That is your loss function is:
total_error = 0
forall i,j
if (Participant i participated in Race j)
actual = ActualRaceResult(i,j)
predicted = ScalarProduct(ParticipantFeatures_i, RaceFeatures_j)
total_error += (actual - predicted)^2
So calculate the partial derivative of this function wrt the feature vectors and adjust them incrementally as per a usual ML algorithm.
(You should also include a regularization term on the loss function, for example square of the lengths of the feature vectors)
Let me know if this architecture is clear to you or you need further elaboration.
I think this is a classical task of missing data recovery. There exist some different methods. One of them which I can suggest is based on Self Organizing Feature Map (Kohonen's Map).
Below it's assumed that every athlet record is a pattern, and every competition data is a feature.
Basically, you should divide your data into 2 sets: first - with fully defined patterns, and second - patterns with partially lost features. I assume this is eligible because sparsity is 8%, that is you have enough data (92%) to train net on undamaged records.
Then you feed first set to the SOM and train it on this data. During this process all features are used. I'll not copy algorithm here, because it can be found in many public sources, and even some implementations are available.
After the net is trained, you can feed patterns from the second set to the net. For each pattern the net should calculate best matching unit (BMU), based only on those features that exist in the current pattern. Then you can take from the BMU its weigths, corresponding to missing features.
As alternative, you could not divide the whole data into 2 sets, but train the net on all patterns including the ones with missing features. But for such patterns learning process should be altered in the similar way, that is BMU should be calculated only on existing features in every pattern.
I think you can have a look at the recent low rank matrix completion methods.
The assumption is that your matrix has a low rank compared to the matrix dimension.
min rank(M)
s.t. ||P(M-M')||_F=0
M is the final result, and M' is the uncompleted matrix you currently have.
This algorithm minimizes the rank of your matrix M. P in the constraint is an operator that takes the known terms of your matrix M', and constraint those terms in M to be the same as in M'.
The optimization of this problem has a relaxed version, which is:
min ||M||_* + \lambda*||P(M-M')||_F
rank(M) is relaxed to its convex hull ||M||_* Then you trade off the two terms by controlling the parameter lambda.

What's a good weighting function?

I'm trying to perform some calculations on a non-directed, cyclic, weighted graph, and I'm looking for a good function to calculate an aggregate weight.
Each edge has a distance value in the range [1,∞). The algorithm should give greater importance to lower distances (it should be monotonically decreasing), and it should assign the value 0 for the distance ∞.
My first instinct was simply 1/d, which meets both of those requirements. (Well, technically 1/∞ is undefined, but programmers tend to let that one slide more easily than do mathematicians.) The problem with 1/d is that the function cares a lot more about the difference between 1/1 and 1/2 than the difference between 1/34 and 1/35. I'd like to even that out a bit more. I could use √(1/d) or ∛(1/d) or even ∜(1/d), but I feel like I'm missing out on a whole class of possibilities. Any suggestions?
(I thought of ln(1/d), but that goes to -∞ as d goes to ∞, and I can't think of a good way to push that up to 0.)
Later:
I forgot a requirement: w(1) must be 1. (This doesn't invalidate the existing answers; a multiplicative constant is fine.)
perhaps:
exp(-d)
edit: something along the lines of
exp(k(1-d)), k real
will fit your extra requirement (I'm sure you knew that but what the hey).
How about 1/ln (d + k)?
Some of the above answers are versions of a Gaussian distribution which I agree is a good choice. The Gaussian or normal distribution can be found often in nature. It is a B-Spline basis function of order-infinity.
One drawback to using it as a blending function is its infinite support requires more calculations than a finite blending function. A blend is found as a summation of product series. In practice the summation may stop when the next term is less than a tolerance.
If possible form a static table to hold discrete Gaussian function values since calculating the values is computationally expensive. Interpolate table values if needed.
How about this?
w(d) = (1 + k)/(d + k) for some large k
d = 2 + k would be the place where w(d) = 1/2
It seems you are in effect looking for a linear decrease, something along the lines of infinity - d. Obviously this solution is garbage, but since you are probably not using a arbitrary precision data type for the distance, you could use yourDatatype.MaxValue - d to get a linear decreasing function for this.
In fact you might consider using (yourDatatype.MaxValue - d) + 1 you are using doubles, because you could then assign the weight of 0 if your distance is "infinity" (since doubles actually have a value for that.)
Of course you still have to consider implementation details like w(d) = double.infinity or w(d) = integer.MaxValue, but these should be easy to spot if you know the actual data types you are using ;)

Resources