I am interested in plotting the performance of individual subpopulations in the island based distributed genetic algorithms. I saw a couple of research works that calculate the rank and order of subpopulations and plot the rank against the generations to understand how the subpopulations evolve?
I could not understand how the rank of each subpopulation is calculated?
Could anyone please explain.
Generally, the rank of sub-populations is assigned based on a quality of the sub-populations. Some common qualities to use for this are the average fitness value of the sub-population, the best fitness value of the sub-population, etc...
The rank may be used as measure to order the population.
Related
I'm working on a problem that necessitates running KMeans separately on ~125 different datasets. Therefore, I'm looking to mathematically calculate the 'optimal' K for each respective dataset. However, the evaluation metric continues decreasing with higher K values.
For a sample dataset, there are 50K rows and 8 columns. Using sklearn's calinski-harabaz score, I'm iterating through different K values to find the optimum / minimum score. However, my code reached k=5,600 and the calinski-harabaz score was still decreasing!
Something weird seems to be happening. Does the metric not work well? Could my data be flawed (see my question about normalizing rows after PCA)? Is there another/better way to mathematically converge on the 'optimal' K? Or should I force myself to manually pick a constant K across all datasets?
Any additional perspectives would be helpful. Thanks!
I don't know anything about the calinski-harabaz score but some score metrics will be monotone increasing/decreasing with respect to increasing K. For instance the mean squared error for linear regression will always decrease each time a new feature is added to the model so other scores that add penalties for increasing number of features have been developed.
There is a very good answer here that covers CH scores well. A simple method that generally works well for these monotone scoring metrics is to plot K vs the score and choose the K where the score is no longer improving 'much'. This is very subjective but can still give good results.
SUMMARY
The metric decreases with each increase of K; this strongly suggests that you do not have a natural clustering upon the data set.
DISCUSSION
CH scores depend on the ratio between intra- and inter-cluster densities. For a relatively smooth distribution of points, each increase in K will give you clusters that are slightly more dense, with slightly lower density between them. Try a lattice of points: vary the radius and do the computations by hand; you'll see how that works. At the extreme end, K = n: each point is its own cluster, with infinite density, and 0 density between clusters.
OTHER METRICS
Perhaps the simplest metric is sum-of-squares, which is already part of the clustering computations. Sum the squares of distances from the centroid, divide by n-1 (n=cluster population), and then add/average those over all clusters.
I'm looking for a particular paper that discusses metrics for this very problem; if I can find the reference, I'll update this answer.
N.B. With any metric you choose (as with CH), a failure to find a local minimum suggests that the data really don't have a natural clustering.
WHAT TO DO NEXT?
Render your data in some form you can visualize. If you see a natural clustering, look at the characteristics; how is it that you can see it, but the algebra (metrics) cannot? Formulate a metric that highlights the differences you perceive.
I know, this is an effort similar to the problem you're trying to automate. Welcome to research. :-)
The problem with my question is that the 'best' Calinski-Harabaz score is the maximum, whereas my question assumed the 'best' was the minimum. It is computed by analyzing the ratio of between-cluster dispersion vs. within-cluster dispersion, the former/numerator you want to maximize, the latter/denominator you want to minimize. As it turned out, in this dataset, the 'best' CH score was with 2 clusters (the minimum available for comparison). I actually ran with K=1, and this produced good results as well. As Prune suggested, there appears to be no natural grouping within the dataset.
Consider I have the following weights and quantitative parameters: w_1..w_n, p_1..p_n. 0 <= w <= 1. I also have a selection of cases of parameters and associated values.
What algorithms exist for finding the optimal weights to minimize the errors of predicting the value given the parameters? And what algorithms have typically achieved the best results?
I try to predict the quality of an apple based on the parameters p_1=transport _time, p_2=days_since_picking. The quality is measured using a subjective likert scale.
Fifty people have rated apples with scores from 1 to 5 and I know p_1 and p_2 for all those apples. How do I predict and find the weights for p_1 and p_2 that minimize the total errors in the cases?
I agree with the comment that you should run a web search on "linear regression". At least three other sources for lists of algorithms come to mind:
NLopt: http://ab-initio.mit.edu/wiki/index.php/NLopt_Algorithms (and my C# wrapper for it: https://github.com/BrannonKing/NLoptNet)
S. Boyd's book: http://stanford.edu/~boyd/cvxbook/
You could probably use a supervised AI algorithm. Neural networks are typically made up of "weights": https://en.wikipedia.org/wiki/Supervised_learning
You could also use a genetic algorithm in conjunction with gray code weight encoding.
I try to solve this problem using genetic algorithm and get difficult to choose the fitness function.
My problem is a little differnt than the original Traveling Salesman Problem ,since the population and maybe also the win unit not neccesrly contain all the cities.
So , I have 2 value for each unit: the amount of cities he visit, the total time and the order he visit the cities.
I tried 2-3 fitness function but they don't give good sulotion.
I need idea of good fitness function which take in account the amount of cities he visited and also the total time.
Thanks!
In addition to Peladao's suggestions of using a pareto approach or some kind of weighted sum there are two more possibilities that I'd like to mention for the sake of completeness.
First, you could prioritize your fitness functions. So that the individuals in the population are ranked by first goal, then second goal, then third goal. Therefore only if two individuals are equal in the first goal they will be compared by second goal. If there is a clear dominance in your goals this may be a feasible approach.
Second, you could define two of your goals as constraints that you penalize only when they exceed a certain threshold. This may be feasible when e.g. the amount of cities should not be in a certain range, e.g. [4;7], but doesn't matter if it's 4 or 5. This is similar to a weighted sum approach where the contribution of the individual goals to the combined fitness value differs by several orders of magnitude.
The pareto approach is the only one that treats all objectives with equal importance. It requires special algorithms suited for multiobjective optimization though, such as NSGA-II, SPEA2, AbYSS, PAES, MO-TS, ...
In any case, it would be good if you could show the 2-3 fitness functions that you tried. Maybe there were rather simple errors.
Multiple-objective fitness functions can be implemented using a Pareto optimal.
You could also use a weighted sum of different fitness values.
For a good and readable introduction into multiple-objective optimisation and GA: http://www.calresco.org/lucas/pmo.htm
I see that for k-means, we have Lloyd's algorithm, Elkan's algorithm, and we also have hierarchical version of k-means.
For all these algorithms, I see that Elkan's algorithm could provide a boost in term of speed. But what I want to know, is the quality from all these k-means algorithms. Each time, we run these algorithms, the result would be different, due to their heuristic and probabilistic nature. Now, my question is, when it comes to clustering algorithm like k-means, if we want to have a better quality result (as in lesser distortion, etc.) between all these k-means algorithms, which algorithm would be able to give you the better quality? Is it possible to measure such thing?
A better solution is usually one that has a better (lower) J(x,c) value, where:
J(x,c) = 1/|x| * Sum(distance(x(i),c(centroid(i)))) for each i in [1,|x|]
Wherre:
x is the list of samples
|x| is the size of x (number of elements)
[1,|x|] all the numbers from 1 to |x| (inclusive)
c is the list of centroids (or means) of clusters (i.e., for k clusters |c| = k)
distance(a,b) (sometimes denoted as ||a-b|| is the distance between "point" a to "point" b (In euclidean 2D space it is sqrt((a.x-b.x)^2 + (a.y-b.y)^2))
centroid(i) - the centroid/mean which is closest to x(i)
Note that this approach does not require switching to supervised technique and can be fully automated!
As I understand it, you need some data with labels to cross-validate you clustering algorithm.
How about the pathological case of the two-moons dataset? unsupervised k-means will fail badly. A high quality method I am aware of employs a more probabilistic approach using mutual information and combinatorial optimization. Basically you cast the clustering problem as the problem of finding the optimal [cluster] subset of the full point-set for the case of two clusters.
You can find the relevant paper here (page 42) and the corresponding Matlab code here to play with (checkout the two-moons case). If you are interested in a C++ high-performance implementation of that with a speed up of >30x then you can find it here HPSFO.
To compare the quality, you should have a labeled dataset and measure the results by some criteria like NMI
In a typical genetic algorithm, is there any guideline for estimating the generations required to converge given the amount of entropy in the description of an individual in the population?
Also, I suppose it is reasonable to also require the number of offspring per generation and rate of mutation, but adjustment of those parameters is of less interest to me at the moment.
Well, there are not any concrete guidelines in the form of mathematical models, but there are several concepts that people use to communicate about parameter settings and advice on how to choose them. One of these concepts is diversity, which would be similar to the entropy that you mentioned. The other concept is called selection pressure and determines the chance an individual has to be selected based on its relative fitness.
Diversity and selection pressure can be computed for each generation, but the change between generations is very difficult to estimate. You would also need models that predict the expected quality of your crossover and mutation operator in order to estimate the fitness distribution in the next generation.
There have been work published on these topics very recently:
* Chicano and Alba. 2011. Exact Computation of the Expectation Curves of the Bit-Flip Mutation using Landscapes Theory
* Chicano, Whitley, and Alba. 2012. Exact computation of the expectation curves for uniform crossover
Is your question resulting from a general research interest or do you seek practical guidence?
No. If you define a mathematical model of the algorithm (initial population, combination function, mutation function) you can use normal mathematical methods to calculate what you want to know, but "typical genetic algorithm" is too vague to have any meaningful answer.
If you want to set the hyperparameters of some genetic algorithm (eg number of "DNA" bits) than this is typically done in the usual way for any machine learning algorithm, with a cross validation set.