I"m working with flow job scheduling issue. is there a mathematical equation to compute the fitness instead of the Gantt Chart ? thanks in advance.
It seems the most common fitness measure would be the make span. Specifically you would be minimizing the total length of the schedule (Smaller make-span --> better fitness).
Related
I am developing a simulation model to compare different delivery route options. A critical criteria for selecting the delivery route is to evaluate both the transportation time and cost, and the best achieved balance between time and cost will be selected (or according to certain weight assigned to time and cost). The question is time and cost are different measures and there needs a way to transform the two isolated measures into a single uniform measure. What are the usual methods/algorithms to do this work?
Choosing the best method for decision making is completely related to the assumptions exist in your problem.
The first thing you should consider is that "Are these two parameters completely independent or not?". If we assume transportation time and communication cost are independent, then there is a simple trade-off between them. On communication cost vs. load balancing in Content Delivery Networks is a published paper which investigated around this trade-off in a CDN.
I suggest you read the three basic methods proposed in this paper. These methods are general enough to use in any independent trade-off problem. So I think it would be enough to get the basic idea.
Added Information:
In case of having problem accessing the paper.
The first step to compare cost and time would be scaling these two variables, so it would possible to compare them easily.
Wikipedia has a good article on this part. Feature scaling would be a good solution for you.
One of the simplest methods for decision making in your problem is calculating the following parameter for each possible solution:
wi = α*ci + (1-α)*ti
Which ci denotes the scaled cost of picking ith solution and ti denotes the scaled time of choosing ith solution. The solution with minimum amount of wi would be the best answer.
In this algorithm 0< α <1 determines the importance of time and cost. if α=1 you are deciding only based on cost and if α=0, time is the only important parameter for you.
I'm working on a problem that necessitates running KMeans separately on ~125 different datasets. Therefore, I'm looking to mathematically calculate the 'optimal' K for each respective dataset. However, the evaluation metric continues decreasing with higher K values.
For a sample dataset, there are 50K rows and 8 columns. Using sklearn's calinski-harabaz score, I'm iterating through different K values to find the optimum / minimum score. However, my code reached k=5,600 and the calinski-harabaz score was still decreasing!
Something weird seems to be happening. Does the metric not work well? Could my data be flawed (see my question about normalizing rows after PCA)? Is there another/better way to mathematically converge on the 'optimal' K? Or should I force myself to manually pick a constant K across all datasets?
Any additional perspectives would be helpful. Thanks!
I don't know anything about the calinski-harabaz score but some score metrics will be monotone increasing/decreasing with respect to increasing K. For instance the mean squared error for linear regression will always decrease each time a new feature is added to the model so other scores that add penalties for increasing number of features have been developed.
There is a very good answer here that covers CH scores well. A simple method that generally works well for these monotone scoring metrics is to plot K vs the score and choose the K where the score is no longer improving 'much'. This is very subjective but can still give good results.
SUMMARY
The metric decreases with each increase of K; this strongly suggests that you do not have a natural clustering upon the data set.
DISCUSSION
CH scores depend on the ratio between intra- and inter-cluster densities. For a relatively smooth distribution of points, each increase in K will give you clusters that are slightly more dense, with slightly lower density between them. Try a lattice of points: vary the radius and do the computations by hand; you'll see how that works. At the extreme end, K = n: each point is its own cluster, with infinite density, and 0 density between clusters.
OTHER METRICS
Perhaps the simplest metric is sum-of-squares, which is already part of the clustering computations. Sum the squares of distances from the centroid, divide by n-1 (n=cluster population), and then add/average those over all clusters.
I'm looking for a particular paper that discusses metrics for this very problem; if I can find the reference, I'll update this answer.
N.B. With any metric you choose (as with CH), a failure to find a local minimum suggests that the data really don't have a natural clustering.
WHAT TO DO NEXT?
Render your data in some form you can visualize. If you see a natural clustering, look at the characteristics; how is it that you can see it, but the algebra (metrics) cannot? Formulate a metric that highlights the differences you perceive.
I know, this is an effort similar to the problem you're trying to automate. Welcome to research. :-)
The problem with my question is that the 'best' Calinski-Harabaz score is the maximum, whereas my question assumed the 'best' was the minimum. It is computed by analyzing the ratio of between-cluster dispersion vs. within-cluster dispersion, the former/numerator you want to maximize, the latter/denominator you want to minimize. As it turned out, in this dataset, the 'best' CH score was with 2 clusters (the minimum available for comparison). I actually ran with K=1, and this produced good results as well. As Prune suggested, there appears to be no natural grouping within the dataset.
I am interested in plotting the performance of individual subpopulations in the island based distributed genetic algorithms. I saw a couple of research works that calculate the rank and order of subpopulations and plot the rank against the generations to understand how the subpopulations evolve?
I could not understand how the rank of each subpopulation is calculated?
Could anyone please explain.
Generally, the rank of sub-populations is assigned based on a quality of the sub-populations. Some common qualities to use for this are the average fitness value of the sub-population, the best fitness value of the sub-population, etc...
The rank may be used as measure to order the population.
A couple weeks ago I encountered a problem that I virtually broke down to a variation of the traveling salesman problem. The twists are:
There are multiple salemen.
The list of cities is dynamically increasing (as in, live input)
Each city is only fully profitable for a limited amount of time, as in after a certain time the city will return less of a reward
And there is an overall time limit
Obviously, this problem is NP. I was wondering if there were any good TSP approximations that could have been modified to fit this problem?
You may be able to use some operations research software to solve your problem, e.g. Coin-OR, the reason being that it's generally easier to add new constraints / objectives to an OR constraint/linear/integer/etc programming solver than to e.g. a specialized TSP solver written in C or Fortran or whatever (and it's not likely that you'll find some C/Fortran code to solve your precise problem). Here is a tutorial on how to code a Tabu search to solve the basic TSP using Coin-OR. In addition, here is an article on modeling the time-dependent TSP (the article uses branch-and-bound to solve the problem which probably isn't appropriate for your problem as it doesn't scale beyond a hundred cities or so, but the model should carry over to an approximate solver like Coin-OR).
To account for having multiple salesmen, you may want to look into graph partitioning to divide up the cities among the different salesmen, for example here is a fast online graph partitioning algorithm. The advantage is that once you've partitioned the graphs you can reduce or even eliminate synchronization between the different salesmen.
I try to solve this problem using genetic algorithm and get difficult to choose the fitness function.
My problem is a little differnt than the original Traveling Salesman Problem ,since the population and maybe also the win unit not neccesrly contain all the cities.
So , I have 2 value for each unit: the amount of cities he visit, the total time and the order he visit the cities.
I tried 2-3 fitness function but they don't give good sulotion.
I need idea of good fitness function which take in account the amount of cities he visited and also the total time.
Thanks!
In addition to Peladao's suggestions of using a pareto approach or some kind of weighted sum there are two more possibilities that I'd like to mention for the sake of completeness.
First, you could prioritize your fitness functions. So that the individuals in the population are ranked by first goal, then second goal, then third goal. Therefore only if two individuals are equal in the first goal they will be compared by second goal. If there is a clear dominance in your goals this may be a feasible approach.
Second, you could define two of your goals as constraints that you penalize only when they exceed a certain threshold. This may be feasible when e.g. the amount of cities should not be in a certain range, e.g. [4;7], but doesn't matter if it's 4 or 5. This is similar to a weighted sum approach where the contribution of the individual goals to the combined fitness value differs by several orders of magnitude.
The pareto approach is the only one that treats all objectives with equal importance. It requires special algorithms suited for multiobjective optimization though, such as NSGA-II, SPEA2, AbYSS, PAES, MO-TS, ...
In any case, it would be good if you could show the 2-3 fitness functions that you tried. Maybe there were rather simple errors.
Multiple-objective fitness functions can be implemented using a Pareto optimal.
You could also use a weighted sum of different fitness values.
For a good and readable introduction into multiple-objective optimisation and GA: http://www.calresco.org/lucas/pmo.htm