I've tried to research about this but couldn't find a satisfying answer. How does the length of an individuals DNA impact on the overall performance of the GA?
Imagine that I'm trying to find a solution for a combinatorial problem with many dimensions (x, y, z, w) for example. Representing the DNA in binary would yield a very long sequence.. are there any suggestions to how far I am allowed to go?
In my opinion the search space (should) increase exponentially with the amount of elements exposed to mutation, am I wrong?
What are some guidelines or techniques (preferably derived from experience) that somebody could provide on how to reduce the length of the DNA?
The GA specific operations such as crossover and mutation should not take long time even for very large chromosome sizes, since they are computationally simple operations (clone a matrix, change a bit of a matrix, etc).
Most of the time (I have seen around 85-95%) will be spent running the evaluation function for your individuals. Is entirely up to the specific problem you are solving to determine if a large DNA will impact the performance or not. Also, depending on the problem you try to solve, it may be impossible to shorten the size of the possible results.
I have a dataset with unknown number of clusters and I aim to cluster them. Since, I don't know the number of clusters in advance, I tried to use density-based algorithms especially DBSCAN. The problem that I have with DBSCAN is that how to detect appropriate epsilon. The method suggested in the DBSCAN paper assume there are some noises and when we plot sorted k-dist graph we can detect valley and define the threshold for epsilon. But, my dataset obtained from a controlled environment and there are no noise.
Does anybody have an idea of how to detect epsilon? Or, suggest better clustering algorithm could fit this problem.
In general, there is no unsupervised epsilon detection. From what little you describe, DBSCAN is a very appropriate approach.
Real-world data tend to have a gentle gradient of distances; deciding what distance should be the cut-off is a judgement call requiring knowledge of the paradigm and end-use. In short, the problem requires knowledge not contained in the raw data.
I suggest that you use a simple stepping approach to converge on the solution you want. Set epsilon to some simple value that your observation suggests will be appropriate. If you get too much fragmentation, increase epsilon by a factor of 3; if the clusters are too large, decrease by a factor of 3. Repeat your runs until you get the desired results.
I need to cluster a matrix which contains mostly zeros values...Is K-means appropriate for these kind of data or do I need to consider a different algorithm?
No. The reason is that the mean is not sensible on sparse data. The resulting mean vectors will have very different characteristics than your actual data; they will often end up being more similar to each other than to actual documents!
There are some modifications that improve k-means for sparse data such as spherical k-means.
But largely, k-means on such data is just a crude heuristic. The results aren't entirely useless, but they are not the best that you can do either. It works, but by chance, not by design.
k-means is widely used to cluster sparse data such as document-term vectors, so I'd say go ahead. Whether you get good results depends on the data and what you're looking for, of course.
There are a few things to keep in mind:
If you have very sparse data, then a sparse representation of your input can reduce memory usage and runtime by many orders of magnitude, so pick a good k-means implementation.
Euclidean distance isn't always the best metric for sparse vectors, but normalizing them to unit length may give better results.
The cluster centroids are in all likelihood going to be dense regardless of the input sparsity, so don't use too many features.
Doing dimensionality reduction, e.g. SVD, on the samples may boost the running time and cluster quality a lot.
Currently, I'm studying genetic algorithms (personal, not required) and I've come across some topics I'm unfamiliar or just basically familiar with and they are:
Search Space
The "extreme" of a Function
I understand that one's search space is a collection of all possible solutions but I also wish to know how one would decide the range of their search space. Furthermore I would like to know what an extreme is in relation to functions and how it is calculated.
I know I should probably understand what these are but so far I've only taken Algebra 2 and Geometry but I have ventured into physics, matrix/vector math, and data structures on my own so please excuse me if I seem naive.
Generally, all algorithms which are looking for a specific item in a collection of items are called search algorithms. When the collection of items is defined by a mathematical function (opposed to existing in a database), it is called a search space.
One of the most famous problems of this kind is the travelling salesman problem, where an algorithm is sought which will, given a list of cities and their distances, find the shortest route for visiting each city only once. For this problem, the exact solution can be found only by examining all possible routes (the entire search space), and finding the shortest one (the route which has the minimum distance, which is the extreme value in the search space). The best time complexity of such an algorithm (called an exhaustive search) is exponential (although it is still possible that there may be a better solution), meaning that the worst-case running time increases exponentially as the number of cities increases.
This is where genetic algorithms come into play. Similar to other heuristic algorithms, genetic algorithms try to get close to the optimal solution by improving a candidate solution iteratively, with no guarantee that an optimal solution will actually be found.
This iterative approach has the problem that the algorithm can easily get "stuck" in a local extreme (while trying to improve a solution), not knowing that there is a potentially better solution somewhere further away:
The figure shows that, in order to get to the actual, optimal solution (the global minimum), an algorithm currently examining the solution around the local minimum needs to "jump over" a large maximum in the search space. A genetic algorithm will rapidly locate such local optimums, but it will usually fail to "sacrifice" this short-term gain to get a potentially better solution.
So, a summary would be:
exhaustive search
examines the entire search space (long time)
finds global extremes
heuristic (e.g. genetic algorithms)
examines a part of the search space (short time)
finds local extremes
Genetic algorithms are not good in tuning to a local optimum. If you want to find a global optimum at least you should be able to approach or find a strategy to approach the local optimum. Recently some improvements have been developed to better find the local optima.
"GENETIC ALGORITHM FOR INFORMATIVE BASIS FUNCTION SELECTION
FROM THE WAVELET PACKET DECOMPOSITION WITH APPLICATION TO
CORROSION IDENTIFICATION USING ACOUSTIC EMISSION"
http://gbiomed.kuleuven.be/english/research/50000666/50000669/50488669/neuro_research/neuro_research_mvanhulle/comp_pdf/Chemometrics.pdf
In general, "search space" means, what type of answers are you looking for. For example, if you are writing a genetic algorithm which builds bridges, tests them out, and then builds more, the answers you are looking for are bridge models (in some form). As another example, if you're trying to find a function which agrees with a set of sample inputs on some number of points, you might try to find a polynomial which has this property. In this instance your search space might be polynomials. You might make this simpler by putting a bound on the number of terms, maximum degree of the polynomial, etc... So you could specify that you wanted to search for polynomials with integer exponents in the range [-4, 4]. In genetic algorithms, the search space is the set of possible solutions you could generate. In genetic algorithms you need to carefully limit your search space so you avoid answers which are completely dumb. At my former university, a physics student wrote a program which was a GA to calculate the best configuration of atoms in a molecule to have low energy properties: they found a great solution having almost no energy. Unfortunately, their solution put all the atoms at the exact center of the molecule, which is physically impossible :-). GAs really hone in on good solutions to your fitness functions, so it's important to choose your search space so that it doesn't produce solutions with good fitness but are in reality "impossible answers."
As for the "extreme" of a function. This is simply the point at which the function takes its maximum value. With respect to genetic algorithms, you want the best solution to the problem you're trying to solve. If you're building a bridge, you're looking for the best bridge. In this scenario, you have a fitness function that can tell you "this bridge can take 80 pounds of weight" and "that bridge can take 120 pounds of weight" then you look around for solutions which have higher fitness values than others. Some functions have simple extremes: you can find the extreme of a polynomial using simple high school calculus. Other functions don't have a simple way to calculate their extremes. Notably, highly nonlinear functions have extremes which might be difficult to find. Genetic algorithms excel at finding these solutions using a clever search technique which looks around for high points and then finds others. It's worth noting that there are other algorithms that do this as well, hill climbers in particular. The things that make GAs different is that if you find a local maximum, other types of algorithms can get "stuck," blinded by a locally good solution, so that they never see a possibly much better solution farther away in the search space. There are other ways to adapt hill climbers to this as well, simulated annealing, for one.
The range space usually requires some intuitive understanding of the problem you're trying to solve-- some expertise in the domain of the problem. There's really no guaranteed method to pick the range.
The extremes are just the minimum and maximum values of the function.
So for instance, if you're coding up a GA just for practice, to find the minimum of, say, f(x) = x^2, you know pretty well that your range should be +/- something because you already know that you're going to find the answer at x=0. But then of course, you wouldn't use a GA for that because you already have the answer, and even if you didn't, you could use calculus to find it.
One of the tricks in genetic algorithms is to take some real-world problem (often an engineering or scientific problem) and translate it, so to speak, into some mathematical function that can be minimized or maximized. But if you're doing that, you probably already have some basic notion where the solutions might lie, so it's not as hopeless as it sounds.
The term "search space" does not restrict to genetic algorithms. I actually just means the set of solutions to your optimization problem. An "extremum" is one solution that minimizes or maximizes the target function with respect to the search space.
Search space simply put is the space of all possible solutions. If you're looking for a shortest tour, the search space consists of all possible tours that can be formed. However, beware that it's not the space of all feasible solutions! It only depends on your encoding. If your encoding is e.g. a permutation, than the search space is that of the permutation which is n! (factorial) in size. If you're looking to minimize a certain function the search space with real valued input the search space is bounded by the hypercube of the real valued inputs. It's basically infinite, but of course limited by the precision of the computer.
If you're interested in genetic algorithms, maybe you're interested in experimenting with our software. We're using it to teach heuristic optimization in classes. It's GUI driven and windows based so you can start right away. We have included a number of problems such as real-valued test functions, traveling salesman, vehicle routing, etc. This allows you to e.g. look at how the best solution of a certain TSP is improving over the generations. It also exposes the problem of parameterization of metaheuristics and lets you find better parameters that will solve the problems more effectively. You can get it at http://dev.heuristiclab.com.
Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater increase in speed? Which smoothing algorithm/method did you use?
The "L-Method" is detailed in:
Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms, Salvador & Chan
This calculates the evaluation metric for a range of different trial cluster counts. Then, to find the knee (which occurs for an optimum number of clusters), two lines are fitted using linear regression. A simple iterative process is applied to improve the knee fit - this uses the existing evaluation metric calculations and does not require any re-runs of the k-means.
For the evaluation metric, I am using a reciprocal of a simplified version of the Dunns Index. Simplified for speed (basically my diameter and inter-cluster calculations are simplified). The reciprocal is so that the index works in the correct direction (ie. lower is generally better).
K-means is a stochastic algorithm, so typically it is run multiple times and the best fit chosen. This works pretty well, but when you are doing this for 1..N clusters the time quickly adds up. So it is in my interest to keep the number of runs in check. Overall processing time may determine whether my implementation is practical or not - I may ditch this functionality if I cannot speed it up.
I had asked a similar question in the past here on SO. My question was about coming up with a consistent way of finding the knee to the L-shape you described. The curves in question represented the trade-off between complexity and a fit measure of the model.
The best solution was to find the point with the maximum distance d according to the figure shown:
Note: I haven't read the paper you linked to yet..