Impact of DNA length on Genetic Algorithm performance - genetic-algorithm

I've tried to research about this but couldn't find a satisfying answer. How does the length of an individuals DNA impact on the overall performance of the GA?
Imagine that I'm trying to find a solution for a combinatorial problem with many dimensions (x, y, z, w) for example. Representing the DNA in binary would yield a very long sequence.. are there any suggestions to how far I am allowed to go?
In my opinion the search space (should) increase exponentially with the amount of elements exposed to mutation, am I wrong?
What are some guidelines or techniques (preferably derived from experience) that somebody could provide on how to reduce the length of the DNA?

The GA specific operations such as crossover and mutation should not take long time even for very large chromosome sizes, since they are computationally simple operations (clone a matrix, change a bit of a matrix, etc).
Most of the time (I have seen around 85-95%) will be spent running the evaluation function for your individuals. Is entirely up to the specific problem you are solving to determine if a large DNA will impact the performance or not. Also, depending on the problem you try to solve, it may be impossible to shorten the size of the possible results.

Related

Intuitive reason why minimization is harder in multiple dimensions for separable functions

Let's say I have N positive-valued 1-d functions. Does it take more function evaluations for a numerical minimizer to minimize their product in N-dimensional space rather than do N individual 1d minimizations?
If so, is there an intuitive way to understand this? Somehow I feel like both problems should be equal in complexity.
Minimizing their product is minimizing the sum of their logs. There are many algorithms for min(max)imizing N-dimensional functions. One is the old standby OPTIF9.
If you have to use hard limits, so you're minimizing in a box, that can be a lot harder, but you can usually avoid it.
The complexity is not linear in the number of variables. Typically n small problems is better than one big problem. Or in other words: making the problem twice as big (in terms of variables) will make it more than twice as expensive to solve.
In some special cases it may be somewhat beneficial to batch a few problems, mainly due to fixed overhead (some solvers do a lot of things before actually starting iterating).

In regards to genetic algorithms

Currently, I'm studying genetic algorithms (personal, not required) and I've come across some topics I'm unfamiliar or just basically familiar with and they are:
Search Space
The "extreme" of a Function
I understand that one's search space is a collection of all possible solutions but I also wish to know how one would decide the range of their search space. Furthermore I would like to know what an extreme is in relation to functions and how it is calculated.
I know I should probably understand what these are but so far I've only taken Algebra 2 and Geometry but I have ventured into physics, matrix/vector math, and data structures on my own so please excuse me if I seem naive.
Generally, all algorithms which are looking for a specific item in a collection of items are called search algorithms. When the collection of items is defined by a mathematical function (opposed to existing in a database), it is called a search space.
One of the most famous problems of this kind is the travelling salesman problem, where an algorithm is sought which will, given a list of cities and their distances, find the shortest route for visiting each city only once. For this problem, the exact solution can be found only by examining all possible routes (the entire search space), and finding the shortest one (the route which has the minimum distance, which is the extreme value in the search space). The best time complexity of such an algorithm (called an exhaustive search) is exponential (although it is still possible that there may be a better solution), meaning that the worst-case running time increases exponentially as the number of cities increases.
This is where genetic algorithms come into play. Similar to other heuristic algorithms, genetic algorithms try to get close to the optimal solution by improving a candidate solution iteratively, with no guarantee that an optimal solution will actually be found.
This iterative approach has the problem that the algorithm can easily get "stuck" in a local extreme (while trying to improve a solution), not knowing that there is a potentially better solution somewhere further away:
The figure shows that, in order to get to the actual, optimal solution (the global minimum), an algorithm currently examining the solution around the local minimum needs to "jump over" a large maximum in the search space. A genetic algorithm will rapidly locate such local optimums, but it will usually fail to "sacrifice" this short-term gain to get a potentially better solution.
So, a summary would be:
exhaustive search
examines the entire search space (long time)
finds global extremes
heuristic (e.g. genetic algorithms)
examines a part of the search space (short time)
finds local extremes
Genetic algorithms are not good in tuning to a local optimum. If you want to find a global optimum at least you should be able to approach or find a strategy to approach the local optimum. Recently some improvements have been developed to better find the local optima.
"GENETIC ALGORITHM FOR INFORMATIVE BASIS FUNCTION SELECTION
FROM THE WAVELET PACKET DECOMPOSITION WITH APPLICATION TO
CORROSION IDENTIFICATION USING ACOUSTIC EMISSION"
http://gbiomed.kuleuven.be/english/research/50000666/50000669/50488669/neuro_research/neuro_research_mvanhulle/comp_pdf/Chemometrics.pdf
In general, "search space" means, what type of answers are you looking for. For example, if you are writing a genetic algorithm which builds bridges, tests them out, and then builds more, the answers you are looking for are bridge models (in some form). As another example, if you're trying to find a function which agrees with a set of sample inputs on some number of points, you might try to find a polynomial which has this property. In this instance your search space might be polynomials. You might make this simpler by putting a bound on the number of terms, maximum degree of the polynomial, etc... So you could specify that you wanted to search for polynomials with integer exponents in the range [-4, 4]. In genetic algorithms, the search space is the set of possible solutions you could generate. In genetic algorithms you need to carefully limit your search space so you avoid answers which are completely dumb. At my former university, a physics student wrote a program which was a GA to calculate the best configuration of atoms in a molecule to have low energy properties: they found a great solution having almost no energy. Unfortunately, their solution put all the atoms at the exact center of the molecule, which is physically impossible :-). GAs really hone in on good solutions to your fitness functions, so it's important to choose your search space so that it doesn't produce solutions with good fitness but are in reality "impossible answers."
As for the "extreme" of a function. This is simply the point at which the function takes its maximum value. With respect to genetic algorithms, you want the best solution to the problem you're trying to solve. If you're building a bridge, you're looking for the best bridge. In this scenario, you have a fitness function that can tell you "this bridge can take 80 pounds of weight" and "that bridge can take 120 pounds of weight" then you look around for solutions which have higher fitness values than others. Some functions have simple extremes: you can find the extreme of a polynomial using simple high school calculus. Other functions don't have a simple way to calculate their extremes. Notably, highly nonlinear functions have extremes which might be difficult to find. Genetic algorithms excel at finding these solutions using a clever search technique which looks around for high points and then finds others. It's worth noting that there are other algorithms that do this as well, hill climbers in particular. The things that make GAs different is that if you find a local maximum, other types of algorithms can get "stuck," blinded by a locally good solution, so that they never see a possibly much better solution farther away in the search space. There are other ways to adapt hill climbers to this as well, simulated annealing, for one.
The range space usually requires some intuitive understanding of the problem you're trying to solve-- some expertise in the domain of the problem. There's really no guaranteed method to pick the range.
The extremes are just the minimum and maximum values of the function.
So for instance, if you're coding up a GA just for practice, to find the minimum of, say, f(x) = x^2, you know pretty well that your range should be +/- something because you already know that you're going to find the answer at x=0. But then of course, you wouldn't use a GA for that because you already have the answer, and even if you didn't, you could use calculus to find it.
One of the tricks in genetic algorithms is to take some real-world problem (often an engineering or scientific problem) and translate it, so to speak, into some mathematical function that can be minimized or maximized. But if you're doing that, you probably already have some basic notion where the solutions might lie, so it's not as hopeless as it sounds.
The term "search space" does not restrict to genetic algorithms. I actually just means the set of solutions to your optimization problem. An "extremum" is one solution that minimizes or maximizes the target function with respect to the search space.
Search space simply put is the space of all possible solutions. If you're looking for a shortest tour, the search space consists of all possible tours that can be formed. However, beware that it's not the space of all feasible solutions! It only depends on your encoding. If your encoding is e.g. a permutation, than the search space is that of the permutation which is n! (factorial) in size. If you're looking to minimize a certain function the search space with real valued input the search space is bounded by the hypercube of the real valued inputs. It's basically infinite, but of course limited by the precision of the computer.
If you're interested in genetic algorithms, maybe you're interested in experimenting with our software. We're using it to teach heuristic optimization in classes. It's GUI driven and windows based so you can start right away. We have included a number of problems such as real-valued test functions, traveling salesman, vehicle routing, etc. This allows you to e.g. look at how the best solution of a certain TSP is improving over the generations. It also exposes the problem of parameterization of metaheuristics and lets you find better parameters that will solve the problems more effectively. You can get it at http://dev.heuristiclab.com.

What algorithms have high time complexity, to help "burn" more CPU cycles?

I am trying to write a demo for an embedded processor, which is a multicore architecture and is very fast in floating point calculations. The problem is that the current hardware I have is the processor connected through an evaluation board where the DRAM to chip rate is somewhat limited, and the board to PC rate is very slow and inefficient.
Thus, when demonstrating big matrix multiplication, I can do, say, 128x128 matrices in a couple of milliseconds, but the I/O takes (lots of) seconds kills the demo.
So, I am looking for some kind of a calculation with higher complexity than n^3, the more the better (but preferably easy to program and to explain/understand) to make the computation part more dominant in the time budget, where the dataset is preferably bound to about 16KB per thread (core).
Any suggestion?
PS: I think it is very similar to this question in its essence.
You could generate large (256-bit) numbers and factor them; that's commonly used in "stress-test" tools. If you specifically want to exercise floating point computation, you can build a basic n-body simulator with a Runge-Kutta integrator and run that.
What you can do is
Declare a std::vector of int
populate it with N-1 to 0
Now keep using std::next_permutation repeatedly until they are sorted again i..e..next_permutation returns false.
With N integers this will need O(N !) calculations and also deterministic
PageRank may be a good fit. Articulated as a linear algebra problem, one repeatedly squares a certain floating-point matrix of controllable size until convergence. In the graphical metaphor, one "ripples" change coming into each node onto the other edges. Both treatments can be made parallel.
You could do a least trimmed squares fit. One use of this is to identify outliers in a data set. For example you could generate samples from some smooth function (a polynomial say) and add (large) noise to some of the samples, and then the problem is to find a subset H of the samples of a given size that minimises the sum of the squares of the residuals (for the polynomial fitted to the samples in H). Since there are a large number of such subsets, you have a lot of fits to do! There are approximate algorithms for this, for example here.
Well one way to go would be to implement brute-force solver for the Traveling Salesman problem in some M-space (with M > 1).
The brute-force solution is to just try every possible permutation and then calculate the total distance for each permutation, without any optimizations (including no dynamic programming tricks like memoization).
For N points, there are (N!) permutations (with a redundancy factor of at least (N-1), but remember, no optimizations). Each pair of points requires (M) subtractions, (M) multiplications and one square root operation to determine their pythagorean distance apart. Each permutation has (N-1) pairs of points to calculate and add to the total distance.
So order of computation is O(M((N+1)!)), whereas storage space is only O(N).
Also, this should not be either too hard, nor too intensive to parallelize across the cores, though it does take some overhead. (I can demonstrate, if needed).
Another idea might be to compute a fractal map. Basically, choose a grid of whatever dimensionality you want. Then, for each grid point, do the fractal iteration to get the value. Some points might require only a few iterations; I believe some will iterate forever (chaos; of course, this can't really happen when you have a finite number of floating-point numbers, but still). The ones that don't stop you'll have to "cut off" after a certain number of iterations... just make this preposterously high, and you should be able to demonstrate a high-quality fractal map.
Another benefit of this is that grid cells are processed completely independently, so you will never need to do communication (not even at boundaries, as in stencil computations, and definitely not O(pairwise) as in direct N-body simulations). You can usefully use O(gridcells) number of processors to parallelize this, although in practice you can probably get better utilization by using gridcells/factor processors and dynamically scheduling grid points to processors on an as-ready basis. The computation is basically all floating-point math.
Mandelbrot/Julia and Lyupanov come to mind as potential candidates, but any should do.

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition

Using a smoother with the L Method to determine the number of K-Means clusters

Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater increase in speed? Which smoothing algorithm/method did you use?
The "L-Method" is detailed in:
Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms, Salvador & Chan
This calculates the evaluation metric for a range of different trial cluster counts. Then, to find the knee (which occurs for an optimum number of clusters), two lines are fitted using linear regression. A simple iterative process is applied to improve the knee fit - this uses the existing evaluation metric calculations and does not require any re-runs of the k-means.
For the evaluation metric, I am using a reciprocal of a simplified version of the Dunns Index. Simplified for speed (basically my diameter and inter-cluster calculations are simplified). The reciprocal is so that the index works in the correct direction (ie. lower is generally better).
K-means is a stochastic algorithm, so typically it is run multiple times and the best fit chosen. This works pretty well, but when you are doing this for 1..N clusters the time quickly adds up. So it is in my interest to keep the number of runs in check. Overall processing time may determine whether my implementation is practical or not - I may ditch this functionality if I cannot speed it up.
I had asked a similar question in the past here on SO. My question was about coming up with a consistent way of finding the knee to the L-shape you described. The curves in question represented the trade-off between complexity and a fit measure of the model.
The best solution was to find the point with the maximum distance d according to the figure shown:
Note: I haven't read the paper you linked to yet..

Resources