Genetic algorithm not producing better results - genetic-algorithm

I'm running a genetic algorithm to train a set of Hunters to learn to capture as many Elephants as possible. Basically, there's ~20 Hunters that move around in a 2D Grid environment, and 4 Hunters must surround an Elephant to capture it (there's ~20 Elephants in the world). Each time this simulation runs, all Hunters and Elephants are placed in random starting locations. Hunters' movement controlled by its chromosome, Elephants' movements are random.
All of the Hunters have the same chromosome (homogeneous), so I run this simulation each time when assigning a fitness to a chromosome.
My fitness function simply rewards the chromosome for the total number of Elephants captured in the simulation:
double fitness = totalElephantsCaptured() * 100;
The results of the generations' fitnesses are basically random. The most fit individual in each generation does not become more fit as generations progress, and the total fitness of each generation does not increase. I sense that my fitness function is too primitive, but I'm not sure how to change it to yield better results. It does not seem like the Hunters are learning anything.
Details on my GA (Numbers in a range because I've tried various values):
Population size: 64 or 128.
Generation cap: 500 - 1000.
Elitism: 5% - 10% of population
Mutation: 1% - 4%
Chromosome size: 180
Selection: Roulette wheel.
Crossover: Simple crossover in middle of chromosome.
The chromosome is designed in 'blocks' of movement rules based on sensor data:
Each block has a sensor criteria (e.g. "if nearest Elephant is X distance away") and a movement rule (e.g. "move towards nearest Elephant for Y amount of time"). Hunters can either move towards nearest Elephant or towards nearest teammate Hunter.
Does anyone have any suggestions to make the fitness of the Hunters gradually increase as generations increase?

Related

Algorithm to detect sawtooth like timeSeries

What you see here is a Graph of acceleration on the Vertical Axis (or head to toe axis) of a person walking.
I want to Implement a reliable method to recognise this pattern of motion and count no of steps.
As we can immediately notice each step corresponds to a spike and dip from the mean around 10-10.5 ms^2 line.
Earlier I planned on a Threshold detection based mechanism but that yielded very poor results because there are some variables:
If the person walks slower or faster the graph would expand out in time axis
If a person steps lighter or harder then the spikes and dips are smaller and larger respectively
however in all of the cases the pattern is still the same that is a spike and dip at almost regular intervals
what is the best reasonable algorithm to detect this pattern with reasonable accuracy and computing time
Never mind I figured it out, it was rather very simple ,all I had to do was decide a noise threshold and a base level or zero level then run a peak detector on it
following is the abstract procedure
Base level is calculated in Real time as average of last 30 samples
Values above base level - noise threshold were considered as positive spikes
Values below base level - noise threshold were considered as negative spikes
A pair of subsequent positive and negative spikes detected within a short interval of about ~500ms is considered as step.
with proper tuning the accuracy is ~98% and can count no of steps taken very reliably

Why do you need fitness scaling in Genetic Algorithms?

Reading the book "Genetic Algorithms" by David E. Goldberg, he mentions fitness scaling in Genetic Algorithms.
My understanding of this function is to constrain the strongest candidates so that they don't flood the pool for reproduction.
Why would you want to constrain the best candidates? In my mind having as many of the best candidates as early as possible would help get to the optimal solution as fast as possible.
What if your early best candidates later on turn out to be evolutionary dead ends? Say, your early fittest candidates are big, strong agents that dominate smaller, weaker candidates. If all the weaker ones are eliminated, you're stuck with large beasts that maybe have a weakness to an aspect of the environment that hasn't been encountered yet that the weak ones can handle: think dinosaurs vs tiny mammals after an asteroid impact. Or, in a more deterministic setting that is more likely the case in a GA, the weaker candidates may be one or a small amount of evolutionary steps away from exploring a whole new fruitful part of the fitness landscape: imagine the weak small critters evolving flight, opening up a whole new world of possibilities that the big beasts most likely will never touch.
The underlying problem is that your early strongest candidates may actually be in or around a local maximum in fitness space, that may be difficult to come out of. It could be that the weaker candidates are actually closer to the global maximum.
In any case, by pruning your population aggressively, you reduce the genetic diversity of your population, which in general reduces the search space you are covering and limits how fast you can search this space. For instance, maybe your best candidates are relatively close to the global best solution, but just inbreeding that group may not move it much closer to it, and you may have to wait for enough random positive mutations to happen. However, perhaps one of the weak candidates that you wanted to cut out has some gene that on its own doesn't help much, but when crossed with the genes from your strong candidates in may cause a big evolutionary jump! Imagine, say, a human crossed with spider DNA.
#sgvd's answer makes valid points but I would like to elaborate more.
First of all, we need to define what fitness scaling actually means. If it means just multiplying the fitnesses by some factor then this does not change the relationships in the population - if the best individual had 10 times higher fitness than the worst one, after such multiplication this is still true (unless you multiply by zero which makes no real sense). So, a much more sensible fitness scaling is an affine transformation of the fitness values:
scaled(f) = a * f + b
i.e. the values are multiplied by some number and offset by another number, up or down.
Fitness scaling makes sense only with certain types of selection strategies, namely those where the selection probability is proportional to the fitness of the individuals1.
Fitness scaling plays, in fact, two roles. The first one is merely practical - if you want a probability to be proportional to the fitness, you need the fitness to be positive. So, if your raw fitness value can be negative (but is limited from below), you can adjust it so you can compute probabilities out of it. Example: if your fitness gives values from the range [-10, 10], you can just add 10 to the values to get all positive values.
The second role is, as you and #sgvd already mentioned, to limit the capability of the strongest solutions to overwhelm the weaker ones. The best illustration would be with an example.
Suppose that your raw fitness values gives values from the range [0, 100]. If you left it this way, the worst individuals would have zero probability of being selected, and the best ones would have up to 100x higher probability than the worst ones (excluding the really worst ones). However, let's set the scaling factors to a = 1/2, b = 50. Then, the range is transformed to [50, 100]. And right away, two things happen:
Even the worst individuals have non-zero probability of being selected.
The best individuals are now only 2x more likely to be selected than the worst ones.
Exploration vs. exploitation
By setting the scaling factors you can control whether the algorithm will do more exploration over exploitation and vice versa. The more "compressed"2 the values are going to be after the scaling, the more exploration is going to be done (because the likelihood of the best individuals being selected compared to the worst ones will be decreased). And vice versa, the more "expanded"2 are the values going to be, the more exploitation is going to be done (because the likelihood of the best individuals being selected compared to the worst ones will be increased).
Other selection strategies
As I have already written at the beginning, fitness scaling only makes sense with selection strategies which derive the selection probability proportionally from the fitness values. There are, however, other selection strategies that do not work like this.
Ranking selection
Ranking selection is identical to roulette wheel selection but the numbers the probabilities are derived from are not the raw fitness values. Instead, the whole population is sorted by the raw fitness values and the rank (i.e. the position in the sorted list) is the number you derive the selection probability from.
This totally erases the discrepancy when there is one or two "big" individuals and a lot of "small" ones. They will just be ranked.
Tournament selection
In this type of selection you don't even need to know the absolute fitness values at all, you just need to be able to compare two of them and tell which one is better. To select one individual using tournament selection, you randomly pick a number of individuals from the population (this number is a parameter) and you pick the best one of them. You repeat that as long as you have selected enough individuals.
Here you can also control the exploration vs. exploitation thing by the size of the tournament - the larger the tournament is the higher is the chance that the best individuals will take part in the tournaments.
1 An example of such selection strategy is the classical roulette wheel selection. In this selection strategy, each individual has its own section of the roulette wheel which is proportional in size to the particular individual's fitness.
2 Assuming the raw values are positive, the scaled values get compressed as a goes down to zero and as b goes up. Expansion goes the other way around.

Nearest neighbors in high-dimensional data?

I have asked a question a few days back on how to find the nearest neighbors for a given vector. My vector is now 21 dimensions and before I proceed further, because I am not from the domain of Machine Learning nor Math, I am beginning to ask myself some fundamental questions:
Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
In addition, how does one go about deciding the right threshold for determining the k-neighbors? Is there some analysis that can be done to figure this value out?
Previously, I was suggested to use kd-Trees but the Wikipedia page clearly says that for high-dimensions, kd-Tree is almost equivalent to a brute-force search. In that case, what is the best way to find nearest-neighbors in a million point dataset efficiently?
Can someone please clarify the some (or all) of the above questions?
I currently study such problems -- classification, nearest neighbor searching -- for music information retrieval.
You may be interested in Approximate Nearest Neighbor (ANN) algorithms. The idea is that you allow the algorithm to return sufficiently near neighbors (perhaps not the nearest neighbor); in doing so, you reduce complexity. You mentioned the kd-tree; that is one example. But as you said, kd-tree works poorly in high dimensions. In fact, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high dimensions [1][2][3].
Among ANN algorithms proposed recently, perhaps the most popular is Locality-Sensitive Hashing (LSH), which maps a set of points in a high-dimensional space into a set of bins, i.e., a hash table [1][3]. But unlike traditional hashes, a locality-sensitive hash places nearby points into the same bin.
LSH has some huge advantages. First, it is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin from the hash table.
Second, there is a rigorous theory that supports its performance. It can be shown that the query time is sublinear in the size of the database, i.e., faster than linear search. How much faster depends upon how much approximation we can tolerate.
Finally, LSH is compatible with any Lp norm for 0 < p <= 2. Therefore, to answer your first question, you can use LSH with the Euclidean distance metric, or you can use it with the Manhattan (L1) distance metric. There are also variants for Hamming distance and cosine similarity.
A decent overview was written by Malcolm Slaney and Michael Casey for IEEE Signal Processing Magazine in 2008 [4].
LSH has been applied seemingly everywhere. You may want to give it a try.
[1] Datar, Indyk, Immorlica, Mirrokni, "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions," 2004.
[2] Weber, Schek, Blott, "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces," 1998.
[3] Gionis, Indyk, Motwani, "Similarity search in high dimensions via hashing," 1999.
[4] Slaney, Casey, "Locality-sensitive hashing for finding nearest neighbors", 2008.
I. The Distance Metric
First, the number of features (columns) in a data set is not a factor in selecting a distance metric for use in kNN. There are quite a few published studies directed to precisely this question, and the usual bases for comparison are:
the underlying statistical
distribution of your data;
the relationship among the features
that comprise your data (are they
independent--i.e., what does the
covariance matrix look like); and
the coordinate space from which your
data was obtained.
If you have no prior knowledge of the distribution(s) from which your data was sampled, at least one (well documented and thorough) study concludes that Euclidean distance is the best choice.
YEuclidean metric used in mega-scale Web Recommendation Engines as well as in current academic research. Distances calculated by Euclidean have intuitive meaning and the computation scales--i.e., Euclidean distance is calculated the same way, whether the two points are in two dimension or in twenty-two dimension space.
It has only failed for me a few times, each of those cases Euclidean distance failed because the underlying (cartesian) coordinate system was a poor choice. And you'll usually recognize this because for instance path lengths (distances) are no longer additive--e.g., when the metric space is a chessboard, Manhattan distance is better than Euclidean, likewise when the metric space is Earth and your distances are trans-continental flights, a distance metric suitable for a polar coordinate system is a good idea (e.g., London to Vienna is is 2.5 hours, Vienna to St. Petersburg is another 3 hrs, more or less in the same direction, yet London to St. Petersburg isn't 5.5 hours, instead, is a little over 3 hrs.)
But apart from those cases in which your data belongs in a non-cartesian coordinate system, the choice of distance metric is usually not material. (See this blog post from a CS student, comparing several distance metrics by examining their effect on kNN classifier--chi square give the best results, but the differences are not large; A more comprehensive study is in the academic paper, Comparative Study of Distance Functions for Nearest Neighbors--Mahalanobis (essentially Euclidean normalized by to account for dimension covariance) was the best in this study.
One important proviso: for distance metric calculations to be meaningful, you must re-scale your data--rarely is it possible to build a kNN model to generate accurate predictions without doing this. For instance, if you are building a kNN model to predict athletic performance, and your expectation variables are height (cm), weight (kg), bodyfat (%), and resting pulse (beats per minute), then a typical data point might look something like this: [ 180.4, 66.1, 11.3, 71 ]. Clearly the distance calculation will be dominated by height, while the contribution by bodyfat % will be almost negligible. Put another way, if instead, the data were reported differently, so that bodyweight was in grams rather than kilograms, then the original value of 86.1, would be 86,100, which would have a large effect on your results, which is exactly what you don't want. Probably the most common scaling technique is subtracting the mean and dividing by the standard deviation (mean and sd refer calculated separately for each column, or feature in that data set; X refers to an individual entry/cell within a data row):
X_new = (X_old - mu) / sigma
II. The Data Structure
If you are concerned about performance of the kd-tree structure, A Voronoi Tessellation is a conceptually simple container but that will drastically improve performance and scales better than kd-Trees.
This is not the most common way to persist kNN training data, though the application of VT for this purpose, as well as the consequent performance advantages, are well-documented (see e.g. this Microsoft Research report). The practical significance of this is that, provided you are using a 'mainstream' language (e.g., in the TIOBE Index) then you ought to find a library to perform VT. I know in Python and R, there are multiple options for each language (e.g., the voronoi package for R available on CRAN)
Using a VT for kNN works like this::
From your data, randomly select w points--these are your Voronoi centers. A Voronoi cell encapsulates all neighboring points that are nearest to each center. Imagine if you assign a different color to each of Voronoi centers, so that each point assigned to a given center is painted that color. As long as you have a sufficient density, doing this will nicely show the boundaries of each Voronoi center (as the boundary that separates two colors.
How to select the Voronoi Centers? I use two orthogonal guidelines. After random selecting the w points, calculate the VT for your training data. Next check the number of data points assigned to each Voronoi center--these values should be about the same (given uniform point density across your data space). In two dimensions, this would cause a VT with tiles of the same size.That's the first rule, here's the second. Select w by iteration--run your kNN algorithm with w as a variable parameter, and measure performance (time required to return a prediction by querying the VT).
So imagine you have one million data points..... If the points were persisted in an ordinary 2D data structure, or in a kd-tree, you would perform on average a couple million distance calculations for each new data points whose response variable you wish to predict. Of course, those calculations are performed on a single data set. With a V/T, the nearest-neighbor search is performed in two steps one after the other, against two different populations of data--first against the Voronoi centers, then once the nearest center is found, the points inside the cell corresponding to that center are searched to find the actual nearest neighbor (by successive distance calculations) Combined, these two look-ups are much faster than a single brute-force look-up. That's easy to see: for 1M data points, suppose you select 250 Voronoi centers to tesselate your data space. On average, each Voronoi cell will have 4,000 data points. So instead of performing on average 500,000 distance calculations (brute force), you perform far lesss, on average just 125 + 2,000.
III. Calculating the Result (the predicted response variable)
There are two steps to calculating the predicted value from a set of kNN training data. The first is identifying n, or the number of nearest neighbors to use for this calculation. The second is how to weight their contribution to the predicted value.
W/r/t the first component, you can determine the best value of n by solving an optimization problem (very similar to least squares optimization). That's the theory; in practice, most people just use n=3. In any event, it's simple to run your kNN algorithm over a set of test instances (to calculate predicted values) for n=1, n=2, n=3, etc. and plot the error as a function of n. If you just want a plausible value for n to get started, again, just use n = 3.
The second component is how to weight the contribution of each of the neighbors (assuming n > 1).
The simplest weighting technique is just multiplying each neighbor by a weighting coefficient, which is just the 1/(dist * K), or the inverse of the distance from that neighbor to the test instance often multiplied by some empirically derived constant, K. I am not a fan of this technique because it often over-weights the closest neighbors (and concomitantly under-weights the more distant ones); the significance of this is that a given prediction can be almost entirely dependent on a single neighbor, which in turn increases the algorithm's sensitivity to noise.
A must better weighting function, which substantially avoids this limitation is the gaussian function, which in python, looks like this:
def weight_gauss(dist, sig=2.0) :
return math.e**(-dist**2/(2*sig**2))
To calculate a predicted value using your kNN code, you would identify the n nearest neighbors to the data point whose response variable you wish to predict ('test instance'), then call the weight_gauss function, once for each of the n neighbors, passing in the distance between each neighbor the the test point.This function will return the weight for each neighbor, which is then used as that neighbor's coefficient in the weighted average calculation.
What you are facing is known as the curse of dimensionality. It is sometimes useful to run an algorithm like PCA or ICA to make sure that you really need all 21 dimensions and possibly find a linear transformation which would allow you to use less than 21 with approximately the same result quality.
Update:
I encountered them in a book called Biomedical Signal Processing by Rangayyan (I hope I remember it correctly). ICA is not a trivial technique, but it was developed by researchers in Finland and I think Matlab code for it is publicly available for download. PCA is a more widely used technique and I believe you should be able to find its R or other software implementation. PCA is performed by solving linear equations iteratively. I've done it too long ago to remember how. = )
The idea is that you break up your signals into independent eigenvectors (discrete eigenfunctions, really) and their eigenvalues, 21 in your case. Each eigenvalue shows the amount of contribution each eigenfunction provides to each of your measurements. If an eigenvalue is tiny, you can very closely represent the signals without using its corresponding eigenfunction at all, and that's how you get rid of a dimension.
Top answers are good but old, so I'd like to add up a 2016 answer.
As said, in a high dimensional space, the curse of dimensionality lurks around the corner, making the traditional approaches, such as the popular k-d tree, to be as slow as a brute force approach. As a result, we turn our interest in Approximate Nearest Neighbor Search (ANNS), which in favor of some accuracy, speedups the process. You get a good approximation of the exact NN, with a good propability.
Hot topics that might be worthy:
Modern approaches of LSH, such as Razenshteyn's.
RKD forest: Forest(s) of Randomized k-d trees (RKD), as described in FLANN,
or in a more recent approach I was part of, kd-GeRaF.
LOPQ which stands for Locally Optimized Product Quantization, as described here. It is very similar to the new Babenko+Lemptitsky's approach.
You can also check my relevant answers:
Two sets of high dimensional points: Find the nearest neighbour in the other set
Comparison of the runtime of Nearest Neighbor queries on different data structures
PCL kd-tree implementation extremely slow
To answer your questions one by one:
No, euclidean distance is a bad metric in high dimensional space. Basically in high dimensions, data points have large differences between each other. That decreases the relative difference in the distance between a given data point and its nearest and farthest neighbour.
Lot of papers/research are there in high dimension data, but most of the stuff requires a lot of mathematical sophistication.
KD tree is bad for high dimensional data ... avoid it by all means
Here is a nice paper to get you started in the right direction. "When in Nearest Neighbour meaningful?" by Beyer et all.
I work with text data of dimensions 20K and above. If you want some text related advice, I might be able to help you out.
Cosine similarity is a common way to compare high-dimension vectors. Note that since it's a similarity not a distance, you'd want to maximize it not minimize it. You can also use a domain-specific way to compare the data, for example if your data was DNA sequences, you could use a sequence similarity that takes into account probabilities of mutations, etc.
The number of nearest neighbors to use varies depending on the type of data, how much noise there is, etc. There are no general rules, you just have to find what works best for your specific data and problem by trying all values within a range. People have an intuitive understanding that the more data there is, the fewer neighbors you need. In a hypothetical situation where you have all possible data, you only need to look for the single nearest neighbor to classify.
The k Nearest Neighbor method is known to be computationally expensive. It's one of the main reasons people turn to other algorithms like support vector machines.
kd-trees indeed won't work very well on high-dimensional data. Because the pruning step no longer helps a lot, as the closest edge - a 1 dimensional deviation - will almost always be smaller than the full-dimensional deviation to the known nearest neighbors.
But furthermore, kd-trees only work well with Lp norms for all I know, and there is the distance concentration effect that makes distance based algorithms degrade with increasing dimensionality.
For further information, you may want to read up on the curse of dimensionality, and the various variants of it (there is more than one side to it!)
I'm not convinced there is a lot use to just blindly approximating Euclidean nearest neighbors e.g. using LSH or random projections. It may be necessary to use a much more fine tuned distance function in the first place!
A lot depends on why you want to know the nearest neighbors. You might look into the mean shift algorithm http://en.wikipedia.org/wiki/Mean-shift if what you really want is to find the modes of your data set.
I think cosine on tf-idf of boolean features would work well for most problems. That's because its time-proven heuristic used in many search engines like Lucene. Euclidean distance in my experience shows bad results for any text-like data. Selecting different weights and k-examples can be done with training data and brute-force parameter selection.
iDistance is probably the best for exact knn retrieval in high-dimensional data. You can view it as an approximate Voronoi tessalation.
I've experienced the same problem and can say the following.
Euclidean distance is a good distance metric, however it's computationally more expensive than the Manhattan distance, and sometimes yields slightly poorer results, thus, I'd choose the later.
The value of k can be found empirically. You can try different values and check the resulting ROC curves or some other precision/recall measure in order to find an acceptable value.
Both Euclidean and Manhattan distances respect the Triangle inequality, thus you can use them in metric trees. Indeed, KD-trees have their performance severely degraded when the data have more than 10 dimensions (I've experienced that problem myself). I found VP-trees to be a better option.
KD Trees work fine for 21 dimensions, if you quit early,
after looking at say 5 % of all the points.
FLANN does this (and other speedups)
to match 128-dim SIFT vectors. (Unfortunately FLANN does only the Euclidean metric,
and the fast and solid
scipy.spatial.cKDTree
does only Lp metrics;
these may or may not be adequate for your data.)
There is of course a speed-accuracy tradeoff here.
(If you could describe your Ndata, Nquery, data distribution,
that might help people to try similar data.)
Added 26 April, run times for cKDTree with cutoff on my old mac ppc, to give a very rough idea of feasibility:
kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=1000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.1 % of the 1000000 points, 0.31 % of 188315 boxes; better 0.0042 0.014 0.1 %
3.5 sec to query 1000 points
distances to 2 nearest: av 0.131 max 0.253
kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=5000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.48 % of the 1000000 points, 1.1 % of 188315 boxes; better 0.0071 0.026 0.5 %
15 sec to query 1000 points
distances to 2 nearest: av 0.131 max 0.245
You could try a z order curve. It's easy for 3 dimension.
I had a similar question a while back. For fast Approximate Nearest Neighbor Search you can use the annoy library from spotify: https://github.com/spotify/annoy
This is some example code for the Python API, which is optimized in C++.
from annoy import AnnoyIndex
import random
f = 40
t = AnnoyIndex(f, 'angular') # Length of item vector that will be indexed
for i in range(1000):
v = [random.gauss(0, 1) for z in range(f)]
t.add_item(i, v)
t.build(10) # 10 trees
t.save('test.ann')
# ...
u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
They provide different distance measurements. Which distance measurement you want to apply depends highly on your individual problem. Also consider prescaling (meaning weighting) certain dimensions for importance first. Those dimension or feature importance weights might be calculated by something like entropy loss or if you have a supervised learning problem gini impurity gain or mean average loss, where you check how much worse your machine learning model performs, if you scramble this dimensions values.
Often the direction of the vector is more important than it's absolute value. For example in the semantic analysis of text documents, where we want document vectors to be close when their semantics are similar, not their lengths. Thus we can either normalize those vectors to unit length or use angular distance (i.e. cosine similarity) as a distance measurement.
Hope this is helpful.
Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
I would suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are calculated to find the most relevant dimensions. You can use these weights when using euclidean distance, for example. See curse of dimensionality for common problems and also this article can enlighten you somehow:
A k-means type clustering algorithm for subspace clustering of mixed numeric and
categorical datasets

What is Crossover Probability & Mutation Probability in Genetic Algorithm or Genetic Programming?

What is Crossover Probability & Mutation Probability in Genetic Algorithm or Genetic Programming ? Could someone explain them from implementation perspective!
Mutation probability (or ratio) is basically a measure of the likeness that random elements of your chromosome will be flipped into something else. For example if your chromosome is encoded as a binary string of lenght 100 if you have 1% mutation probability it means that 1 out of your 100 bits (on average) picked at random will be flipped.
Crossover basically simulates sexual genetic recombination (as in human reproduction) and there are a number of ways it is usually implemented in GAs. Sometimes crossover is applied with moderation in GAs (as it breaks symmetry, which is not always good, and you could also go blind) so we talk about crossover probability to indicate a ratio of how many couples will be picked for mating (they are usually picked by following selection criteria - but that's another story).
This is the short story - if you want the long one you'll have to make an effort and follow the link Amber posted. Or do some googling - which last time I checked was still a good option too :)
According to Goldberg (Genetic Algorithms in Search, Optimization and Machine Learning) the probability of crossover is the probability that crossover will occur at a particular mating; that is, not all matings must reproduce by crossover, but one could choose Pc=1.0.
Probability of Mutation is per JohnIdol.
It's shows the quantity of features which inherited from the parents in crossover!
Note: If crossover probability is 100%, then all offspring is made by crossover. If it is 0%, whole new generation is made from exact
copies of chromosomes from old population (but this does not mean that
the new generation is the same!).
Here might be a little good explanation on these two probabilities:
http://www.optiwater.com/optiga/ga.html
Johnldol's answer on mutation probability is exactly words that the website is saying:
"Each bit in each chromosome is checked for possible mutation by generating a random number between zero and one and if this number is less than or equal to the given mutation probability e.g. 0.001 then the bit value is changed."
For crossover probability, maybe it is the ratio of next generation population born by crossover operation. While the rest of population...maybe by previous selection
or you can define it as best fit survivors

Channel allocation algorithm

We have a set of radio nodes in close proximity to each other and would like to allocate the frequencies for them to minimize overlap. To get complete coverage of the area, radio channels need to be oversubscribed and so we will have nearby radios transmitting on the same frequency.
Sample data:
5 Frequencies
343 Radios
4158 Edges
My current best guess is to randomly generate a population of frequency allocations and to swap frequencies between radios until the best score does not improve for 10 generations. Score is the sum of 1/range^2 for radios on the same frequency.
Each edge is the distance between the radios, corrected for walls and floors. Edges above 2* the max radio range have been culled from the list.
Is there a better way?
This is basically a graph-coloring problem with a twist. Rather than all proper colorings being equally good, some proper colorings are better than others, as defined by your scoring algorithm.
I think your genetic approach is practical and will yield good (if not provably optimal) solutions, but I would definitely suggest looking at some graph-coloring papers and seeing how applicable they are. It is very likely that you will get some great ideas for deciding how your algorithm should consider the available choices.
I agree that a simulation based on random initial assignment followed by some optimization is a good approach, but you're describing an optimization procedure which does not seem optimal, if I understand correctly (you're planning to swap frequencies at random if I read you correctly). At each optimization step you could pick a "reasonable" improvement by taking one radio from each frequency group and considering the 5*4/2=10 possible swaps of frequencies between two of them, and either choose the best, or (say) one of those which has positive delta score, with probabilities proportional to the deltas in the scores.
In the spirit of "simulated annealing", once the overall score seems to have more or less stabilized, you may want to switch for a small number of steps to "high temperature" (high randomness) where you just pick the set of 5 radios and swap them all e.g. with a circular permutation of frequency assignments -- do that a few times then go to the "cooling down" part again with the procedure in the above paragraph (which tries to get a cheap simulation of a maximum-gradient descent;-).
My quick stab at it would be to use a thin plate spline (or possibly a similar, cleverer linear algebra technique) to fit a plane to the function of frequency density. The average 'altitude' of each plane (per frequency) would then tell you whether a frequency is overused (i.e. when it's higher than the others); the slope would be an indication of the spatial distribution.

Resources