Prevent inbreeding and monoculture in genetic algorithm (newbie question)

Prevent inbreeding and monoculture in genetic algorithm (newbie question) - genetic-algorithm

I am writing a genetic algorithm. My population quickly develops a monoculture. I am using a small population (32 individuals) with a small number of discrete genes (24 genes per individual) and a single point cross-over mating approach. Combine that with a roulette wheel selection strategy and it is easy to see how all the genetic diversity is lost in just a few dozen generations.
What I would like to know is, what is the appropriate response? I do not have academic-level knowledge on GAs and only a few solutions come to mind:
Use a larger population. (slow)
Use runtime checks to prevent in-breeding. (slow)
Use more cross-over points. (not very effective)
Raise the number of mutations.
What are some appropriate responses to the situation?

I would look at a larger population, 32 induviduals is a very small population. I usually run GAs with a population at least in the number of chromosomes^2 range (by experience) to get a good starting distribution of individuals.
A possible way to speed things upwith a larger population is to spawn different threads (1 per individual, possibly in batches) when running your fitness function (usually the most expensive part of a GA).
Assuming a population of 32, and a Quad core system, spawn threads in batches of 8 (2 threads per cpu will interleave nicely) and you should be able to run approx 4 * faster.
Therefore if you have a time limit on how long to run your GA, this may be a solution.

You can add to that:
tournament selection instead of roulette wheel
island separated multi population scheme, with migration
restarts
incorporating ideas from estimation of distribution algorithms (EDA) (resampling the domain close to promising areas to introduce new individuals)

Related

Assignment Problem when jobs are available more than once

I have a normal assignment problem, where I want to match workers to jobs. But there are several kinds of jobs, each with a set amount of positions. So for example I would need 10,000 builders, 5,000 welders etc. Each worker has of course the same preference for each position of the same kind of job.
My current approach is to use the Hungarian Algorithm and to just extend the matrix columns to account for that. So for example it would have 10,000 builder columns, 5,000 welder etc. Of course with O(n3) and a matrix that big, getting results may take a while.
Is there any variation of the Hungarian algorithm, or a different one, which uses the fact, that there can be multiple connections to one job node? Or should rather look into Monte Carlo or genetic search tree algorithms?
edit:
Formal description as Sascha proposed:
Set W for workers, J for jobs, weight function for the preference, function for the amount of jobs available
So the function I want to minimize would be:
where
Constraints would be:
and
As asked by Yay295, it would be ok if it ran for a day or two on a normal consumer machine. There are 50k workers right now with 10 kinds of jobs and 50k jobs total. So the matrix is 50k x 50k (extended) in the case of the Hungarian algorithm I'm using right now, or 50k x 10 for LP with the additional constraint , while and preference values in the matrix would go from 0-100.

This is actually called the Transportation Problem. The Transportation Problem is similar to the Assignment Problem in that they both have sources and destinations, but the Transportation Problem has two more values: each source has a supply, and each destination has a demand. The Assignment Problem is a simplification of the Transportation Problem in which the supply of each source and the demand of each destination is 1.
In your case, you have 50,000 sources (your workers) each with a supply of 1 (each worker can only work one job). You also have 10 destinations (the job types) each with some amount of demand (the number of openings for that type).
The Transportation Problem is traditionally solved with the Simplex Algorithm. I couldn't tell you how it works off the top of my head, but there is plenty of information available elsewhere online on how to do it. I would recommend these two videos: first, second.
Alternatively, the Transportation Problem can actually also be solved using the Hungarian Algorithm. The idea is to keep track of your supply and demand separately, and then use the Hungarian Algorithm (or any other algorithm for the Assignment Problem) to solve it as if the supply and demand were 1 (this can be incredibly fast when it's as lopsided as 50,000 sources to 10 destinations as in your case). Once you've solved it once, use the results to decrement the supply and demand of the assigned solution appropriately. Repeat until the sum of either supply or demand is zero.
However, none of this may be necessary. I wrote my own Assignment Problem solver in C++ a few years ago, and despite using 2.5GB of RAM, it can solve a 50,000 by 50,000 assignment problem in less than 5 seconds. The trick is to write your own. Before I wrote mine I had a look around at what was available online, and they were all pretty bad, often with obvious bugs. If you are going to write your own code for this though, it would be better to use the Simplex Algorithm as described in the videos I linked above. I don't know that one is faster than the other, but the Hungarian Algorithm wasn't made for the Transportation Problem.
ps: The same person who did the two lectures I linked above also did one on the Assignment Problem and the Hungarian Algorithm.

Simulation Performance Metrics

This is a semi-broad question, but it's one that I feel on some level is answerable or at least approachable.
I've spent the last month or so making a fairly extensive simulation. In order to protect the interests of my employer, I won't state specifically what it does... but an analogy of what it does may be explained by... a high school dance.
A girl or boy enters the dance floor, and based on the selection of free dance partners, an optimal choice is made. After a period of time, two dancers finish dancing and are now free for a new partnership.
I've been making partner selection algorithms designed to maximize average match outcome while not sacrificing wait time for a partner too much.
I want a way to gauge / compare versions of my algorithms in order to make a selection of the optimal algorithm for any situation. This is difficult however since the inputs of my simulation are extremely large matrices of input parameters (2-5 per dancer), and the simulation takes several minutes to run (a fact that makes it difficult to test a large number of simulation inputs). I have a few output metrics, but linking them to the large number of inputs is extremely hard. I'm also interested in finding which algorithms completely fail under certain input conditions...
Any pro tips / online resources which might help me in defining input constraints / output variables which might give clarity on an optimal algorithm?

I might not understand what you exactly want. But here is my suggestion. Let me know if my solution is inaccurate/irrelevant and I will edit/delete accordingly.
Assume you have a certain metric (say compatibility of the pairs or waiting time). If you just have the average or total number for this metric over all the users, it is kind of useless. Instead you might want to find the distribution of of this metric over all users. If nothing, you should always keep track of the variance. Once you have the distribution, you can calculate a probability that particular algorithm A is better than B for a certain metric.
If you do not have the distribution of the metric within an experiment, you can always run multiple experiments, and the number of experiments you need to run depends on the variance of the metric and difference between two algorithms.

How to combine two features (two minimum distance classifiers)

Hello All with my first post here,
I work on tracking objects through images without prior training. I use two features, the color of the region (the ab channels of the Lab space) and the HOG. In my initial experiments, I found that using min. distance classifier with the HOG feature alone has the advantage of low false positives FP but with high FN. On the other hand, using the min. distance classifier with the color alone increases the TP and decreases the FN results but with the price of increasing FP.
My question, is how to combine the two classifiers? I like to know the standard algorithm to do that in an unsupervised way.
I tried to combine the two features into one feature (after normalization) but the HOG dominates the results. Even if I weighted the combined feature, results are worse than either of the two.
The good results I reach till now is to (cascade) the two classifiers, by running the color first to increase the possibilities then run the HOG (with a threshold a little bit higher than that used with HOG alone). I googled the topic but I don't have enough knowledge about classification to find the standard methods.
Thanks for help

What are the most common uses for distributed computing?

I wrote a very simple distributed computing platform (based on the Map/Reduce paradigm), and I'm in the process of writing some demos and showcases. I have a very small team and have to prioritize which demos I'll write first.
To prioritize I need to sort the demos accordingly to about 70% being a relevant, common, significant use case of distributed computing, 30% being easy to write.
So far I have it ordered like this:
Discovering pi digits with Monte Carlo
Numerical integration with Monte Carlo
Large matrix multiplication (dense matrices)
Linear regressions
Large matrix inversion
Multiple regressions
Sorting
Clustering (K-Means)
Clustering (Hierarchical)
Number 1 is on the list because it took 10 minutes to write, although it's completely useless (I'm not sure but I figure there's not a lot of people trying to find more digits to pi).
Due to the nature of my platform, it will shine more in things that are of course embarrassingly parallel, and not I/O-bounded or reduce-dominated.
How would you change my list? What would you add to it? Is sorting useful at all in the enterprise world or is it only for benchmarking distributed computing platforms?

Your list suggests that you are not distinguishing between parallel computing and distributed computing. This is not necessarily wrong but someone looking for a demonstration of the excellence of a distributed computing platform might be left tepidly enthused upon seeing parallel computations, such as your items 2 - 5, being performed.
Sorting is certainly useful everywhere there is data: large enterprises, small enterprises, in your desk drawers, across the Googlesphere. So too is searching, which is a surprising omission from your list. The other omission which strikes me immediately is any sort of data fusion, merging large datasets to get information from their intersections beyond what can be extracted from the datasets individually.

I second Mark in that you are mixing distributed computing and HPC. Here are some comments on each of your topics:
(1) There are people trying to compute as many digits of Pi as they can but the Monte Carlo algorithm is completely useless there as its precision scales with the inverse square root of the number of trials, so in order to get one more decimal digit of precision you would roughly need 100 times more trials. There are other algorithms - see if you can implement some of them using Map/Reduce.
(2) This one is fine, although seldom used - same problem with precision as (1).
(5) Pure matrix inversions are seldom performed, mainly because of numerical instabilities. How about solving a dense system of linear equations instead?
I would say that you are missing one of the main usages of M/R processing nowadays, namely graph processing (read: social and other networks/flows analysis). Also some more general optimisation problem might be nice, e.g. genetic algorithms.

Modeling distribution of performance measurements

How would you mathematically model the distribution of repeated real life performance measurements - "Real life" meaning you are not just looping over the code in question, but it is just a short snippet within a large application running in a typical user scenario?
My experience shows that you usually have a peak around the average execution time that can be modeled adequately with a Gaussian distribution. In addition, there's a "long tail" containing outliers - often with a multiple of the average time. (The behavior is understandable considering the factors contributing to first execution penalty).
My goal is to model aggregate values that reasonably reflect this, and can be calculated from aggregate values (like for the Gaussian, calculate mu and sigma from N, sum of values and sum of squares). In other terms, number of repetitions is unlimited, but memory and calculation requirements should be minimized.
A normal Gaussian distribution can't model the long tail appropriately and will have the average biased strongly even by a very small percentage of outliers.
I am looking for ideas, especially if this has been attempted/analysed before. I've checked various distributions models, and I think I could work out something, but my statistics is rusty and I might end up with an overblown solution. Oh, a complete shrink-wrapped solution would be fine, too ;)
Other aspects / ideas: Sometimes you get "two humps" distributions, which would be acceptable in my scenario with a single mu/sigma covering both, but ideally would be identified separately.
Extrapolating this, another approach would be a "floating probability density calculation" that uses only a limited buffer and adjusts automatically to the range (due to the long tail, bins may not be spaced evenly) - haven't found anything, but with some assumptions about the distribution it should be possible in principle.
Why (since it was asked) -
For a complex process we need to make guarantees such as "only 0.1% of runs exceed a limit of 3 seconds, and the average processing time is 2.8 seconds". The performance of an isolated piece of code can be very different from a normal run-time environment involving varying levels of disk and network access, background services, scheduled events that occur within a day, etc.
This can be solved trivially by accumulating all data. However, to accumulate this data in production, the data produced needs to be limited. For analysis of isolated pieces of code, a gaussian deviation plus first run penalty is ok. That doesn't work anymore for the distributions found above.
[edit] I've already got very good answers (and finally - maybe - some time to work on this). I'm starting a bounty to look for more input / ideas.

Often when you have a random value that can only be positive, a log-normal distribution is a good way to model it. That is, you take the log of each measurement, and assume that is normally distributed.
If you want, you can consider that to have multiple humps, i.e. to be the sum of two normals having different mean. Those are a bit tricky to estimate the parameters of, because you may have to estimate, for each measurement, its probability of belonging to each hump. That may be more than you want to bother with.
Log-normal distributions are very convenient and well-behaved. For example, you don't deal with its average, you deal with it's geometric mean, which is the same as its median.
BTW, in pharmacometric modeling, log-normal distributions are ubiquitous, modeling such things as blood volume, absorption and elimination rates, body mass, etc.
ADDED: If you want what you call a floating distribution, that's called an empirical or non-parametric distribution. To model that, typically you save the measurements in a sorted array. Then it's easy to pick off the percentiles. For example the median is the "middle number". If you have too many measurements to save, you can go to some kind of binning after you have enough measurements to get the general shape.
ADDED: There's an easy way to tell if a distribution is normal (or log-normal). Take the logs of the measurements and put them in a sorted array. Then generate a QQ plot (quantile-quantile). To do that, generate as many normal random numbers as you have samples, and sort them. Then just plot the points, where X is the normal distribution point, and Y is the log-sample point. The results should be a straight line. (A really simple way to generate a normal random number is to just add together 12 uniform random numbers in the range +/- 0.5.)

The problem you describe is called "Distribution Fitting" and has nothing to do with performance measurements, i.e. this is generic problem of fitting suitable distribution to any gathered/measured data sample.
The standard process is something like that:
Guess the best distribution.
Run hypothesis tests to check how well it describes gathered data.
Repeat 1-3 if not well enough.
You can find interesting article describing how this can be done with open-source R software system here. I think especially useful to you may be function fitdistr.

In addition to already given answers consider Empirical Distributions. I have successful experience in using empirical distributions for performance analysis of several distributed systems. The idea is very straightforward. You need to build histogram of performance measurements. Measurements should be discretized with given accuracy. When you have histogram you could do several useful things:
calculate the probability of any given value (you are bound by accuracy only);
build PDF and CDF functions for the performance measurements;
generate sequence of response times according to a distribution. This one is very useful for performance modeling.

Try whit gamma distribution http://en.wikipedia.org/wiki/Gamma_distribution
From wikipedia
The gamma distribution is frequently a probability model for waiting times; for instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution.

The standard for randomized Arrival times for performance modelling is either Exponential distribution or Poisson distribution (which is just the distribution of multiple Exponential distributions added together).

Not exactly answering your question, but relevant still: Mor Harchol-Balter did a very nice analysis of the size of jobs submitted to a scheduler, The effect of heavy-tailed job size distributions on computer systems design (1999). She found that the size of jobs submitted to her distributed task assignment system took a power-law distribution, which meant that certain pieces of conventional wisdom she had assumed in the construction of her task assignment system, most importantly that the jobs should be well load balanced, had awful consequences for submitters of jobs. She's done good follor-up work on this issue.
The broader point is, you need to ask such questions as:
What happens if reasonable-seeming assumptions about the distribution of performance, such as that they take a normal distribution, break down?
Are the data sets I'm looking at really representative of the problem I'm trying to solve?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio