Whats the ideal population size and number of iterations for Genetic Algorithm - genetic-algorithm

I am using a GA to evaluate a continuous function for a vector with approximately 40,000 variables. Currently I am using a population size of 200 where every member of the population has 40,000 variables. I am using 50 iterations.
With these numbers, the GA is not getting me really close to the optimum solution. I was wondering if there is a way to determine the best population size and number of iterations for a vector of huge size (40,000 variables).

Yes, it's call trail and error. I suggest starting with a much larger size and see how close you get then decrease the population size repeatedly until you find the point where the size is getting unacceptable results.
There is also check that the population size is the problem. There might be a problem with your algorithm so that given any size and iterations you still don't get an ideal solution.

I have answered a similar question Here. Basically you have a very large number of variables, and an extremely small generation number. I would look into Parallelising your algorithm, Increase your population size, and number of iterations.
Also #Peter Lawrey makes good suggestions.

Related

lightgbm how to deal with No further splits with positive gain, best gain: -inf

how to deal with [Warning] No further splits with positive gain, best gain: -inf
is there any parameters not suit?
Some explanation from lightGBM's issues:
it means the learning of tree in current iteration should be stop, due to cannot split any more.
I think this is caused by "min_data_in_leaf":1000, you can set it to a smaller value.
This is not a bug, it is a feature.
The output message is to warn user that your parameters may be wrong, or your dataset is not easy to learn.
link: https://github.com/Microsoft/LightGBM/issues/640
So on the contrary, the data is hard to fit.
This means that no improvement can be gained by adding additional leaves to the tree subject to the restrictions of the hyperparameters. It is not necessarily a bad thing, as limiting the depth of the tree can prevent overfitting. However, if the tree is underfitting the data, try tweaking these hyperparameters:
decrease min_data_in_leaf - minimum number of data points in a leaf
decrease min_sum_hessian_in_leaf - Minimum sum of the Hessian (second derivative of the objective function evaluated for each observation) for observations in a leaf. For some regression objectives, this is just the minimum number of records that have to fall into each node. For classification objectives, it represents a sum over a distribution of probabilities. It works like min_child_weight in xgboost.
increase max_bin or max_bin_by_feature when creating dataset
LightGBM training buckets continuous features into discrete bins to improve training speed and reduce memory requirements for training. This binning is done one time during Dataset construction. Increasing the number of bins per feature can increase the number of splits that can be made.
max_bin controls the maximum number of bins that features will bucketed into. It is also possible to set this maximum feature-by-feature, by passing max_bin_by_feature.
set 'verbosity': -1 in params, it works!
Increase max_depth or set it to -1.

KMeans evaluation metric not converging. Is this normal behavior or no?

I'm working on a problem that necessitates running KMeans separately on ~125 different datasets. Therefore, I'm looking to mathematically calculate the 'optimal' K for each respective dataset. However, the evaluation metric continues decreasing with higher K values.
For a sample dataset, there are 50K rows and 8 columns. Using sklearn's calinski-harabaz score, I'm iterating through different K values to find the optimum / minimum score. However, my code reached k=5,600 and the calinski-harabaz score was still decreasing!
Something weird seems to be happening. Does the metric not work well? Could my data be flawed (see my question about normalizing rows after PCA)? Is there another/better way to mathematically converge on the 'optimal' K? Or should I force myself to manually pick a constant K across all datasets?
Any additional perspectives would be helpful. Thanks!
I don't know anything about the calinski-harabaz score but some score metrics will be monotone increasing/decreasing with respect to increasing K. For instance the mean squared error for linear regression will always decrease each time a new feature is added to the model so other scores that add penalties for increasing number of features have been developed.
There is a very good answer here that covers CH scores well. A simple method that generally works well for these monotone scoring metrics is to plot K vs the score and choose the K where the score is no longer improving 'much'. This is very subjective but can still give good results.
SUMMARY
The metric decreases with each increase of K; this strongly suggests that you do not have a natural clustering upon the data set.
DISCUSSION
CH scores depend on the ratio between intra- and inter-cluster densities. For a relatively smooth distribution of points, each increase in K will give you clusters that are slightly more dense, with slightly lower density between them. Try a lattice of points: vary the radius and do the computations by hand; you'll see how that works. At the extreme end, K = n: each point is its own cluster, with infinite density, and 0 density between clusters.
OTHER METRICS
Perhaps the simplest metric is sum-of-squares, which is already part of the clustering computations. Sum the squares of distances from the centroid, divide by n-1 (n=cluster population), and then add/average those over all clusters.
I'm looking for a particular paper that discusses metrics for this very problem; if I can find the reference, I'll update this answer.
N.B. With any metric you choose (as with CH), a failure to find a local minimum suggests that the data really don't have a natural clustering.
WHAT TO DO NEXT?
Render your data in some form you can visualize. If you see a natural clustering, look at the characteristics; how is it that you can see it, but the algebra (metrics) cannot? Formulate a metric that highlights the differences you perceive.
I know, this is an effort similar to the problem you're trying to automate. Welcome to research. :-)
The problem with my question is that the 'best' Calinski-Harabaz score is the maximum, whereas my question assumed the 'best' was the minimum. It is computed by analyzing the ratio of between-cluster dispersion vs. within-cluster dispersion, the former/numerator you want to maximize, the latter/denominator you want to minimize. As it turned out, in this dataset, the 'best' CH score was with 2 clusters (the minimum available for comparison). I actually ran with K=1, and this produced good results as well. As Prune suggested, there appears to be no natural grouping within the dataset.

Finding the average of large list of numbers

Came across this interview question.
Write an algorithm to find the mean(average) of a large list. This
list could contain trillions or quadrillions of number. Each number is
manageable in hundreds, thousands or millions.
Googling it gave me all Median of Medians solutions. How should I approach this problem?
Is divide and conquer enough to deal with trillions of number?
How to deal with the list of the such a large size?
If the size of the list is computable, it's really just a matter of how much memory you have available, how long it's supposed to take and how simple the algorithm is supposed to be.
Basically, you can just add everything up and divide by the size.
If you don't have enough memory, dividing first might work (Note that you will probably lose some precision that way).
Another approach would be to recursively split the list into 2 halves and calculating the mean of the sublists' means. Your recursion termination condition is a list size of 1, in which case the mean is simply the only element of the list. If you encounter a list of odd size, make either the first or second sublist longer, this is pretty much arbitrary and doesn't even have to be consistent.
If, however, you list is so giant that its size can't be computed, there's no way to split it into 2 sublists. In that case, the recursive approach works pretty much the other way around. Instead of splitting into 2 lists with n/2 elements, you split into n/2 lists with 2 elements (or rather, calculate their mean immediately). So basically, you calculate the mean of elements 1 and 2, that becomes you new element 1. the mean of 3 and 4 is your new second element, and so on. Then apply the same algorithm to the new list until only 1 element remains. If you encounter a list of odd size, either add an element at the end or ignore the last one. If you add one, you should try to get as close as possible to your expected mean.
While this won't calculate the mean mathematically exactly, for lists of that size, it will be sufficiently close. This is pretty much a mean of means approach. You could also go the median of medians route, in which case you select the median of sublists recursively. The same principles apply, but you will generally want to get an odd number.
You could even combine the approaches and calculate the mean if your list is of even size and the median if it's of odd size. Doing this over many recursion steps will generate a pretty accurate result.
First of all, this is an interview question. The problem as stated would not arise in practice. Also, the question as stated here is imprecise. That is probably deliberate. (They want to see how you deal with solving an imprecisely specified problem.)
Write an algorithm to find the mean(average) of a large list.
The word "find" is rubbery. It could mean calculate (to some precision) or it could mean estimate.
The phrase "large list" is rubbery. If could mean a list or array data structure in memory, or the "list" could be the result of a database query, the contents of a file or files.
There is no mention of the hardware constraints on the system where this will be implemented.
So the first thing >>I<< would do would be to try to narrow the scope by asking some questions of the interviewer.
But assuming that you can't, then a complete answer would need to cover the following points:
The dataset probably won't fit in memory at the same time. (But if it does, then that is good.)
Calculating the average of N numbers is O(N) if you do it serially. For N this size, it could be an intractable problem.
An alternative is to split into sublists of equals size and calculate the averages, and the average of the averages. In theory, this gives you O(N/P) where P is the number of partitions. The parallelism could be implemented with multiple threads, with multiple processes on the same machine, or distributed.
In practice, the limiting factors are going to be computational, memory and/or I/O bandwidth. A parallel solution will be effective if you can address these limits. For example, you need to balance the problem of each "worker" having uncontended access to its "sublist" versus the problem of making copies of the data so that that can happen.
If the list is represented in a way that allows sampling, then you can estimate the average without looking at the entire dataset. In fact, this could be O(C) depending on how you sample. But there is a risk that your sample will be unrepresentative, and the average will be too inaccurate.
In all cases doing calculations, you need to guard against (integer) overflow and (floating point) rounding errors. Especially while calculating the sums.
It would be worthwhile discussing how you would solve this with a "big data" platform (e.g. Hadoop) and the limitations of that approach (e.g. time taken to load up the data ...)

"On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?

Is there an algorithm to estimate the median, mode, skewness, and/or kurtosis of set of values, but that does NOT require storing all the values in memory at once?
I'd like to calculate the basic statistics:
mean: arithmetic average
variance: average of squared deviations from the mean
standard deviation: square root of the variance
median: value that separates larger half of the numbers from the smaller half
mode: most frequent value found in the set
skewness: tl; dr
kurtosis: tl; dr
The basic formulas for calculating any of these is grade-school arithmetic, and I do know them. There are many stats libraries that implement them, as well.
My problem is the large number (billions) of values in the sets I'm handling: Working in Python, I can't just make a list or hash with billions of elements. Even if I wrote this in C, billion-element arrays aren't too practical.
The data is not sorted. It's produced randomly, on-the-fly, by other processes. The size of each set is highly variable, and the sizes will not be known in advance.
I've already figured out how to handle the mean and variance pretty well, iterating through each value in the set in any order. (Actually, in my case, I take them in the order in which they're generated.) Here's the algorithm I'm using, courtesy http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:
Initialize three variables: count, sum, and sum_of_squares
For each value:
Increment count.
Add the value to sum.
Add the square of the value to sum_of_squares.
Divide sum by count, storing as the variable mean.
Divide sum_of_squares by count, storing as the variable mean_of_squares.
Square mean, storing as square_of_mean.
Subtract square_of_mean from mean_of_squares, storing as variance.
Output mean and variance.
This "on-line" algorithm has weaknesses (e.g., accuracy problems as sum_of_squares quickly grows larger than integer range or float precision), but it basically gives me what I need, without having to store every value in each set.
But I don't know whether similar techniques exist for estimating the additional statistics (median, mode, skewness, kurtosis). I could live with a biased estimator, or even a method that compromises accuracy to a certain degree, as long as the memory required to process N values is substantially less than O(N).
Pointing me to an existing stats library will help, too, if the library has functions to calculate one or more of these operations "on-line".
I use these incremental/recursive mean and median estimators, which both use constant storage:
mean += eta * (sample - mean)
median += eta * sgn(sample - median)
where eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources you can use something like eta=1/n for the mean estimator, where n is the number of samples seen so far... unfortunately, this does not appear to work for the median estimator.)
This type of incremental mean estimator seems to be used all over the place, e.g. in unsupervised neural network learning rules, but the median version seems much less common, despite its benefits (robustness to outliers). It seems that the median version could be used as a replacement for the mean estimator in many applications.
I would love to see an incremental mode estimator of a similar form...
UPDATE (2011-09-19)
I just modified the incremental median estimator to estimate arbitrary quantiles. In general, a quantile function tells you the value that divides the data into two fractions: p and 1-p. The following estimates this value incrementally:
quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)
The value p should be within [0,1]. This essentially shifts the sgn() function's symmetrical output {-1,0,1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1-p of the data are less than/greater than the quantile estimate, respectively). Note that for p=0.5, this reduces to the median estimator.
UPDATE (2021-11-19)
For further details about the median estimator described here, I'd like to highlight this paper linked in the comments below: Bylander & Rosen, 1997, A Perceptron-Like Online Algorithm for Tracking the Median. Here is a postscript version from the author's website.
Skewness and Kurtosis
For the on-line algorithms for Skewness and Kurtosis (along the lines of the variance), see in the same wiki page here the parallel algorithms for higher-moment statistics.
Median
Median is tough without sorted data. If you know, how many data points you have, in theory you only have to partially sort, e.g. by using a selection algorithm. However, that doesn't help too much with billions of values. I would suggest using frequency counts, see the next section.
Median and Mode with Frequency Counts
If it is integers, I would count
frequencies, probably cutting off the highest and lowest values beyond some value where I am sure that it is no longer relevant. For floats (or too many integers), I would probably create buckets / intervals, and then use the same approach as for integers. (Approximate) mode and median calculation than gets easy, based on the frequencies table.
Normally Distributed Random Variables
If it is normally distributed, I would use the population sample mean, variance, skewness, and kurtosis as maximum likelihood estimators for a small subset. The (on-line) algorithms to calculate those, you already now. E.g. read in a couple of hundred thousand or million datapoints, until your estimation error gets small enough. Just make sure that you pick randomly from your set (e.g. that you don't introduce a bias by picking the first 100'000 values). The same approach can also be used for estimating mode and median for the normal case (for both the sample mean is an estimator).
Further comments
All the algorithms above can be run in parallel (including many sorting and selection algorithm, e.g. QuickSort and QuickSelect), if this helps.
I have always assumed (with the exception of the section on the normal distribution) that we talk about sample moments, median, and mode, not estimators for theoretical moments given a known distribution.
In general, sampling the data (i.e. only looking at a sub-set) should be pretty successful given the amount of data, as long as all observations are realizations of the same random variable (have the same distributions) and the moments, mode and median actually exist for this distribution. The last caveat is not innocuous. For example, the mean (and all higher moments) for the Cauchy Distribution do not exist. In this case, the sample mean of a "small" sub-set might be massively off from the sample mean of the whole sample.
I implemented the P-Square Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations in a neat Python module I wrote called LiveStats. It should solve your problem quite effectively. The library supports every statistic that you mention except for mode. I have not yet found a satisfactory solution for mode estimation.
Ryan, I'm afraid you are not doing the mean and variance right... This came up a few weeks ago here. And one of the strong points of the online version (which actually goes by the name of Welford's method) is the fact that it is specially accurate and stable, see the discussion here. One of the strong points is the fact that you do not need to store the total sum or total sum of squares...
I can't think of any on-line approach to the mode and median, which seem to require considering the whole list at once. But it may very well be that a similar approach than the one for the variance and mean will work also for the skewness and kurtosis...
The Wikipedia article quoted in the question contains the formulas for calcualting skewness and kurtosis on-line.
For mode - I believe - there is no way doing this on-line. Why? Assume that all values of your input are different besides the last one that duplicates a previous one. In this case you have to remember all values allready seen in the input to detect that the last value duplicates a value seen befor and makes it the most frequent one.
For median it is almost the same - up to the last input you don't know what value will become the median if all input values are different because it could be before or after the current median. If you know the length of the input, you can find the median without storing all values in memory, but you will still have to store many of them (I guess around the half) because a bad input sequence could shift the median heavily in the second half possibly making any value from the first half the median.
(Note that I am refering to exact calculation only.)
If you have billions of data points, then it's not likely that you need exact answers, as opposed to close answers. Generally, if you have billions of data points the underlying process which generates them will likely obey some kind of statistical stationarity / ergodicity / mixing property. Also it may matter whether you expect the distributions to be reasonably continuous or not.
In these circumstances, there exist algorithms for on-line, low memory, estimation of quantiles (the median is a special case of 0.5 quantile), as well as modes, if you don't need exact answers. This is an active field of statistics.
quantile estimation example: http://www.computer.org/portal/web/csdl/doi/10.1109/WSC.2006.323014
mode estimation example: Bickel DR. Robust estimators of the mode and skewness of continuous data. Computational Statistics and Data Analysis. 2002;39:153–163. doi: 10.1016/S0167-9473(01)00057-3.
These are active fields of computational statistics. You are getting into the fields where there isn't any single best exact algorithm, but a diversity of them (statistical estimators, in truth), which have different properties, assumptions and performance. It's experimental mathematics. There are probably hundreds to thousands of papers on the subject.
The final question is whether you really need skewness and kurtosis by themselves, or more likely some other parameters which may be more reliable at characterizing the probability distribution (assuming you have a probability distribution!). Are you expecting a Gaussian?
Do you have ways of cleaning/preprocessing the data to make it mostly Gaussianish? (for instance, financial transaction amounts are often somewhat Gaussian after taking logarithms). Do you expect finite standard deviations? Do you expect fat tails? Are the quantities you care about in the tails or in the bulk?
Everyone keeps saying that you can't do the mode in an online manner but that is simply not true. Here is an article describing an algorithm to do just this very problem invented in 1982 by Michael E. Fischer and Steven L. Salzberg of Yale University. From the article:
The majority-finding algorithm uses one of its registers for temporary
storage of a single item from the stream; this item is the current
candidate for majority element. The second register is a counter
initialized to 0. For each element of the stream, we ask the algorithm
to perform the following routine. If the counter reads 0, install the
current stream element as the new majority candidate (displacing any
other element that might already be in the register). Then, if the
current element matches the majority candidate, increment the counter;
otherwise, decrement the counter. At this point in the cycle, if the
part of the stream seen so far has a majority element, that element is
in the candidate register, and the counter holds a value greater than
0. What if there is no majority ele­ment? Without making a second pass through the data—which isn't possible in a stream environment—the
algorithm cannot always give an unambiguous answer in this
circumstance. It merely promises to correctly identify the majority
element if there is one.
It can also be extended to find the top N with more memory but this should solve it for the mode.
Ultimately if you have no a priori parametric knowledge of the distribution I think you have to store all the values.
That said unless you are dealing with some sort of pathological situation, the remedian (Rousseuw and Bassett 1990) may well be good enough for your purposes.
Very simply it involves calculating the median of batches of medians.
median and mode can't be calculated online using only constant space available. However, because median and mode are anyway more "descriptive" than "quantitative", you can estimate them e.g. by sampling the data set.
If the data is normal distributed in the long run, then you could just use your mean to estimate the median.
You can also estimate median using the following technique: establish a median estimation M[i] for every, say, 1,000,000 entries in the data stream so that M[0] is the median of the first one million entries, M[1] the median of the second one million entries etc. Then use the median of M[0]...M[k] as the median estimator. This of course saves space, and you can control how much you want to use space by "tuning" the parameter 1,000,000. This can be also generalized recursively.
I would tend to use buckets, which could be adaptive. The bucket size should be the accuracy you need. Then as each data point comes in you add one to the relevant bucket's count.
These should give you simple approximations to median and kurtosis, by counting each bucket as its value weighted by its count.
The one problem could be loss of resolution in floating point after billions of operations, i.e. adding one does not change the value any more! To get round this, if the maximum bucket size exceeds some limit you could take a large number off all the counts.
OK dude try these:
for c++:
double skew(double* v, unsigned long n){
double sigma = pow(svar(v, n), 0.5);
double mu = avg(v, n);
double* t;
t = new double[n];
for(unsigned long i = 0; i < n; ++i){
t[i] = pow((v[i] - mu)/sigma, 3);
}
double ret = avg(t, n);
delete [] t;
return ret;
}
double kurt(double* v, double n){
double sigma = pow(svar(v, n), 0.5);
double mu = avg(v, n);
double* t;
t = new double[n];
for(unsigned long i = 0; i < n; ++i){
t[i] = pow( ((v[i] - mu[i]) / sigma) , 4) - 3;
}
double ret = avg(t, n);
delete [] t;
return ret;
}
where you say you can already calculate sample variance (svar) and average (avg)
you point those to your functions for doin that.
Also, have a look at Pearson's approximation thing. on such a large dataset it would be pretty similar.
3 (mean − median) / standard deviation
you have median as max - min/2
for floats mode has no meaning. one would typically stick them in bins of a sginificant size (like 1/100 * (max - min)).
This problem was solved by Pebay et al:
https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf
Median
Two recent percentile approximation algorithms and their python implementations can be found here:
t-Digests
https://arxiv.org/abs/1902.04023
https://github.com/CamDavidsonPilon/tdigest
DDSketch
https://arxiv.org/abs/1908.10693
https://github.com/DataDog/sketches-py
Both algorithms bucket data. As T-Digest uses smaller bins near the tails the
accuracy is better at the extremes (and weaker close to the median). DDSketch additionally provides relative error guarantees.
for j in range (1,M):
y=np.zeros(M) # build the vector y
y[0]=y0
#generate the white noise
eps=npr.randn(M-1)*np.sqrt(var)
#increment the y vector
for k in range(1,T):
y[k]=corr*y[k-1]+eps[k-1]
yy[j]=y
list.append(y)

Summarise unlimited sequence using constant storage

Assume we're frequently sampling a particular value and want to keep statistics on the samples. The simplest approach is to store every sample so we can calculate whatever stats we want, but this requires unbounded storage. Using a constant amount of storage, we can keep track of some stats like minimum and maximum values. What else can we track using only constant storage? I am thinking of percentiles, standard deviation, and any other useful statistics.
That's the theoretical question. In my actual situation, the samples are simply millisecond timings: profiling information for a long-running application. There will be millions of samples but not much more than a billion or so. So what stats can be kept for the samples using no more than, say, 10 variables?
Minimum, maximum, average, total count, variance are all easy and useful. That's 5 values.
Usually you'll store sum and not average, and when you need the average you can just divide the sum by the count.
So, in your loop
maxVal=max(x, maxVal);
minVal=min(x, minVal);
count+=1;
sum+=x;
secondorder+=x*x;
later, you may print any of these stats. Mean and standard deviation can be computed at any time and are:
mean=sum/count;
std=sqrt(secondorder/count - mean*mean);
Median and percentile estimation are more difficult, but possible. The usual trick is to make a set of histogram bins and fill the bins when a sample is found inside them.
You can then estimate median and such by looking at the distribution of those bin populations.
This is only an approximation to the distribution, but often enough. To find the exact median, you must store all samples.
Well, lots of them. Consider, eg, σ. Square root of the variance, which is computed from the average and the data point, repeatedly. At point Xi, you have an estimated variance for all the points up to this one; then add (x-μ)² to that for the next estimate. There's a nice section on that in wikipedia.
What you're looking for, in general, is called "on line methods for calculation".

Resources