Change of density during iterations in NSGA-II - genetic-algorithm

I am working with NSGA II and I am wondering, how the density of the population changes while the algorithm is running. Suppose you initialize your population according to a uniform distribution. Your population is changed after each iteration. How does this afflict the density of the population? E.g. If the sample size is huge will the population after n steps still be roughly uniform distributed? Does anybody know an answer to this?

Candidate solutions will close after a while. It's normal. Uniform distributon necessary just for first population. If you hesitate to get stuck on a local optimum, you can raise mutation operator probability. Mutation operator provides to jump another area.

Related

Computational Complexity of Finding Area Under Discrete Curve

I apologize if my questions are extremely misguided or loosely scoped. Math is not my strongest subject. For context, I am trying to figure out the computational complexity of calculating the area under a discrete curve. In the particular use case that I am interested in, the y-axis is the length of a queue and the x-axis is time. The curve will always have the following bounds: it begins at zero, it is composed of multiple timestamped samples that are greater than zero, and it eventually shrinks to zero. My initial research has yielded two potential mathematical approaches to this problem. The first is a Reimann sum over domain [a, b] where a is initially zero and b eventually becomes zero (not sure if my understanding is completely correct there). I think the mathematical representation of this the formula found here:
https://en.wikipedia.org/wiki/Riemann_sum#Connection_with_integration.
The second is a discrete convolution. However, I am unable to tell the difference between, and applicability of, a discrete convolution and a Reimann sum over domain [a, b] where a is initially zero and b eventually becomes zero.
My questions are:
Is there are difference between the two?
Which approach is most applicable/efficient for what I am trying to figure out?
Is it even appropriate ask the computation complexity of either mathematical approach? If so, what are the complexities of each in this particular application?
Edit:
For added context, there will be a function calculating average queue length by taking the sum of the area under two separate curves and dividing it by the total time interval spanning those two curves. The particular application can be seen on page 168 of this paper: https://www.cse.wustl.edu/~jain/cv/raj_jain_paper4_decbit.pdf
Is there are difference between the two?
A discrete convolution requires two functions. If the first one corresponds to the discrete curve, what is the second one?
Which approach is most applicable/efficient for what I am trying to figure out?
A Riemann sum is an approximation of an integral. It's typically used to approximate the area under a continuous curve. You can of course use it on a discrete curve, but it's not an approximation anymore, and I'm not sure you can call it a "Riemann" sum.
Is it even appropriate ask the computation complexity of either mathematical approach? If so, what are the complexities of each in this particular application?
In any case, the complexity of computing the area under a dicrete curve is linear in the number of samples, and it's pretty straightforward to find why: you need to do something with each sample, once or twice.
What you probably want looks like a Riemann sum with the trapezoidal rule. Pick the first two samples, calculate their average, and multiply that by the distance between two samples. Repeat for every adjacent pair and sum it all.
So, this is for the router feedback filter in the referenced paper...
That algorithm is specifically designed so that you can implement it without storing a lot of samples and timestamps.
It works by accumulating total queue_length * time during each cycle.
At the start of each "cycle", record the current queue length and current clock time and set the current cycle's total to 0. (The paper defines the cycle so that the queue length is 0 at the start, but that's not important here)
every time the queue length changes, get the new current clock time and add (new_clock_time - previous_clock_time) * previous_queue_length to the total. Also do this at the end of the cycle. Then, record new new current queue length and current clock time.
When you need to calculate the current "average queue length", it's just (previous_cycle_total + current_cycle_total + (current_clock_time - previous_clock_time)*previous_queue_length) / total_time_since_previous_cycle_start

Which behaviour should I expect of the extended KALMAN FILTER covariance?

I have an extended kalman filter setting, which converges to the reference value pretty well. However, if i check the covariance matrix I cannot see a fast trend to a certain value of the diagonal elements (3x3), which is actually what I expected.
How should the covariance matrix behave? Can somebody give me a hint on that?
The diagonal elements will increase in the prediction/propagation step by the process noise that is applied there. In the measurement/correction step, the diagonal elements should go down. If this is not the case, either your measurement noise is too large or the process noise noise is too large. In that case the noise added in the prediction step increases the covariance more than the measurements reduce them. At some point, there will be a balance between the two.
Of course your model or implementation could also be wrong....

KMeans evaluation metric not converging. Is this normal behavior or no?

I'm working on a problem that necessitates running KMeans separately on ~125 different datasets. Therefore, I'm looking to mathematically calculate the 'optimal' K for each respective dataset. However, the evaluation metric continues decreasing with higher K values.
For a sample dataset, there are 50K rows and 8 columns. Using sklearn's calinski-harabaz score, I'm iterating through different K values to find the optimum / minimum score. However, my code reached k=5,600 and the calinski-harabaz score was still decreasing!
Something weird seems to be happening. Does the metric not work well? Could my data be flawed (see my question about normalizing rows after PCA)? Is there another/better way to mathematically converge on the 'optimal' K? Or should I force myself to manually pick a constant K across all datasets?
Any additional perspectives would be helpful. Thanks!
I don't know anything about the calinski-harabaz score but some score metrics will be monotone increasing/decreasing with respect to increasing K. For instance the mean squared error for linear regression will always decrease each time a new feature is added to the model so other scores that add penalties for increasing number of features have been developed.
There is a very good answer here that covers CH scores well. A simple method that generally works well for these monotone scoring metrics is to plot K vs the score and choose the K where the score is no longer improving 'much'. This is very subjective but can still give good results.
SUMMARY
The metric decreases with each increase of K; this strongly suggests that you do not have a natural clustering upon the data set.
DISCUSSION
CH scores depend on the ratio between intra- and inter-cluster densities. For a relatively smooth distribution of points, each increase in K will give you clusters that are slightly more dense, with slightly lower density between them. Try a lattice of points: vary the radius and do the computations by hand; you'll see how that works. At the extreme end, K = n: each point is its own cluster, with infinite density, and 0 density between clusters.
OTHER METRICS
Perhaps the simplest metric is sum-of-squares, which is already part of the clustering computations. Sum the squares of distances from the centroid, divide by n-1 (n=cluster population), and then add/average those over all clusters.
I'm looking for a particular paper that discusses metrics for this very problem; if I can find the reference, I'll update this answer.
N.B. With any metric you choose (as with CH), a failure to find a local minimum suggests that the data really don't have a natural clustering.
WHAT TO DO NEXT?
Render your data in some form you can visualize. If you see a natural clustering, look at the characteristics; how is it that you can see it, but the algebra (metrics) cannot? Formulate a metric that highlights the differences you perceive.
I know, this is an effort similar to the problem you're trying to automate. Welcome to research. :-)
The problem with my question is that the 'best' Calinski-Harabaz score is the maximum, whereas my question assumed the 'best' was the minimum. It is computed by analyzing the ratio of between-cluster dispersion vs. within-cluster dispersion, the former/numerator you want to maximize, the latter/denominator you want to minimize. As it turned out, in this dataset, the 'best' CH score was with 2 clusters (the minimum available for comparison). I actually ran with K=1, and this produced good results as well. As Prune suggested, there appears to be no natural grouping within the dataset.

maxmin clustering algorithm

I read a a paper that mention max min clustering algorithm, but i don't really quite understand what this algorithm does. Googling "max min clustering algorithm" doesn't yield any helpful result. does anybody know what this algorithm mean? this is an excerpt of the paper:
Max-min clustering proceeds by choosing an observation at random as the first centroid c1, and by setting the set C of centroids to {c1}. During the ith iteration, ci is chosen such that it maximizes the minimum Euclidean distance between ci and observations in C. Max-min clustering is preferable to a density-based clustering algorithm (e.g. k-means) which would tend to select many examples from the dense group of non-seizure data points.
I don't quite understand the bolded part.
link to paper is here
We choose each new centroid to be as far as possible from the existing centroids. Here's some Python code.
def maxminclustering(observations, k):
observations = set(observations)
if k < 1 or not observations: return set()
centroids = set([observations.pop()])
for i in range(min(k - 1, len(observations))):
newcentroid = max(observations,
key=lambda observation:
min(distance(observation, centroid)
for centroid in centroids))
observations.remove(newcentroid)
centroids.add(newcentroid)
return centroids
This sounds a lot like the farthest-points heuristic for seeding k-means, but then not performing any k-means iterations at all.
This is a surprisingly simple, but quite effective strategy. Basically it will find a number of data points that are well spread out, which can make k-means converge fast. Usually, one would discard the first (random) data point.
It only works well for low values of k though (it avoids placing centroids in the center of the data set!), and it is not very favorable to multiple runs - it tends to choose the same initial centroids again.
K-means++ can be seen as a more randomized version of this. Instead of always choosing the farthes object, it chooses far objects with increased likelihood, but may at random also choose a near neighbor. This way, you get more diverse results when running it multiple times.
You can try it out in ELKI, it is named FarthestPointsInitialMeans. If you choose the algorithm SingleAssignmentKMeans, then it will not perform k-means iterations, but only do the initial assignment. That will probably give you this "MaxMin clustering" algorithm.

RANSAC variation: does an inlier membership probability distribution makes sense?

I'm using RANSAC to fit a geometric model to a point cloud with outliers. I know, because of the generation process of the point cloud, that 99.9% of the inlier distances to my model are distributed following a gaussian probability density function with known μ and σ, in the interval [−3σ,−3σ].
The first question is whether do you think that it is reasonable to evaluate the total number of inliers for a certain model adding the inlier membership probability instead of adding 1 for each inlier. That is, the traditional RANSAC assumes that everything that is in an interval delimited by a threshold is an inlier; I would like to know if I can bend that, giving to some inliers more weight than others, following a probability distribution for this purpose.
In case this is reasonable, the second question is, how do you think it affects the number of samples N:
1−(1−(1−e)^s)^N=p
being e the probability that a point is an outlier, s the number of points used in a sample, N the number of samples (RANSAC iterations), p the desired probability that we get a good sample.
If none of that is reasonable, how do you suggest I may introduce my prior information of the inlier distribution?
Thanks in advance,
Federico

Resources