Decrasing Non uniform distribution of a gene - genetic-algorithm

I currently have a gene that can take an integer value in [1, 1500].
I don't like the 1500 upper bound that is acting like a magic number. Each integer can be selected uniformely, however when exploring I feel that a change from 1 to 2 is not the same as a change from 1001 to 1002 (but both have the same probability).
Is there a common/known way to :
select a value non uniformely, with a decreasing probability : say a 'small' value ('5' for instance) as a higher probability to be selected than a medium value ('200' for instance), and a 'medium' value has a higher probability to be selected than a high value ('1000' for instance), 'higher' needing to be defined somehow...
explore the full integers range without bounding but avoid to have crazy high value from the start (maybe something like a bounded evolutive range)

Have you thought about a random distributions, for example a normal distribution?
With a normal distribution you specify the center of your random values as well as the the width of the standard deviation. The random numbers generated by such a normal distribution have a much higher likelihood to be near the specified center rather than in value areas that are 1, 2 or 3 standard deviations away. The distribution of values from a normal distribution looks like this:
Source: [3]
You can read more on normal distributions or standard deviations on Wikipedia [2] [3]
If you don't like this particular probability distribution because you think it is too flat or too step then there are plenty of other distributions to choose from.
In code this could be implemented like this with Numpy:
import numpy as np
rng = np.random.default_rng()
gene_value = int(np.abs(rng.normal(loc=0, scale=500)))
This makes use of the Numpy random normal generator [4]. The center is set to 0 and the standard deviation to 500, making values closer to 0 significantly more likely and values close to 1500 or even higher very unlikely.
[2] https://en.wikipedia.org/wiki/Normal_distribution
[3] https://en.wikipedia.org/wiki/Standard_deviation
[4] https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.normal.html

Related

Random Normal Distributed Number Generator

I saw an article saying that one could generate a random normal distributed number by averaging many random numbers (0 to 1).
But how much numbers is "enough"!?
If the amount is too few, the result will be "inaccurate".
But if the amount is too large, the average is almost guaranteed to be the "true average (0.5 in this case)".
So what's the "threshold" I should take to generate a reliable random normal number!?
Much appreciated!!!
A usual number is 12 - just add 12 uniform numbers in [0,1) and subtract 6 to avoid any bias. The reason is that the variance for the sum is n/12; so by using 12 uniform numbers you avoid the need for scaling. I first noticed the 12 in https://en.wikipedia.org/wiki/INTERCAL - but assume there are more serious libraries using it.
However, what is enough would depend on your needs, and how good the pseudo-random-generator is - and at some point using other formulas will be more efficient.

Random number that prefers a range

I'm trying to write a program that requires me to generator a random number.
I also need to make it so there's a variable chance to pick a set range.
In this case, I would be generating between 1-10 and the range with the percent chance is 7-10.
How would I do this? Could I be supplied with a formula or something like that?
So if I'm understanding your question, you want two number ranges, and a variable-defined probability that the one range will be selected. This can be described mathematically as a probability density function (PDF), which in this case would also take your "chance" variable as an argument. If, for example, your 7-10 range is more likely than the rest of the 1-10 range, your PDF might look something like:
One PDF such as a flat distribution can be transformed into another via a transformation function, which would allow you to generate a uniformly random number and transform it to your own density function. See here for the rigorous mathematics:
http://www.stat.cmu.edu/~shyun/probclass16/transformations.pdf
But since you're writing a program and not a mathematics thesis, I suggest that the easiest way is to just generate two random numbers. The first decides which range to use, and the second generates the number within your chosen range. Keep in mind that if your ranges overlap (1-10 and 7-10 obviously do) then the overlapping region will be even more likely, so you will probably want to make your ranges exclusive so you can more easily control the probabilities. You haven't said what language you're using, but here's a simple example in Python:
import random
range_chance = 30 #30 percent change of the 7-10 range
if random.uniform(0,100) < range_chance:
print(random.uniform(7,10))
else:
print(random.uniform(1,7)) #1-7 so that there is no overlapping region

Why we multiply 'most likely estimate' by 4 in three point estimation?

I have used three point estimation for one of my project.
Formula is
Three Point Estimate = (O + 4M + L ) / 6
That means,
Best Estimate + 4 x Most Likely Estimate + Worst Case Estimate divided by 6
Here
divided by 6 means, average 6
and there is less chance of the worst case or the best case happening. In good faith, most likely estimate (M), is what it will take to get the job done.
But I don't know why they use 4(M). Why they multiplied by 4 ???. Not use 5,6,7 etc...
why most likely estimate is weighted four times as much as the other two values ?
There is a derivation here:
http://www.deepfriedbrainproject.com/2010/07/magical-formula-of-pert.html
In case the link goes dead, I'll provide a summary here.
So, taking a step back from the question for a moment, the goal here is to come up with a single mean (average) figure that we can say is the expected figure for any given 3 point estimate. That is to say, If I was to attempt the project X times, and add up all the costs of the project attempts for a total of $Y, then I expect the cost of one attempt to be $Y/X. Note that this number may or may not be the same as the mode (most likely) outcome, depending on the probability distribution.
An expected outcome is useful because we can do things like add up a whole list of expected outcomes to create an expected outcome for the project, even if we calculated each individual expected outcome differently.
A mode on the other hand, is not even necessarily unique per estimate, so that's one reason that it may be less useful than an expected outcome. For example, every number from 1-6 is the "most likely" for a dice roll, but 3.5 is the (only) expected average outcome.
The rationale/research behind a 3 point estimate is that in many (most?) real-world scenarios, these numbers can be more accurately/intuitively estimated by people than a single expected value:
A pessimistic outcome (P)
An optimistic outcome (O)
The most likely outcome (M)
However, to convert these three numbers into an expected value we need a probability distribution that interpolates all the other (potentially infinite) possible outcomes beyond the 3 we produced.
The fact that we're even doing a 3-point estimate presumes that we don't have enough historical data to simply lookup/calculate the expected value for what we're about to do, so we probably don't know what the actual probability distribution for what we're estimating is.
The idea behind the PERT estimates is that if we don't know the actual curve, we can plug some sane defaults into a Beta distribution (which is basically just a curve we can customise into many different shapes) and use those defaults for every problem we might face. Of course, if we know the real distribution, or have reason to believe that default Beta distribution prescribed by PERT is wrong for the problem at hand, we should NOT use the PERT equations for our project.
The Beta distribution has two parameters A and B that set the shape of the left and right hand side of the curve respectively. Conveniently, we can calculate the mode, mean and standard deviation of a Beta distribution simply by knowing the minimum/maximum values of the curve, as well as A and B.
PERT sets A and B to the following for every project/estimate:
If M > (O + P) / 2 then A = 3 + √2 and B = 3 - √2, otherwise the values of A and B are swapped.
Now, it just so happens that if you make that specific assumption about the shape of your Beta distribution, the following formulas are exactly true:
Mean (expected value) = (O + 4M + P) / 6
Standard deviation = (O - P) / 6
So, in summary
The PERT formulas are not based on a normal distribution, they are based on a Beta distribution with a very specific shape
If your project's probability distribution matches the PERT Beta distribution then the PERT formula are exactly correct, they are not approximations
It is pretty unlikely that the specific curve chosen for PERT matches any given arbitrary project, and so the PERT formulas will be an approximation in practise
If you don't know anything about the probability distribution of your estimate, you may as well leverage PERT as it's documented, understood by many people and relatively easy to use
If you know something about the probability distribution of your estimate that suggests something about PERT is inappropriate (like the 4x weighting towards the mode), then don't use it, use whatever you think is appropriate instead
The reason why you multiply by 4 to get the Mean (and not 5, 6, 7, etc.) is because the number 4 is tied to the shape of the underlying probability curve
Of course, PERT could have been based off a Beta distribution that yields 5, 6, 7 or any other number when calculating the Mean, or even a normal distribution, or a uniform distribution, or pretty much any other probability curve, but I'd suggest that the question of why they chose the curve they did is out of scope for this answer and possibly quite open ended/subjective anyway
I dug into this once. I cleverly neglected to write down the trail, so this is from memory.
So far as I can make out, the standards documents got it from the textbooks. The textbooks got it from the original 1950s write up in a statistics journals. The writeup in the journal was based on an internal report done by RAND as part of the overall work done to develop PERT for the Polaris program.
And that's where the trail goes cold. Nobody seems to have a firm idea of why they chose that formula. The best guess seems to be that it's based on a rough approximation of a normal distribution -- strictly, it's a triangular distribution. A lumpy bell curve, basically, that assumes that the "likely case" falls within 1 standard deviation of the true mean estimate.
4/6ths approximates 66.7%, which approximates 68%, which approximates the area under a normal distribution within one standard deviation of the mean.
All that being said, there are two problems:
It's essentially made up. There doesn't seem to be a firm basis for picking it. There's some Operational Research literature arguing for alternative distributions. In what universe are estimates normally distributed around the true outcome? I'd very much like to move there.
The accuracy-improving effect of the 3-point / PERT estimation method might be more about the breaking down of tasks into subtasks than from any particular formula. Psychologists studying what they call "the planning fallacy" have found that breaking down tasks -- "unpacking", in their terminology -- consistently improves estimates by making them higher and thus reducing inaccuracy. So perhaps the magic in PERT/3-point is the unpacking, not the formulae.
Isn't it a well working thumb-number?
The cone of uncertainty uses the factor 4 for the beginning phase of the project.
The book "Software Estimation" by Steve McConnell is based around the "cone of uncertainty" model and gives many "thumb-rules". However every approximated number or a thumb-rule is based on statistics from COCOMO or similar solid researches, models or studies.
Ideally these factors for O, M and L are derived using historical data for other projects in the same company in the same environment. In other words, the company should have 4 projects completed within M estimate, 1 within O and 1 within L. If my company/team had got 1 project completed within original O estimate, 2 projects within M and 2 within L, I would use another formula - (O + 2M + 2L) / 5. Does it make sense?
The cone of uncertainty was referenced above ... it's a well-known foundational element used in agile estimation practices.
What's the problem with it though? Doesn't it look too symmetrical - as if it's not natural, not really based on real data?
If you ever though that then you're right. The cone of uncertainty shown in the picture above is made up based on probabilities ... not actual raw data from real projects (but most of the times it's used as such).
Laurent Bossavit wrote a book and also gave a presentation where he presented his research on how that cone came to be (and other 'facts' we often believe in software engineering):
The Leprechauns of Software Engineering
https://www.amazon.com/Leprechauns-Software-Engineering-Laurent-Bossavit/dp/2954745509/
https://www.youtube.com/watch?v=0AkoddPeuxw
Is there some real data to support a cone of uncertainty? The closest he was able to find was a cone that can go up to 10x in the positive Y direction (so we can be up to 10 times off on our estimation in terms of the project taking 10 times as long in the end).
Hardly anybody estimates a project that ends up finishing 4 times earlier ... or ... gasp ... 10 times earlier.

how to generate longer random number from a short random number?

I have a short random number input, let's say int 0-999.
I don't know the distribution of the input. Now I want to generate a random number in range 0-99999 based on the input without changing the distribution shape.
I know there is a way to make the input to [0,1] by dividing it by 999 and then multiple 99999 to get the result. However, this method doesn't cover all the possible values, like 99999 will never get hit.
Assuming your input is some kind of source of randomness...
You can take two consecutive inputs and combine them:
input() + 1000*(input()%100)
Be careful though. This relies on the source having plenty of entropy, so that a given input number isn't always followed by the same subsequent input number. If your source is a PRNG designed to cycle between the numbers 0–999 in some fashion, this technique won't work.
With most production entropy sources (e.g., /dev/urandom), this should work fine. OTOH, with a production entropy source, you could fetch a random number between 0–99999 fairly directly.
You can try something like the following:
(input * 100) + random
where random is a random number between 0 and 99.
The problem is that input only specifies which 100 range to use. For instance 50 just says you will have a number between 5000 and 5100 (to keep a similar shape distribution). Which number between 5000 and 5100 to pick is up to you.

Peak detection of measured signal

We use a data acquisition card to take readings from a device that increases its signal to a peak and then falls back to near the original value. To find the peak value we currently search the array for the highest reading and use the index to determine the timing of the peak value which is used in our calculations.
This works well if the highest value is the peak we are looking for but if the device is not working correctly we can see a second peak which can be higher than the initial peak. We take 10 readings a second from 16 devices over a 90 second period.
My initial thoughts are to cycle through the readings checking to see if the previous and next points are less than the current to find a peak and construct an array of peaks. Maybe we should be looking at a average of a number of points either side of the current position to allow for noise in the system. Is this the best way to proceed or are there better techniques?
We do use LabVIEW and I have checked the LAVA forums and there are a number of interesting examples. This is part of our test software and we are trying to avoid using too many non-standard VI libraries so I was hoping for feedback on the process/algorithms involved rather than specific code.
There are lots and lots of classic peak detection methods, any of which might work. You'll have to see what, in particular, bounds the quality of your data. Here are basic descriptions:
Between any two points in your data, (x(0), y(0)) and (x(n), y(n)), add up y(i + 1) - y(i) for 0 <= i < n and call this T ("travel") and set R ("rise") to y(n) - y(0) + k for suitably small k. T/R > 1 indicates a peak. This works OK if large travel due to noise is unlikely or if noise distributes symmetrically around a base curve shape. For your application, accept the earliest peak with a score above a given threshold, or analyze the curve of travel per rise values for more interesting properties.
Use matched filters to score similarity to a standard peak shape (essentially, use a normalized dot-product against some shape to get a cosine-metric of similarity)
Deconvolve against a standard peak shape and check for high values (though I often find 2 to be less sensitive to noise for simple instrumentation output).
Smooth the data and check for triplets of equally spaced points where, if x0 < x1 < x2, y1 > 0.5 * (y0 + y2), or check Euclidean distances like this: D((x0, y0), (x1, y1)) + D((x1, y1), (x2, y2)) > D((x0, y0),(x2, y2)), which relies on the triangle inequality. Using simple ratios will again provide you a scoring mechanism.
Fit a very simple 2-gaussian mixture model to your data (for example, Numerical Recipes has a nice ready-made chunk of code). Take the earlier peak. This will deal correctly with overlapping peaks.
Find the best match in the data to a simple Gaussian, Cauchy, Poisson, or what-have-you curve. Evaluate this curve over a broad range and subtract it from a copy of the data after noting it's peak location. Repeat. Take the earliest peak whose model parameters (standard deviation probably, but some applications might care about kurtosis or other features) meet some criterion. Watch out for artifacts left behind when peaks are subtracted from the data.
Best match might be determined by the kind of match scoring suggested in #2 above.
I've done what you're doing before: finding peaks in DNA sequence data, finding peaks in derivatives estimated from measured curves, and finding peaks in histograms.
I encourage you to attend carefully to proper baselining. Wiener filtering or other filtering or simple histogram analysis is often an easy way to baseline in the presence of noise.
Finally, if your data is typically noisy and you're getting data off the card as unreferenced single-ended output (or even referenced, just not differential), and if you're averaging lots of observations into each data point, try sorting those observations and throwing away the first and last quartile and averaging what remains. There are a host of such outlier elimination tactics that can be really useful.
You could try signal averaging, i.e. for each point, average the value with the surrounding 3 or more points. If the noise blips are huge, then even this may not help.
I realise that this was language agnostic, but guessing that you are using LabView, there are lots of pre-packaged signal processing VIs that come with LabView that you can use to do smoothing and noise reduction. The NI forums are a great place to get more specialised help on this sort of thing.
This problem has been studied in some detail.
There are a set of very up-to-date implementations in the TSpectrum* classes of ROOT (a nuclear/particle physics analysis tool). The code works in one- to three-dimensional data.
The ROOT source code is available, so you can grab this implementation if you want.
From the TSpectrum class documentation:
The algorithms used in this class have been published in the following references:
[1] M.Morhac et al.: Background
elimination methods for
multidimensional coincidence gamma-ray
spectra. Nuclear Instruments and
Methods in Physics Research A 401
(1997) 113-
132.
[2] M.Morhac et al.: Efficient one- and two-dimensional Gold
deconvolution and its application to
gamma-ray spectra decomposition.
Nuclear Instruments and Methods in
Physics Research A 401 (1997) 385-408.
[3] M.Morhac et al.: Identification of peaks in
multidimensional coincidence gamma-ray
spectra. Nuclear Instruments and
Methods in Research Physics A
443(2000), 108-125.
The papers are linked from the class documentation for those of you who don't have a NIM online subscription.
The short version of what is done is that the histogram flattened to eliminate noise, and then local maxima are detected by brute force in the flattened histogram.
I would like to contribute to this thread an algorithm that I have developed myself:
It is based on the principle of dispersion: if a new datapoint is a given x number of standard deviations away from some moving mean, the algorithm signals (also called z-score). The algorithm is very robust because it constructs a separate moving mean and deviation, such that signals do not corrupt the threshold. Future signals are therefore identified with approximately the same accuracy, regardless of the amount of previous signals. The algorithm takes 3 inputs: lag = the lag of the moving window, threshold = the z-score at which the algorithm signals and influence = the influence (between 0 and 1) of new signals on the mean and standard deviation. For example, a lag of 5 will use the last 5 observations to smooth the data. A threshold of 3.5 will signal if a datapoint is 3.5 standard deviations away from the moving mean. And an influence of 0.5 gives signals half of the influence that normal datapoints have. Likewise, an influence of 0 ignores signals completely for recalculating the new threshold: an influence of 0 is therefore the most robust option.
It works as follows:
Pseudocode
# Let y be a vector of timeseries data of at least length lag+2
# Let mean() be a function that calculates the mean
# Let std() be a function that calculates the standard deviaton
# Let absolute() be the absolute value function
# Settings (the ones below are examples: choose what is best for your data)
set lag to 5; # lag 5 for the smoothing functions
set threshold to 3.5; # 3.5 standard deviations for signal
set influence to 0.5; # between 0 and 1, where 1 is normal influence, 0.5 is half
# Initialise variables
set signals to vector 0,...,0 of length of y; # Initialise signal results
set filteredY to y(1,...,lag) # Initialise filtered series
set avgFilter to null; # Initialise average filter
set stdFilter to null; # Initialise std. filter
set avgFilter(lag) to mean(y(1,...,lag)); # Initialise first value
set stdFilter(lag) to std(y(1,...,lag)); # Initialise first value
for i=lag+1,...,t do
if absolute(y(i) - avgFilter(i-1)) > threshold*stdFilter(i-1) then
if y(i) > avgFilter(i-1)
set signals(i) to +1; # Positive signal
else
set signals(i) to -1; # Negative signal
end
# Adjust the filters
set filteredY(i) to influence*y(i) + (1-influence)*filteredY(i-1);
set avgFilter(i) to mean(filteredY(i-lag,i),lag);
set stdFilter(i) to std(filteredY(i-lag,i),lag);
else
set signals(i) to 0; # No signal
# Adjust the filters
set filteredY(i) to y(i);
set avgFilter(i) to mean(filteredY(i-lag,i),lag);
set stdFilter(i) to std(filteredY(i-lag,i),lag);
end
end
Demo
> For more information, see original answer
This method is basically from David Marr's book "Vision"
Gaussian blur your signal with the expected width of your peaks.
this gets rid of noise spikes and your phase data is undamaged.
Then edge detect (LOG will do)
Then your edges were the edges of features (like peaks).
look between edges for peaks, sort peaks by size, and you're done.
I have used variations on this and they work very well.
I think you want to cross-correlate your signal with an expected, exemplar signal. But, it has been such a long time since I studied signal processing and even then I didn't take much notice.
I don't know very much about instrumentation, so this might be totally impractical, but then again it might be a helpful different direction. If you know how the readings can fail, and there is a certain interval between peaks given such failures, why not do gradient descent at each interval. If the descent brings you back to an area you've searched before, you can abandon it. Depending upon the shape of the sampled surface, this also might help you find peaks faster than search.
Is there a qualitative difference between the desired peak and the unwanted second peak? If both peaks are "sharp" -- i.e. short in time duration -- when looking at the signal in the frequency domain (by doing FFT) you'll get energy at most bands. But if the "good" peak reliably has energy present at frequencies not existing in the "bad" peak, or vice versa, you may be able to automatically differentiate them that way.
You could apply some Standard Deviation to your logic and take notice of peaks over x%.

Resources