How to calculate alpha if error rate is zero (Adaboost) - adaboost

I have been wondering what the value of alpha (weight of a weak classifier) should be when it has an error rate(perfect classification) since the algorithm for alpha is
(0.5) * Math.log(((1 - errorRate) / errorRate))
Thank you.

If you're boosting by reweighting and passing to the weak learner the whole training data, I'd say that you found a weak classifier that is in fact strong, after all it flawlessly classified your data.
In this case, it should happen in the first Adaboost iteration. Add that weak classifier to your strong classifier with an alpha set to 1 and stop the training.
Now, if that happened while you're boosting by resampling, and your sample is only a subset of your training data, I believe you should discard this subset and retry with another sample.
I believe you reached such result because you're playing with a very simple example, or your training dataset is very small or isn't representative. It's also possible that your weak classifier is too weak and is approaching random guessing too quickly.

Nominally, the alpha for the weak classifier with zero error should be large because it classifies all training instances correctly. I'm assuming you're using all training data to estimate alpha. It's possible you're estimating alpha only with the training sample for that round of boosting as well--in which case your alpha should be slightly smaller based on the sample size--but same idea.
In theory, this alpha should be near infinity if your other alphas are unnormalized. In practice, the suggestion to check if your error is zero and give those alphas a very high value is reasonable, but error rates of zero or near zero typically indicate you're overfitting (or just have too little training data to estimate reliable alphas).
This is covered in section 4.2 of Schapire & Singer's Confidence Rated Predictions version of Adaboost. They suggest adding a small epsilon to your numerator and denominator for stability:
alpha = (0.5) * Math.log(((1 - errorRate + epsilon) / (errorRate + epsilon)))
In any event, this alpha shouldn't be set to a small value (it should be large). And setting it to 1 only makes sense if all other alphas for all other rounds of boosting are normalized so the sum of all alphas is almost 1, e.g..

I ran into this problem a few times and usually what I do is to check if error is equal to 0 and if it is, set it equal to 1/10 of the minimum weight. It is a hack, but it usually ends up working pretty well.

It is actually better if you do not use such a classifier in your prediction of Adaboost as it would not improve it much as it is not a weak classifier and will tend to eat up all the weight.

Related

Why we multiply 'most likely estimate' by 4 in three point estimation?

I have used three point estimation for one of my project.
Formula is
Three Point Estimate = (O + 4M + L ) / 6
That means,
Best Estimate + 4 x Most Likely Estimate + Worst Case Estimate divided by 6
Here
divided by 6 means, average 6
and there is less chance of the worst case or the best case happening. In good faith, most likely estimate (M), is what it will take to get the job done.
But I don't know why they use 4(M). Why they multiplied by 4 ???. Not use 5,6,7 etc...
why most likely estimate is weighted four times as much as the other two values ?
There is a derivation here:
http://www.deepfriedbrainproject.com/2010/07/magical-formula-of-pert.html
In case the link goes dead, I'll provide a summary here.
So, taking a step back from the question for a moment, the goal here is to come up with a single mean (average) figure that we can say is the expected figure for any given 3 point estimate. That is to say, If I was to attempt the project X times, and add up all the costs of the project attempts for a total of $Y, then I expect the cost of one attempt to be $Y/X. Note that this number may or may not be the same as the mode (most likely) outcome, depending on the probability distribution.
An expected outcome is useful because we can do things like add up a whole list of expected outcomes to create an expected outcome for the project, even if we calculated each individual expected outcome differently.
A mode on the other hand, is not even necessarily unique per estimate, so that's one reason that it may be less useful than an expected outcome. For example, every number from 1-6 is the "most likely" for a dice roll, but 3.5 is the (only) expected average outcome.
The rationale/research behind a 3 point estimate is that in many (most?) real-world scenarios, these numbers can be more accurately/intuitively estimated by people than a single expected value:
A pessimistic outcome (P)
An optimistic outcome (O)
The most likely outcome (M)
However, to convert these three numbers into an expected value we need a probability distribution that interpolates all the other (potentially infinite) possible outcomes beyond the 3 we produced.
The fact that we're even doing a 3-point estimate presumes that we don't have enough historical data to simply lookup/calculate the expected value for what we're about to do, so we probably don't know what the actual probability distribution for what we're estimating is.
The idea behind the PERT estimates is that if we don't know the actual curve, we can plug some sane defaults into a Beta distribution (which is basically just a curve we can customise into many different shapes) and use those defaults for every problem we might face. Of course, if we know the real distribution, or have reason to believe that default Beta distribution prescribed by PERT is wrong for the problem at hand, we should NOT use the PERT equations for our project.
The Beta distribution has two parameters A and B that set the shape of the left and right hand side of the curve respectively. Conveniently, we can calculate the mode, mean and standard deviation of a Beta distribution simply by knowing the minimum/maximum values of the curve, as well as A and B.
PERT sets A and B to the following for every project/estimate:
If M > (O + P) / 2 then A = 3 + √2 and B = 3 - √2, otherwise the values of A and B are swapped.
Now, it just so happens that if you make that specific assumption about the shape of your Beta distribution, the following formulas are exactly true:
Mean (expected value) = (O + 4M + P) / 6
Standard deviation = (O - P) / 6
So, in summary
The PERT formulas are not based on a normal distribution, they are based on a Beta distribution with a very specific shape
If your project's probability distribution matches the PERT Beta distribution then the PERT formula are exactly correct, they are not approximations
It is pretty unlikely that the specific curve chosen for PERT matches any given arbitrary project, and so the PERT formulas will be an approximation in practise
If you don't know anything about the probability distribution of your estimate, you may as well leverage PERT as it's documented, understood by many people and relatively easy to use
If you know something about the probability distribution of your estimate that suggests something about PERT is inappropriate (like the 4x weighting towards the mode), then don't use it, use whatever you think is appropriate instead
The reason why you multiply by 4 to get the Mean (and not 5, 6, 7, etc.) is because the number 4 is tied to the shape of the underlying probability curve
Of course, PERT could have been based off a Beta distribution that yields 5, 6, 7 or any other number when calculating the Mean, or even a normal distribution, or a uniform distribution, or pretty much any other probability curve, but I'd suggest that the question of why they chose the curve they did is out of scope for this answer and possibly quite open ended/subjective anyway
I dug into this once. I cleverly neglected to write down the trail, so this is from memory.
So far as I can make out, the standards documents got it from the textbooks. The textbooks got it from the original 1950s write up in a statistics journals. The writeup in the journal was based on an internal report done by RAND as part of the overall work done to develop PERT for the Polaris program.
And that's where the trail goes cold. Nobody seems to have a firm idea of why they chose that formula. The best guess seems to be that it's based on a rough approximation of a normal distribution -- strictly, it's a triangular distribution. A lumpy bell curve, basically, that assumes that the "likely case" falls within 1 standard deviation of the true mean estimate.
4/6ths approximates 66.7%, which approximates 68%, which approximates the area under a normal distribution within one standard deviation of the mean.
All that being said, there are two problems:
It's essentially made up. There doesn't seem to be a firm basis for picking it. There's some Operational Research literature arguing for alternative distributions. In what universe are estimates normally distributed around the true outcome? I'd very much like to move there.
The accuracy-improving effect of the 3-point / PERT estimation method might be more about the breaking down of tasks into subtasks than from any particular formula. Psychologists studying what they call "the planning fallacy" have found that breaking down tasks -- "unpacking", in their terminology -- consistently improves estimates by making them higher and thus reducing inaccuracy. So perhaps the magic in PERT/3-point is the unpacking, not the formulae.
Isn't it a well working thumb-number?
The cone of uncertainty uses the factor 4 for the beginning phase of the project.
The book "Software Estimation" by Steve McConnell is based around the "cone of uncertainty" model and gives many "thumb-rules". However every approximated number or a thumb-rule is based on statistics from COCOMO or similar solid researches, models or studies.
Ideally these factors for O, M and L are derived using historical data for other projects in the same company in the same environment. In other words, the company should have 4 projects completed within M estimate, 1 within O and 1 within L. If my company/team had got 1 project completed within original O estimate, 2 projects within M and 2 within L, I would use another formula - (O + 2M + 2L) / 5. Does it make sense?
The cone of uncertainty was referenced above ... it's a well-known foundational element used in agile estimation practices.
What's the problem with it though? Doesn't it look too symmetrical - as if it's not natural, not really based on real data?
If you ever though that then you're right. The cone of uncertainty shown in the picture above is made up based on probabilities ... not actual raw data from real projects (but most of the times it's used as such).
Laurent Bossavit wrote a book and also gave a presentation where he presented his research on how that cone came to be (and other 'facts' we often believe in software engineering):
The Leprechauns of Software Engineering
https://www.amazon.com/Leprechauns-Software-Engineering-Laurent-Bossavit/dp/2954745509/
https://www.youtube.com/watch?v=0AkoddPeuxw
Is there some real data to support a cone of uncertainty? The closest he was able to find was a cone that can go up to 10x in the positive Y direction (so we can be up to 10 times off on our estimation in terms of the project taking 10 times as long in the end).
Hardly anybody estimates a project that ends up finishing 4 times earlier ... or ... gasp ... 10 times earlier.

What if each round of boosting selects same Haar-feature in Viola-jones face detection method?

I am implementing Viola-Jones face detection to detect human faces. While training using Adaboost, boosting round selects the same haar feature. For example, if the selected Haar-feature (x,y,w,h,f,p) for the first three round is (0,0,4,2,1,0) , (1,3,5,2,3,1) and (2,4,7,2,4,1) then for the remaining round of boosting it select the same haar-feature, so that the list of my selected Haar-feature becomes,
[(0,0,4,2,1,0),(1,3,5,2,3,1),(2,4,7,2,4,1),(1,2,4,8,1,0),(1,2,4,8,1,0),(1,2,4,8,1,0),(1,2,4,8,1,0),(1,2,4,8,1,0)].
Here,
x,y = x_y coordinate, w = width of Haar-feature, h = height of Haar-feature, f = feature type, p = parity of Haar-feature.
My Question:
1) If the each round of boosting select the same Haar-feature, should I select the next Haar-feature that have comparatively minimum error.
Thanks!
No, you should not. Adaboost can indeed pick the same feature more than once per boosting run, but usually the feature will have a different weight value (alpha value).
The results you're getting might have many different causes. For instance, you may have a bug in your Adaboost code. You may also have a bug in your features or weak classifiers. Or you're not providing enough samples to your boosting algorithm. Or, your weak classifiers are too weak. Or your strong classifier is overfitting really fast.

Some details about adjusting cascaded AdaBoost stage threshold

I have implemented AdaBoost sequence algorithm and currently I am trying to implement so called Cascaded AdaBoost, basing on P. Viola and M. Jones original paper. Unfortunately I have some doubts, connected with adjusting the threshold for one stage. As we can read in original paper, the procedure is described in literally one sentence:
Decrease threshold for the ith classifier until the current
cascaded classifier has a detection rate of at least
d × Di − 1 (this also affects Fi)
I am not sure mainly two things:
What is the threshold? Is it 0.5 * sum (alpha) expression value or only 0.5 factor?
What should be the initial value of the threshold? (0.5?)
What does "decrease threshold" mean in details? Do I need to iterative select new threshold e.g. 0.5, 0.4, 0.3? What is the step of decreasing?
I have tried to search this info in Google, but unfortunately I could not find any useful information.
Thank you for your help.
I had the exact same doubt and have not found any authoritative source so far. However, this is what is my best guess to this issue:
1. (0.5*sum(aplha)) is the threshold.
2. Initial value of the threshold is what is above. Next, try to classify the samples using the intermediate strong classifier (what you currently have). You'll get the scores each of the samples attain, and depending on the current value of threshold, some of the positive samples will be classified as negative etc. So, depending on the desired detection rate desired for this stage (strong classifier), reduce the threshold so that that many positive samples get correctly classified ,
eg:
say thresh. was 10, and these are the current classifier outputs for positive training samples:
9.5, 10.5, 10.2, 5.4, 6.7
and I want a detection rate of 80% => 80% of above 5 samples classified correctly => 4 of above => set threshold to 6.7
Clearly, by changing the threshold, the FP rate also changes, so update that, and if the desired FP rate for the stage not reached, go for another classifier at that stage.
I have not done a formal course on ada-boost etc, but this is my observation based on some research papers I tried to implement. Please correct me if something is wrong. Thanks!
I have found a Master thesis on real-time face detection by Karim Ayachi (pdf) in which he describes the Viola Jones face detection method.
As it is written in Section 5.2 (Creating the Cascade using AdaBoost), we can set the maximal threshold of the strong classifier to sum(alpha) and the minimal threshold to 0 and then find the optimal threshold using binary search (see Table 5.1 for pseudocode).
Hope this helps!

Optimal population size, mutate rate and mate rate in genetic algorithm

I have written a game playing program for a competition, which relies on some 16 floating point "constants". Changing a constant can and will have dramatic impact on playing style and success rate.
I have also written a simple genetic algorithm to generate the optimal values for the constants. However the algorithm does not generate "optimal" constants.
The likely reasons:
The algorithm has errors (for the time being rule this out!)
The population is to small
The mutate rate is to high
The mate rate could be better
The algorithm goes like this:
First the initial population is created
Initial constants for each member are assigned (based on my bias multiplied with a random factor between 0.75 and 1.25)
For each generation members of the population are paired for a game match
The winner is cloned twice, if draw both are cloned once
The cloning mutates one gene if random() is less than mutate rate
Mutation multiplies a random constant with a random factor between 0.75 and 1.25
At fixed intervals, dependent on mate rate, the members are paired and genes are mixed
My current settings:
Population: 40 (to low)
Mutate rate 0.10 (10%)
Mate rate 0.20 (every 5 generations)
What would be better values for population size, mutate rate and mate rate?
Guesses are welcome, exact values are not expected!
Also, if you have insights with similar genetic algorithms, you will like to share, please do so.
P.S.: The game playing competition in question, if anyone is interested: http://ai-contest.com/
Your mutation size strikes me as surprisingly high. There's also a bit of bias inherent in it - the larger the current value is, the larger the mutation will be.
You might consider
Having a (much!) smaller mutation
Giving the mutation a fixed range
Distributing your mutation sizes differently - e.g. you could use a normal distribution with a mean of 1.
R.A. Fisher once compared the mutation size to focusing a microscope. If you change the focus, you might be going in the right direction, or the wrong direction. However, if you're fairly close to the optimum and turn it a lot - either you'll go in the wrong direction, or you'll overshoot the target. So a more subtle tweak is generally better!
Use GAUL framework, it's really easy so you could extract your objective function to plug it to GAUL. If you have a multi-core machine, then you would want to use omp (openMP ) when compiling to parallelize your evaluations( that I assume are time consumming ). This way you can have a bigger population size. http://gaul.sourceforge.net/
Normally they use High crossover and low mutation. Since you want creativity i suggest you High mutation and low crossover.http://games.slashdot.org/story/10/11/02/0211249/Developing-emStarCraft-2em-Build-Orders-With-Genetic-Algorithms?from=rss
Be really carefull in your mutation function to stay in your space search ( inside 0.75, 1.25 ). Use GAUL random function such as random_double( min, max ). They are really well designed. Build your own mutation function. Make sure parents dies !
Then you may want combine this with a simplex (Nelder-Mead), included in GAUL, because genetic programming with low crossover will find a non optimal solution.

Peak detection of measured signal

We use a data acquisition card to take readings from a device that increases its signal to a peak and then falls back to near the original value. To find the peak value we currently search the array for the highest reading and use the index to determine the timing of the peak value which is used in our calculations.
This works well if the highest value is the peak we are looking for but if the device is not working correctly we can see a second peak which can be higher than the initial peak. We take 10 readings a second from 16 devices over a 90 second period.
My initial thoughts are to cycle through the readings checking to see if the previous and next points are less than the current to find a peak and construct an array of peaks. Maybe we should be looking at a average of a number of points either side of the current position to allow for noise in the system. Is this the best way to proceed or are there better techniques?
We do use LabVIEW and I have checked the LAVA forums and there are a number of interesting examples. This is part of our test software and we are trying to avoid using too many non-standard VI libraries so I was hoping for feedback on the process/algorithms involved rather than specific code.
There are lots and lots of classic peak detection methods, any of which might work. You'll have to see what, in particular, bounds the quality of your data. Here are basic descriptions:
Between any two points in your data, (x(0), y(0)) and (x(n), y(n)), add up y(i + 1) - y(i) for 0 <= i < n and call this T ("travel") and set R ("rise") to y(n) - y(0) + k for suitably small k. T/R > 1 indicates a peak. This works OK if large travel due to noise is unlikely or if noise distributes symmetrically around a base curve shape. For your application, accept the earliest peak with a score above a given threshold, or analyze the curve of travel per rise values for more interesting properties.
Use matched filters to score similarity to a standard peak shape (essentially, use a normalized dot-product against some shape to get a cosine-metric of similarity)
Deconvolve against a standard peak shape and check for high values (though I often find 2 to be less sensitive to noise for simple instrumentation output).
Smooth the data and check for triplets of equally spaced points where, if x0 < x1 < x2, y1 > 0.5 * (y0 + y2), or check Euclidean distances like this: D((x0, y0), (x1, y1)) + D((x1, y1), (x2, y2)) > D((x0, y0),(x2, y2)), which relies on the triangle inequality. Using simple ratios will again provide you a scoring mechanism.
Fit a very simple 2-gaussian mixture model to your data (for example, Numerical Recipes has a nice ready-made chunk of code). Take the earlier peak. This will deal correctly with overlapping peaks.
Find the best match in the data to a simple Gaussian, Cauchy, Poisson, or what-have-you curve. Evaluate this curve over a broad range and subtract it from a copy of the data after noting it's peak location. Repeat. Take the earliest peak whose model parameters (standard deviation probably, but some applications might care about kurtosis or other features) meet some criterion. Watch out for artifacts left behind when peaks are subtracted from the data.
Best match might be determined by the kind of match scoring suggested in #2 above.
I've done what you're doing before: finding peaks in DNA sequence data, finding peaks in derivatives estimated from measured curves, and finding peaks in histograms.
I encourage you to attend carefully to proper baselining. Wiener filtering or other filtering or simple histogram analysis is often an easy way to baseline in the presence of noise.
Finally, if your data is typically noisy and you're getting data off the card as unreferenced single-ended output (or even referenced, just not differential), and if you're averaging lots of observations into each data point, try sorting those observations and throwing away the first and last quartile and averaging what remains. There are a host of such outlier elimination tactics that can be really useful.
You could try signal averaging, i.e. for each point, average the value with the surrounding 3 or more points. If the noise blips are huge, then even this may not help.
I realise that this was language agnostic, but guessing that you are using LabView, there are lots of pre-packaged signal processing VIs that come with LabView that you can use to do smoothing and noise reduction. The NI forums are a great place to get more specialised help on this sort of thing.
This problem has been studied in some detail.
There are a set of very up-to-date implementations in the TSpectrum* classes of ROOT (a nuclear/particle physics analysis tool). The code works in one- to three-dimensional data.
The ROOT source code is available, so you can grab this implementation if you want.
From the TSpectrum class documentation:
The algorithms used in this class have been published in the following references:
[1] M.Morhac et al.: Background
elimination methods for
multidimensional coincidence gamma-ray
spectra. Nuclear Instruments and
Methods in Physics Research A 401
(1997) 113-
132.
[2] M.Morhac et al.: Efficient one- and two-dimensional Gold
deconvolution and its application to
gamma-ray spectra decomposition.
Nuclear Instruments and Methods in
Physics Research A 401 (1997) 385-408.
[3] M.Morhac et al.: Identification of peaks in
multidimensional coincidence gamma-ray
spectra. Nuclear Instruments and
Methods in Research Physics A
443(2000), 108-125.
The papers are linked from the class documentation for those of you who don't have a NIM online subscription.
The short version of what is done is that the histogram flattened to eliminate noise, and then local maxima are detected by brute force in the flattened histogram.
I would like to contribute to this thread an algorithm that I have developed myself:
It is based on the principle of dispersion: if a new datapoint is a given x number of standard deviations away from some moving mean, the algorithm signals (also called z-score). The algorithm is very robust because it constructs a separate moving mean and deviation, such that signals do not corrupt the threshold. Future signals are therefore identified with approximately the same accuracy, regardless of the amount of previous signals. The algorithm takes 3 inputs: lag = the lag of the moving window, threshold = the z-score at which the algorithm signals and influence = the influence (between 0 and 1) of new signals on the mean and standard deviation. For example, a lag of 5 will use the last 5 observations to smooth the data. A threshold of 3.5 will signal if a datapoint is 3.5 standard deviations away from the moving mean. And an influence of 0.5 gives signals half of the influence that normal datapoints have. Likewise, an influence of 0 ignores signals completely for recalculating the new threshold: an influence of 0 is therefore the most robust option.
It works as follows:
Pseudocode
# Let y be a vector of timeseries data of at least length lag+2
# Let mean() be a function that calculates the mean
# Let std() be a function that calculates the standard deviaton
# Let absolute() be the absolute value function
# Settings (the ones below are examples: choose what is best for your data)
set lag to 5; # lag 5 for the smoothing functions
set threshold to 3.5; # 3.5 standard deviations for signal
set influence to 0.5; # between 0 and 1, where 1 is normal influence, 0.5 is half
# Initialise variables
set signals to vector 0,...,0 of length of y; # Initialise signal results
set filteredY to y(1,...,lag) # Initialise filtered series
set avgFilter to null; # Initialise average filter
set stdFilter to null; # Initialise std. filter
set avgFilter(lag) to mean(y(1,...,lag)); # Initialise first value
set stdFilter(lag) to std(y(1,...,lag)); # Initialise first value
for i=lag+1,...,t do
if absolute(y(i) - avgFilter(i-1)) > threshold*stdFilter(i-1) then
if y(i) > avgFilter(i-1)
set signals(i) to +1; # Positive signal
else
set signals(i) to -1; # Negative signal
end
# Adjust the filters
set filteredY(i) to influence*y(i) + (1-influence)*filteredY(i-1);
set avgFilter(i) to mean(filteredY(i-lag,i),lag);
set stdFilter(i) to std(filteredY(i-lag,i),lag);
else
set signals(i) to 0; # No signal
# Adjust the filters
set filteredY(i) to y(i);
set avgFilter(i) to mean(filteredY(i-lag,i),lag);
set stdFilter(i) to std(filteredY(i-lag,i),lag);
end
end
Demo
> For more information, see original answer
This method is basically from David Marr's book "Vision"
Gaussian blur your signal with the expected width of your peaks.
this gets rid of noise spikes and your phase data is undamaged.
Then edge detect (LOG will do)
Then your edges were the edges of features (like peaks).
look between edges for peaks, sort peaks by size, and you're done.
I have used variations on this and they work very well.
I think you want to cross-correlate your signal with an expected, exemplar signal. But, it has been such a long time since I studied signal processing and even then I didn't take much notice.
I don't know very much about instrumentation, so this might be totally impractical, but then again it might be a helpful different direction. If you know how the readings can fail, and there is a certain interval between peaks given such failures, why not do gradient descent at each interval. If the descent brings you back to an area you've searched before, you can abandon it. Depending upon the shape of the sampled surface, this also might help you find peaks faster than search.
Is there a qualitative difference between the desired peak and the unwanted second peak? If both peaks are "sharp" -- i.e. short in time duration -- when looking at the signal in the frequency domain (by doing FFT) you'll get energy at most bands. But if the "good" peak reliably has energy present at frequencies not existing in the "bad" peak, or vice versa, you may be able to automatically differentiate them that way.
You could apply some Standard Deviation to your logic and take notice of peaks over x%.

Resources