Evaluation & Calculate Top-N Accuracy: Top 1 and Top 5

Evaluation & Calculate Top-N Accuracy: Top 1 and Top 5 - algorithm

I have come across few (Machine learning-classification problem) journal papers mentioned about evaluate accuracy with Top-N approach. Data was show that Top 1 accuracy = 42.5%, and Top-5 accuracy = 72.5% in the same training, testing condition.
I wonder how to calculate this percentage of top-1 and top-5?
Can some one show me example and steps to calculate this?
Thanks

Top-1 accuracy is the conventional accuracy: the model answer (the one with highest probability) must be exactly the expected answer.
Top-5 accuracy means that any of your model 5 highest probability answers must match the expected answer.
For instance, let's say you're applying machine learning to object recognition using a neural network. A picture of a cat is shown, and these are the outputs of your neural network:
Tiger: 0.4
Dog: 0.3
Cat: 0.1
Lynx: 0.09
Lion: 0.08
Bird: 0.02
Bear: 0.01
Using top-1 accuracy, you count this output as wrong, because it predicted a tiger.
Using top-5 accuracy, you count this output as correct, because cat is among the top-5 guesses.

The Complement of the accuracy is the error, The top-1 error is the percentage of time that the classifier did not give the correct class highest probability score.
The top-5 error:- The percentage of time that the classifier did not include the correct class among the top 5 probabilities or guesses.

Related

Normalizing workouts based on activity, total milage, and total time

My friends and I are competing in our own fitness challenge (Sober October) where we are keeping track of Activity, Total Time Spent Moving, and Distance. Our activities include running (outdoors), running (treadmill), running (elliptical), rowing, biking (stationary), biking (outdoors), swimming, and stair stepper.
As a group, we weren't really interested in using a calorie estimation because those results can be easily manipulated by increasing the weight that the equation uses, so we wanted to keep it based on just distance and time.
What kind of equation should I use to best normalize such exercises? I'm looking for something that would weight distance and time differently based on the activity; for example, when compared to running,biking should give more weight to time than to milage because it takes less work to go a mile on a bike than it does on foot.
I was able to find this article on how calories are calculated, and just thought about removing the weight portion of the equation to get our normalized number, but wanted to see if there was a better way to calculate what I'm looking for.

Objective measure
You are seeking an objective measurement which is independent of weight. Use METs.
A human expends a baseline of one MET sitting quietly. Maybe your measure will be excess-MET-hours.
Score = (METs - 1) × Hours
MET values
On that link above you can find reference METs values for various activities, including several of your target activities. These are independent of speed.
You can further improve the calculation by factoring in your distance/time measurements. For example, given cited METs figures:
Walking slowly (1 mph) = 2.0 MET
Walking (3 mph) = 3.0 MET
Jogging (6.8 mph) = 11.2 MET
You can fit them to a curve. Use Desmos.
So your score for walking/jogging/running is:
Excess METs = [1 + 0.2 × (miles/hours) ^ 2 - 1] × hours
You can make similar estimations for other activities.

suitable formula/algorithm for detecting temperature fluctuations

I'm creating an app to monitor water quality. The temperature data is updated every 2 min to firebase real-time database. App has two requirements
1) It should alert the user when temperature exceed 33 degree or drop below 23 degree - This part is done
2) It should alert user when it has big temperature fluctuation after analysing data every 30min - This part i'm confused.
I don't know what algorithm to use to detect big temperature fluctuation over a period of time and alert the user. Can someone help me on this?

For a period of 30 minutes, your app would give you 15 values.
If you want to figure out a big change in this data, then there is one way to do so.
You can use implement the following method:
Calculate the mean and the standard deviation of the values.
Subtract the data you have from the mean and then take the absolute value of the result.
Compare if the absolute value is greater than one standard deviation, if it is greater then you have a big data.
See this example for better understanding:
Lets suppose you have these values for 10 minutes:
25,27,24,35,28
First Step:
Mean = 27 (apprx)
One standard deviation = 3.8
Second Step: Absolute(Data - Mean)
abs(25-27) = 2
abs(27-27) = 0
abs(24-27) = 3
abs(35-27) = 8
abs(28-27) = 1
Third Step
Check if any of the subtraction is greater than standard deviation
abs(35-27) gives 8 which is greater than 3.8
So, there is a big fluctuation. If all the subtracted results are less than standard deviation, then there is no fluctuation.
You can still improvise the result by selecting two or three standard deviation instead of one standard deviation.

Start by defining what you mean by fluctuation.
You don't say what temperature scale you're using. Fahrenheit, Celsius, Rankine, or Kelvin?
Your sampling rate is a new data value every two minutes. Do you define fluctuation as the absolute value of the difference between the last point and current value? That's defensible.
If the max allowable absolute value is some multiple of your 33-23 = 10 degrees you're in business.

Some details about adjusting cascaded AdaBoost stage threshold

I have implemented AdaBoost sequence algorithm and currently I am trying to implement so called Cascaded AdaBoost, basing on P. Viola and M. Jones original paper. Unfortunately I have some doubts, connected with adjusting the threshold for one stage. As we can read in original paper, the procedure is described in literally one sentence:
Decrease threshold for the ith classiﬁer until the current
cascaded classiﬁer has a detection rate of at least
d × Di − 1 (this also affects Fi)
I am not sure mainly two things:
What is the threshold? Is it 0.5 * sum (alpha) expression value or only 0.5 factor?
What should be the initial value of the threshold? (0.5?)
What does "decrease threshold" mean in details? Do I need to iterative select new threshold e.g. 0.5, 0.4, 0.3? What is the step of decreasing?
I have tried to search this info in Google, but unfortunately I could not find any useful information.
Thank you for your help.

I had the exact same doubt and have not found any authoritative source so far. However, this is what is my best guess to this issue:
1. (0.5*sum(aplha)) is the threshold.
2. Initial value of the threshold is what is above. Next, try to classify the samples using the intermediate strong classifier (what you currently have). You'll get the scores each of the samples attain, and depending on the current value of threshold, some of the positive samples will be classified as negative etc. So, depending on the desired detection rate desired for this stage (strong classifier), reduce the threshold so that that many positive samples get correctly classified ,
eg:
say thresh. was 10, and these are the current classifier outputs for positive training samples:
9.5, 10.5, 10.2, 5.4, 6.7
and I want a detection rate of 80% => 80% of above 5 samples classified correctly => 4 of above => set threshold to 6.7
Clearly, by changing the threshold, the FP rate also changes, so update that, and if the desired FP rate for the stage not reached, go for another classifier at that stage.
I have not done a formal course on ada-boost etc, but this is my observation based on some research papers I tried to implement. Please correct me if something is wrong. Thanks!

I have found a Master thesis on real-time face detection by Karim Ayachi (pdf) in which he describes the Viola Jones face detection method.
As it is written in Section 5.2 (Creating the Cascade using AdaBoost), we can set the maximal threshold of the strong classifier to sum(alpha) and the minimal threshold to 0 and then find the optimal threshold using binary search (see Table 5.1 for pseudocode).
Hope this helps!

Clustering algorithm to cluster objects based on their relation weight

I have n words and their relatedness weight that gives me a n*n matrix. I'm going to use this for a search algorithm but the problem is I need to cluster the entered keywords based on their pairwise relation. So let's say if the keywords are {tennis,federer,wimbledon,london,police} and we have the following data from our weight matrix:
tennis federer wimbledon london police
tennis 1 0.8 0.6 0.4 0.0
federer 0.8 1 0.65 0.4 0.02
wimbledon 0.6 0.65 1 0.08 0.09
london 0.4 0.4 0.08 1 0.71
police 0.0 0.02 0.09 0.71 1
I need an algorithm to to cluster them into 2 clusters : {tennis,federer,wimbledon} {london,police}. Is there any know clustering algorithm than can deal with such thing ? I did some research, it appears that K-means algorithm is the most well known algorithm being used for clustering but apparently K-means doesn't suit this case.
I would greatly appreciate any help.

You can treat it as a network clustering problem. With a recent version of mcl software (http://micans.org/mcl), you can do this (I've called your example fe.data).
mcxarray -data fe.data -skipr 1 -skipc 1 -write-tab fe.tab -write-data fe.mci -co 0 -tf 'gq(0)' -o fe.cor
# the above computes correlations (put in data file fe.cor) and a network (put in data file fe.mci).
# below proceeds with the network.
mcl fe.mci -I 3 -o - -use-tab fe.tab
# this outputs the clustering you expect. -I is the 'inflation parameter'. The latter affects
# cluster granularity. With the default parameter 2, everything ends up in a single cluster.
Disclaimer: I wrote mcl and a slew of associated network loading/conversion and analysis programs recently rebranded as 'mcl-edge'. They all come together in a single software package. Seeing your example made me curious whether it would be doable with mcl-edge, so I quickly tested it.

Consider DBSCAN. If it suits your needs, you might wish to take a closer look at an optimised version, TI-DBSCAN, which uses triangle inequality for reducing spatial query cost.
DBSCAN's advantages and disadvantages are discussed on Wikipedia. It splits input data to a set of clusters whose cardinality isn't known a priori. You'd have to transform your similarity matrix into a distance matrix, for example by taking 1 - similarity as a distance.

Check this book on Information retrieval
http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html
it explains very well what you want to do

Your weights are higher for more similar words and lower for more different words. A clustering algorithm requires similar points/words to be closer spatially and different words to be distant. You should change the matrix M into 1-M and then use any clustering method you want, including k-means.

If you've got a distance matrix, it seems a shame not to try http://en.wikipedia.org/wiki/Single_linkage_clustering. By hand, I think you get the following clustering:
((federer, tennis), wimbledon) (london, police)
The similarity for the link that joins the two main groups (either tennis-london or federer-london) is smaller than any of the similarities that build the two groups: london-police, tennis-federer, and federer-wimbledon: this characteristic is guaranteed by single linkage clustering, since it binds together closest clusters at each stage, and the two main groups are linked by the last binding found.

DBSCAN (see other answers) and successors such as OPTICS are clearly an option.
While the examples are on vector data, all that the algorithms need is a distance function. If you have a similarity matrix, that can trivially be used as distance function.
The example data set probably is a bit too small for them to produce meaningful results. If you just have this little of data, any "hierarchical clustering" should be feasible and do the job for you. You then just need to decide on the best number of clusters.

How do I measure variability of a benchmark comprised of many sub-benchmarks?

(Not strictly programming, but a question that programmers need answered.)
I have a benchmark, X, which is made up of a lot of sub-benchmarks x1..xn. Its quite a noisy test, with the results being quite variable. To accurately benchmark, I must reduce that "variability", which requires that I first measure the variability.
I can easily calculate the variability of each sub-benchmark, using perhaps standard deviation or variance. However, I'd like to get a single number which represents the overall variability as a single number.
My own attempt at the problem is:
sum = 0
foreach i in 1..n
calculate mean across the 60 runs of x_i
foreach j in 1..60
sum += abs(mean[i] - x_i[j])
variability = sum / 60

Best idea: ask at the statistics Stack Exchange once it hits public beta (in a week).
In the meantime: you might actually be more interested in the extremes of variability, rather than the central tendency (mean, etc.). For many applications, I imagine that there's relatively little to be gained by incrementing the typical user experience, but much to be gained by improving the worst user experiences. Try the 95th percentile of the standard deviations and work on reducing that. Alternatively, if the typical variability is what you want to reduce, plot the standard deviations all together. If they're approximately normally distributed, I don't know of any reason why you couldn't just take the mean.

I think you're misunderstanding the standard deviation -- if you run your test 50 times and have 50 different runtimes the standard deviation will be a single number that describes how tight or loose those 50 numbers are distributed around your average. In conjunction with your average run time, the standard deviation will help you see how much spread there is in your results.
Consider the following run times:
12 15 16 18 19 21 12 14
The mean of these run times is 15.875. The sample standard deviation of this set is 3.27. There's a good explanation of what 3.27 actually means (in a normally distributed population, roughly 68% of the samples will fall within one standard deviation of the mean: e.g., between 15.875-3.27 and 15.875+3.27) but I think you're just looking for a way to quantify how 'tight' or 'spread out' the results are around your mean.
Now consider a different set of run times (say, after you compiled all your tests with -O2):
14 16 14 17 19 21 12 14
The mean of these run times is also 15.875. The sample standard deviation of this set is 3.0. (So, roughly 68% of the samples will fall within 15.875-3.0 and 15.875+3.0.) This set is more closely grouped than the first set.
And you have a single number that summarizes how compact or loose a group of numbers is around the mean.
Caveats
Standard deviation is built on the assumption of a normal distribution -- but your application may not be normally distributed, so please be aware that standard deviation may be a rough guideline at best. Plot your run-times in a histogram to see if your data looks roughly normal or uniform or multimodal or...
Also, I'm using the sample standard deviation because these are only a sample out of the population space of benchmark runs. I'm not a professional statistician, so even this basic assumption may be wrong. Either population standard deviation or sample standard deviation will give you good enough results in your application IFF you stick to either sample or population. Don't mix the two.
I mentioned that the standard deviation in conjunction with the mean will help you understand your data: if the standard deviation is almost as large as your mean, or worse, larger, then your data is very dispersed, and perhaps your process is not very repeatable. Interpreting a 3% speedup in the face of a large standard deviation is nearly useless, as you've recognized. And the best judge (in my experience) of the magnitude of the standard deviation is the magnitude of the average.
Last note: yes, you can calculate standard deviation by hand, but it is tedious after the first ten or so. Best to use a spreadsheet or wolfram alpha or your handy high-school calculator.

From Variance:
"the variance of the total group is equal to the mean of the variances of the subgroups, plus the variance of the means of the subgroups."
I had to read that several times, then run it: 464 from this formula == 464, the standard deviation of all the data -- the single number you want.
#!/usr/bin/env python
import sys
import numpy as np
N = 10
exec "\n".join( sys.argv[1:] ) # this.py N= ...
np.set_printoptions( 1, threshold=100, suppress=True ) # .1f
np.random.seed(1)
data = np.random.exponential( size=( N, 60 )) ** 5 # N rows, 60 cols
row_avs = np.mean( data, axis=-1 ) # av of each row
row_devs = np.std( data, axis=-1 ) # spread, stddev, of each row about its av
print "row averages:", row_avs
print "row spreads:", row_devs
print "average row spread: %.3g" % np.mean( row_devs )
# http://en.wikipedia.org/wiki/Variance:
# variance of the total group
# = mean of the variances of the subgroups + variance of the means of the subgroups
avvar = np.mean( row_devs ** 2 )
varavs = np.var( row_avs )
print "sqrt total variance: %.3g = sqrt( av var %.3g + var avs %.3g )" % (
np.sqrt( avvar + varavs ), avvar, varavs)
var_all = np.var( data ) # std^2 all N x 60 about the av of the lot
print "sqrt variance all: %.3g" % np.sqrt( var_all )
row averages: [ 49.6 151.4 58.1 35.7 59.7 48. 115.6 69.4 148.1 25. ]
row devs: [ 244.7 932.1 251.5 76.9 201.1 280. 513.7 295.9 798.9 159.3]
average row dev: 375
sqrt total variance: 464 = sqrt( av var 2.13e+05 + var avs 1.88e+03 )
sqrt variance all: 464
To see how group variance increases, run the example in Wikipedia Variance.
Say we have
60 men of heights 180 +- 10, exactly 30: 170 and 30: 190
60 women of heights 160 +- 7, 30: 153 and 30: 167.
The average standard dev is (10 + 7) / 2 = 8.5 .
Together though, the heights
-------|||----------|||-|||-----------------|||---
153 167 170 190
spread like 170 +- 13.2, much greater than 170 +- 8.5.
Why ? Because we have not only the spreads men +- 10 and women +- 7,
but also the spreads from 160 / 180 about the common mean 170.
Exercise: compute the spread 13.2 in two ways,
from the formula above, and directly.

This is a tricky problem because benchmarks can be of different natural lengths anyway. So, the first thing you need to do is to convert each of the individual sub-benchmark figures into scale-invariant values (e.g., “speed up factor” relative to some believed-good baseline) so that you at least have a chance to compare different benchmarks.
Then you need to pick a way to combine the figures. Some sort of average. There are, however, many types of average. We can reject the use of the mode and the median here; they throw away too much relevant information. But the different kinds of mean are useful because of the different ways they give weight to outliers. I used to know (but have forgotten) whether it was the geometric mean or the harmonic mean that was most useful in practice (the arithmetic mean is less good here). The geometric mean is basically an arithmetic mean in the log-domain, and a harmonic mean is similarly an arithmetic mean in the reciprocal-domain. (Spreadsheets make this trivial.)
Now that you have a means to combine the values for a run of the benchmark suite into something suitably informative, you've then got to do lots of runs. You might want to have the computer do that while you get on with some other task. :-) Then try combining the values in various ways. In particular, look at the variance of the individual sub-benchmarks and the variance of the combined benchmark number. Also consider doing some of the analyses in the log and reciprocal domains.
Be aware that this is a slow business that is difficult to get right and it's usually uninformative to boot. A benchmark only does performance testing of exactly what's in the benchmark, and that's mostly not how people use the code. It's probably best to consider strictly time-boxing your benchmarking work and instead focus on whether users think the software is perceived as fast enough or whether required transaction rates are actually attained in deployment (there are many non-programming ways to screw things up).
Good luck!

You are trying to solve the wrong problem. Better try to minimize it. The differences can be because of caching.
Try running the code on a single (same) core with SetThreadAffinityMask() function on Windows.
Drop the first measurement.
Increase the thead priority.
Stop hyperthreading.
If you have many conditional jumps it can introduce visible differences between calls with different input. (this could be solved by giving exactly the same input for i-th iteration, and then comparing the measured times between these iterations).
You can find here some useful hints: http://www.agner.org/optimize/optimizing_cpp.pdf

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio