Is the threshold of Haar-feature is calculated by the only way, Viola-Jones described in their paper? - threshold

I am implementing Viola-Jones face detection algorithm and bit confused about haar-feature threshold. I am calculating the threshold of haar-feature using follow. steps:
a) Calculate the haar-feature value in all positive(face) images respective to the same position.
b) Get all the feature values that lie in between minimum feature value and average feature value to get list, MinToAvg = []
c) For each value in MinToAvg classify data (Positive and Negative) and get Number of pos. images(Pos.) it classify as face and number of false positive (FP).
d) Feature Value is consider as a threshold for particular feature for which
I get max(Pos-FP).
For each round of boosting the threshold for the Haar-feature remain same on the contrary, threshold of Haar-feature as discussed in Viola-Jones paper change with every round of boosting.
My Question is:
1) Am I doing right way for calculation of Haar-feature threshold?
2) For each round of boosting the threshold need to be change?
I am using Python.
Thanks!

#Ramiro,user2766019: I just have one doubt from your previous comment. How does training weights affect threshold? Aren't the thresholds decided only using the feature value on each training samples and then calculating the error for each of that feature value by the equation:
e = MIN((S+) + (T-) - (S-), (S-) + (T+) - (S+))
I meant the threshold of weak classifier(single feature) which is used to calculate the error as the sum of weights of misclassified samples. Am I right or did I got wrong at somewhere? Thank you.

Related

How to calculate the standard errors of ERGM predicted probabilities?

I'm having trouble estimating the standard errors from the predicted probabilities from a ERGM model, to calculate a confidence interval. Getting the predicted probabilities is not a problem, but I want to get a sense of the uncertainty surrounding the predictions.
Below is a reproduceable example based on the data set of marriage and business ties among Renaissance Florentine families.
library(statnet)
data(flo)
flomarriage <- network(flo,directed=FALSE)
flomarriage
flomarriage %v% "wealth" <- c(10,36,27,146,55,44,20,8,42,103,48,49,10,48,32,3)
flomarriage
gest <- ergm(flomarriage ~ edges +
absdiff("wealth"))
summary(gest)
plogis(coef(gest)[['edges']] + coef(gest)[['absdiff.wealth']]*10)
Based on the model, it is estimated that a wealth difference of 10 corresponds to a 0.182511 probability of a tie. My first question is, is this a correct interpretation? And my second question is, how does one calculate the standard error of this probability?
This is the correct interpretation for a dyad-independent model such as this one. For a dyad-dependent model, it would be the conditional probability given the rest of the network.
You can obtain the standard error of the prediction on the logit scale by rewriting your last line as a dot product of a weight vector and the coefficient vector:
eta <- sum(c(1,10)*coef(gest))
plogis(eta)
then, since vcov(gest) is the covariance matrix of parameter estimates, and using the variance formulas (e.g., http://www.stat.cmu.edu/~larry/=stat401/lecture-13.pdf),
(var.eta <- c(1,10)%*%vcov(gest)%*%c(1,10))
You can then get the variance (and the standard error) of the predicted probability using the Delta Method (e.g., https://blog.methodsconsultants.com/posts/delta-method-standard-errors/). However, unless you require the standard error as such, as opposed to a confidence interval, my advice would be to first calculate the interval for eta (i.e., eta + qnorm(c(0.025,0.0975))*sqrt(var.eta)), then call plogis() on the endpoints.
Really very helpful answer! Many thanks for it.
Here's just a small, almost cosmetic note about the suggested formula: 1) the upper bound in qnorm should be 0.975, not 0.0975, and 2) use matrix multiplication before sqrt --> eta + qnorm(c(0.025,0.975))%*%sqrt(var.eta).

KMeans evaluation metric not converging. Is this normal behavior or no?

I'm working on a problem that necessitates running KMeans separately on ~125 different datasets. Therefore, I'm looking to mathematically calculate the 'optimal' K for each respective dataset. However, the evaluation metric continues decreasing with higher K values.
For a sample dataset, there are 50K rows and 8 columns. Using sklearn's calinski-harabaz score, I'm iterating through different K values to find the optimum / minimum score. However, my code reached k=5,600 and the calinski-harabaz score was still decreasing!
Something weird seems to be happening. Does the metric not work well? Could my data be flawed (see my question about normalizing rows after PCA)? Is there another/better way to mathematically converge on the 'optimal' K? Or should I force myself to manually pick a constant K across all datasets?
Any additional perspectives would be helpful. Thanks!
I don't know anything about the calinski-harabaz score but some score metrics will be monotone increasing/decreasing with respect to increasing K. For instance the mean squared error for linear regression will always decrease each time a new feature is added to the model so other scores that add penalties for increasing number of features have been developed.
There is a very good answer here that covers CH scores well. A simple method that generally works well for these monotone scoring metrics is to plot K vs the score and choose the K where the score is no longer improving 'much'. This is very subjective but can still give good results.
SUMMARY
The metric decreases with each increase of K; this strongly suggests that you do not have a natural clustering upon the data set.
DISCUSSION
CH scores depend on the ratio between intra- and inter-cluster densities. For a relatively smooth distribution of points, each increase in K will give you clusters that are slightly more dense, with slightly lower density between them. Try a lattice of points: vary the radius and do the computations by hand; you'll see how that works. At the extreme end, K = n: each point is its own cluster, with infinite density, and 0 density between clusters.
OTHER METRICS
Perhaps the simplest metric is sum-of-squares, which is already part of the clustering computations. Sum the squares of distances from the centroid, divide by n-1 (n=cluster population), and then add/average those over all clusters.
I'm looking for a particular paper that discusses metrics for this very problem; if I can find the reference, I'll update this answer.
N.B. With any metric you choose (as with CH), a failure to find a local minimum suggests that the data really don't have a natural clustering.
WHAT TO DO NEXT?
Render your data in some form you can visualize. If you see a natural clustering, look at the characteristics; how is it that you can see it, but the algebra (metrics) cannot? Formulate a metric that highlights the differences you perceive.
I know, this is an effort similar to the problem you're trying to automate. Welcome to research. :-)
The problem with my question is that the 'best' Calinski-Harabaz score is the maximum, whereas my question assumed the 'best' was the minimum. It is computed by analyzing the ratio of between-cluster dispersion vs. within-cluster dispersion, the former/numerator you want to maximize, the latter/denominator you want to minimize. As it turned out, in this dataset, the 'best' CH score was with 2 clusters (the minimum available for comparison). I actually ran with K=1, and this produced good results as well. As Prune suggested, there appears to be no natural grouping within the dataset.

What is a bad, decent, good, and excellent F1-measure range?

I understand F1-measure is a harmonic mean of precision and recall. But what values define how good/bad a F1-measure is? I can't seem to find any references (google or academic) answering my question.
Consider sklearn.dummy.DummyClassifier(strategy='uniform') which is a classifier that make random guesses (a.k.a bad classifier). We can view DummyClassifier as a benchmark to beat, now let's see it's f1-score.
In a binary classification problem, with balanced dataset: 6198 total sample, 3099 samples labelled as 0 and 3099 samples labelled as 1, f1-score is 0.5 for both classes, and weighted average is 0.5:
Second example, using DummyClassifier(strategy='constant'), i.e. guessing the same label every time, guessing label 1 every time in this case, average of f1-scores is 0.33, while f1 for label 0 is 0.00:
I consider these to be bad f1-scores, given the balanced dataset.
PS. summary generated using sklearn.metrics.classification_report
You did not find any reference for f1 measure range because there is not any range. The F1 measure is a combined matrix of precision and recall.
Let's say you have two algorithms, one has higher precision and lower recall. By this observation , you can not tell that which algorithm is better, unless until your goal is to maximize precision.
So, given this ambiguity about how to select superior algorithm among two (one with higher recall and other with higher precision), we use f1-measure to select superior among them.
f1-measure is a relative term that's why there is no absolute range to define how better your algorithm is.

How to check for Page Rank convergence?

I am writing a small code (sequential) to calculate Page Rank for a modest dataset (although not completely trivial).
The algo goes like this :
while ( not converged ) {
// Do a bunch of things to calculate PR
}
I am clear on the algorithm apart from the 'convergence' criteria. What is the best way to check if the algorithm has converged? Should I :
Check I keep a copy of all individual node's PR from an iteration and check all node's PR in the next iteration to be the same value?
This seems highly inefficient to me. Is this a right way to do it?
For each node take the difference in score between the current iteration and the last one, if this error falls below a certain threshold the graph has converged.
The paper for TextRank describes the quite well:
Starting from arbitrary values assigned to each node in the graph, the computation iterates until convergence below a given threshold is achieved.
Convergence is achieved when the error rate for any vertex in the graph falls below a given threshold. The error rate of a vertex is defined as the difference between the “real” score of the vertex S(Vi) and the score computed at iteration k, S^K(Vi) . Since the real score is not known apriori, this error rate is approximated with the difference between the scores computed at two successive iterations: S^(k+1)(Vi)+S^(k)(Vi).

RANSAC variation: does an inlier membership probability distribution makes sense?

I'm using RANSAC to fit a geometric model to a point cloud with outliers. I know, because of the generation process of the point cloud, that 99.9% of the inlier distances to my model are distributed following a gaussian probability density function with known μ and σ, in the interval [−3σ,−3σ].
The first question is whether do you think that it is reasonable to evaluate the total number of inliers for a certain model adding the inlier membership probability instead of adding 1 for each inlier. That is, the traditional RANSAC assumes that everything that is in an interval delimited by a threshold is an inlier; I would like to know if I can bend that, giving to some inliers more weight than others, following a probability distribution for this purpose.
In case this is reasonable, the second question is, how do you think it affects the number of samples N:
1−(1−(1−e)^s)^N=p
being e the probability that a point is an outlier, s the number of points used in a sample, N the number of samples (RANSAC iterations), p the desired probability that we get a good sample.
If none of that is reasonable, how do you suggest I may introduce my prior information of the inlier distribution?
Thanks in advance,
Federico

Resources