Adjust predicted probability after smote - probability

I have an imbalance data set and I used smote to oversample the minority class and undersample the majority class.
now, I want to check the test AUC using predict_proba of the model.
I have two questions:
1. Do I have to correct the probability if I am comparing AUCs?
2. How can I correct it (a combination of undersampling and oversampling!)

(1) The good news is no, you don't have to correct when comparing AUC. The resampling correction is a strictly increasing function of the uncorrected score, so it doesn't change the order of cases, so the ROC is exactly the same.
(2) There is a simple formula for correcting after under/over-sampling, I forget what it is, I'm pretty sure a web search will find it.
Further discussion is best suited to stats.stackexchange.com.

Related

Method to choose overall winner across multiple categories [migrated]

I have four numeric variables. All of them are measures of soil quality. Higher the variable, higher the quality. The range for all of them is different:
Var1 from 1 to 10
Var2 from 1000 to 2000
Var3 from 150 to 300
Var4 from 0 to 5
I need to combine four variables into single soil quality score which will successfully rank order.
My idea is very simple. Standardize all four variables, sum them up and whatever you get is the score which should rank-order. Do you see any problem with applying this approach. Is there any other (better) approach that you would recommend?
Thanks
Edit:
Thanks guys. A lot of discussion went into "domain expertise"... Agriculture stuff... Whereas I expected more stats-talk. In terms of technique that I will be using... It will probably be simple z-score summation + logistic regression as an experiment. Because vast majority of samples has poor quality 90% I'm going to combine 3 quality categories into one and basically have binary problem (somequality vs no-quality). I kill two birds with one stone. I increase my sample in terms of event rate and I make a use of experts by getting them to clasify my samples. Expert classified samples will then be used to fit log-reg model to maximize level of concordance / discordance with the experts.... How does that sound to you?
The proposed approach may give a reasonable result, but only by accident. At this distance--that is, taking the question at face value, with the meanings of the variables disguised--some problems are apparent:
It is not even evident that each variable is positively related to "quality." For example, what if a 10 for 'Var1' means the "quality" is worse than the quality when Var1 is 1? Then adding it to the sum is about as wrong a thing as one can do; it needs to be subtracted.
Standardization implies that "quality" depends on the data set itself. Thus the definition will change with different data sets or with additions and deletions to these data. This can make the "quality" into an arbitrary, transient, non-objective construct and preclude comparisons between datasets.
There is no definition of "quality". What is it supposed to mean? Ability to block migration of contaminated water? Ability to support organic processes? Ability to promote certain chemical reactions? Soils good for one of these purposes may be especially poor for others.
The problem as stated has no purpose: why does "quality" need to be ranked? What will the ranking be used for--input to more analysis, selecting the "best" soil, deciding a scientific hypothesis, developing a theory, promoting a product?
The consequences of the ranking are not apparent. If the ranking is incorrect or inferior, what will happen? Will the world be hungrier, the environment more contaminated, scientists more misled, gardeners more disappointed?
Why should a linear combination of variables be appropriate? Why shouldn't they be multiplied or exponentiated or combined as a posynomial or something even more esoteric?
Raw soil quality measures are commonly re-expressed. For example, log permeability is usually more useful than the permeability itself and log hydrogen ion activity (pH) is much more useful than the activity. What are the appropriate re-expressions of the variables for determining "quality"?
One would hope that soils science would answer most of these questions and indicate what the appropriate combination of the variables might be for any objective sense of "quality." If not, then you face a multi-attribute valuation problem. The Wikipedia article lists dozens of methods for addressing this. IMHO, most of them are inappropriate for addressing a scientific question. One of the few with a solid theory and potential applicability to empirical matters is Keeney & Raiffa's multiple attribute valuation theory (MAVT). It requires you to be able to determine, for any two specific combinations of the variables, which of the two should rank higher. A structured sequence of such comparisons reveals (a) appropriate ways to re-express the values; (b) whether or not a linear combination of the re-expressed values will produce the correct ranking; and (c) if a linear combination is possible, it will let you compute the coefficients. In short, MAVT provides algorithms for solving your problem provided you already know how to compare specific cases.
Anyone looked at Russell G. Congalton 'Review of Assessing the Accuracy of Classifications of Remotely Sensed Data' 1990 ?. It describes a technique known as error matrix for varing matrices, also a term he uses called ' Normalizing data' , whereby one gets all the different vectors and 'normalizes' or sets them to equal from 0 to 1. You basically change all vectors to equal ranges from 0 to 1.
One other thing you did not discuss is the scale of the measurements. V1 and V5 looks like they are of rank order and the other seem not. So standardization may be skewing the score. So you may be better transforming all of the variables into ranks, and determining a weighting for each variable, since it is highly unlikely that they have the same weight. Equal weighting is more of a "no nothing" default. You might want to do some correlation or regression analysis to come up with some a priori weights.
I had a similar problem recently and though I add my approach to the nice answers. I think in order to find a simple way to determine which variable leads to the best ranking. One could transform your problem to a gridsearch approach:
Basically use a combined score for the ranking which is composed as such:
Finel_score = Var1 * A + Var2 * B + Var3 * C ....
Then you can compute the final score with different values for A,B,C (sklearn gridsearch could be used) ... and compare the resulting ranking to an expected ranking (some ground truth is needed to determine the goodness of you ranking). The best parameters result in the weights of your individual variables.
Following up on Ralph Winters' answer, you might use PCA (principal component analysis) on the matrix of suitably standardized scores. This will give you a "natural" weight vector that you can use to combine future scores.
Do this also after all scores have been transformed into ranks. If the results are very similar, you have good reasons to continue with either method. If there are discrepancies, this will lead to interesting questions and a better understanding.

How to judge performance of algorithms for Text Clustering?

I am using K-Means algorithm for Text Clustering with initial seeding with K-Means++.
I try to make the algorithm more efficient with some changes like changing the stop-word dictionary and increasing the max_no_of_random_iterations.
I get different results. How do i compare them ? I could not apply the idea of confusion matrix here. Output is not in the form of some document getting some value or tag. A document goes to a set. It is just relative "good clustering" or the set that matters.
So Is there some standard way for marking the performance for this output set ?
If confusion matrix is the answer, please explain how to do it ?
Thanks.
You could decide in advance how to measure the quality of the clusters, for example count how many empty ones or some stats like Within Sum of Squares
This paper says
"... three distinctive approaches to cluster validity are possible.
The first approach relies on external criteria that investigate the
existence of some predefined structure in clustered data set. The
second approach makes use of internal criteria and the clustering
results are evaluated by quantities describing the data set such as
proximity matrix etc. Approaches based on internal and external
criteria make use of statistical tests and their disadvantage is
high computational cost. The third approach makes use of relative
criteria and relies on finding the best clustering scheme that meets
certain assumptions and requires predefined input parameters values"
Since clustering is unsupervised, you are asking for something difficult. I suggest researching how people cluster using genetic algorithms and see what fitness criteria they use.

Edge detection : Any performance evaluation technique?

I am working on edge detection in images and would like to evaluate the performance of algorithm, if any any one could give me a reference or method on how to proceed it will be really helpful. :)
I do not have ground truth and data set includes color as well as gray images.
Thank you.
Create a synthetic data set with known edges, for example by 3D rendering, by compositing 2D images with precise masks (as may be obtained in royalty free photosets), or by introducing edges directly (thin/faint lines). Remember to add some confounding non-edges that look like edges, of a type appropriate for what you're tuning for.
Use your (non-synthetic) data set. Run the reference algorithms that you want to compare against. Also produce combinations of the reference algorithms, for example by voting (majority, at least K out of N, etc). Calculate stats on your algo vs reference algo performance, in terms of (a) number of points your algo classifies as edge which each reference algo, or the combination, does not classify as edge (false positive), or (b) number of points which the reference algo classifies as edge that your algo does not (false negative). You can also calculate a rank correlation-type number for algos by looking at each point and looking at which algos do (or don't) classify that as an edge.
Create ground truth manually. Use reference edge-finding algos as a starting point, then fix up by hand. Probably valuable to do for a small number of images in any case.
Good luck!
For comparisons, quantitative measures like what #Alex I explained is best. To do so, you need to define what is "correct" with a ground truth set and a way to consistently determine if a given image is correct or on a more granular level, how correct (some number like a percentage) it is. #Alex I gave a way to do that.
Another option that is often used in graphics research where there is no ground truth is user studies. Usually less desirable as they are time consuming and often more costly. However, if it is a qualitative improvement that you are after or if a quantitative measurement is just too hard to do, a user study is an appropriate solution.
When I mean user study I mean to poll people on how well a result is given the input image. You could give them a scale to rate things on and randomly give them samples from both your results and the results of another algorithm
And of course, if you still want more ideas, be sure to check out edge detection papers to see how they measured their results (I'd actually look here first as they've already gone through this same process and determined what was best for them: google scholar).

Estimate good parameters for Algorithms with lots of arguments (Like for MSER in OpenCV)

I was wondering if there is a better way to estimate a good set of parameters for algorithms with lots of arguments than just randomly picking them. In detail I am trying to find some good parameters for the MSER Feature Detector which consumes 9 number parameters so there is a huge space to search in. I was thinking about alternatingly picking smaller and larger numbers around the default parameter value with exponentially growing distance. Are there any good thoughts that could help me?
Thanks!
First, you must define an objective function you want to minimize - what defines "better" parameters? In your case, I'd suggest using the number of correct matches found or similar.
Second, you must have an efficient way of looping over the virtually uncountable possibilities. Here, it probably helps that there is a minimal step size beyond which the results don't meaningfully change. Since the objective function is not necessarily derivable, I'd use a method similar to the Golden search in each dimension separately, and then repeat, until hopefully a global "good enough" maximum is reached.

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Resources