ROC for predictions - how to define class labels - roc

I have a set of predictions from a model, and a set of true values of the observations, and I want to create an ROC.
The quality of the prediction (in absolute error terms) is independent of the magnitude of the prediction. So I have a set of predictions (pred(1), pred(2), ..., pred(n)) and observations (obs(1), obs(2), ..., obs(n)).
Someone told me to create the elements of my binary classification vector label(i) as label(i) = ifelse(|obs(i) - pred(i)| < tol, 1, 0) and then calculate the AUC (tol is some respecified tolerance). So for each prediction, if it is close to the corresponding observation, the corresponding label is 1, otherwise it is 0.
But I don't see how the suggested labeling is valid, as higher pred() values will not necessarily discriminate my binary classification, i.e. prediction values do not serve to "RANK" the quality of my predictions (i.e., a given threshold does not divide my data naturally). Can someone please shed some light for me on what to do here? Is the suggestion given above a valid one? Or is an ROC inappropriate to use here?

ROC analysis is defined for binary classification, where the observed labels can take two values (binary), and your predictions are any sort of numbers. There are extensions of ROC analysis to multi-class classification, but your question suggests that your observations are some sort of continuous measurement. You could binarize them (something like label(i) = ifelse(obs(i) > someValue, 1, 0)), but it would be invalid for the labels to depend on the classification: they must be some sort of truth that is independent on your classifier.
Alternatively if your observations are continuous, you should assess the quality of your predictions with a coefficient of correlation or a similar measure.

Related

How to form precision-recall curve using one test dataset for my algorithm?

I'm working on knowledge graph, more precisely in natural language processing field. To evaluate the components of my algorithm, it is necessary to be able to classify the good and the poor candidates. For this purpose, we manually classified pairs in a dataset.
My system returns the relevant pairs according to the implementation logic. now I'm able to calculate :
Precision = X
Recall = Y
For establishing a complete curve I need the rest of points (X,Y), what should I do?:
build another dataset for test ?
split my dataset ?
or any other solution ?
Neither of your proposed two methods. In short, Precision-recall or ROC curve is designed for classifiers with probabilistic output. That is, instead of simply producing a 0 or 1 (in case of binary classification), you need classifiers that can provide a probability in [0,1] range. This is the function to do it in sklearn, note how the 2nd parameter is called probas_pred.
To turn this probabilities into concrete class prediction, you can then set a threshold, say at .5. Setting such a threshold is problematic however, since you can trade-off precision/recall by varying the threshold, and an arbitrary choice can give false impression of a classifier's performance. To circumvent this, threshold-independent measures like area under ROC or Precision-Recall curve is used. They create thresholds at different intervals, say 0.1,0.2,0.3...0.9, turn probabilities into binary classes and then compute precision-recall for each such threshold.

Nearest Neighbor for partially unknown vector

Let's say we have list of people and would like to find people like person X.
The feature vector has 3 items [weight, height, age] and there are 3 persons in our list. Note that we don't know height of person C.
A: [70kg, 170cm, 60y]
B: [60kg, 169cm, 50y]
C: [60kg, ?, 50y]
What would be the best way to find people closest to person A?
My guess
Let's calculate the average value for height, and use it instead of unknown value.
So, let's say we calculated that 170cm is average value for height, and redefining person C as [60kg, ~170cm, 50y].
Now we can find people closest to A, it will be A, C, B.
Problem
Now, the problem is that we put C with guessed ~170cm before than B with known 169cm.
It kinda feels wrong. We humans are smarter than machines, and know that there's little chance that C will be exactly 170cm. So, it would be better to put B with 169cm before than C.
But how can we calculate that penalty? (preferably in simple empiric algorithm) Should we somehow penalise vectors with unknown values? And by how much (maybe calculate average diff between every two person's height in the set)?
And how would that penalisation look like in a general case when dimension of feature vector is N and it has K known items and U unknown (K + U = N)?
In this particular example, would it be better to use linear regression to fill the missing values instead of taking average? This way you may have more confidence in the guessed value and may not need penalty.
But if you want penalty, I have an idea of taking the ratio of non-missing features. In the example, there are 3 features in total. C has values in 2 of the features. So the ratio of non-missing features for C is 2/3. Adjust the similarity score by multiplying it with the ratio of non-missing features. For example, if the similarity between A and C is 0.9, the adjusted similarity is 0.9 * 2 / 3 = 0.6. Whereas the similarity between A and B will not be impacted since B has values for all the features and the ratio will be 1.
You can also weight the features when computing the ratio. For example, (weight, height, age) get the weights (0.3, 0.4, 0.3) respectively. Then missing the height feature will have the weighted ratio of (0.3 + 0.3) = 0.6. You can see C is penalized even more since we think height is more important than weight and age.
I would suggest , with data points for we have the the known attributes , use a learning model , linear regression or a multi layer perceptron to learn the unknown attribute and then with use of this model fill the unknown attributes. the average case is a special case of linear model
You are interested in the problem of Data Imputation.
There are several approaches to solving this problem, and I am just going to list some:
Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. This can further be classified as generalized and similar case imputation.
Prediction Model: Prediction model is one of the sophisticated method for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values. First data set become training data set of the model while second data set with missing values is test data set and variable with missing values is treated as target variable. Next, we create a model to predict target variable based on other attributes of the training data set and populate missing values of test data set.
KNN(k-nearest neighbor) Imputation: In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function.
Linear Regression: A linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. In prediction, linear regression can be used to fit a predictive model to an observed data set of y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y. Check this example if you want.

Error calculation in backpropagation (gradient descent)

Can someone please give an explanation about the calculation of the error in backpropagation which is found in many code examples such as:
error=calculated-target
// then calculate error with respect to each parameter...
Is this same for squared error and cross entropy error? How?
Thanks...
I will note x an example from the training set, f(x) the prediction of your network for this particular example, and g_x the ground truth (label) associated to x.
The short answer is, the root means squared error (RMS) is used when you have a network that can exactly, and differentiably, predict the labels that you want. The cross-entropy error is used when your network predicts scores for a set of discrete labels.
To clarify, you usually use Root Mean Squared (RMS) when you want to predict values that can change continuously. Imagine you want your network to predict vectors in R^n. This is the case when, for example, you want to predict surface normals or optical flow. Then, these values can changes continuously, and ||f(x)-g_x|| is differentiable. You can use backprop and train your network.
Crosss-entropy, on the other hand, is useful in classification with n labels, for example, in image classification. In that case, the g_x take the discrete values c_1,c_2,...,c_m where m is the number of classes.
Now, you can not use RMS because if you assume that your netwrok predicts the exact labels (i.e. f(x) in {c_1,...,c_m}), then ||f(x)-g_x|| is no longer differentiable, and you can not use back-propagation. So, you make a network that does not compute class labels directly, but instead computes a set of scores s_1,...,s_m for each class label. Then, you maximize the probability of the correct score, by maximizing a softmax function on the scores. This makes the loss function differentiable.

What is a bad, decent, good, and excellent F1-measure range?

I understand F1-measure is a harmonic mean of precision and recall. But what values define how good/bad a F1-measure is? I can't seem to find any references (google or academic) answering my question.
Consider sklearn.dummy.DummyClassifier(strategy='uniform') which is a classifier that make random guesses (a.k.a bad classifier). We can view DummyClassifier as a benchmark to beat, now let's see it's f1-score.
In a binary classification problem, with balanced dataset: 6198 total sample, 3099 samples labelled as 0 and 3099 samples labelled as 1, f1-score is 0.5 for both classes, and weighted average is 0.5:
Second example, using DummyClassifier(strategy='constant'), i.e. guessing the same label every time, guessing label 1 every time in this case, average of f1-scores is 0.33, while f1 for label 0 is 0.00:
I consider these to be bad f1-scores, given the balanced dataset.
PS. summary generated using sklearn.metrics.classification_report
You did not find any reference for f1 measure range because there is not any range. The F1 measure is a combined matrix of precision and recall.
Let's say you have two algorithms, one has higher precision and lower recall. By this observation , you can not tell that which algorithm is better, unless until your goal is to maximize precision.
So, given this ambiguity about how to select superior algorithm among two (one with higher recall and other with higher precision), we use f1-measure to select superior among them.
f1-measure is a relative term that's why there is no absolute range to define how better your algorithm is.

Converting SVM hyperplane distance (response) to Likelihood

I am trying to use SVM to train some image models. However SVM is not a probabilistic framework so it outputs distance between hyperplanes as a whole number.
Platt converted the output of SVM to likelihood by using some optimisation function but I fail to understand that, does the method assumes one class has same probability I.E for binary classifier if all training sets are even and proportional, then for label 1 or -1 it occurs every time with 50% probability.
Secondly, in some papers I read that for binary SVM classifier they convert -1 and 1 label to range of 0 to 1 and compute the likelihood. But they do not mention anything about how to convert the SVM distance to probability.
Sorry for my english. I would welcome any suggestion and comment. Thank you.
link to paper
Well as far as I can tell that paper is proposing a mapping from the SVM output to a range of [0,1] using a sigmoid function.
From a simplified point of view, it would be something like Sigmoid(RAWSVM(X)) in [0,1], so there is not an explicit "weight" to the labels. The idea is that you take one label (let's say Y=+1) and then you take the output of the SVM and see how close is the prediction for that pattern to that label, if it is close then the sigmoid would give you a number close to 1, otherwise will give you a number close to 0. And hence you have a sense of probability.
Secondly, in some papers I read that for binary SVM classifier they convert -1 and 1 label to range of 0 to 1 and compute the likelihood. But they do not mention anything about how to convert the SVM distance to probability.
Yes, you are correct and some implementations works in the realm of [0,1] instead of [-1,+1], some even maps the label to a factor depending on the value of C. In any case, that shouldn't affect the method proposed in the paper since they would map any range to [0,1]. Keep in mind that this "probabilistic" distribution is just a map from any range to [0,1] assuming uniformity. I am oversimplifying this but the effect is the same.
One last thing, that sigmoid map is not static but data-driven, which means that there would be some training using the dataset to parametrize the sigmoid to adjust it to the data. In other words, for two different datasets you would probably get two different mapping functions.

Resources