How can I measure the probability of error of a trained model, in particular the random forest? - validation

To do the binary classification of a set of images, I trained the random forest on a set of data.
I now want to evaluate the error probability of my model.
For that, I did two things and I don't know what corresponds to this error probability:
I calculated the accuracy using the k-fold cross validation
I tested my model after I calculated the ratio between the misclassified images and the total number of images.
What is the correct way to calculate the probability of error for my trained model?

Related

Average of predictions from many models versus prediction from the best fitted model?

I have a dataset containing about 700 data points. I randomly shuffle the dataset, then split the dataset in 80:20 ratio for training and testing. Then I use Gaussian process regression (GPR) for fitting the training dataset.
I repeated this process 6 times, and each time I get a new model with varying R2 between 0.50 to 0.80. Now which model should I use for further predictions? Should I take an average of predictions from all the six models, or just choose the best fitted model? If I chose the best fitted model, will there be concern for reproducibility? If there is any other better approach for the fitting, please let me know. Thanks.
You can try a Gaussian mixture model that combines all your GP models and gives you a probability distribution over all your results. Also, you can provide different weights for each GP model you have when forming a Gaussian mixture. The weights could be your R2 for example.

Neural Network for Matrix Eigenvalues

I am interested in training a network that can take a square (symmetric, for the sake of real eigenvalues) 2x2 matrix as an input and output the corresponding eigenvalues.
So far I have attempted this using pytorch. I create a training set by randomly generating matrices and then using scipy.linalg to solve for the eignevalues:
A = np.random.randint(10, size=(2, 2))
A = (A + A.T)/2 #Make the matrices symmetric
results = la.eig(A)
Training a CNN with pytorch on a single input matrix, it can achieve a high training set accuracy and approximate the eigenvalues well.
However when using a training set with two or more input matrices, the cost function decreases some, but the resulting eigenvalues have a low training set accuracy.
I've tried changing the network depth/size, as well as just using a regular ANN instead of a CNN, but I am unable to achieve a low training set error when the training set contains more than one matrix.
Is there some fundamental reason why a NN is not able to be applied to this kind of problem? Or is it just a problem of my incorrect implementation?

Multiclass classification confidence score using multiclass classification using predict_proba of SGDclassifier

I am using Logistic regression in SGDClassifier to perform multi-class classification of ~10k category.
To get confidence score for predicted result I am using predict_proba function.
But I am getting prediction probability value of 0.00026091,0.00049697,0.00019632 for both correct and wrong prediction.
Please suggest the way to normalize the score so that I can consider result by filtering the probability value
If the probability values of all classes are very low, it might mean, that your classifier has a hard time to classify the samples. You might want to do some feature engineering or try another model.
To normalize the values, have a look at scikit-learn MinMaxScaler. This will scale the data to numbers between 0 and 1. But as I said, if the probability for all values is very low, you wont get a good classification result.
Hope that helps

Error calculation in backpropagation (gradient descent)

Can someone please give an explanation about the calculation of the error in backpropagation which is found in many code examples such as:
error=calculated-target
// then calculate error with respect to each parameter...
Is this same for squared error and cross entropy error? How?
Thanks...
I will note x an example from the training set, f(x) the prediction of your network for this particular example, and g_x the ground truth (label) associated to x.
The short answer is, the root means squared error (RMS) is used when you have a network that can exactly, and differentiably, predict the labels that you want. The cross-entropy error is used when your network predicts scores for a set of discrete labels.
To clarify, you usually use Root Mean Squared (RMS) when you want to predict values that can change continuously. Imagine you want your network to predict vectors in R^n. This is the case when, for example, you want to predict surface normals or optical flow. Then, these values can changes continuously, and ||f(x)-g_x|| is differentiable. You can use backprop and train your network.
Crosss-entropy, on the other hand, is useful in classification with n labels, for example, in image classification. In that case, the g_x take the discrete values c_1,c_2,...,c_m where m is the number of classes.
Now, you can not use RMS because if you assume that your netwrok predicts the exact labels (i.e. f(x) in {c_1,...,c_m}), then ||f(x)-g_x|| is no longer differentiable, and you can not use back-propagation. So, you make a network that does not compute class labels directly, but instead computes a set of scores s_1,...,s_m for each class label. Then, you maximize the probability of the correct score, by maximizing a softmax function on the scores. This makes the loss function differentiable.

k nearest neighbor classifier training sample size for each class

Could someone please tell me whether the training sample sizes for each class need to be equal?
Can I take this scenario?
class1 class2 class3
samples 400 500 300
or should all the classes have equal sample sizes?
The KNN results basically depend on 3 things (except for the value of N):
Density of your training data: you should have roughly the same number of samples for each class. Doesn't need to be exact, but I'd say not more than 10% disparity. Otherwise the boundaries will be very fuzzy.
Size of your whole training set: you need to have sufficiently enough examples in your training set so your model can generalize to unknown samples.
Noise: KNN is very sensitive to noise by nature, so you want to avoid noise in your training set as much as possible.
Consider the following example where you're trying to learn a donut-like shape in a 2D space.
By having a different density in your training data (let's say you have more training samples inside of the donut than outside), your decision boundary will be biased like below:
On the other hand, if your classes are relatively balanced, you'll get a much finer decision boundary that will be close to the actual shape of the donut:
So basically, I would advise trying to balance your dataset (just normalize it somehow), and also take in consideration the 2 other items I mentionned above, and you should be fine.
In case you have to deal with inbalanced training data, you could also consider using the WKNN algorithm (just an optimization of KNN) to assign stronger weights to your class that has less elements.
k nearest neighbor method does not depend on sample sizes. You can use your example sample sizes. For example see following paper on KDD99 data set with k-nearest neighbor. KDD99 is wildly imbalanced dataset more than your example dataset.

Resources