Statistics to validate model with independent data set - validation

I am working on modeling the understory forest using the RandomForest classifier. The results are the probability values of understory tree occurrence. And I have an independent dataset, which was not utilized in model building. I want to test how reliable the prediction model is against the field data.
I would like to know what statistics should I use to do it? I was thinking to use a t-test but I doubt it is good statistics. I wonder if I can use kappa statistics or agreement statistics but I am not so sure about it. I hope someone can help me with this. Thank you.

Related

Does DAI standardize/normalize during training, which methods does it try, and does the genetic algorithm try them all?

Often I'm unsure to what extent to preprocess my data while using DAI. Often you want to reduce the dimensionality, rid duplicate features, standardize/normalize, etc... for a production level model. Is there a rule at which I should stop personal preprocessing in favor of DAI (I.E. Only rid a binary classification algorithm of Nan's and DAI will do the rest). Will it explicitly explain which normalization technique it used, like a MinMaxScaler() from Sklearn for example?
Generally, no preprocessing is needed and the methods DAI uses for internal preprocessing are dependent on the algorithms behind the models.
However, there are specific use cases that may require preprocessing and h2o can assist you with that if you contact them. For example, if you want to predict something at a customer level but your data is transactions, then you need to do preprocessing - say you have grocery store transactions and you want to predict how much the store will make tomorrow. Then you need to aggregate to the day store level since that is the level you want predictions at. Basically any case where the data is more granular than the level you want predictions at needs preprocessing.
For missing values it's best to let Driverless AI handle them unless you know why the values are missing and thus can use domain rules to fill them in. For example if you have transaction = NA but you know that means no money was spent, you'd want to change the NA to 0.
I think the following docs may be helpful: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/faq.html#data-experiments-predictions. In particular the sections 'Can Driverless AI handle data with missing values/nulls?' and 'Does Driverless AI standardize the data?'.
You also can find a lot of information about what your experiment is doing in the experiment report: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/experiment-summary.html. We don't currently report methods of standardization because it happens differently for each model in an ensemble that is potentially quite complex.

Which model to pick from K fold Cross Validation

I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?
If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.
Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Edit:
Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.
Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.
Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Scenario-2:
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.

Validation of hurdle model?

I built a hurdle model, and then used that model to predict from known to unknown data points using the predict command. Is there a way to validate the model and these predictions? Do I have to do this in two parts, for example using sensitivity and specificity for the binomial part of the model?
Any other ideas for how to assess the validity of this model?
For validating predictive models, I usually trust Cross-Validation.
In short: With cross-validation you can measure the predictive performance of your model using only the training data (data with known results). Thus you can get a general opinion on how your model works. Cross-validation works quite well for wide variety of different models. The downside is that it can get quite computation heavy.
With large data sets, 10-fold cross-validation is enough. The smaller your dataset is, the more "folds" you have to do (i.e. with very small datasets, you have to do leave-one-out cross-validation)
With cross-validation, you get predictions for the whole data set. You can then compare these predictions to the actual outputs and measure how well your model performed.
Cross-validated results can take a bit to understand in more complicated comparisons, but for your general purpose question "how to assess the validity of the model", the results should be quite easy to use.

4 fold cross validation | Caffe

So I trying to perform a 4-fold cross validation on my training set. I have divided my training data into four quarters. I use three quarters for training and one quarter for validation. I repeat this three more times till all the quarters are given a chance to be the validation set, atleast once.
Now after training I have four caffemodels. I test the models on my validation sets. I am getting different accuracy in each case. How should I proceed from here? Should I just choose the model with the highest accuracy?
Maybe it is a late reply, but in any case...
The short answer is that, if the performances of the four models are similar and good enough, then you re-train the model on all the data available, because you don't want to waste any of them.
The n-fold cross validation is a practical technique to get some insights on the learning and generalization properties of the model you are trying to train, when you don't have a lot of data to start with. You can find details everywhere on the web, but I suggest the open-source book Introduction to Statistical Learning, Chapter 5.
The general rule says that after you trained your n models, you average the prediction error (MSE, accuracy, or whatever) to get a general idea of the performance of that particular model (in your case maybe the network architecture and learning strategy) on that dataset.
The main idea is to assess the models learned on the training splits checking if they have an acceptable performance on the validation set. If they do not, then your models probably overfitted tha training data. If both the errors on training and validation splits are high, then the models should be reconsidered, since they don't have predictive capacity.
In any case, I would also consider the advice of Yoshua Bengio who says that for the kind of problem deep learning is meant for, you usually have enough data to simply go with a training/test split. In this case this answer on Stackoverflow could be useful to you.

How to create training data

Can anybody tell me how to create training data for categorization. I am using OpenNLP for categorization. Is there any tool to create training data or if i have to create it manually then how it should be done? I am a complete noob in this field. Please help
Well, normally you have some kind of historical data of previous (manual) categorization. Else you would have to create the data that your need somehow. Such data is often created by observation.
Although it heavy depends on the data you are trying to categorize.
If your are able to generate training data you would have a perfect algorithm for the data, and you would not need to train a system, would you?
If it is not possible to have training data, you might have to look at algorithms which don't need to learn upfront, i.e. which learn as data comes in and someone is constantly correcting the system's faults.

Resources