Cross-validating H2OStackedEnsembleEstimator? - cross-validation

H2O docs claim that "for all algos that support the nfolds parameter" cross-validation is done by the train method.
However, H2OStackedEnsembleEstimator does not:
H2OValueError: Unknown parameter nfolds = 5
So, how do I cross-validate such a model?

The name of the CV parameter for Stacked Ensemble is called metalearner_nfolds instead of nfolds. This is to emphasize that the cross-validation is for the metalearning algorithm. The list of parameters for Stacked Ensemble can be found here.

Related

BayesSearchCV of LGBMregressor: how to weight samples in both training and CV scoring?

While optimizing LightGBM hyperparameters, I'd like to individually weight samples during both training and CV scoring. From the BayesSearchCV docs, it seems that a way to do that could be to insert a LGBMregressor sample_weight key into the BayesSearchCV fit_params option. But this is not clear because both BayesSearchCV and LGBMregressor have fit methods.
To which fit method is the BayesSearchCV fit_params going? And is using fit_params really the way to weight samples during both training and CV scoring?
Based on the documentation I believe fit_params is passed as an argument upon BayesSearchCV() instantiation, not when the .fit() method is called.

Use of validation_frame in H2O AutoML

Just started with H2O AutoML so apologies in advance if I have missed something basic.
I have a binary classification problem where data are observations from K years. I want to train on the K-1 years and tune the models and select the best one explicitly based on the remaining K year.
If I switch off cross-validation (with nfolds=0) to avoid randomly blending of years into the N folds and define data of year K as the validation_frame then I don't have the ensemble created (as expected according to the documentation) which in fact I need.
If I train with cross-validation (default nfolds) and defining a validation frame to be the K-year data
aml = H2OAutoML(max_runtime_secs=3600, seed=1)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)
then according to
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
the validation_frame is ignored
"...By default and when nfolds > 1, cross-validation metrics will be used for early stopping and thus validation_frame will be ignored."
Is there a way to get the tuning of the models and the selection of the best one(ensemble or not) based on the K-year data only, and while the ensemble of models is also available in the output?
Thanks a lot!
You don't want to have cross-validation (CV) if you are dealing with times-series (non-IID) data, since you won't want folds from the future to the predict the past.
I would explicitly add nfolds=0 so that CV is disabled in AutoML:
aml = H2OAutoML(max_runtime_secs=3600, seed=1, nfolds=0)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)
To have an ensemble, add a blending_frame which also applies to time-series. See more info here.
Additionally, since you are dealing with time-series data. I would recommend adding time-series transformations (e.g. lags), so that your model gets info from previous years and their aggregates (e.g. weighted moving average).

Problems generating PMML for 10-fold cross validation knime

I am working with KNIME and trying to train my Naive bayes classification algorithm with test data. I tried to use 10-fold cross-validation to make my results accurate but I am not able to generate the PMML model: I keep getting the error Loop end already assigned (start node has more than one end node). This is my KNIME workflow:
Exactly as the error message says, you have two loop end nodes (the PMML Ensemble Loop End and X-Aggregator nodes) but only one loop start (the X-Partitioner).
What is it you are trying to achieve? Normally the purpose of cross-validation is to estimate how well your predictive model is likely to perform on unknown data. If what you want at the end is a single trained Naive Bayes model that you can make predictions with, I think you want to delete the PMML Ensemble Loop End and instead connect the normalised data set to a second Naive Bayes Learner, configured the same as the first one, as well as to the X-Partitioner input. The output of the second learner is the model that you can then use for prediction. In that way the second learner node gets trained on the whole dataset, for the most accurate model, while the original one inside the cross-validation loop is just used to produce the estimate of how good the model is going to be.
If you want to make sure both learners are using the same settings, you can use flow variables to pass the setting values from the whole-dataset learner to the one inside the loop:
Show flow variable ports on the whole-dataset learner and the
X-Partitioner and link the output of the former to the input of the
latter
In the Flow Variables tab of the whole-dataset learner
configuration, type a name in the box for each parameter you want to
pass on
Run the whole-dataset learner
In the Flow Variables tab of the learner in the loop, you should now be able to select the
variable names that you created in the drop-down beside the
corresponding parameter.

Trying to understand one-class SVM

I am trying to use one-class SVM with Python scikit-learn.
But I do not understand what are the different variables X_outliers, n_error_train, n_error_test, n_error_outliers, etc. which are at this address. Why does X is randomly selected and is not a part of a dataset?
Scikit-learn "documentation" did not help me a lot. Also, I found very few examples on Internet
Can I use One-class SVM for outlier detection in a case of a hudge number of data and if I do not know if there are anomalies in my training set?
One-class SVM is an Unsupervised Outlier Detection (here)
One-class SVM is not an outlier-detection method, but a
novelty-detection method (here)
Is this possible?
Ok, so this is not really a Python question, more of a SVM comprehension question, but eh. A typical SVM is two-classed, and is an algorithm which is going to have two phases :
First, it will learn relationships between variables and attributes. For example, you show your algorithm tomato pictures and banana pictures, telling him each time if it's a banana or a tomato, and you tell him to count the number of red pixels in each picture. If you do it correctly, the SVM will be trained, meaning he will know that pictures with lots of red pixels are more likely to be tomatoes than bananas.
Then comes the predicting phase. You show him a picture of a tomato or a banana without telling him which it is. And since he has been trained before, he will count the red pixels, and know which it is.
In your case of a one-class SVM, it's a bit simpler, basically the training phase is showing him a bunch of variables which are all supposed to be similar. You show him a bunch of tomato pictures telling him "these are tomatoes, everything else too different from these are not tomatoes".
The code you link to is a code to test the SVM's capability of learning. You start by creating variables X_train. Then you generate two other sets, X_test which is similar to X_train (tomato pictures) and X_outliers which is very different. (banana pictures)
Then you show him the X_train variables and tell your SVM "this is the kind of variables we're looking for" with the line clf.fit(X_train). This is equivalent in my example to showing him lots of tomato images, and the SVN learning what a "tomato" is.
And then you test your SVM's capability to sort new variables, by showing him your two other sets (X_test and X_outliers), and asking him whether he thinks they are similar to X_train or not. You ask him that with the predict fuction, and predict will yield for every element in the sets either "1" i.e. "yes this is a similar element to X_train", or "-1", i.e. "this element is very different".
In an ideal case, the SVM should yield only "1" for X_test and only "-1" for X_outliers. But this code is to show you that this is not always the case. The variables n_error_ are here to count the mistakes that the SVM makes, misclassifying X_test elements as "not similar to X_train and X_outliers elements as "similar to X_train". You can see that there are even errors when the SVM is asked to predict on the very set that is has been trained on ! (n_error_train)
Why are there such errors ? Welcome to machine learning. The main difficulty of SVMs is setting the training set such that it enables the SVM to learn efficiently to distinguish between classes. So you need to set carefully the number of images you show him, (and what he has to look out for in the images (in my example, it was the number of red pixels, in the code, it is the value of the variable), but that is a different question).
In the code, the bounded but random initialization of the X sets means that for example you could during on run train the SVM on an X_train set with lots of values between -0.3 and 0 even though they are randomly initialized between -0.3 and 0.3 (espcecially if you have few elements per set, say for example 5, and you get [-0.2 -0.1 0 -0.1 0.1]). And so, when you show the SVM an element with a value of 0.2, then he will have trouble associating it to X_train, because it will have learned that X_train elements are more likely to have negative values.
This is equivalent to show your SVM a few yellow-ish tomatoes when you train him, so when you show him a really red tomato afterwards, it will have trouble clasifying it as a tomato.
This one-class SVM is a classifier to determine whether entries are similar or dissimilar to entries that the classifier has been trained with.
The script generates three sets:
A training set.
A test-set of entries that are similar to the training a set.
A test-set of entries that are dissimilar to the training set.
The error is the number of entries from each of the sets, that have been classified wrongly. That is; That have been classified as dissimilar to the training set when they were similar (for set 1 and 2), or that have been classifier as similar to the training set when they were dissimilar (set 3).
X_outliers: This is set 3.
n_error_train: The number of classification errors for the elements in the train-set (1).
n_error_test: The number of classification errors for the elements in the test-set (2).
n_error_outliers: The number of classification errors for the elements in the outlier-set (3).
This answer should be complementary to scikit-description but I agree that is a bit technical. I will elaborate some aspects of the One Class SVM algorithm (OCSVM) here. OCSVM is designed to solve the unsupervised anomaly detection problem.
Given unstructured (unlabelled) data it will find a n-dimensional space a matrix W^T with d columns (T stands for transpose).
The objective function of all SVM based methods (and OCSVM) is:
$$f(x) = sign(wT x + b)$$, where sign means sign (-1 anomalous 1 nominal) shifted by a bias term b.
In the classification problem the matrix W is associated with the distance(margin) between 2 classes but this differs in OCSVM since there is only 1 class and it maximizes from the origin (original paper of OCSVM demonstrates this ) .
As you see it is a generic algorithm because SVM is a family of models that can approximate any non linear boundary such as neural networks. To achieve something complicated you have to construct your own kernel matrix.
To do this you need to find some convenient mathematical property (suggestions to improve the answer are welcome at this point).
But in the most cases Gaussian kernel is a kernel that has some quite nice mathematical properties and associated ML theorems such as the Large
of large numbers.
The scikit implementation provides a wrapper to LIBSVM implementation for SVM and has 4 such kernels.
-nu parameter is a problem formulation parameter it allows to say to the model here is how dirty my sample is.
More formally it makes the problem a outlier detection problem where you know your data is mixed (nominal and anomalous) instead of pure where the problem is different and it is called novelty detection.
kernel parameter: One of the most important decisions. Mathematically kernel is a big matrix of numbers where by multiplying you achieve to project data in a higher dimensions. A nice read demonstrating the issue is here while the paper of Scholkopf who created OCSVMK goes into more detail.
gamma
In the case of robust kernel you essentially use a gaussian projection.
Disclaimer my interpretation: Essentially with gamma parameter you describe how big the variance of the Normal distribution $N(\mu, \sigma)$ is.
-tolerance
One class svm search the margin tha separates better among training data and the origin. The tolerance refers to the stopping criterion or how small should the tolerance for satisfaction of the quadratic optimization of the
objective function. The objective function the thing that tells SVM what the parameters should like to describe a specific margin - the space between nominal and anomalous) seen in Figure~().
Many Sklearn examples are usually based on randomly generated data. If you want to see an example of how OneClassSVM works on a real dataset for outlier detection, you can go through my post: https://justanoderbit.com/outlier-detection/one-class-svm/

Confusion Matrix of Bayesian Network

I'm trying to understand bayesian network. I have a data file which has 10 attributes, I want to acquire the confusion table of this data table ,I thought I need to calculate tp,fp, fn, tn of all fields. Is it true ? if it's then what i need to do for bayesian network.
Really need some guidance, I'm lost.
The process usually goes like this:
You have some labeled data instances
which you want to use to train a
classifier, so that it can predict
the class of new unlabeled instances.
Using your classifier
of choice (neural networks, bayes
net, SVM, etc...) we build a
model with your training data
as input.
At this point, you usually would like
to evaluate the performance of the
model before deploying it. So using a
previously unused subset of the data
(test set), we compare the model
classification for these instances
against that of the actual class. A
good way to summarize these results
is by a confusion matrix which shows
how each class of instances is
predicted.
For binary classification tasks, the convention is to assign one class as positive, and the other as negative. Thus from the confusion matrix, the percentage of positive instances that are correctly classified as positive is know as the True Positive (TP) rate. The other definitions follows the same convention...
Confusion matrix is used to evaluate the performance of a classifier, any classifier.
What you are asking is a confusion matrix with more than two classes.
Here is the steps how you do:
Build a classifier for each class, where the training set consists of
the set of documents in the class (positive labels) and its
complement (negative labels).
Given the test document, apply each classifier separately.
Assign the document to the class with the maximum score, the
maximum confidence value, or the maximum probability
Here is the reference for the paper you can have more information:
Picca, Davide, Benoît Curdy, and François Bavaud.2006.Non-linear correspondence analysis in text retrieval: A kernel view. In Proc. JADT.

Resources