Problems generating PMML for 10-fold cross validation knime - cross-validation

I am working with KNIME and trying to train my Naive bayes classification algorithm with test data. I tried to use 10-fold cross-validation to make my results accurate but I am not able to generate the PMML model: I keep getting the error Loop end already assigned (start node has more than one end node). This is my KNIME workflow:

Exactly as the error message says, you have two loop end nodes (the PMML Ensemble Loop End and X-Aggregator nodes) but only one loop start (the X-Partitioner).
What is it you are trying to achieve? Normally the purpose of cross-validation is to estimate how well your predictive model is likely to perform on unknown data. If what you want at the end is a single trained Naive Bayes model that you can make predictions with, I think you want to delete the PMML Ensemble Loop End and instead connect the normalised data set to a second Naive Bayes Learner, configured the same as the first one, as well as to the X-Partitioner input. The output of the second learner is the model that you can then use for prediction. In that way the second learner node gets trained on the whole dataset, for the most accurate model, while the original one inside the cross-validation loop is just used to produce the estimate of how good the model is going to be.
If you want to make sure both learners are using the same settings, you can use flow variables to pass the setting values from the whole-dataset learner to the one inside the loop:
Show flow variable ports on the whole-dataset learner and the
X-Partitioner and link the output of the former to the input of the
latter
In the Flow Variables tab of the whole-dataset learner
configuration, type a name in the box for each parameter you want to
pass on
Run the whole-dataset learner
In the Flow Variables tab of the learner in the loop, you should now be able to select the
variable names that you created in the drop-down beside the
corresponding parameter.

Related

Questions about feature selection and data engineering when using H2O autoencoder for anomaly detection

I am using H2O autoencoder in R for anomaly detection. I don’t have a training dataset, so I am using the data.hex to train the model, and then the same data.hex to calculate the reconstruction errors. The rows in data.hex with the largest reconstruction errors are considered anomalous. Mean squared error (MSE) of the model, which is calculated by the model itself, would be the sum of the squared reconstruction errors and then divided by the number of rows (i.e. examples). Below is some sudo code of the model.
# Deeplearning Model
model.dl <- h2o.deeplearning(x = x, training_frame = data.hex, autoencoder = TRUE, activation = "Tanh", hidden = c(25,25,25), variable_importances = TRUE)
# Anomaly Detection Algorithm
errors <- h2o.anomaly(model.dl, data.hex, per_feature = FALSE)
Currently there are about 10 features (factors) in my data.hex, and they are all categorical features. I have two questions below:
(1) Do I need to perform feature selection to select a subset of the 10 features before the data go into the deep learning model (with autoencoder=TRUE), in case some features are significantly associated with each other? Or I don’t need to since the data will go into an autoencoder which compresses the data and selects only the most importance information already, so feature selection would be redundant?
(2) The purpose of using the H2O autoencoder here is to identify the senders in data.hex whose action is anomalous. Here are two examples of data.hex. Example B is a transformed version of Example A, by concatenating all the actions for each sender-receiver pair in Example A.
After running the model on data.hex in Example A and in Example B separately, what I got is
(a) MSE from Example A (~0.005) is 20+ times larger than MSE from Example B;
(b) When I put the reconstruction errors in ascending order and plot them (so errors increase from left to right in the plot), the reconstruction error curve from Example A is steeper (e.g. skyrocketing) on the right end, while the reconstruction error curve from Example B increases more gradually.
My question is, which example of data.hex works better for my purpose to identify anomalies?
Thanks for your insights!
Question 1
You shouldn't need to decrease the number of inputted features into the model. I can't say I know what would happen during training, but collinear/associated features could be eliminated in the hidden layers as you said. You could consider adjusting your hidden nodes and see how it behaves. hidden = c(25,25,25) -> hidden = c(25,10,25) or hidden = c(15,15) or even hidden = c(7, 5, 7) for your few features.
Question 2
What is the purpose of your model? Are you trying to determine which "Sender/Receiver combinations" are anomalies or are you trying to determine which "Sender/Receiver + specific Action combo" are anomalies? If it's the former ("Sender/Receiver combinations") I would guess Example B is better.
If you want to know "Sender/Receiver combinations" and use Example A, then how would you aggregate all the actions for one Sender-Receiver combo? Will you average their error?
But it sounds like Example A has more of a response for anomalies in ascended order list (where only a few rows have high error). I would sample different rows and see if the errors make sense (as a domain expert). See if higher errors tend to seem to be anomaly-like rows.

cb_explore input format : Use of providing probability value in training

The cb_explore input format requires specifying action:cost:action_probability for each example.
However the cb algorithms within are already trying to learn the optimal policy i.e. probability for each action from the data. Then, why does it need the probability of each action in the input? Is it just for initialization?
If I understand correctly, you are asking why the label associated with cb_explore is a set of action/probability pairs.
The probability of the label action is used as an importance weight for training. This has the effect of amplifying the updates for actions that are played less frequently, making them less likely to be drowned out by actions played more frequently.
As well, this type of label is very useful during predict-time, because it generates a log that can be used to perform unbiased counterfactual analyses. In other words, by logging the probability of playing each of the actions before sampling (see cb_sample - this implements how a single action/probability vector is sampled, as for example in the ccb reduction: https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/vowpalwabbit/cb_sample.cc#L37), we can then use the log to train another policy, and compare how it performs against the original.
See the "A Multi-World Testing Decision Service" paper to describe the mechanism to do unbiased offline experimentation with logged data: https://arxiv.org/pdf/1606.03966v1.pdf

H2O document question for stopping_tolerance, score_each_iteration, score_tree_interval, etc

I have the following questions that still confused me after I read the h2o document. Can someone provide some explanation for me
For the stopping_tolerance = 0.001, let's use AUC for example, current AUC is 0.8. Does that mean the AUC need to increase 0.8 + 0.001 or need to increase 0.8*(1+0.1%)?
score_each_iteration, in H2O document
(http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/score_each_iteration.html) it just say "iteration". But what exactly is the definition for each
"iteration", is that each tree or each grid search or each K folder
cross validation or something else?
Can I define score_tree_interval and set score_each_iteration = True
at the same time or I can only use one of them to make the grid
search repeatable?
Is there any difference to put 'stopping_metric',
'stopping_tolerance', 'stopping_rounds' in
H2OGradientBoostingEstimator vs in search_criteria of H2OGridSearch?
I found put in H2OGradientBoostingEstimator will make the code run
much faster when I test it in Spark environment
0.001 is the same as 0.1%, for AUC since bigger is better, you will want to see an increase of at least .001 after a specified number of scoring rounds.
You have linked to a portion of the documentation that is specific to the algorithms listed in Available in at the top of the page. So let's stick to answering this question with respect to individual models and not grid search. If you want to see what is being scored at each iteration take a look at your model results in Flow or use my_model.plot() (for the python api) to see what is getting scored at each iteration. For GBM and DRF this will be ntrees, but since different algorithms will have different aspects that change the word iteration is used since it is more generic.
Did you test this out? what did you find when you did this? Take a look at the scoring history plot in flow and notice what happens when you set both score_tree_interval and score_each_iteration = True versus when you only set score_tree_interval (I would recommend trying to understand these parameters at the individual model level before you use grid search).
yes, in once case you are specifying early stopping as you build an individual model in the case of grid search you are indicating whether on not to build more models.

Back propagation algorithm when we have two outputs

I have a big problem I want to implement my neuronal neutwork with 2 neurons outputs. Sth like that :
And I want to use backpropagation algorithm, but I don't know how to calculate a error, because I have a output with 2 neurons, when I have a only one neuron on a output that's very easy to use a backpropagation algorithm from one exit error, but with two neurons? I thinking about calculate error for every output seperately but then I must calculate seperately back propagation for 2 cases and I get "two different hidden layers" (For every neuron in hidden layer I have a weights for two cases). Mayby anyone knows some better solutions?
I will be very gratefull for any help.
Logically thinking, the first layer of weights should give you a representation (the hidden layer) that is useful for predicting both outputs. So, this layer should be updated based on the error made in both outputs. But the next layer of weights are separate for each output node, so should get separate weight updates.
So, on second layer weights, the weight updates will be calculated separately based on the respective outputs. For the first layer of weights, I would first calculate error derivatives backpropagating from each output separately and then simply combine them to get the final error derivative. Then apply learning rate to get the weight updates.
Watch out for the dynamic range of your outputs. For example, if one output is producing some real value of range [0,10] and another is producing values in range [-1000,1000] then your updates will be dominated by the one with larger range. You can
add a preprocessing step that would change your data set to have same dynamic range in both outputs. Also, add a postprocessing step to restore the actual range.
formulate the error functions for each output so that they produce error values of same dynamic range.

Confusion Matrix of Bayesian Network

I'm trying to understand bayesian network. I have a data file which has 10 attributes, I want to acquire the confusion table of this data table ,I thought I need to calculate tp,fp, fn, tn of all fields. Is it true ? if it's then what i need to do for bayesian network.
Really need some guidance, I'm lost.
The process usually goes like this:
You have some labeled data instances
which you want to use to train a
classifier, so that it can predict
the class of new unlabeled instances.
Using your classifier
of choice (neural networks, bayes
net, SVM, etc...) we build a
model with your training data
as input.
At this point, you usually would like
to evaluate the performance of the
model before deploying it. So using a
previously unused subset of the data
(test set), we compare the model
classification for these instances
against that of the actual class. A
good way to summarize these results
is by a confusion matrix which shows
how each class of instances is
predicted.
For binary classification tasks, the convention is to assign one class as positive, and the other as negative. Thus from the confusion matrix, the percentage of positive instances that are correctly classified as positive is know as the True Positive (TP) rate. The other definitions follows the same convention...
Confusion matrix is used to evaluate the performance of a classifier, any classifier.
What you are asking is a confusion matrix with more than two classes.
Here is the steps how you do:
Build a classifier for each class, where the training set consists of
the set of documents in the class (positive labels) and its
complement (negative labels).
Given the test document, apply each classifier separately.
Assign the document to the class with the maximum score, the
maximum confidence value, or the maximum probability
Here is the reference for the paper you can have more information:
Picca, Davide, Benoît Curdy, and François Bavaud.2006.Non-linear correspondence analysis in text retrieval: A kernel view. In Proc. JADT.

Resources