H2o: Is there a way to fix threshold in H2ORandomForestEstimator performance during training and testing? - performance

I have built a model with H2ORandomForestEstimator and the results shows something like this below.
The threshold keeps changing (0.5 from traning and 0.313725489027 from validation) and I like to fix the threshold in H2ORandomForestEstimator for comparison during fine tuning. Is there a way to set the threshold?
From http://h2o-release.s3.amazonaws.com/h2o/master/3484/docs-website/h2o-py/docs/modeling.html#h2orandomforestestimator, there is no such parameter.
If there is no way to set this, how do we know what threshold our model is built on?
rf_v1
** Reported on train data. **
MSE: 2.75013548238e-05
RMSE: 0.00524417341664
LogLoss:0.000494320913199
Mean Per-Class Error: 0.0188802936476
AUC: 0.974221763605
Gini: 0.948443527211
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.5:
0 1 Error Rate
----- ------ --- ------- --------------
0 161692 1 0 (1.0/161693.0)
1 3 50 0.0566 (3.0/53.0)
Total 161695 51 0 (4.0/161746.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- -------- -----
max f1 0.5 0.961538 19
max f2 0.25 0.955056 21
max f0point5 0.571429 0.983936 18
max accuracy 0.571429 0.999975 18
max precision 1 1 0
max recall 0 1 69
max specificity 1 1 0
max absolute_mcc 0.5 0.961704 19
max min_per_class_accuracy 0.25 0.962264 21
max mean_per_class_accuracy 0.25 0.98112 21
Gains/Lift Table: Avg response rate: 0.03 %
** Reported on validation data. **
MSE: 1.00535766226e-05
RMSE: 0.00317073755183
LogLoss: 4.53885183426e-05
Mean Per-Class Error: 0.0
AUC: 1.0
Gini: 1.0
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.313725489027:
0 1 Error Rate
----- ----- --- ------- -------------
0 53715 0 0 (0.0/53715.0)
1 0 16 0 (0.0/16.0)
Total 53715 16 0 (0.0/53731.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- ------- -----
max f1 0.313725 1 5
max f2 0.313725 1 5
max f0point5 0.313725 1 5
max accuracy 0.313725 1 5
max precision 1 1 0
max recall 0.313725 1 5
max specificity 1 1 0
max absolute_mcc 0.313725 1 5
max min_per_class_accuracy 0.313725 1 5
max mean_per_class_accuracy 0.313725 1 5

The threshold is max-F1.
If you want to apply your own threshold, you will have to take the probability of the positive class and compare it yourself to produce the label you want.
If you use your web browser to connect to the H2O Flow Web UI inside of H2O-3, you can mouse over the ROC curve and visually browse the confusion matrix for each threshold, which is convenient.

Related

Ceres Solver Evaluate() "successful step 1" Jacobian and residual evaluate only once

I use Evaluate() to build Jacobian. The problems that exist after the optimization iteration is over. That is, only the 0th iteration calculates the Jacobian, and the rest of the iterations are not calculated, and the Initial cost is equal to the Final cost. Only one of the iterative steps succeeds and the rest fail. I don't know what is causing this error.
summary.breifReport()
tier cost cost_change |gradient| |step| tr_ratio tr_radius
0 1.2e+05 0 6.9e+04 0 0. 1e+4
1 4.4e+05 -3.2e+05 0 3.83 -2.64 5e+3
2 4.4e+05 -3.2e+05 0 3.83 -2.64 2.5e+3
3 4.4e+05 -3.2e+05 0 3.83 -2.64 1.25e+3
4 4.4e+05 -3.2e+05 0 3.83 -2.64 6.25e+2
5 4.4e+05 -3.2e+05 0 3.31 -2.70 3.12e+2
6 4.6e+05 -3.4e+05 0 3.13 -2.84 1.56e+2
7 2.8e+05 -1.6e+05 0 1.93 -1.74 7.81e+1
8 1.9e+05 -7.3e+04 0 9.7e-1 -1.30 3.91e+1
9 1.5e+05 -3.4e+04 0 3.8e-1 -1.14 1.95e+1
10 ...
summary.FullReport()
Solver Summary (v 1.14.0-eigen-(3.2.9)-lapack-suitesparse-(5.7.1)-cxsparse-(3.2.0)-eigensparse-openmp-no_tbb)
Original Reduced
Parameter blocks 15 15
Parameters 564 564
Effective parameters 561 561
Residual blocks 6 6
Residuals 80 80
Minimizer TRUST_REGION
Sparse linear algebra library SUITE_SPARSE
Trust region strategy LEVENBERG_MARQUARDT
Given Used
Linear solver SPARSE_NORMAL_CHOLESKY SPARSE_NORMAL_CHOLESKY
Threads 1 1
Linear solver ordering AUTOMATIC 15
Cost:
Initial 3.682558e+04
Final 3.682558e+04
Change 0.000000e+00
Minimizer iterations 13
Successful steps 1
Unsuccessful steps 12
Time (in seconds):
Preprocessor 0.000045
Residual only evaluation 0.030806 (13)
Jacobian & residual evaluation 0.004554 (1)
Linear solver 0.181772 (13)
Minimizer 0.217993
Postprocessor 0.000007
Total 0.218046
Termination: CONVERGENCE (Function tolerance reached. |cost_change|/cost: 0.000000e+00 <= 1.000000e-16)

A variant of the Knapsack algorithm

I have a list of items, a, b, c,..., each of which has a weight and a value.
The 'ordinary' Knapsack algorithm will find the selection of items that maximises the value of the selected items, whilst ensuring that the weight is below a given constraint.
The problem I have is slightly different. I wish to minimise the value (easy enough by using the reciprocal of the value), whilst ensuring that the weight is at least the value of the given constraint, not less than or equal to the constraint.
I have tried re-routing the idea through the ordinary Knapsack algorithm, but this can't be done. I was hoping there is another combinatorial algorithm that I am not aware of that does this.
In the german wiki it's formalized as:
finite set of objects U
w: weight-function
v: value-function
w: U -> R
v: U -> R
B in R # constraint rhs
Find subset K in U subject to:
sum( w(u) <= B ) | all w in K
such that:
max sum( v(u) ) | all u in K
So there is no restriction like nonnegativity.
Just use negative weights, negative values and a negative B.
The basic concept is:
sum( w(u) ) <= B | all w in K
<->
-sum( w(u) ) >= -B | all w in K
So in your case:
classic constraint: x0 + x1 <= B | 3 + 7 <= 12 Y | 3 + 10 <= 12 N
becomes: -x0 - x1 <= -B |-3 - 7 <=-12 N |-3 - 10 <=-12 Y
So for a given implementation it depends on the software if this is allowed. In terms of the optimization-problem, there is no problem. The integer-programming formulation for your case is as natural as the classic one (and bounded).
Python Demo based on Integer-Programming
Code
import numpy as np
import scipy.sparse as sp
from cylp.cy import CyClpSimplex
np.random.seed(1)
""" INSTANCE """
weight = np.random.randint(50, size = 5)
value = np.random.randint(50, size = 5)
capacity = 50
""" SOLVE """
n = weight.shape[0]
model = CyClpSimplex()
x = model.addVariable('x', n, isInt=True)
model.objective = value # MODIFICATION: default = minimize!
model += sp.eye(n) * x >= np.zeros(n) # could be improved
model += sp.eye(n) * x <= np.ones(n) # """
model += np.matrix(-weight) * x <= -capacity # MODIFICATION
cbcModel = model.getCbcModel()
cbcModel.logLevel = True
status = cbcModel.solve()
x_sol = np.array(cbcModel.primalVariableSolution['x'].round()).astype(int) # assumes existence
print("INSTANCE")
print(" weights: ", weight)
print(" values: ", value)
print(" capacity: ", capacity)
print("Solution")
print(x_sol)
print("sum weight: ", x_sol.dot(weight))
print("value: ", x_sol.dot(value))
Small remarks
This code is just a demo using a somewhat low-level like library and there are other tools available which might be better suited (e.g. windows: pulp)
it's the classic integer-programming formulation from wiki modifies as mentioned above
it will scale very well as the underlying solver is pretty good
as written, it's solving the 0-1 knapsack (only variable bounds would need to be changed)
Small look at the core-code:
# create model
model = CyClpSimplex()
# create one variable for each how-often-do-i-pick-this-item decision
# variable needs to be integer (or binary for 0-1 knapsack)
x = model.addVariable('x', n, isInt=True)
# the objective value of our IP: a linear-function
# cylp only needs the coefficients of this function: c0*x0 + c1*x1 + c2*x2...
# we only need our value vector
model.objective = value # MODIFICATION: default = minimize!
# WARNING: typically one should always use variable-bounds
# (cylp problems...)
# workaround: express bounds lower_bound <= var <= upper_bound as two constraints
# a constraint is an affine-expression
# sp.eye creates a sparse-diagonal with 1's
# example: sp.eye(3) * x >= 5
# 1 0 0 -> 1 * x0 + 0 * x1 + 0 * x2 >= 5
# 0 1 0 -> 0 * x0 + 1 * x1 + 0 * x2 >= 5
# 0 0 1 -> 0 * x0 + 0 * x1 + 1 * x2 >= 5
model += sp.eye(n) * x >= np.zeros(n) # could be improved
model += sp.eye(n) * x <= np.ones(n) # """
# cylp somewhat outdated: need numpy's matrix class
# apart from that it's just the weight-constraint as defined at wiki
# same affine-expression as above (but only a row-vector-like matrix)
model += np.matrix(-weight) * x <= -capacity # MODIFICATION
# internal conversion of type neeeded to treat it as IP (or else it would be
LP)
cbcModel = model.getCbcModel()
cbcModel.logLevel = True
status = cbcModel.solve()
# type-casting
x_sol = np.array(cbcModel.primalVariableSolution['x'].round()).astype(int)
Output
Welcome to the CBC MILP Solver
Version: 2.9.9
Build Date: Jan 15 2018
command line - ICbcModel -solve -quit (default strategy 1)
Continuous objective value is 4.88372 - 0.00 seconds
Cgl0004I processed model has 1 rows, 4 columns (4 integer (4 of which binary)) and 4 elements
Cutoff increment increased from 1e-05 to 0.9999
Cbc0038I Initial state - 0 integers unsatisfied sum - 0
Cbc0038I Solution found of 5
Cbc0038I Before mini branch and bound, 4 integers at bound fixed and 0 continuous
Cbc0038I Mini branch and bound did not improve solution (0.00 seconds)
Cbc0038I After 0.00 seconds - Feasibility pump exiting with objective of 5 - took 0.00 seconds
Cbc0012I Integer solution of 5 found by feasibility pump after 0 iterations and 0 nodes (0.00 seconds)
Cbc0001I Search completed - best objective 5, took 0 iterations and 0 nodes (0.00 seconds)
Cbc0035I Maximum depth 0, 0 variables fixed on reduced cost
Cuts at root node changed objective from 5 to 5
Probing was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Gomory was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Knapsack was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Clique was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
FlowCover was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Result - Optimal solution found
Objective value: 5.00000000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.00
Time (Wallclock seconds): 0.00
Total time (CPU seconds): 0.00 (Wallclock seconds): 0.00
INSTANCE
weights: [37 43 12 8 9]
values: [11 5 15 0 16]
capacity: 50
Solution
[0 1 0 1 0]
sum weight: 51
value: 5

How can you improve computation time when predicting KNN Imputation?

I feel like my run time is extremely slow for my data set, this is the code:
library(caret)
library(data.table)
knnImputeValues <- preProcess(mainData[trainingRows, imputeColumns], method = c("zv", "knnImpute"))
knnTransformed <- predict(knnImputeValues, mainData[ 1:1000, imputeColumns])
the PreProcess into knnImputeValues run's fairly quickly, however the predict function takes a tremendous amount of time. When I calculated it on a subset of the data this was the result:
testtime <- system.time(knnTransformed <- predict(knnImputeValues, mainData[ 1:15000, imputeColumns
testtime
user 969.78
system 38.70
elapsed 1010.72
Additionally, it should be noted that caret preprocess uses "RANN".
Now my full dataset is:
str(mainData[ , imputeColumns])
'data.frame': 1809032 obs. of 16 variables:
$ V1: int 3 5 5 4 4 4 3 4 3 3 ...
$ V2: Factor w/ 3 levels "1000000","1500000",..: 1 1 3 1 1 1 1 3 1 1 ...
$ V3: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ V4: int 2 5 5 12 4 5 11 8 7 8 ...
$ V5: int 2 0 0 2 0 0 1 3 2 8 ...
$ V6: int 648 489 489 472 472 472 497 642 696 696 ...
$ V7: Factor w/ 4 levels "","N","U","Y": 4 1 1 1 1 1 1 1 1 1 ...
$ V8: int 0 0 0 0 0 0 0 1 1 1 ...
$ V9: num 0 0 0 0 0 ...
$ V10: Factor w/ 56 levels "1","2","3","4",..: 45 19 19 19 19 19 19 46 46 46 ...
$ V11: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ V12: num 2 5 5 12 4 5 11 8 7 8 ...
$ V13: num 2 0 0 2 0 0 1 3 2 8 ...
$ V14: Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 3 3 ...
$ V15: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 2 ...
$ V16: num 657 756 756 756 756 ...
So is there something I'm doing wrong, or is this typical for how long it will take to run this? If you back of the envelop extrapolate (which I know isn't entire accurate) you'd get what 33 days?
Also it looks like system time is very low and user time is very high, is that normal?
My computer is a laptop, with a Intel(R) Core(TM) i5-6300U CPU # 2.40Ghz processor.
Additionally would this improve the runtime of the predict function?
cl <- makeCluster(4)
registerDoParallel()
I tried it, and it didn't seem to make a difference other than all the processors looked more active in my task manager.
FOCUSED QUESTION: I'm using Caret package to do KNN Imputation on 1.8 Million Rows, the way I'm currently doing it will take over a month to run, how do I write this in such a way that I could do it in a much faster amount of time(if possible)?
Thank you for any help provided. And the answer might very well be "that's how long it takes don't bother" I just want to rule out any possible mistakes.
You can speed this up via the imputation package and use of canopies which can be installed from Github:
Sys.setenv("PKG_CXXFLAGS"="-std=c++0x")
devtools::install_github("alexwhitworth/imputation")
Canopies use a cheap distance metric--in this case distance from the data mean vector--to get approximate neighbors. In general, we wish to keep the canopies each sized < 100k so for 1.8M rows, we'll use 20 canopies:
library("imputation")
to_impute <- mainData[trainingRows, imputeColumns] ## OP undefined
imputed <- kNN_impute(to_impute, k= 10, q= 2, verbose= TRUE,
parallel= TRUE, n_canopies= 20)
NOTE:
The imputation package requires numeric data inputs. You have several factor variables in your str output. They will cause this to fail.
You'll also get some mean vector imputation if you have fulling missing rows.
# note this example data is too small for canopies to be useful
# meant solely to illustrate
set.seed(2143L)
x1 <- matrix(rnorm(1000), 100, 10)
x1[sample(1:1000, size= 50, replace= FALSE)] <- NA
x_imp <- kNN_impute(x1, k=5, q=2, n_canopies= 10)
sum(is.na(x_imp[[1]])) # 0
# with fully missing rows
x2 <- x1; x2[5,] <- NA
x_imp <- kNN_impute(x2, k=5, q=2, n_canopies= 10)
[1] "Computing canopies kNN solution provided within canopies"
[1] "Canopies complete... calculating kNN."
row(s) 1 are entirely missing.
These row(s)' values will be imputed to column means.
Warning message:
In FUN(X[[i]], ...) :
Rows with entirely missing values imputed to column means.

Evaluating the model in WEKA

I have applied classification algorithm on dataset and came out with below stats:
Correctly Classified Instances 684 76.1693 %
Incorrectly Classified Instances 214 23.8307 %
Kappa statistic 0
Mean absolute error 0.1343
Root mean squared error 0.2582
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 898
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0 0 0 0 0 0.5 1
0 0 0 0 0 0.5 2
1 1 0.762 1 0.865 0.5 3
0 0 0 0 0 ? 4
0 0 0 0 0 0.5 5
0 0 0 0 0 0.5 U
Weighted Avg. 0.762 0.762 0.58 0.762 0.659 0.5
=== Confusion Matrix ===
a b c d e f <-- classified as
0 0 8 0 0 0 | a = 1
0 0 99 0 0 0 | b = 2
0 0 684 0 0 0 | c = 3
0 0 0 0 0 0 | d = 4
0 0 67 0 0 0 | e = 5
0 0 40 0 0 0 | f = U
I can understand much of the data however there is a problem interpreting the values since i am new to Weka:
1. Which error rate to report overall?
2. How to interpret if something interesting about the model?
1) Overall error measure
The triplet Precision, Recall and F-Measure together is reported quite often because each number represents a different aspect of the model.
If would like to have a single number only then take Percent (In)correctly Classified Instances or Weighted Avg. F-Measure.
The other error measures are also useful but they require deeper knowledge of statistics (which I'm lacking :-)
2) Something interesting about the model
From Detailed Accuracy By Class and Confusion Matrix you can see that the model is quite simple. It classifies everything as class 3. The error measures looks quite successful, but it is just because 76% of instances in the dataset have the class 3. The model corresponds with often used baseline algorithm called "most common class".
The ROC area is also useful in terms of evaluating accuracy and interpreting how interesting a model is. Simply speaking, the true positive rate is plotted against the false positive rate and the ROC area is calculated as the area underneath this curve. A high ROC area, say 0.9 to 1, indicates that the model is very good at classifying instances, whereas a ROC area of 0.5 (as in your model) means that the model is no better at classification than a random method like flipping coins.

GAMS, matrix direct assignment

I want to assign values to a 3-D table in GAMS. But it seems it doesn't work as in Matlab.....Any luck ? Code is as followed and the problem is at the last few lines:
Sets
n nodes / Sto , Lon , Par , Ber , War , Mad , Rom /
i scenarios / 1 * 4 /
k capacity level / L, N, H / ;
alias(n,m);
Table balance(n,i) traffic balance for different nodes
1 2 3 4
Sto 50 50 -50 -50
Lon -40 40 -40 40
Par 0 0 0 0
Ber 0 0 0 0
War 40 -40 40 -40
Mad 0 0 0 0
Rom -50 -50 50 50 ;
Scalar r fluctuation rate of the capacity level
/0.15/;
Parameter p(k) probability of each level
/ L 0.25
N 0.5
H 0.25 / ;
Table nor_cap(n,m) Normal capacity level from n to m
Sto Lon Par Ber War Mad Rom
Sto 0 11 14 25 30 0 0
Lon 11 0 21 0 0 14 0
Par 14 21 0 22 0 31 19
Ber 25 0 22 0 26 0 18
War 30 0 0 26 0 18 22
Mad 0 14 31 0 18 0 15
Rom 0 0 19 18 22 15 0 ;
Table max_cap(n,m,k) capacity level under each k
max_cap(n,m,'N')=nor_cap(n,m)
max_cap(n,m,'L')=nor_cap(n,m)*(1-r)
max_cap(n,m,'H')=nor_cap(n,m)*(1+r);
The final assignment to a 3-D matrix should be done with PARAMETER as opposed to TABLE. In general I would also note that TABLE is very restrictive (2 dimensional, text input inside the code). You might want to consider $GDXIN (or EXECUTE_LOAD) and some of the GAMS utilities for loading xls or csv files.
As a user of both MATLAB and GAMS I would note that GAMS depends on "indices" for every array, but otherwise they can be quite similar. In your case max_cap(n,m,k) would be something like the maximum capacity between from_city and to_city under each capacity level scenario. Your matrix needs to be declared as a PARAMETER which can be any n-dimensional (indexed) matrix, including even a SCALAR.
Also, try the GAMS mailing list if you really need an answer quickly, the number of proficient GAMS users globally can't be more than a few thousand, so it might be hard to find a quick answer on StackOverflow - awesome as it is for the more common languages.

Resources