Threshold for spark XGBoost Classification model - apache-spark-mllib

How do I set an optimal threshold for an XGBoost classifier ? The default value used in the algorithm is 0.5. I wanted to know if there is any feature/in-built function I can use to change this.

If using python: You are looking for predict_proba() python API instead of usual predict() API. With predict_proba() you get probability which then can be mapped to any class depending on threshold value.
Since you mentioned spark mllib so you might be using scala or java with xgboost4j. In such cases also options exist; for example https://xgboost.readthedocs.io/en/latest/jvm/scaladocs/xgboost4j/ml/dmlc/xgboost4j/scala/Booster.html#predict(data:ml.dmlc.xgboost4j.scala.DMatrix,outPutMargin:Boolean,treeLimit:Int):Array[Array[Float]] you are looking for outPutMargin
For deciding threshold you can use ROC curve or evaluate you business outcome with xgboost outcome e.g. if all cases below score 0.8 are are loss making then you can set threshold to 0.8

Related

h2o AutoML use in class imbalance mode

i have a usecase of very imbalace data set , i undersampled the training dataset ,
and tried running the automl in h2o, but it gave me great AUC results (over 0.99) but very bad aup_pr results (0.09).
is it related to the imbalance issue ?
i ran with weight_column option (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/weights_column.html)
but it didn't help.
should i use the balance_classes option instead (when i run both options it fails with "h2oFrame is empty" message) .
the train and test are splitted on date time range , and the test dataset has the proper ration between majority and minority classes.
The large difference between AUC and AUCPR is most likely caused, as you suggest, by the class imbalance. You can either try to set balance_classes = True or set weights to a column that would weight the minority class more, e.g. taking the inverse of the class frequency. If you have really small number of observations for the minority class, you can try to synthesise more using e.g. SMOTE.

H2O document question for stopping_tolerance, score_each_iteration, score_tree_interval, etc

I have the following questions that still confused me after I read the h2o document. Can someone provide some explanation for me
For the stopping_tolerance = 0.001, let's use AUC for example, current AUC is 0.8. Does that mean the AUC need to increase 0.8 + 0.001 or need to increase 0.8*(1+0.1%)?
score_each_iteration, in H2O document
(http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/score_each_iteration.html) it just say "iteration". But what exactly is the definition for each
"iteration", is that each tree or each grid search or each K folder
cross validation or something else?
Can I define score_tree_interval and set score_each_iteration = True
at the same time or I can only use one of them to make the grid
search repeatable?
Is there any difference to put 'stopping_metric',
'stopping_tolerance', 'stopping_rounds' in
H2OGradientBoostingEstimator vs in search_criteria of H2OGridSearch?
I found put in H2OGradientBoostingEstimator will make the code run
much faster when I test it in Spark environment
0.001 is the same as 0.1%, for AUC since bigger is better, you will want to see an increase of at least .001 after a specified number of scoring rounds.
You have linked to a portion of the documentation that is specific to the algorithms listed in Available in at the top of the page. So let's stick to answering this question with respect to individual models and not grid search. If you want to see what is being scored at each iteration take a look at your model results in Flow or use my_model.plot() (for the python api) to see what is getting scored at each iteration. For GBM and DRF this will be ntrees, but since different algorithms will have different aspects that change the word iteration is used since it is more generic.
Did you test this out? what did you find when you did this? Take a look at the scoring history plot in flow and notice what happens when you set both score_tree_interval and score_each_iteration = True versus when you only set score_tree_interval (I would recommend trying to understand these parameters at the individual model level before you use grid search).
yes, in once case you are specifying early stopping as you build an individual model in the case of grid search you are indicating whether on not to build more models.

AutoML Vision: Predictions include --other-- field

I have just trained a new model with a binary outcome (elite/non-elite). The model trained well, but when I tested a new image on it in the GUI it returned a third label --other--. I am not sure how/why that has appeared. Any ideas?
When multi-class (single-label) classification is used, there is an assumption that the confidence of all predictions must sum to 1 (as one and exactly one valid label is assumed). This is achieved by using softmax function. It normalizes all predictions to sum to 1 - which has some drawbacks - for example if both predictions are very low - for example prediction of "elite" is 0.0001 and Non_elite is 0.0002 - after normalization the predictions would be 0.333 and 0.666 respectively.
To work around that the automl system allows to use extra label (--other--) to indicate that none of the allowed predictions seems valid. This label is implementation detail and shouldn't be returned by the system (should be filtered out). This should get fixed in the near future.

Get cross_validation_holdout_predictions() of models from a grid search

I'm trying to calculate performance in a different way how it is built in for models right now.
I would like to access raw predictions during cross-validation, so I can calculate performance on my own.
g = h2o.get_grid(grid_id)
for m in g.models:
print "Model %s" % m.model_id
rrc[m.model_id] = m.cross_validation_holdout_predictions()
I could just run prediction with a model on my dataset, but I think then this test might be biased because the model has seen this data before, or not? Can I take new predictions made on the same data set and use it to calculate performance?
I would like to access raw predictions during cross-validation, so I can calculate performance on my own.
If you want to calculate a custom metric on the cross-validated predictions, then set keep_cross_validation_predictions = True and you can access the raw predicted values using the .cross_validation_holdout_predictions() method like you have above.
Can I take new predictions made on the same data set and use it to calculate performance?
It sounds like you're asking if you can use only training data to estimate model performance? Yes, using cross-validation. If you set nfolds > 1, H2O will do cross-validation and compute a handful of cross-validated performance metrics for you. Also, if you tell H2O to save the cross-validated predictions, you can compute "cross-validated metrics" of your own.

Some details about adjusting cascaded AdaBoost stage threshold

I have implemented AdaBoost sequence algorithm and currently I am trying to implement so called Cascaded AdaBoost, basing on P. Viola and M. Jones original paper. Unfortunately I have some doubts, connected with adjusting the threshold for one stage. As we can read in original paper, the procedure is described in literally one sentence:
Decrease threshold for the ith classifier until the current
cascaded classifier has a detection rate of at least
d × Di − 1 (this also affects Fi)
I am not sure mainly two things:
What is the threshold? Is it 0.5 * sum (alpha) expression value or only 0.5 factor?
What should be the initial value of the threshold? (0.5?)
What does "decrease threshold" mean in details? Do I need to iterative select new threshold e.g. 0.5, 0.4, 0.3? What is the step of decreasing?
I have tried to search this info in Google, but unfortunately I could not find any useful information.
Thank you for your help.
I had the exact same doubt and have not found any authoritative source so far. However, this is what is my best guess to this issue:
1. (0.5*sum(aplha)) is the threshold.
2. Initial value of the threshold is what is above. Next, try to classify the samples using the intermediate strong classifier (what you currently have). You'll get the scores each of the samples attain, and depending on the current value of threshold, some of the positive samples will be classified as negative etc. So, depending on the desired detection rate desired for this stage (strong classifier), reduce the threshold so that that many positive samples get correctly classified ,
eg:
say thresh. was 10, and these are the current classifier outputs for positive training samples:
9.5, 10.5, 10.2, 5.4, 6.7
and I want a detection rate of 80% => 80% of above 5 samples classified correctly => 4 of above => set threshold to 6.7
Clearly, by changing the threshold, the FP rate also changes, so update that, and if the desired FP rate for the stage not reached, go for another classifier at that stage.
I have not done a formal course on ada-boost etc, but this is my observation based on some research papers I tried to implement. Please correct me if something is wrong. Thanks!
I have found a Master thesis on real-time face detection by Karim Ayachi (pdf) in which he describes the Viola Jones face detection method.
As it is written in Section 5.2 (Creating the Cascade using AdaBoost), we can set the maximal threshold of the strong classifier to sum(alpha) and the minimal threshold to 0 and then find the optimal threshold using binary search (see Table 5.1 for pseudocode).
Hope this helps!

Resources