ROC on multiple test sets in h2o (python) - h2o

I had a use-case that I thought was really simple but couldn't find a way to do it with h2o. I thought you might know.
I want to train my model once, and then evaluate its ROC on a few different test sets (e.g. a validation set and a test set, though in reality I have more than 2) without having to retrain the model. The way I know to do it now requires retraining the model each time:
train, valid, test = fr.split_frame([0.2, 0.25], seed=1234)
rf_v1 = H2ORandomForestEstimator( ... )
rf_v1.train(features, var_y, training_frame=train, validation_frame=valid)
roc = rf_v1.roc(valid=1)
rf_v1.train(features, var_y, training_frame=train, validation_frame=test) # training again with the same training set - can I avoid this?
roc2 = rf_v1.roc(valid=1)
I can also use model_performance(), which gives me some metrics on an arbitrary test set without retraining, but not the ROC. Is there a way to get the ROC out of the H2OModelMetrics object?
Thanks!

You can use the h2o flow to inspect the model performance. Simply go to: http://localhost:54321/flow/index.html (if you changed the default port change it in the link); type "getModel "rf_v1"" in a cell and it will show you all the measurements of the model in multiple cells in the flow. It's quite handy.
If you are using Python, you can find the performance in your IDE like this:
rf_perf1 = rf_v1.model_performance(test)
and then print the ROC like this:
print (rf_perf1.auc())

Yes, indirectly. Get the TPRs and FPRs from the H2OModelMetrics object:
out = rf_v1.model_performance(test)
fprs = out.fprs
tprs = out.tprs
roc = zip(fprs, tprs)
(By the way, my H2ORandomForestEstimator object does not seem to have an roc() method at all, so I'm not 100% sure that this output is in the exact same format. I'm using h2o version 3.10.4.7.)

Related

What's difference RobertaModel, RobertaSequenceClassification (hugging face)

I try to use hugging face transformers api.
As I import library , I have some questions. If anyone who know the answer, please tell me your knowledge.
transformers library have several models that are trained. transformers provide not only bare model like 'BertModel, RobertaModel, ... but also convenient heads like 'ModelForMultipleChoice' , 'ModelForSequenceClassification', 'ModelForTokenClassification' , ModelForQuestionAnswering.
I wonder what's difference between bare model adding new linear transformation myself and modelforsequenceclassification.
what's different custom model (pretrained model with random intialized linear) and transformers modelforsequenceclassification.
is ModelforSequenceClassification trained from glue data?
I look forward to someone's reply Thanks.
I think it's easiest to understand if we have a look at the actual implementation, where I randomly chose RobertaModel and RobertaForSequenceClassification as an example. However, the conclusion is valid for all other models, too.
You can find the implementation for RobertaForSequenceClassification here, which looks roughly like this:
class RobertaForSequenceClassification(RobertaPreTrainedModel):
authorized_missing_keys = [r"position_ids"]
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.roberta = RobertaModel(config, add_pooling_layer=False)
self.classifier = RobertaClassificationHead(config)
self.init_weights()
[...]
def forward([...]):
[...]
As we can see, there is no indication about the pretraining here, and it simply adds another linear layer on top (the implementation of the RobertaClassificationHead can be found a bit further down, namely here):
class RobertaClassificationHead(nn.Module):
"""Head for sentence-level classification tasks."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, features, **kwargs):
x = features[:, 0, :] # take <s> token (equiv. to [CLS])
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)
return x
So, to answer your question: These models come without any pretrained additional layers on top, and you could easily implement them yourself*.
Now for the asterisk: While it could be easy to wrap this yourself, also note that it is an inherited class RobertaPreTrainedModel. This has several advantages, the most important one being a consistent design between different implementations (sequence classification model, sequence tagging model, etc.). Further, there are some neat functionalities that they are providing, like the forward call including extensive parameters (padding, masking, attention output, ...), which would cost quite some time to implement.
Last but not least, there are existing trained models based on these specific implementations, which you can search for on the Huggingface Model Hub. There, you might find models that are fine-tuned on a sequence classification task (e.g., this one), and then directly load its weights in a RobertaForSequenceClassification model. If you had your own implementation of a sequence classification model, loading and aligning these pre-trained weights would be incredibly more complicated.
I hope this answers your main concern, but feel free to elaborate (either as comment or new question) on any points that have not been addressed!

Change lgbm internal parameter (threshold) by hand

I have trained a model with lgbm. I can dump its interval values with
booster.dump_model()
and see all the internal parameters that has been optimized during the training (leaf values, threshold, index of the variables for each split, ...). For testing purpose I would like to change some. Is there a way? I guess that changing just the output of dump_model will do nothing.
You can save your model to a human-understandable format using
booster.save_model('model.txt'), do your modifications on model.txt, and load back the modified model using modified_booster = lightgbm.Booster(model_file='model.txt').
I hope it helps!

Why the result different everytime after I run the same query in spark?

I use collaborative filtering in mllib to train an ALS model on my spark cluster. I use python api.
als = ALS(maxIter=10, rank=10, regParam=0.01, userCol="member_srl", itemCol="productid", ratingCol="rating")
model = als.fit(training)
pred = model.transform(test)
but when I check the pred, it seems it is changing.
If I run the query pred.count(), the result is not steady with a little variance. So why the result is always changing.
for example:
pred.count()
4219257
run it again:
pred.count()
4220723
run it again:
pred.count()
4222431
ALS is an algorithm that is initialized with random numbers. So a little variance is to be expected. Try it with an explicit seed number, like:
als.setSeed(123)

Spark RDD into Matrix

I have an RDD like:
(A,AA,1)
(A,BB,0)
(A,CC,0)
(B,AA,2)
(B,BB,1)
(B,CC,4)
and I want to convert it into the following RRD:
([1,0,0],[2,1,4])
the order is important for me since the main propose is using RowMatrix to convert the second RDD to a matrix.
Your need to be careful with the wording, when you ask for a Matrix, do you mean something like the spark.mllib.matrix ? If so, you will need to follow very specific instructions to create one. However, it seems to me that your problem can be solved in a much easier way. Just using zipWithIndex with groupBy
//Here is how I see it
val test = sc.parallelize(Array(("A","AA",1),("A","BB",0),("A","CC",0),("B","AA",2),("B","BB",1),("B","CC",4))).zipWithIndex
val grouptest = test.groupBy(_._1._1).map(x=>(Vectors.dense(x._2.map(y=>(y._2,y._1._3)).toArray.sortBy(_._1).map(z=>z._2.toDouble))))
In your example, you seem to want the result as a vector? So I used spark's Vector (which by the way, only allows Doubles).
Result looks like:
[1.0,0.0,0.0]
[2.0,1.0,4.0]

How to implement custom Gibbs sampling scheme in pymc

I have a hidden Markov stochastic volatility model (represented as a linear state space model). I am using a hand-written Gibbs sampling scheme to estimate parameters for the model. The actual sampler requires some fairly sophisticated update rules that I believe I need to write by hand. You can see an example of a Julia version of these update rules here.
My question is the following: how can I specify the model in a custom way and then hand the job of running the sampler and collecting the samples to pymc? In other words, I am happy to provide code to do all the heavy lifting (how to update each block of parameters on each scan -- utilizing full conditionals within each block), but I want to let pymc handle the "accounting" for me.
I realize that I will probably need to provide more information so that others can answer this question. The problem is I am not sure exactly what information will be useful. So, if you feel you can help me out with this, but need more information -- please let me know in a comment and I will update the question.
Here is an example of a custom sampler in PyMC2:
class BDSTMetropolis(mc.Metropolis):
def __init__(self, stochastic):
mc.Metropolis.__init__(self, stochastic, scale=1., proposal_sd='custom',
proposal_distribution='custom', verbose=None, tally=False)
def propose(self):
T = self.stochastic.value
T.u_new, T.v_new = T.edges()[0]
while T.has_edge(T.u_new, T.v_new):
T.u_new, T.v_new = random.choice(T.base_graph.edges())
T.path = nx.shortest_path(T, T.u_new, T.v_new)
i = random.randrange(len(T.path)-1)
T.u_old, T.v_old = T.path[i], T.path[i+1]
T.remove_edge(T.u_old, T.v_old)
T.add_edge(T.u_new, T.v_new)
self.stochastic.value = T
def reject(self):
T = self.stochastic.value
T.add_edge(T.u_old, T.v_old)
T.remove_edge(T.u_new, T.v_new)
self.stochastic.value = T
It pretty different than your model, but it should demonstrate all the parts. Does that give you enough to go on?

Resources