How to get the highest predicted value in multiclass classification problem using H2O AI? - h2o

When predicting values in a multiclass classification problem, I would like to get the probability of the predicted value.
I tried to solve this by using H2O's apply function:
predicted_df = modelo_assessor.predict(to_predict_h2o_frame)
predicted_df.apply((lambda x: x.max()), axis=1)
But it does not work:
'ValueError: unimpl bytecode instr: CALL_METHOD'
Maybe it doesn't work because h2o.max does not have axis parameter as h2o.mean does???
I couldn't find the documentation of which operations are supported on apply function.
I would like to solve the problem using h2o data manipulation similarly to this pandas code:
predicted_df = modelo_assessor.predict(to_predict_h2o_frame).as_data_frame()
predicted_df['PROB_PREDICTED']=predicted_df.iloc[:,1:].max(axis=1)

This is happening whenever using apply. Use the example from H2O documentation:
I was able to solve the problem by downgrading to Python 3.6.x
http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html#h2oframe
python_lists = [[1,2,3,4], [1,2,3,4]]
h2oframe = h2o.H2OFrame(python_obj=python_lists,
na_strings=['NA'])
colMean = h2oframe.apply(lambda x: x.mean(), axis=0)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-43-8da6b76c71bd> in <module>
2 h2oframe = h2o.H2OFrame(python_obj=python_lists,
3 na_strings=['NA'])
----> 4 colMean = h2oframe.apply(lambda x: x.mean(), axis=0)
~/anaconda3/envs/h2o1/lib/python3.7/site-packages/h2o/frame.py in apply(self, fun, axis)
4910 assert_is_type(fun, FunctionType)
4911 assert_satisfies(fun, fun.__name__ == "<lambda>")
-> 4912 res = lambda_to_expr(fun)
4913 return H2OFrame._expr(expr=ExprNode("apply", self, 1 + (axis == 0), *res))
4914
~/anaconda3/envs/h2o1/lib/python3.7/site-packages/h2o/astfun.py in lambda_to_expr(fun)
133 code = fun.__code__
134 lambda_dis = _disassemble_lambda(code)
--> 135 return _lambda_bytecode_to_ast(code, lambda_dis)
136
137 def _lambda_bytecode_to_ast(co, ops):
~/anaconda3/envs/h2o1/lib/python3.7/site-packages/h2o/astfun.py in _lambda_bytecode_to_ast(co, ops)
147 body, s = _opcode_read_arg(s, ops, keys)
148 else:
--> 149 raise ValueError("unimpl bytecode instr: " + instr)
150 if s > 0:
151 print("Dumping disassembled code: ")
ValueError: unimpl bytecode instr: CALL_METHOD

Related

Error running healpy.sphtfunc.alm2map with alm size different from l_max required

I run healpy.sphtfunc.alm2map passing alm as an array with l_max = 256 and asking as output a map with l_max = 191, but it seems that the function does not accept correctly the new l_max.
Here an example code of what I am doing, with artificially generated alms:
import healpy as hp
import numpy as np
#nside and lmax
nside_high = 128
lmax_high= 2*nside_high
nside_low = 64
lmax_low= 3*nside_low-1
#Cl
Cl = np.ones(lmax_high)
#Alm
Alm = hp.synalm(Cl, lmax = lmax_high)
#Map
Map = hp.alm2map(Alm, nside = nside_low, lmax = lmax_low)
I get this error:
ValueError Traceback (most recent call last)
/var/folders/k_/2j08yy711tb17zmmsznfx1zh0000gn/T/ipykernel_9189/2207532576.py in <cell line: 18>()
16
17 #Map
---> 18 Map = hp.alm2map(Alm, nside = nside_l, lmax = lmax_l)
/opt/anaconda3/lib/python3.8/site-packages/astropy/utils/decorators.py in wrapper(*args, **kwargs)
552 warnings.warn(msg, warning_type, stacklevel=2)
553
--> 554 return function(*args, **kwargs)
555
556 return wrapper
/opt/anaconda3/lib/python3.8/site-packages/healpy/sphtfunc.py in alm2map(alms, nside, lmax, mmax, pixwin, fwhm, sigma, pol, inplace, verbose)
502 mmax = -1
503 if pol:
--> 504 output = sphtlib._alm2map(
505 alms_new[0] if lonely else tuple(alms_new), nside, lmax=lmax, mmax=mmax
506 )
ValueError: Wrong alm size.
Can anyone help?
The purpose of lmax argument in alm2map is just to handle input alms which have mmax != lmax, they cannot be used to clip alms.
Currently the easiest way to clip alms is:
alm_clipped = hp.almxfl(Alm, np.ones(lmax_low+1))
Map = hp.alm2map(alm_clipped, nside = nside_low)
The next version of healpy will have a dedicated function resize_alm:
https://github.com/healpy/healpy/pull/803
In the future it would be nice to use the function to automatically handle this, I opened an issue to track this:
https://github.com/healpy/healpy/issues/817

Fitting Lightgbm distributed with lgb.train hangs

I'm trying to learn how to use lightgbm distributed.
I wrote a simple hello world kind of code where I use iris dataset with 150 rows, split it into train (100 rows) and test(50 rows). Then training the train test set are further split into two parts. Each part is fed into two machines with appropriate rank.
The problem I see is that lgb.train hangs.
Here is the code:
import argparse
import logging
import lightgbm as lgb
import pandas as pd
from sklearn import datasets
import socket
print('lightgbm', lgb.__version__)
HOST = socket.gethostname()
ip_address = socket.gethostbyname(HOST)
print("IP=", ip_address)
# looks like lightgbm operates only with ip addresses
IPS = ['10.121.22.166', '10.121.22.83']
assert ip_address in IPS
logger = logging.getLogger(__name__)
pd.set_option('display.max_rows', 4)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 10000)
pd.set_option('max_colwidth', 100)
pd.set_option('precision', 5)
def read_train_data(rank):
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
partition = rank
assert partition < 2
separate = 100
train_df = iris_df.iloc[:separate]
test_df = iris_df.iloc[separate:]
separate_train = 60
separate_test = 30
if partition == 0:
train_df = train_df.iloc[:separate_train]
test_df = test_df.iloc[:separate_test]
else:
train_df = train_df.iloc[separate_train:]
test_df = test_df.iloc[separate_test:]
def get_lgb_dataset(df):
target_column = df.columns[-1]
columns = df.columns[:-1]
assert target_column not in columns
print('Target column', target_column)
x = df[columns]
y = df[target_column]
print(x)
ds = lgb.Dataset(free_raw_data=False, data=x, label=y, params={
"enable_bundle": False
})
ds.construct()
return ds
dtrain = get_lgb_dataset(train_df)
dtest = get_lgb_dataset(test_df)
return dtrain, dtest
def train(args):
port0 = 56456
rank = IPS.index(ip_address)
print("Rank=", rank, HOST)
print("RR", rank)
dtrain, dtest = read_train_data(rank=rank)
params = {'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree': 1.0,
'importance_type': 'split',
'learning_rate': 0.1,
'max_depth': 2,
'min_child_samples': 20,
'min_child_weight': 0.001,
'min_split_gain': 0.0,
'n_estimators': 1,
'num_leaves': 31,
'objective': 'regression',
'metric': 'rmse',
'random_state': None,
'reg_alpha': 0.0,
'reg_lambda': 0.0,
'silent': False,
'subsample': 1.0,
'subsample_for_bin': 200000,
'subsample_freq': 0,
'tree_learner': 'data_parallel',
'num_threads': 48,
'machines': ','.join([f'{machine}:{port0}' for i, machine in enumerate(IPS)]),
'local_listen_port': port0,
'time_out': 120,
'num_machines': len(IPS)
}
print(params)
logging.info("starting to train lgb at node with rank %d", rank)
evals_result = {}
if args.scikit == 1:
print("Using scikit learn")
bst = lgb.sklearn.LGBMRegressor(**params)
bst.fit(
dtrain.data,
dtrain.label,
eval_set=[(dtest.data, dtest.label)],
)
else:
print("Using regular LGB")
bst = lgb.train(params,
dtrain,
valid_sets=[dtest],
evals_result=evals_result)
print(evals_result)
logging.info("finish xgboost training at node with rank %d", rank)
return bst
def main(args):
logging.info("starting the train job")
model = train(args)
pd.set_option('display.max_rows', 500)
print("OUT", model.__class__)
try:
print(model.trees_to_dataframe())
except:
print(model.booster_.trees_to_dataframe())
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--scikit',
help='scikit',
default=0,
type=int,
)
main(parser.parse_args())
I can run it with the scikit fit interface by running: python simple_distributed_lgb_test.py --scikit 1
On the two machines. It produces a reasonable result.
However, when I use -- scikit 0 (which uses lgb.train), then fitting just hangs on both nodes. Last messages before it hangs:
[LightGBM] [Info] Total Bins 22
[LightGBM] [Info] Number of data points in the train set: 40, number of used features: 2
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Start training from score 0.873750
Is that a bug or an expected behavior? dask.py in lightgbm does use scikit learn fit interface.
I use an overnight master version 3.2.1.99. 5b7a6f3e7150aeb704d1dd2b852d246af3e913a3 tag to be exact from Jul 12.
UPDATE 1
I'm trying to dig into the code. So far I see few things:
scikit.train interface appears to have an extra syncronization step before fitting first tree. lgb.train doesn't have it. Dunno yet where it comes from. (I see some Network::Allreduce operations)
It appears that scikit.train has workers syncronized - each worker knows the correct sizes of the blocks to send and receive during reducescatter operations. For example one the first allreduce worker1 sends 208 blocks and receives 368 blocks of data (in Linkers::SendRecv), while worker2 is reversed - sends 368 and receives 208. So allreduce completes fine. ()
On the contrary, lgb.train has workers not syncronized - each worker has numbers for send and receive blocks during reducescatter at the first DataParallelTreeLearner::FindBestSplits encounter. But they don't match. Worker1 sends 208 abd wants to receive 400. Worker2 sends 192 and wants to receive 176. So, the worker that wants to receive more just hangs. The other worker eventually hangs too.
Possibly it has something to do with lgb.Dataset. That thing may need to have same bins or something. I tried to force it by forcedbins_filename parameter. But it doesn't seem to help with lgb.train.
UPDATE 2
Success. If I remove the following line from the example:
ds.construct()
Everything works. So I guess we can't use construct on Dataset when using distributed training.

KeyError when using non-default models in Huggingface transformers pipeline

I have no problems using the default model in the sentiment analysis pipeline.
# Allocate a pipeline for sentiment-analysis
nlp = pipeline('sentiment-analysis')
nlp('I am a black man.')
>>>[{'label': 'NEGATIVE', 'score': 0.5723695158958435}]
But, when I try to customise the pipeline a little by adding a specific model. It throws a KeyError.
nlp = pipeline('sentiment-analysis',
tokenizer = AutoTokenizer.from_pretrained("DeepPavlov/bert-base-cased-conversational"),
model = AutoModelWithLMHead.from_pretrained("DeepPavlov/bert-base-cased-conversational"))
nlp('I am a black man.')
>>>---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-55-af7e46d6c6c9> in <module>
3 tokenizer = AutoTokenizer.from_pretrained("DeepPavlov/bert-base-cased-conversational"),
4 model = AutoModelWithLMHead.from_pretrained("DeepPavlov/bert-base-cased-conversational"))
----> 5 nlp('I am a black man.')
6
7
~/opt/anaconda3/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
721 outputs = super().__call__(*args, **kwargs)
722 scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
--> 723 return [{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores]
724
725
~/opt/anaconda3/lib/python3.7/site-packages/transformers/pipelines.py in <listcomp>(.0)
721 outputs = super().__call__(*args, **kwargs)
722 scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
--> 723 return [{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores]
724
725
KeyError: 58129
I am facing the same problem. I am working with a model from XML-R fine-tuned with squadv2 data set ("a-ware/xlmroberta-squadv2"). In my case, the KeyError is 16.
Link
Looking for help on the issue I have found this information: link I hope you find it helpful.
Answer (from the link)
The pipeline throws an exception when the model predicts a token that is not part of the document (e.g. final special token [SEP])
My problem:
from transformers import XLMRobertaTokenizer, XLMRobertaForQuestionAnswering
from transformers import pipeline
nlp = pipeline('question-answering',
model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2'),
tokenizer= XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2'))
nlp(question = "Who was Jim Henson?", context ="Jim Henson was a nice puppet")
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-15-b5a8ece5e525> in <module>()
1 context = "Jim Henson was a nice puppet"
2 # --------------- CON INTERROGACIONES
----> 3 nlp(question = "Who was Jim Henson?", context =context)
1 frames
/usr/local/lib/python3.6/dist-packages/transformers/pipelines.py in <listcomp>(.0)
1745 ),
1746 }
-> 1747 for s, e, score in zip(starts, ends, scores)
1748 ]
1749
KeyError: 16
Solution 1: Adding punctuation at the end of the context
In order to avoid the bug of trying to extract the final token (which may be an special one as [SEP]) I added an element (in this case a punctuation mark) at the end of the context:
nlp(question = "Who was Jim Henson?", context ="Jim Henson was a nice puppet.")
[OUT]
{'answer': 'nice puppet.', 'end': 28, 'score': 0.5742837190628052, 'start': 17}
Solution 2: Do not use pipeline()
The original model can handle itself to retrieve the correct token`s index.
from transformers import XLMRobertaTokenizer, XLMRobertaForQuestionAnswering
import torch
tokenizer = XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2')
model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
start_scores, end_scores = model(input_ids, attention_mask=attention_mask, output_attentions=False)[:2]
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
answer = tokenizer.convert_tokens_to_ids(answer.split())
answer = tokenizer.decode(answer)
Update
Looking in more detail your case, I found that the default model for Conversational task in the pipeline is distilbert-base-cased (source code).
The first solution I posted is not a good solution indeed. Trying other questions I got the same error. However, the model itself outside the pipeline works fine (as I showed in solution 2). Thus, I believe that not all models can be introduced in the pipeline. If anyone has more information about it please help us out. Thanks.

How to calculate shap values for ADABoost model?

I am running 3 different model (Random forest, Gradient Boosting, Ada Boost) and a model ensemble based on these 3 models.
I managed to use SHAP for GB and RF but not for ADA with the following error:
Exception Traceback (most recent call last)
in engine
----> 1 explainer = shap.TreeExplainer(model,data = explain_data.head(1000), model_output= 'probability')
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
110 self.feature_perturbation = feature_perturbation
111 self.expected_value = None
--> 112 self.model = TreeEnsemble(model, self.data, self.data_missing)
113
114 if feature_perturbation not in feature_perturbation_codes:
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing)
752 self.tree_output = "probability"
753 else:
--> 754 raise Exception("Model type not yet supported by TreeExplainer: " + str(type(model)))
755
756 # build a dense numpy version of all the tree objects
Exception: Model type not yet supported by TreeExplainer: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
I found this link on Git that state
TreeExplainer creates a TreeEnsemble object from whatever model type we are trying to explain, and then works with that downstream. So all you would need to do is and add another if statement in the
TreeEnsemble constructor similar to the one for gradient boosting
But I really don't know how to implement it since I quite new to this.
I had the same problem and what I did, was to modify the file in the git you are commenting.
In my case I use windows so the file is in C:\Users\my_user\AppData\Local\Continuum\anaconda3\Lib\site-packages\shap\explainers but you can do double click over the error message and the file will be opened.
The next step is to add another elif as the answer of the git help says. In my case I did it from the line 404 as following:
1) Modify the source code.
...
self.objective = objective_name_map.get(model.criterion, None)
self.tree_output = "probability"
elif str(type(model)).endswith("sklearn.ensemble.weight_boosting.AdaBoostClassifier'>"): #From this line I have modified the code
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line I added
elif str(type(model)).endswith("sklearn.ensemble.forest.ExtraTreesClassifier'>"): # TODO: add unit test for this case
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
...
Note in the other models, the code of shap needs the attribute 'criterion' that the AdaBoost classifier doesn't have in a direct way. So in this case this attribute is obtained from the "weak" classifiers with the AdaBoost has been trained, that's why I add model.base_estimator_.criterion .
Finally you have to import the library again, train your model and get the shap values. I leave an example:
2) Import again the library and try:
from sklearn import datasets
from sklearn.ensemble import AdaBoostClassifier
import shap
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
ADABoost_model = AdaBoostClassifier()
ADABoost_model.fit(X, y)
shap_values = shap.TreeExplainer(ADABoost_model).shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")
Which generates the following:
3) Get your new results:
It seems that the shap package has been updated and still does not contain the AdaBoostClassifier. Based on the previous answer, I've modified the previous answer to work with the shap/explainers/tree.py file in lines 598-610
### Added AdaBoostClassifier based on the outdated StackOverflow response and Github issue here
### https://stackoverflow.com/questions/60433389/how-to-calculate-shap-values-for-adaboost-model/61108156#61108156
### https://github.com/slundberg/shap/issues/335
elif safe_isinstance(model, ["sklearn.ensemble.AdaBoostClassifier", "sklearn.ensemble._weighted_boosting.AdaBoostClassifier"]):
assert hasattr(model, "estimators_"), "Model has no `estimators_`! Have you called `model.fit`?"
self.internal_dtype = model.estimators_[0].tree_.value.dtype.type
self.input_dtype = np.float32
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line added
Also working on testing to add this to the package :)

RuntimeError: module must have its parameters and buffers on device cuda:1 (device_ids[0]) but found one of them on device: cuda:2

I have 4 GPUs (0,1,2,3) and I want to run one Jupyter notebook on GPU 2 and another one on GPU 0. Thus, after executing,
export CUDA_VISIBLE_DEVICES=0,1,2,3
for the GPU 2 notebook I do,
device = torch.device( f'cuda:{2}' if torch.cuda.is_available() else 'cpu')
device, torch.cuda.device_count(), torch.cuda.is_available(), torch.cuda.current_device(), torch.cuda.get_device_properties(1)
and after creating a new model or loading one,
model = nn.DataParallel( model, device_ids = [ 0, 1, 2, 3])
model = model.to( device)
Then, when I start training the model, I get,
RuntimeError Traceback (most recent call last)
<ipython-input-18-849ffcb53e16> in <module>
46 with torch.set_grad_enabled( phase == 'train'):
47 # [N, Nclass, H, W]
---> 48 prediction = model(X)
49 # print( prediction.shape, y.shape)
50 loss_matrix = criterion( prediction, y)
~/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
--> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)
~/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
144 raise RuntimeError("module must have its parameters and buffers "
145 "on device {} (device_ids[0]) but found one of "
--> 146 "them on device: {}".format(self.src_device_obj, t.device))
147
148 inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2
DataParallel requires every input tensor be provided on the first device in its device_ids list.
It basically uses that device as a staging area before scattering to the other GPUs and it's the device where final outputs are gathered before returning from forward. If you want device 2 to be the primary device then you just need to put it at the front of the list as follows
model = nn.DataParallel(model, device_ids = [2, 0, 1, 3])
model.to(f'cuda:{model.device_ids[0]}')
After which all tensors provided to model should be on the first device as well.
x = ... # input tensor
x = x.to(f'cuda:{model.device_ids[0]}')
y = model(x)
this error happened when using the torch, model and data both are not on cuda:
try some code like this to model and data set on cuda
model = model.toDevice(‘cuda’)
images = images.toDevice(‘cuda’)
For me even the following works:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
network = nn.DataParallel(network)
network.to(device)
tnsr = tnsr.to(device)

Resources