Vowpal Wabbit python wrapper empty prediction file - vowpalwabbit

The prediction file when called from pyvw wrapper is empty.
For example, I am doing something like,
vw = pyvw.vw(" -i cb.model --cb_explore 50 --cover 10 -p prediction.txt")
ex = vw.example(" | label label2")
vw.predict(ex)
vw.finish()
ex.finish()
This creates prediction.txt but does not write anything to it.
I would greatly appreciate any guidance.
Thank you!

This snippet might help you:
from vowpalwabbit import pyvw
def to_vw(clf, text, str_label):
vw_example = str('{} |f {} '.format(str_label, text))
return clf.example(vw_example)
clf = vw = pyvw.vw(
loss_function='logistic', oaa=2,
link='logistic', raw_predictions='output.txt'
)
ex = to_vw(clf, 'I like vowpal wabbit. But not that much.', '1')
clf.learn(ex)
clf.predict(ex, labelType=pyvw.pylibvw.vw.lMulticlass)
You should have the probabilities written in output.txt file.

Related

Python plot multiple z-test result with confidence interval (visualize A/B test results)

what I want to plot
Hi, I want to visualize results for one A/B test. The experiment tracks 4 metrics, and I want to show them in one plot altogether. The schema of my dataframe is:
test_control | metric1 | metric2 | metric3 | metric4
Does anyone know how to plot, by matplotlib, pandas or seaborn?
Thanks in advance!
I found it's probably easier to be done in R.
In python, I calculated the error bar and then used matplotlib.pyplot.errorbar to plot:
get CI
kpi_map = {'kpi':[], 'mean_diff':[], 'err':[], 'pval':[]}
for col in metrics:
sp1 = df.loc[df['test_control']=='test'][col]
sp2 = df.loc[df['test_control']=='control'][col]
std1 = np.std(sp1, ddof=1)
std2 = np.std(sp2, ddof=1)
mean_diff_std = (std1**2/len(sp1) + std2**2/len(sp2)) **0.5
mean_diff = sp1.mean() - sp2.mean()
kpi_map['kpi'].append(col)
kpi_map['mean_diff'].append(mean_diff)
kpi_map['err'].append(1.96*mean_diff_std)
plot
df_kpi = pd.DataFrame(data = kpi_map)
plt.errorbar(y=df_kpi['kpi'], x=df_kpi['mean_diff'], xerr=df_kpi['err'], fmt='o', elinewidth=2, capsize=4, capthick=2)

How to calculate shap values for ADABoost model?

I am running 3 different model (Random forest, Gradient Boosting, Ada Boost) and a model ensemble based on these 3 models.
I managed to use SHAP for GB and RF but not for ADA with the following error:
Exception Traceback (most recent call last)
in engine
----> 1 explainer = shap.TreeExplainer(model,data = explain_data.head(1000), model_output= 'probability')
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
110 self.feature_perturbation = feature_perturbation
111 self.expected_value = None
--> 112 self.model = TreeEnsemble(model, self.data, self.data_missing)
113
114 if feature_perturbation not in feature_perturbation_codes:
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing)
752 self.tree_output = "probability"
753 else:
--> 754 raise Exception("Model type not yet supported by TreeExplainer: " + str(type(model)))
755
756 # build a dense numpy version of all the tree objects
Exception: Model type not yet supported by TreeExplainer: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
I found this link on Git that state
TreeExplainer creates a TreeEnsemble object from whatever model type we are trying to explain, and then works with that downstream. So all you would need to do is and add another if statement in the
TreeEnsemble constructor similar to the one for gradient boosting
But I really don't know how to implement it since I quite new to this.
I had the same problem and what I did, was to modify the file in the git you are commenting.
In my case I use windows so the file is in C:\Users\my_user\AppData\Local\Continuum\anaconda3\Lib\site-packages\shap\explainers but you can do double click over the error message and the file will be opened.
The next step is to add another elif as the answer of the git help says. In my case I did it from the line 404 as following:
1) Modify the source code.
...
self.objective = objective_name_map.get(model.criterion, None)
self.tree_output = "probability"
elif str(type(model)).endswith("sklearn.ensemble.weight_boosting.AdaBoostClassifier'>"): #From this line I have modified the code
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line I added
elif str(type(model)).endswith("sklearn.ensemble.forest.ExtraTreesClassifier'>"): # TODO: add unit test for this case
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
...
Note in the other models, the code of shap needs the attribute 'criterion' that the AdaBoost classifier doesn't have in a direct way. So in this case this attribute is obtained from the "weak" classifiers with the AdaBoost has been trained, that's why I add model.base_estimator_.criterion .
Finally you have to import the library again, train your model and get the shap values. I leave an example:
2) Import again the library and try:
from sklearn import datasets
from sklearn.ensemble import AdaBoostClassifier
import shap
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
ADABoost_model = AdaBoostClassifier()
ADABoost_model.fit(X, y)
shap_values = shap.TreeExplainer(ADABoost_model).shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")
Which generates the following:
3) Get your new results:
It seems that the shap package has been updated and still does not contain the AdaBoostClassifier. Based on the previous answer, I've modified the previous answer to work with the shap/explainers/tree.py file in lines 598-610
### Added AdaBoostClassifier based on the outdated StackOverflow response and Github issue here
### https://stackoverflow.com/questions/60433389/how-to-calculate-shap-values-for-adaboost-model/61108156#61108156
### https://github.com/slundberg/shap/issues/335
elif safe_isinstance(model, ["sklearn.ensemble.AdaBoostClassifier", "sklearn.ensemble._weighted_boosting.AdaBoostClassifier"]):
assert hasattr(model, "estimators_"), "Model has no `estimators_`! Have you called `model.fit`?"
self.internal_dtype = model.estimators_[0].tree_.value.dtype.type
self.input_dtype = np.float32
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line added
Also working on testing to add this to the package :)

H2O GLM model: saved MOJO's prediction is very different when running on the same validation data

I built a GLM model using H2O (ver 3.14) in R. Please note that the training data contains integers, and also many NA, which I use MeanImputation to handle them.
glm <- h2o.glm(
training_frame = train.truth,
x=getColNames(train.truth),
y="isFemale",
family = "binomial",
missing_values_handling = "MeanImputation",
seed = 1000000)
I then use a validation data set to look at the perf, and the Precision looks good to me:
h2o.performance(glm, newdata=valid.truth)%>% h2o.confusionMatrix()
Confusion Matrix (vertical: actual; across: predicted) for max f1 # threshold = 0.529384526696015:
0 1 Error Rate
0 41962 300 0.007099 =300/42262
1 863 13460 0.060253 =863/14323
Totals 42825 13760 0.020553 =1163/56585
I then saved the model as a MOJO:
h2o.download_mojo(glm, path="models/mojo", get_genmodel_jar=TRUE)
I exported the validation DF to a CSV file:
dt.valid <- data.table(as.data.frame(valid.truth))
write.table(dt.valid, row.names = F, na="", file="models/test.csv")
I tried to use the saved mojo to do the same prediction by running this on my Linux shell:
java -cp h2o-genmodel.jar hex.genmodel.tools.PredictCsv \
--mojo GLM_model_R_1511161743608_15 \
--decimal --mojo GLM_model_R_1511161743608_15.zip \
--input ../test.csv --output output.csv
However, the result is terrible. All the records were predicted as 0, which is very different from what I got when I ran the model in R.
I have been stuck in this for a day but I couldn't figure out what went wrong. Anyone can shed some light on this?

Vowpal Wabbit - How to get prediction probabilities from contextual bandit model on a test sample

Given a trained contextual bandit model, how can I retrieve a prediction vector on test samples?
For example, let's say I have a train set named "train.dat" containing lines formatted as below
1:-1:0.3 | a b c # <action:cost:probability | features>
2:2:0.3 | a d d
3:-1:0.3 | a b e
....
And I run below command.
vw -d train.dat --cb 30 -f cb.model --save_resume
This produces a file, 'cb.model'. Now, let's say I have a test dataset as below
| a d d
| a b e
I'd like to see probabilities as below
0.2 0.7 0.1
The interpretation of these probabilities would be that action 1 should be picked 20% of the time, action 2 - 70%, and action 3 - 10% of the time.
Is there a way to get something like this?
When you use "--cb K", the prediction is the optimal arm/action based on argmax policy, which is a static policy.
When using "--cb_explore K", the prediction output contains the probability for each arm/action. Depending the policy you pick, the probabilities are calculated differently.
If you send those lines to a daemon running your model, you'd get just that. You send a context, and the reply is a probability distribution across the number of allowed actions, presumably comprising the "recommendation" provided by the model.
Say you have 3 actions, like in your example. Start a contextual bandits daemon:
vowpalwabbit/vw -d train.dat --cb_explore 3 -t --daemon --quiet --port 26542
Then send a context to it:
| a d d
You'll get just what you want as the reply.
In the Workspace Class, initialize the object and then call the method predict(prediction_type: int). Below are the corresponding parameter values
class PredictionType(IntEnum):
SCALAR = pylibvw.vw.pSCALAR
SCALARS = pylibvw.vw.pSCALARS
ACTION_SCORES = pylibvw.vw.pACTION_SCORES
ACTION_PROBS = pylibvw.vw.pACTION_PROBS
MULTICLASS = pylibvw.vw.pMULTICLASS
MULTILABELS = pylibvw.vw.pMULTILABELS
PROB = pylibvw.vw.pPROB
MULTICLASSPROBS = pylibvw.vw.pMULTICLASSPROBS
DECISION_SCORES = pylibvw.vw.pDECISION_SCORES
ACTION_PDF_VALUE = pylibvw.vw.pACTION_PDF_VALUE
PDF = pylibvw.vw.pPDF
ACTIVE_MULTICLASS = pylibvw.vw.pACTIVE_MULTICLASS
NOPRED = pylibvw.vw.pNOPRED

Improve genbank feature addition

I am trying to add more than 70000 new features to a genbank file using biopython.
I have this code:
from Bio import SeqIO
from Bio.SeqFeature import SeqFeature, FeatureLocation
fi = "myoriginal.gbk"
fo = "mynewfile.gbk"
for result in results:
start = 0
end = 0
result = result.split("\t")
start = int(result[0])
end = int(result[1])
for record in SeqIO.parse(original, "gb"):
record.features.append(SeqFeature(FeatureLocation(start, end), type = "misc_feat"))
SeqIO.write(record, fo, "gb")
Results is just a list of lists containing the start and end of each one of the features I need to add to the original gbk file.
This solution is extremely costly for my computer and I do not know how to improve the performance. Any good idea?
You should parse the genbank file just once. Omitting what results contains (I do not know exactly, because there are some missing pieces of code in your example), I would guess something like this would improve performance, modifying your code:
fi = "myoriginal.gbk"
fo = "mynewfile.gbk"
original_records = list(SeqIO.parse(fi, "gb"))
for result in results:
result = result.split("\t")
start = int(result[0])
end = int(result[1])
for record in original_records:
record.features.append(SeqFeature(FeatureLocation(start, end), type = "misc_feat"))
SeqIO.write(record, fo, "gb")

Resources