Shap explainer gives an error with ECFP4 fingerprints - bioinformatics

I am training a Random Forest with molecular fingerprints and adding a shap explainer, with the shap package function
explainer = shap.Explainer(forest)
and it gives me the error:
"ExplainerError: Additivity check failed in TreeExplainer! Please ensure the data matrix you passed to the explainer is the same shape that the model was trained on. If your data shape is correct then please report this on GitHub. Consider retrying with the feature_perturbation='interventional' option. This check failed because for one of the samples the sum of the SHAP values was 28208132836061024.000000, while the model output was 0.846000. If this difference is acceptable you can set check_additivity=False to disable this check."
Now, the error seems to be very straight forward, but the unintelligible thing is :
I did the exact same thing with MACC fingerprints, it worked.
I looked into the shape of the data, it is (2334, 2048) for train and (193, 2048) for validation, with MACCS it was analogous.
the validation set consists only of 1 and 0 as it should
the fingerprints are all same lengths, no errors there
I did some external validation with the validation set and there were no problems there.
roc = metrics.roc_auc_score(labels_val, predicted)
tn, fp, fn, tp = confusion_matrix(labels_val, predicted).ravel()
I checked that the forest I trained with was the forest I was using.
And yes, I even restarted my computer.
If someone has any idea what could cause this problem, please let me know!

Related

gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars

/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
#dynamic topic model
def run_dtm(num_topics=18):
docs, years, titles = preprocessing(datasetType=2)
#resort document by years
Z = zip(years, docs)
Z = sorted(Z, reverse=False)
years_new, docs_new = zip(*Z)
#generate time slice
time_slice = Counter(years_new).values()
for year in Counter(years_new):
print year,' --- ',Counter(years_new)[year]
print '********* data set loaded ********'
dictionary = corpora.Dictionary(docs_new)
corpus = [dictionary.doc2bow(text) for text in docs_new]
print '********* train lda seq model ********'
ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=num_topics)
print '********* lda seq model done ********'
ldaseq.print_topics(time=1)
Hey guys, I'm using the dynamic topic models in gensim package for topic analysis, following this tutorial, https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/ldaseqmodel.ipynb, however I always got the same unexpected error. Can anyone give me some guidance? I'm really puzzled even thought I have tried some different dataset for generating corpus and dictionary.
The error is like this:
/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
The np.fabs error means it is encountering an error with NumPy. What NumPy and gensim versions are you using?
NumPy no longer supports Python 2.7, and Ldaseq was added to Gensim in 2016, so you might just not have a compatible version available. If you are recoding a Python 3+ tutorial to a 2.7 variant, you obviously understand a little bit about the version differences - try running it in a, say, 3.6.8 environment (you will have to upgrade sometime anyway, 2020 is the end of 2.7 support from Python itself). That might already help, I've gone through the tutorial and did not encounter this with my own data.
That being said, I have encountered the same error before when running LdaMulticore, and it was caused by an empty corpus.
Instead of running your code fully in a function, can you try to go through it line by line (or look at you DEBUG level log) and check whether your output has the expected properties: that, for example your corpus is not empty (or contains empty documents)?
If that happens, fix the preprocessing steps and try again - that at least helped me and helped with the same ldamodel error in the mailing list.
PS: not commenting because I lack the reputation, feel free to edit this.
This is the issue with the source code of ldaseqmodel.py itself.
For the latest gensim package(version 3.8.3) I am getting the same error at line 293:
ldaseqmodel.py:293: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
Now, if you go through the code you will see this:
enter image description here
You can see that here they divide the difference between bound and old_bound by the old_bound(which is also visible from the warning)
Now if you analyze further you will see that at line 263, the old_bound is initialized with zero and this is the main reason that you are getting this warning of divide by zero encountered.
enter image description here
For further information, I put a print statement at line 294:
print('bound = {}, old_bound = {}'.format(bound, old_bound))
The output I received is: enter image description here
So, in a single line you are getting this warning because of the source code of the package ldaseqmodel.py not because of any empty document. Although if you do not remove the empty documents from your corpus you will receive another warning. So I suggest if there are any empty documents in your corpus remove them and just ignore the above warning of division by zero.

Clustering using Representatives (CURE)

I need a numerical example which demonstrates the working of clustering using CURE algorithm.
https://www.cs.ucsb.edu/~veronika/MAE/summary_CURE_01guha.pdf
The pyclustering library has a number of clustering algorithims with examples, and example code on their Github. Here is a link the CURE example.
Googling Cure algorithim example also came up with a fair bit.
Hopefully that helps!
Using pyclustering library you can extract information about representatives points and means using corresponding methods (link to CURE pyclustering generated documentation):
# create instance of the algorithm
cure_instance = cure(<algorithm parameters>);
# start processing
cure_instance.process();
# get allocated clusteres
clusters = cure_instance.get_clusters();
# get representative points
representative = cure_instance.get_representors();
Also you can modify source code of the CURE algorithm to display changes after each step, for example, print them to console or even visualize. Here is an example how to modify code to display changes on each step clustering (after line 219) where star means representative point, small points - points itself and big points - means:
# New cluster and updated clusters should relocated in queue
self.__insert_cluster(merged_cluster);
for item in cluster_relocation_requests:
self.__relocate_cluster(item);
#
# ADD FOLLOWING PEACE OF CODE TO DISPLAY CHANGES ON EACH STEP
#
temp_clusters = [ cure_cluster_unit.indexes for cure_cluster_unit in self.__queue ];
temp_representors = [ cure_cluster_unit.rep for cure_cluster_unit in self.__queue ];
temp_means = [ cure_cluster_unit.mean for cure_cluster_unit in self.__queue ];
visualizer = cluster_visualizer();
visualizer.append_clusters(temp_clusters, self.__pointer_data);
for cluster_index in range(len(temp_clusters)):
visualizer.append_cluster_attribute(0, cluster_index, temp_representors[cluster_index], '*', 7);
visualizer.append_cluster_attribute(0, cluster_index, [ temp_means[cluster_index] ], 'o');
visualizer.show();
You will see sequence of images, something like that:
Thus, you can display any information that you need.
Also I would like to add that you can use C++ implementation of the algorithm for visualization (that is also part of pyclustering): https://github.com/annoviko/pyclustering/blob/master/ccore/src/cluster/cure.cpp

ROC on multiple test sets in h2o (python)

I had a use-case that I thought was really simple but couldn't find a way to do it with h2o. I thought you might know.
I want to train my model once, and then evaluate its ROC on a few different test sets (e.g. a validation set and a test set, though in reality I have more than 2) without having to retrain the model. The way I know to do it now requires retraining the model each time:
train, valid, test = fr.split_frame([0.2, 0.25], seed=1234)
rf_v1 = H2ORandomForestEstimator( ... )
rf_v1.train(features, var_y, training_frame=train, validation_frame=valid)
roc = rf_v1.roc(valid=1)
rf_v1.train(features, var_y, training_frame=train, validation_frame=test) # training again with the same training set - can I avoid this?
roc2 = rf_v1.roc(valid=1)
I can also use model_performance(), which gives me some metrics on an arbitrary test set without retraining, but not the ROC. Is there a way to get the ROC out of the H2OModelMetrics object?
Thanks!
You can use the h2o flow to inspect the model performance. Simply go to: http://localhost:54321/flow/index.html (if you changed the default port change it in the link); type "getModel "rf_v1"" in a cell and it will show you all the measurements of the model in multiple cells in the flow. It's quite handy.
If you are using Python, you can find the performance in your IDE like this:
rf_perf1 = rf_v1.model_performance(test)
and then print the ROC like this:
print (rf_perf1.auc())
Yes, indirectly. Get the TPRs and FPRs from the H2OModelMetrics object:
out = rf_v1.model_performance(test)
fprs = out.fprs
tprs = out.tprs
roc = zip(fprs, tprs)
(By the way, my H2ORandomForestEstimator object does not seem to have an roc() method at all, so I'm not 100% sure that this output is in the exact same format. I'm using h2o version 3.10.4.7.)

MATLAB ConnectedComponentLabeler does not work in for loop

I am trying to get a set of binary images' eccentricity and solidity values using the regionprops function. I obtain the label matrix using the vision.ConnectedComponentLabeler function.
This is the code I have so far:
files = getFiles('images');
ecc = zeros(length(files)); %eccentricity values
sol = zeros(length(files)); %solidity values
ccl = vision.ConnectedComponentLabeler;
for i=1:length(files)
I = imread(files{i});
[L NUM] = step(ccl, I);
for j=1:NUM
L = changem(L==j, 1, j); %*
end
stats = regionprops(L, 'all');
ecc(i) = stats.Eccentricity;
sol(i) = stats.Solidity;
end
However, when I run this, I get an error says indicating the line marked with *:
Error using ConnectedComponentLabeler/step
Variable-size input signals are not supported when the OutputDataType property is set to 'Automatic'.'
I do not understand what MATLAB is talking about and I do not have any idea about how to get rid of it.
Edit
I have returned back to bwlabel function and have no problems now.
The error is a bit hard to understand, but I can explain what exactly it means. When you use the CVST Connected Components Labeller, it assumes that all of your images that you're going to use with the function are all the same size. That error happens because it looks like the images aren't... hence the notion about "Variable-size input signals".
The "Automatic" property means that the output data type of the images are automatic, meaning that you don't have to worry about whether the data type of the output is uint8, uint16, etc. If you want to remove this error, you need to manually set the output data type of the images produced by this labeller, or the OutputDataType property to be static. Hopefully, the images in the directory you're reading are all the same data type, so override this field to be a data type that this function accepts. The available types are uint8, uint16 and uint32. Therefore, assuming your images were uint8 for example, do this before you run your loop:
ccl = vision.ConnectedComponentLabeler;
ccl.OutputDataType = 'uint8';
Now run your code, and it should work. Bear in mind that the input needs to be logical for this to have any meaningful output.
Minor comment
Why are you using the CVST Connected Component Labeller when the Image Processing Toolbox bwlabel function works exactly the same way? As you are using regionprops, you have access to the Image Processing Toolbox, so this should be available to you. It's much simpler to use and requires no setup: http://www.mathworks.com/help/images/ref/bwlabel.html

How to handle the Nominal Data by Weka J48

When I ran J48 of weka with binary split option, such decision tree was built.
http://www.fastpic.jp/viewer.php?file=2693704973.jpg
Input explanation variable is 1 nominal data which was made by question id + answer id.
1 nominal data, 1 transaction.
I'm wondering why the tree is at only one side.
Is it caused by my data set or table definition or original binary splits way?
I'd like the tree to have node on both sides.
If you know such a option please show me.
!Sample Data! Please ignore dot '・'
usr,qa,class
A,11,1
A,21,1
A,31,1
B,12,2
B,22,2
B,32,2
C,13,3
C,23,3
C,33,3
D,11,4
D,22,4
D,31,4
E,11,1
E,23,1
E,31,1
F,12,2
F,22,2
F,33,2
G,13,3
G,22,3
G,32,3
H,12,4
H,21,4
H,33,4
There's no error in the tree built and no option would really modify it. If your question is related to your same Akinator project, please reformat your data to get all questions (ie. 11,21,31) on the same instance/line and the answer as target class.
PS: if you import those data as CSV, Weka will take those data as numerical (not as as nominal). You should then add a non digit character (ie. #1,#2,#3...) so that Weka will take those data as nominal.

Resources