How to protonate a molecule in rdkit? - rdkit

I m tring to present the workflow for the positive ion ESI mass spectra, based on the fragmentation of [M+H]+ ions. I want to simulate the ionisation by the addition of one proton to heteroatoms. For example,
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import rdMolDraw2D
from IPython.display import SVG
# read mol
mol = Chem.MolFromSmiles('O=C(O)C1=CC(=NNC2=CC=C(C=C2)C(=O)NCCC(=O)O)C=CC1=O')
# draw the mol
dr = rdMolDraw2D.MolDraw2DSVG(800,800)
dr.SetFontSize(0.3)
op = dr.drawOptions()
for i in range(mol.GetNumAtoms()) :
op.atomLabels[i] = mol.GetAtomWithIdx(i).GetSymbol() + str((i+1))
AllChem.Compute2DCoords(mol)
dr.DrawMolecule(mol)
dr.FinishDrawing()
svg = dr.GetDrawingText()
SVG(svg)
I want to add one proton to the N atom with the index of #17 and to ionize the molecule. How to achieve this in rdkit?

Are these functions what you are looking for?
atom = mol.GetAtomWithIdx(17)
atom.SetNumExplicitHs(1)
atom.SetFormalCharge(1)

Related

Two StatsModels modules have totally different 'end-runs'

I'm running StatsModels to estimate parameters of a multiple regression model, using county-level data for 3085 counties. When I use statsmodels.formula.api, and drop a few rows from the data, I get desired results. All seems well enough.
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
%matplotlib inline
from statsmodels.compat import lzip
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
eg=pd.read_csv(r'C:/Users/user/anaconda3/une_edu_pipc_06.csv')
pd.options.display.precision = 3
plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)
sm_col = eg["lt_hsd_17"] + eg["hsd_17"]
eg["ut_hsd_17"] = sm_col
sm_col2 = eg["sm_col_17"] + eg["col_17"]
eg["bnd_hsd_17"] = sm_col2
eg["d_09"]= eg["Rate_09"]-eg["Rate_06"]
eg["d_10"]= eg["Rate_10"]-eg["Rate_06"]
inc_2=eg["p_c_inc_18"]*eg["p_c_inc_18"]
res = sm.ols(formula = "Rate_18 ~ p_c_inc_18 + ut_hsd_17 + d_10 + inc_2",
data=eg, missing='drop').fit()
print(res.summary()).
(BTW, eg["p_c_inc_18"]is per-capita income, and inc_2 is p_c_inc_18 squarred).
But when I wish to use import statsmodels.api as smas the module, everything else staying pretty much the same, and run the following code after all appropriate preliminaries,
inc_2=eg["p_c_inc_18"]*eg["p_c_inc_18"]
X = eg[["p_c_inc_18","ut_hsd_17","d_10","inc_2"]]
y = eg["Rate_18"]
X = sm.add_constant(X)
mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())
then things fall apart, and the Python interpreter throws an error, as follows:
[......]
KeyError: "['inc_2'] not in index"
BTW, the only difference between the two 'runs' is that 15 rows are dropped during the first, successful, model run, while I don't as yet know how to drop missing rows from the second model formulation. Could that difference be responsible for why the second run fails? (I chose to omit large parts of the error message, to reduce clutter.)
You need to assign inc_2 in your DataFrame.
inc_2=eg["p_c_inc_18"]*eg["p_c_inc_18"]
should be
eg["inc_2"] = eg["p_c_inc_18"]*eg["p_c_inc_18"]

How to calculate shap values for ADABoost model?

I am running 3 different model (Random forest, Gradient Boosting, Ada Boost) and a model ensemble based on these 3 models.
I managed to use SHAP for GB and RF but not for ADA with the following error:
Exception Traceback (most recent call last)
in engine
----> 1 explainer = shap.TreeExplainer(model,data = explain_data.head(1000), model_output= 'probability')
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
110 self.feature_perturbation = feature_perturbation
111 self.expected_value = None
--> 112 self.model = TreeEnsemble(model, self.data, self.data_missing)
113
114 if feature_perturbation not in feature_perturbation_codes:
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing)
752 self.tree_output = "probability"
753 else:
--> 754 raise Exception("Model type not yet supported by TreeExplainer: " + str(type(model)))
755
756 # build a dense numpy version of all the tree objects
Exception: Model type not yet supported by TreeExplainer: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
I found this link on Git that state
TreeExplainer creates a TreeEnsemble object from whatever model type we are trying to explain, and then works with that downstream. So all you would need to do is and add another if statement in the
TreeEnsemble constructor similar to the one for gradient boosting
But I really don't know how to implement it since I quite new to this.
I had the same problem and what I did, was to modify the file in the git you are commenting.
In my case I use windows so the file is in C:\Users\my_user\AppData\Local\Continuum\anaconda3\Lib\site-packages\shap\explainers but you can do double click over the error message and the file will be opened.
The next step is to add another elif as the answer of the git help says. In my case I did it from the line 404 as following:
1) Modify the source code.
...
self.objective = objective_name_map.get(model.criterion, None)
self.tree_output = "probability"
elif str(type(model)).endswith("sklearn.ensemble.weight_boosting.AdaBoostClassifier'>"): #From this line I have modified the code
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line I added
elif str(type(model)).endswith("sklearn.ensemble.forest.ExtraTreesClassifier'>"): # TODO: add unit test for this case
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
...
Note in the other models, the code of shap needs the attribute 'criterion' that the AdaBoost classifier doesn't have in a direct way. So in this case this attribute is obtained from the "weak" classifiers with the AdaBoost has been trained, that's why I add model.base_estimator_.criterion .
Finally you have to import the library again, train your model and get the shap values. I leave an example:
2) Import again the library and try:
from sklearn import datasets
from sklearn.ensemble import AdaBoostClassifier
import shap
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
ADABoost_model = AdaBoostClassifier()
ADABoost_model.fit(X, y)
shap_values = shap.TreeExplainer(ADABoost_model).shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")
Which generates the following:
3) Get your new results:
It seems that the shap package has been updated and still does not contain the AdaBoostClassifier. Based on the previous answer, I've modified the previous answer to work with the shap/explainers/tree.py file in lines 598-610
### Added AdaBoostClassifier based on the outdated StackOverflow response and Github issue here
### https://stackoverflow.com/questions/60433389/how-to-calculate-shap-values-for-adaboost-model/61108156#61108156
### https://github.com/slundberg/shap/issues/335
elif safe_isinstance(model, ["sklearn.ensemble.AdaBoostClassifier", "sklearn.ensemble._weighted_boosting.AdaBoostClassifier"]):
assert hasattr(model, "estimators_"), "Model has no `estimators_`! Have you called `model.fit`?"
self.internal_dtype = model.estimators_[0].tree_.value.dtype.type
self.input_dtype = np.float32
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line added
Also working on testing to add this to the package :)

Dimensionality problem in creation of patches

I have a binary image of 650x650 size. I want to create patches of 50x50. This means that I need 169 patches. I want to examine if there is in every patch is at least "ONE" element.
I need also the result to be pairs of every patch.
Here there is an example of 2d:
2d example
So far so good.When I implement the view_as_blocks function from skimage.util.shape it returns a list of (13,13,50,50).
The way i search for "ONES" will be numpy.where but i am lost in dimensions...
here is my code:
def distinguish_patches_for_label(image_path):
im1=cv2.imread(image_path)
im2 = cv2.cvtColor(im1, cv2.COLOR_BGR2GRAY)
im3=(im2>100).astype(np.uint8)
im4 = cv2.resize(im3, (650,650))
patches=view_as_blocks(im4, block_shape=(50,50))
for i in range(patches.shape[0]):
for j in range(patches.shape[0]):
indexes_normal=list(zip(*np.where(patches[i,j,:,:]== 0)))
for i in range(patches.shape[0]):
for j in range(patches.shape[0]):
indexes_XDs=list(zip(*np.where(patches[i,j,:,:]== 1)))
list1=[]
list2=[]
for i in range(len(indexes_normal)):
list1.append(indexes_normal[i][0])
list2.append(indexes_normal[i][1])
zipped_indexes_normal=list(zip(list1,list2))
list3=[]
list4=[]
for i in range(len(indexes_XDs)):
list3.append(indexes_XDs[i][0])
list4.append(indexes_XDs[i][1])
zipped_indexes_XDs=list(zip(list3,list4))
return zipped_indexes_normal,zipped_indexes_XDs
I'm not exactly sure what you are trying to get as your output, but the below will find the number of 1s in each patch:
import numpy as np
from skimage import io, util
def count_ones_in_patches(image_path):
image = io.imread(image_path)
patches = util.view_as_blocks(image, (50, 50))
ones_per_patch = np.sum(patches, axis=(2, 3))
return ones_per_patch
Using the numpy axis= keyword argument is a very good idea in general. =)

LSTM - LSTM - future value prediction error

After some research, I was able to predict the future value using the LSTM code below. I have also attached the Dmd1ahr.csv file in the github link that I am using.
https://github.com/ukeshchawal/hello-world/blob/master/Dmd1ahr.csv
As you all can see below, 90 data points are training sets and 91st to 100th are future value prediction.
However some of the questions that I still have are:
In order to predict these values I had to originally take more than hundred data sets (here, I have taken 500 data sets) which is not exactly what my primary goal is. Is there a way that given 500 data sets, it will predict the rest 10 or 20 out of sample data points? If yes, will you please write me a sample code where you can just take 500 data points from Dmd1ahr.csv file attached below and it will predict some future values (say 501 to 520) based on those 500 points?
The prediction are way off compared to the one who have in your blogs (definitely indicates for parameter tuning - I tried changing epochs, LSTM layers, Activation, optimizer). What other parameter tuning I can do to make it more robust?
Thank you'll in advance.
import numpy as np
import matplotlib.pyplot as plt
import pandas
# By twaking the architecture it could be made more robust
np.random.seed(7)
numOfSamples = 500
lengthTrain = 90
lengthValidation = 100
look_back = 1 # Can be set higher, in my experiments it made performance worse though
transientTime = 90 # Time to "burn in" time series
series = pandas.read_csv('Dmd1ahr.csv')
def generateTrainData(series, i, look_back):
return series[i:look_back+i+1]
trainX = np.stack([generateTrainData(series, i, look_back) for i in range(lengthTrain)])
testX = np.stack([generateTrainData(series, lengthTrain + i, look_back) for i in range(lengthValidation)])
trainX = trainX.reshape((lengthTrain,look_back+1,1))
testX = testX.reshape((lengthValidation, look_back + 1, 1))
trainY = trainX[:,1:,:]
trainX = trainX[:,:-1,:]
testY = testX[:,1:,:]
testX = testX[:,:-1,:]
############### Build Model ###############
import keras
from keras.models import Model
from keras import layers
from keras import regularizers
inputs = layers.Input(batch_shape=(1,look_back,1), name="main_input")
inputsAux = layers.Input(batch_shape=(1,look_back,1), name="aux_input")
# this layer makes the actual prediction, i.e. decides if and how much it goes up or down
x = layers.recurrent.LSTM(300,return_sequences=True, stateful=True)(inputs)
x = layers.recurrent.LSTM(200,return_sequences=True, stateful=True)(inputs)
x = layers.recurrent.LSTM(100,return_sequences=True, stateful=True)(inputs)
x = layers.recurrent.LSTM(50,return_sequences=True, stateful=True)(inputs)
x = layers.wrappers.TimeDistributed(layers.Dense(1, activation="linear",
kernel_regularizer=regularizers.l2(0.005),
activity_regularizer=regularizers.l1(0.005)))(x)
# auxillary input, the current input will be feed directly to the output
# this way the prediction from the step before will be used as a "base", and the Network just have to
# learn if it goes a little up or down
auxX = layers.wrappers.TimeDistributed(layers.Dense(1,
kernel_initializer=keras.initializers.Constant(value=1),
bias_initializer='zeros',
input_shape=(1,1), activation="linear", trainable=False
))(inputsAux)
outputs = layers.add([x, auxX], name="main_output")
model = Model(inputs=[inputs, inputsAux], outputs=outputs)
model.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['mean_squared_error'])
#model.summary()
#model.fit({"main_input": trainX, "aux_input": trainX[look_back-1,look_back,:]},{"main_output": trainY}, epochs=4, batch_size=1, shuffle=False)
model.fit({"main_input": trainX, "aux_input": trainX[:,look_back-1,:].reshape(lengthTrain,1,1)},{"main_output": trainY}, epochs=100, batch_size=1, shuffle=False)
############### make predictions ###############
burnedInPredictions = np.zeros(transientTime)
testPredictions = np.zeros(len(testX))
# burn series in, here use first transitionTime number of samples from test data
for i in range(transientTime):
prediction = model.predict([np.array(testX[i, :, 0].reshape(1, look_back, 1)), np.array(testX[i, look_back - 1, 0].reshape(1, 1, 1))])
testPredictions[i] = prediction[0,0,0]
burnedInPredictions[:] = testPredictions[:transientTime]
# prediction, now dont use any previous data whatsoever anymore, network just has to run on its own output
for i in range(transientTime, len(testX)):
prediction = model.predict([prediction, prediction])
testPredictions[i] = prediction[0,0,0]
# for plotting reasons
testPredictions[:np.size(burnedInPredictions)-1] = np.nan
############### plot results ###############
#import matplotlib.pyplot as plt
plt.plot(testX[:, 0, 0])
plt.show()
plt.plot(burnedInPredictions, label = "training")
plt.plot(testPredictions, label = "prediction")
plt.legend()
plt.show()

Use Gensim or other python LDA packages to use trained LDA model from Mallet

I have an LDA model trained through Mallet in Java. Three files are generated from the Mallet LDA model, which allow me to run the model from files and infer the topic distribution of a new text.
Now I would like to implement a Python tool which is able to infer a topic distribution given a new text, based on the trained LDA model. I do not want to re-trained the LDA model in Python. Therefore, I wonder if it is possible to load the trained Mallet LDA model into Gensim or any other python LDA package. If so, how can I do it?
Thanks for any answers or comments.
In short yes you can! That is what is nice about using mallet is that once it is run you don't have to go through and relabel topics. I'm doing something very similar - I'll post my code below with a few helpful links. Once your model is trained save the notebook widget state and you'll be free to run your model on new and different data-sets with the same topic allocation. This code includes a test and validation set. Make sure you've downloaded mallet and java then try this:
# future bridges python 2 and 3
from __future__ import print_function
# pandas works with data structures, data manipulation, and analysis specifically for numerical tables, and series like
# the csv we are using here today
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
# Gensim unsupervised topic modeling, natural language processing, statistical machine learning
import gensim
# convert a document to a list of tolkens
from gensim.utils import simple_preprocess
# remove stopwords - words that are not telling: "it" "I" "the" "and" ect.
from gensim.parsing.preprocessing import STOPWORDS
# corpus iterator
from gensim import corpora, models
# nltk - Natural Language Toolkit
# lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed
# into present.
# stemmed — words are reduced to their root form.
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
# NumPy - multidimensional arrays, matrices, and high-level mathematical formulas
import numpy as np
np.random.seed(2018)
import os
from gensim.models.wrappers import LdaMallet
from pathlib import Path
import codecs
import logging
import re
import numpy as np
import pandas as pd
from pprint import pprint
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
# spacy for lemmatization
import spacy
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline
# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO)
data = pd.read_csv('YourData.csv', encoding = "ISO-8859-1");
data_text = data[['Preprocessed Document or your comments column title']]
data_text['index'] = data_text.index
documents = data_text
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['yourStopWord1', 'yourStopWord2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
# gensim.parsing.preprocessing.STOPWORDS
# look at a random row 4310 and see if things worked out
# note that the document created was already preprocessed
doc_sample = documents[documents['index'] == 4310].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))
# let’s look at ten rows passed through the lemmatize stemming and preprocess
documents = documents.dropna(subset=['Preprocessed Document'])
processed_docs = documents['Preprocessed Document'].map(preprocess)
processed_docs[:10]
# we create a dictionary of all the words in the csv by iterating through
# contains the number of times a word appears in the training set.
dictionary_valid = gensim.corpora.Dictionary(processed_docs[20000:])
count = 0
for k, v in dictionary_valid.iteritems():
print(k, v)
count += 1
if count > 30:
break
# we create a dictionary of all the words in the csv by iterating through
# contains the number of times a word appears in the training set.
dictionary_test = gensim.corpora.Dictionary(processed_docs[:20000])
count = 0
for k, v in dictionary_test.iteritems():
print(k, v)
count += 1
if count > 30:
break
# we want to throw out words that are so frequent that they tell us little about the topic
# as well as words that are too infrequent >15 rows then keep just 100,000 words
dictionary_valid.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
# we want to throw out words that are so frequent that they tell us little about the topic
# as well as words that are too infrequent >15 rows then keep just 100,000 words
dictionary_test.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
# the words become numbers and are then counted for frequency
# consider a random row 4310 - it has 8 words word indexed 2 shows up once
# preview the bag of words
bow_corpus_valid = [dictionary_valid.doc2bow(doc) for doc in processed_docs]
bow_corpus_valid[4310]
# the words become numbers and are then counted for frequency
# consider a random row 4310 - it has 8 words word indexed 2 shows up once
# preview the bag of words
bow_corpus_test = [dictionary_test.doc2bow(doc) for doc in processed_docs]
bow_corpus_test[4310]
# same thing in more words
bow_doc_4310 = bow_corpus_test[4310]
for i in range(len(bow_doc_4310)):
print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0],
dictionary_test[bow_doc_4310[i][0]],
bow_doc_4310[i][1]))
mallet_path = 'C:/mallet/mallet-2.0.8/bin/mallet.bat'
ldamallet_test = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus_test, num_topics=20, id2word=dictionary_test)
result = (ldamallet_test.show_topics(num_topics=20, num_words=10,formatted=False))
for each in result:
print (each)
mallet_path = 'C:/mallet/mallet-2.0.8/bin/mallet.bat'
ldamallet_valid = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus_valid, num_topics=20, id2word=dictionary_valid)
result = (ldamallet_valid.show_topics(num_topics=20, num_words=10,formatted=False))
for each in result:
print (each)
# Show Topics
for idx, topic in ldamallet_test.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
# Show Topics
for idx, topic in ldamallet_valid.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
# check out the topics - 30 words - 20 topics
ldamallet_valid.print_topics(idx, 30)
# check out the topics - 30 words - 20 topics
ldamallet_test.print_topics(idx, 30)
# Compute Coherence Score
coherence_model_ldamallet_valid = CoherenceModel(model=ldamallet_valid, texts=processed_docs, dictionary=dictionary_valid, coherence='c_v')
coherence_ldamallet_valid = coherence_model_ldamallet_valid.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet_valid)
# Compute Coherence Score
coherence_model_ldamallet_test = CoherenceModel(model=ldamallet_test, texts=processed_docs, dictionary=dictionary_test, coherence='c_v')
coherence_ldamallet_test = coherence_model_ldamallet_test.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet_test)
Look at 16: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
This helped: https://rare-technologies.com/tutorial-on-mallet-in-python/
and this: https://radimrehurek.com/gensim/models/wrappers/ldamallet.html
I hope this helps and good luck :)

Resources