gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars - gensim

/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
#dynamic topic model
def run_dtm(num_topics=18):
docs, years, titles = preprocessing(datasetType=2)
#resort document by years
Z = zip(years, docs)
Z = sorted(Z, reverse=False)
years_new, docs_new = zip(*Z)
#generate time slice
time_slice = Counter(years_new).values()
for year in Counter(years_new):
print year,' --- ',Counter(years_new)[year]
print '********* data set loaded ********'
dictionary = corpora.Dictionary(docs_new)
corpus = [dictionary.doc2bow(text) for text in docs_new]
print '********* train lda seq model ********'
ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=num_topics)
print '********* lda seq model done ********'
ldaseq.print_topics(time=1)
Hey guys, I'm using the dynamic topic models in gensim package for topic analysis, following this tutorial, https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/ldaseqmodel.ipynb, however I always got the same unexpected error. Can anyone give me some guidance? I'm really puzzled even thought I have tried some different dataset for generating corpus and dictionary.
The error is like this:
/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)

The np.fabs error means it is encountering an error with NumPy. What NumPy and gensim versions are you using?
NumPy no longer supports Python 2.7, and Ldaseq was added to Gensim in 2016, so you might just not have a compatible version available. If you are recoding a Python 3+ tutorial to a 2.7 variant, you obviously understand a little bit about the version differences - try running it in a, say, 3.6.8 environment (you will have to upgrade sometime anyway, 2020 is the end of 2.7 support from Python itself). That might already help, I've gone through the tutorial and did not encounter this with my own data.
That being said, I have encountered the same error before when running LdaMulticore, and it was caused by an empty corpus.
Instead of running your code fully in a function, can you try to go through it line by line (or look at you DEBUG level log) and check whether your output has the expected properties: that, for example your corpus is not empty (or contains empty documents)?
If that happens, fix the preprocessing steps and try again - that at least helped me and helped with the same ldamodel error in the mailing list.
PS: not commenting because I lack the reputation, feel free to edit this.

This is the issue with the source code of ldaseqmodel.py itself.
For the latest gensim package(version 3.8.3) I am getting the same error at line 293:
ldaseqmodel.py:293: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
Now, if you go through the code you will see this:
enter image description here
You can see that here they divide the difference between bound and old_bound by the old_bound(which is also visible from the warning)
Now if you analyze further you will see that at line 263, the old_bound is initialized with zero and this is the main reason that you are getting this warning of divide by zero encountered.
enter image description here
For further information, I put a print statement at line 294:
print('bound = {}, old_bound = {}'.format(bound, old_bound))
The output I received is: enter image description here
So, in a single line you are getting this warning because of the source code of the package ldaseqmodel.py not because of any empty document. Although if you do not remove the empty documents from your corpus you will receive another warning. So I suggest if there are any empty documents in your corpus remove them and just ignore the above warning of division by zero.

Related

Chem.RDKFingerprint did not match C++ signature for some SMILES, but okay for others

I'm working on trying to use ligands that are referenced in UniProt with the same ligand in PDB entries. For many ligands (e.g. FAD), the three-letter code is the same in both UniProt and PDB entries, but for some there is a slight difference. For example, for haemoglobin 1a9w chain A, in the PDB file I find "HEM" but in the corresponding UniProt entry (P69905) I find "heme b". "heme b" (in the UniProt json) has chebi id CHEBI:60344.
I downloaded the full ChEBI sdf file from https://ftp.ebi.ac.uk/pub/databases/chebi/SDF/, and find there are three haems that are close to what I want. So far, so good.
If I use the following code to calculate Tanimoto coefficients using CHEBI:60344 as a reference, one of the haems is okay but the other raises a C++ exception that I haven't been able to catch in my Python code. The problem is that if my list of chebi ids is the other way round, the code always fails before I get a value for the Tanimoto coefficient.
My question is - is this a bug in my implementation of the RDKIT code, is it a bug in the RDKIT code, is it a bug in the ChEBI module of bioservices, is the SMILES string in the ChEBI sdf file written incorrectly, or is there another issue?
This is all using conda installed rdkit, bioservices, python3.9 etc on a (old) Mac Pro running High Sierra (can't upgrade to a newer OS).
Ran this code:
from rdkit import Chem, DataStructs
from bioservices import ChEBI
heme = ChEBI()
heme_chebi_id = "CHEBI:60344"
heme_smiles = heme.getCompleteEntity(heme_chebi_id).smiles
target = Chem.MolFromSmiles(heme_smiles)
fp2 = Chem.RDKFingerprint(target)
for chebi_id in ["CHEBI:17627", "CHEBI:26355"]:
ch = ChEBI()
smiley = ch.getCompleteEntity(chebi_id).smiles
print("reference:", heme_chebi_id)
print("target: ", chebi_id)
print("reference:", heme_smiles)
print("target: ", smiley)
ref = Chem.MolFromSmiles(smiley)
fp1 = Chem.RDKFingerprint(ref)
Tan = DataStructs.TanimotoSimilarity(fp1, fp2)
print(Tan)
print("-" * 64)
exit()
got this output:
reference: CHEBI:60344
target: CHEBI:17627
reference: CC1=C(CCC([O-])=O)C2=[N+]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N+]5[Fe--]3(n14)n1c(=C6)c(C)c(CCC([O-])=O)c1=C2
target: CC1=C(CCC(O)=O)C2=[N+]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N+]5[Fe--]3(n14)n1c(=C6)c(C)c(CCC(O)=O)c1=C2
Tanimoto coefficient: 1.0
reference: CHEBI:60344
target: CHEBI:26355
reference: CC1=C(CCC([O-])=O)C2=[N+]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N+]5[Fe--]3(n14)n1c(=C6)c(C)c(CCC([O-])=O)c1=C2
target: CC1=C(CCC(O)=O)C2=[N]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N]5[Fe]3(n14)n1c(=C6)c(C)c(CCC(O)=O)c1=C2
[12:36:26] Explicit valence for atom # 9 N, 4, is greater than permitted
Traceback (most recent call last):
File "/Volumes/Users/harry/icl/phyre2-ligand/./tanimoto_test.py", line 20, in <module>
fp1 = Chem.RDKFingerprint(ref)
Boost.Python.ArgumentError: Python argument types in
rdkit.Chem.rdmolops.RDKFingerprint(NoneType)
did not match C++ signature:
RDKFingerprint(RDKit::ROMol mol, unsigned int minPath=1, unsigned int maxPath=7, unsigned int fpSize=2048, unsigned int nBitsPerHash=2, bool useHs=True, double tgtDensity=0.0, unsigned int minSize=128, bool branchedPaths=True, bool useBondOrder=True, boost::python::api::object atomInvariants=0, boost::python::api::object fromAtoms=0, boost::python::api::object atomBits=None, boost::python::api::object bitInfo=None)
This error means that the input to the function Chem.RDKFingerprint is None. That means that ref is None. You can try printing the value of ref to verify.
In this case, this is None because RdKit is not able to parse the given SMILES to a proper mol object. It has even raised the following warning if you look at the error carefully:
Explicit valence for atom # 9 N, 4, is greater than permitted
This is because of the co-ordinate bond present in the molecule which RdKit doesn't support. RdKit will treat it as a single bond which will raise the valency of both the Nitrogen atoms to 4 and hence an invalid molecule. Here's the same molecule generated from other sources:
To deal with this error, you'll have to modify the SMILES manually to make it such that either there's a charge on those nitrogen atoms or [Fe] is a separate atom rather than connected with a bond. Something like this:
This isn't really an issue with the SMILES but more of a limitation with RDKit for its inability to support co-ordinate bonds. I have faced this issue many times and always had to modify the SMILES manually to get around it. One suggestion for you is that you can programmatically modify the SMILES because this kind of error will most likely occur for Metal-Ligand catalysts where a co-ordinate bond is almost always there. So you can search for atoms like [Fe] or [Pt] in the SMILES string and then modify them.
I've managed to get a couple of workarounds for this.
The problem arises because RDKit is (as of 30 Jan 2023) unable to process some IUPAC compliant SMILES (as noted in betelgeuse's answers).
One thing to do is to use the "Sanitize=False" option for rdkit.Chem.MolFromSmiles - this allows a non-None value to be returned for this SMILES, and subsequently, rdkit.Chem.RDKFingerprint returns a useful value.
However, using the results of the "Sanitize=False" option fails if I want to explore an alternative measure of similarity, e.g. FCFP4 instead of Tanimoto, using "rdkit.Chem.rdMolDescriptors.GetMorganFingerprint"; the way I got round this was to test for "None" from MolFromSmiles without using sanitize=False, retrieve an alternative SMILES from PubChem and use that. Having said that, if I didn't really want the SMILES from PDBeChem, I could have done that in the first place...

PyQGIS - wrapped C/C++ object of type QgsVectorLayer has been deleted when editing the layer

I'm currently developing a QGIS plug-in.
When i start editing a layer either with with edit(QgsVectorLayer) or with QgsVectorLayer.startediting() this RuneTimeError happens the majority of runs: RuntimeError: wrapped C/C++ object of type QgsVectorLayer has been deleted. I can run 10 times the script and have no error and then run it another 10 times and get 10 times in a row the error. It feels completely random.
As i understood by reading post such as Understanding the "underlying C/C++ object has been deleted" error it might be a garbage collector problem C++ side. But none of the post i saw was about QgsVectorLayer so i'm not really sure it applies.
It really annoys me to the point where i start creating empty layers to store modified features instead of editing.
I tried to move start editing before the loop as i was thinking to continually start editing and commit changes for each feature might cause the issue but the error still appears.
Then i thought it might be the use of break at the end but removing it doesn't resolve the error.
As it is the first time i really use PyQGIS i spent sometimes reading the developer cookbook or searching online (Anita Graser - creating and editing a new vector layer) but i could not find any solutions.
I tried with different version, LTR or not. With another computer by despair but the issue is still here.
I also read somewhere that the progress bar was the issue, so i removed the feedback in my script also without success.
Here are some code example :
nodesLayer = self.parameterAsVectorLayer(parameters, self.INPUT_NODE, context)
arcsLayer = self.parameterAsVectorLayer(parameters, self.INPUT_LINE, context)
# Fill node Id_line_x
# Create spatial index
index = QgsSpatialIndex(nodesLayer.getFeatures())
for line in arcsLayer.getFeatures():
# Construct a geometry engine to speed up spatial relationship
engine = QgsGeometry.createGeometryEngine(line.geometry().constGet())
engine.prepareGeometry()
# Get potential neighbour
candidateIds = index.intersects(line.geometry().boundingBox())
request = QgsFeatureRequest().setFilterFids(candidateIds)
for node in nodesLayer.getFeatures(request):
# Get real neighbour
if engine.intersects(node.geometry().constGet()):
# Fill the Id_line fields for the number of neighbour
for fld in range(1, node["Nb_seg"] + 1):
if node["fk_Id_line_%d" %fld] == NULL:
with edit(nodesLayer):
node["fk_Id_line_%d" %fld] = line["Id_line"]
nodesLayer.updateFeature(node)
break
And the exact error :
Traceback (most recent call last):
File "/some/path/to/a/file.py", line 331, in processAlgorithm
nodesLayer.updateFeature(node)
RuntimeError: wrapped C/C++ object of type QgsVectorLayer has been deleted
Hope the example is enough. The goal of the code is for the nodes to be aware of their surroundings without going through the lines. it's just for treatment and those fields would be removed in the final output.

can't determine number of catergories in trainImageCategoryClassifier in matlab

Doing this example in Matlab Image Category Classification I have found an error trying to encode an image into a feature vector.
categoryClassifier = trainImageCategoryClassifier(trainingSet, bag);
Error using imageCategoryClassifier (line 436)
You need at least two image categories. That means that the number of elements in input array of imageSet
objects, imSets, must be at least two.
Error in imageCategoryClassifier.create (line 328)
this = imageCategoryClassifier(imgSet, bag, varargin{:});
Error in trainImageCategoryClassifier (line 82)
classifier = imageCategoryClassifier.create(imgSet, bag, varargin{:});
i have 3 categories But it says i have one category in trainingSet. what should i do?!
Which version of MATLAB are you using? This documentation is for the 2017a version. If you have an older version then running this code from the document gives you a 3x1 vector of categories.That is why the classifier treats it as one category.For 3 categories you'll need a 1x3 vector.
You can go to the command window and open the reference page on imageCategoryClassifier or on bagOfFeatures.This will give you the documentation of the version that you're running. There'll also be a link for the example MATLAB program. That's what worked for me.
Hope this helps!

is it easy to modify this python code to use pandas and would it help if i did?

I have written a Python 2.7 script that reads a CSV file and then does some standard deviation calculations . It works absolutely fine however it is very very slow. A CSV I tried with 100 million lines took around 28 hours to complete. I did some googling and it appears that maybe using the pandas module might makes this quicker .
I have posted part of the code below, since i am a pretty novice when it comes to python , i am unsure if using pandas would actually help at all and if it did would the function need to be completely re-written.
Just some context for the CSV file, it has 3 columns, first column is an IP address, second is a url and the third is a timestamp.
def parseCsvToDict(filepath):
with open(csv_file_path) as f:
ip_dict = dict()
csv_data = csv.reader(f)
f.next() # skip header line
for row in csv_data:
if len(row) == 3: #Some lines in the csv have more/less than the 3 fields they should have so this is a cheat to get the script working ignoring an wrong data
current_ip, URI, current_timestamp = row
epoch_time = convert_time(current_timestamp) # convert each time to epoch
if current_ip not in ip_dict.keys():
ip_dict[current_ip] = dict()
if URI not in ip_dict[current_ip].keys():
ip_dict[current_ip][URI] = list()
ip_dict[current_ip][URI].append(epoch_time)
return(ip_dict)
Once the above function has finished the data is parsed to another function that calculates the standard deviation for each IP/URL pair (using numpy.std).
Do you think that using pandas may increase the speed and would it require a complete rewrite or is it easy to modify the above code?
The following should work:
import pandas as pd
colnames = ["current_IP", "URI", "current_timestamp", "dummy"]
df = pd.read_csv(filepath, names=colnames)
# Remove incomplete and redundant rows:
df = df[~df.current_timestamp.isnull() & df.dummy.isnull()]
Notice this assumes you have enough RAM. In your code, you are already assuming you have enough memory for the dictionary, but the latter may be significatively smaller than the memory used by the above, for two reasons.
If it is because most lines are dropped, then just parse the csv by chunks: arguments skiprows and nrows are your friends, and then pd.concat
If it is because IPs/URLs are repeated, then you will want to transform IPs and URLs from normal columns to indices: parse by chunks as above, and on each chunk do
indexed = df.set_index(["current_IP", "URI"]).sort_index()
I expect this will indeed give you a performance boost.
EDIT: ... including a performance boost to the calculation of the standard deviation (hint: df.groupby())
I will not be able to give you an exact solution, but here are a couple of ideas.
Based on your data, you read 100000000. / 28 / 60 / 60 approximately 1000 lines per second. Not really slow, but I believe that just reading such a big file can cause a problem.
So take a look at this performance comparison of how to read a huge file. Basically a guy suggests that doing this:
file = open("sample.txt")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
pass # do something
can give you like 3x read boost. I also suggest you to try defaultdict instead of your if k in dict create [] otherwise append.
And last, not related to python: working in data-analysis, I have found an amazing tool for working with csv/json. It is csvkit, which allows to manipulate csv data with ease.
In addition to what Salvador Dali said in his answer: If you want to keep as much of the current code of your script, you may find that PyPy can speed up your program:
“If you want your code to run faster, you should probably just use PyPy.” — Guido van Rossum (creator of Python)

R script line numbers at error? [duplicate]

This question already has answers here:
How to get R script line numbers at error?
(6 answers)
Closed 6 years ago.
I found this post from a year ago, and I'm using R version 2.11.1 (2010-05-31), but still getting error messages without line numbers.
Any solution?
The answers given there are still valid. Returning line numbers from a script ain't that straight-forward, but R can give you a lot more information on where the error can be found.
You could use the error options to save the info in a file, for example :
options(error = quote({
sink(file="error.txt");
dump.frames();
print(attr(last.dump,"error.message"));
traceback();
sink();
q()}))
The function findLineNum() could be used if you have the name of the file somewhere available. If you have the error message, you could do something like :
dump.frames()
x <- attr(last.dump,"error.message")
ll <- gsub("Error in (.*) : .*","\\1",x)
lln <- findLineNum(srcfile,ll)
In the upcoming R 2.14, the core team is making progress toward implementing this feature. Functions in scripts loaded with source(file=..., keep.file=TRUE) will contain an attribute srcref, which identifies the range of characters corresponding to the function's definition in an in-memory copy of the source file stored as an object of class srcfilecopy.
This does not immediately provide line-level debugging, but it lets you get approximate line numbers if you're willing to get your hands dirty. Also, it's progress.

Resources