How to convert Smiles to Fingerprint with rdkit? - rdkit

I have to convert a list of Smiles in a list of Fingerprints with rdkit. But I don't know how. I searched for solutions in the internet, but there is actually no Example working...
Does anyone has experience with the conversion from a list of Smiles from molecules to the Fingerprints?
Thanks!

You can try this:
from rdkit import Chem
smiles_list = ["O=C(NCc1cc(OC)c(O)cc1)CCCC/C=C/C(C)C", "CC(C)CCCCCC(=O)NCC1=CC(=C(C=C1)O)OC", "c1(C=O)cc(OC)c(O)cc1"]
# create a list of mols
mols = [Chem.MolFromSmiles(smiles) for smiles in smiles_list]
# create a list of fingerprints from mols
fps = [Chem.RDKFingerprint(mol) for mol in mols]
RDKit has variety of built-in functionality for generating molecular fingerprints, I have shown example of generating topological fingerprints here. Please refer to this doc for other options.

Related

How to get similar results to pydub.silence.detect_nonsilent() using librosa.effects.split()?

I love pydub. It is simple to understand. But when it comes to detecting non-silent chunks, librosa seems much faster. So I want to try using librosa in a project to speed my code up.
So far, I have been using pydub like this (segment is an AudioSegment):
thresh = segment.dBFS - (segment.max_dBFS - segment.dBFS)
non_silent_ranges = pydub.silence.detect_nonsilent(segment, min_silence_len=1000, silence_thresh=thresh)
The thresh formula works mostly well, and when it does not, moving it a 5 or so dbs up or down does the trick.
Using librosa, I am trying this (y is a numpy array loaded with librosa.load(), with an sr of 22050)
non_silent_ranges = librosa.effects.split(y, frame_length=sr, top_db=mistery)
To get similar results to pydub I tried setting mistery to the following:
mistery = y.mean() - (y.max() - y.mean())
and the same after converting y to dbs:
ydbs = librosa.amplitude_to_db(y)
mistery = ydbs.mean() - (ydbs.max() - ydbs.mean())
In both cases, the results are very different from what get from pydub.
I have no background in audio processing and although I read about rms, dbFS, etc, I just don't get it--I guess I am getting old:)
Could somebody point me in the right direction? What would be the equivalent of my pydub solution in librosa? Or at least, explain to me how to get the max_dBFS and dBFS values of pydub in librosa (I am aware of how to convert and AudioSegment to the equivalent librosa numpy array thanks to the excellent answer here)?
max_dBFS is always 0 by it's nature. dBFS is how much "quieter" the sound is than the max possible signal.
I suspect another part of your issue is that ydbs.max() is the maximum value among data in ydbs, not the maximum possible value that can be stored (i.e., the highest integer or float possible)
Another difference from pydub is your use of ydbs.mean(), pydub uses RMS when computing dBFS.
You can convert ydbs.mean() to dbfs like so:
from numpy import mean, sqrt, square, iinfo
max_sample_value = iinfo(ydbs.dtype).max
ydbs_rms = sqrt(mean(square(ydbs))
ydbs_dbfs = 20 * log(ydbs_rms) / max_sample_value, 10)

Is there a way to set min_df and max_df in gensim's tfidf model?

I am using gensim's tdidf model like so:
from gensim import corpora, models
dictionary = corpora.Dictionary(some_corpus)
mapped_corpus = [dictionary.doc2bow(text)
for text in some_corpus]
tfidf = models.TfidfModel(mapped_corpus)
Now I'd like to apply thresholds to remove terms that appear too frequently (max_df) and too infrequently (min_df). I know that scikit's CountVectorizer allows you to do this, but I can't seem to find how to set these thresholds in gensim's tfidf. Could someone please help?
You can filter your dictionary with
dictionary.filter_extremes(no_below=min_df, no_above=rel_max_df)
Note that no_below expects the minimum number of documents in which tokens must appear, whereas no_above expects a maximum relative frequency, e.g. 0.5. Afterwards you can then construct your corpus with the filtered dictionary. According to the gensim docs it is also possible to construct a TfidfModel with only a dictionary.

Paraview rotate fields

I am using Paraview 5.0.1. If any solution requires updating, I can try.
I want to programmatically obtain field plots (and corresponding PlotOverLine) of displacements and stresses in rotated coordinate systems.
What are appropriate/convenient/possible ways of doing this?
So far, I have created one Calculator filter for each component of displacements and stresses.
For instance, I used Calculators in 2D with results
(displacement.iHat)*cos(0.7853981625)+(displacement.jHat)*sin(0.7853981625)
(stress_3-stress_0)*sin(45.0*3.14159265/180)*cos(45.0*3.14159265/180)+stress_1*((cos(45.0*3.14159265/180))^2-(sin(45.0*3.14159265/180))^2)
It works fine, but it is quite cumbersome, in several aspects:
Creating them (one filter per component).
Plotting several of them in a single XY plot
Exporting them (one export per component).
Is there a simple way to do this?
PS: The Transform filter does not accomplish this. It rotates the view, not the fields.
Two solutions:
Ugly, inneficient solution
Use Transform and check "Transform All Input vectors"
Add a calculator and add a dummy array
Use transform the other way around, without checking "Transform All Input vectors"
Correct solution :
Compute the transformation yourself in a programmable filter
input = self.GetUnstructuredGridInput();
output = self.GetUnstructuredGridOutput();
output.ShallowCopy(input)
data = input.GetPointData().GetArray("YourArray")
vec = vtk.vtkDoubleArray();
vec.SetNumberOfComponents(3);
vec.SetName("TransformedVectors");
numPoints = input.GetNumberOfPoints()
for i in xrange(0, numPoints):
tuple = data.GetTuple(i)
transform(tuple) # implement the transform in python
vec.InsertNextTuple(tuple)
output.GetPointData().AddArray(vec)

Cross validation of dataset separated on files

The dataset that I have is separated on different files grouped on samples that know each other, i.e., they were created on similar conditions on a similar time.
The balance of the train-test dataset is important so the samples have to be on train or test, but cannot be separated. So KFold it is not simple to use on my scikit-learn code.
Right now, I am using something similar to LOO making something like:
train ~> cat ./dataset/!(1.txt)
test ~> cat ./dataset/1.txt
Which is not confortable and not very useful if I want to make folds on test of several files and make a "real" CV.
How would be possible to make a good CV to check real overfitting?
Looking to this answer, I've realized that pandas can concatenate dataframes. I checked that the process is 15-20% slower than cat command-line but makes able to do folds as I was expecting.
Anyway, I am quite sure that there should be any other better way than this one:
import glob
import numpy as np
import pandas as pd
from sklearn.cross_validation import KFold
allFiles = glob.glob("./dataset/*.txt")
kf = KFold(len(allFiles), n_folds=3, shuffle=True)
for train_files, cv_files in kf:
dataTrain = pd.concat((pd.read_csv(allFiles[idTrain], header=None) for idTrain in train_files))
dataTest = pd.concat((pd.read_csv(allFiles[idTest], header=None) for idTest in cv_files))

Combining an image and shapefile in MATLAB

I've been trying to combine an image produced from a deforestation database called Hansen and a shapefile created in ArcGIS to make a georeference image. The script I've written so far is below but unable to figure out how to combine the two (I've tried several scripts including http://uk.mathworks.com/help/map/examples/creating-maps-using-mapshow.html?searchHighlight=overlay%20maps). Any assistance would be helpful!
Thank you,
Michelle
% Read in thresholded Hansen data
Data_FrenchGuiana = imread('FrenchGuiana_GFC_extract_thresholded.tif');
LossYear_FrenchGuiana = Data_FrenchGuiana(:,:,2);
LossYear_FrenchGuiana = double(LossYear_FrenchGuiana);
figure('color','white');
image(LossYear_FrenchGuiana)
imwrite(A,'LossYear_FrenchGuiana.tif')
country = shaperead('FrenchGuiana.shp');
figure mapshow(country);
xlabel('easting in meters')
ylabel('northing in meters')

Resources