Compound classification using RDkit - rdkit

How to classify compound computationally using RDkit or other libraries? For example, how to tell if a compound is a halide, Amine or Alcohol? Does RDkit have build in functions for this kind of task?

There's no straightforward way to do that but there are some hacks you can do to classify the compounds. There's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that
from rdkit.Chem.Fragments import fr_Al_OH
fr_Al_OH(mol)
or the following would return the number of aromatic -OH groups:
from rdkit.Chem.Fragments import fr_Ar_OH
fr_Ar_OH(mol)
Similarly, there are 83 more functions available. Some of them would be useful for your task. So you can just iterate over all the 83 functions and whenever the value is greater than or equal to 1, then you can say that the molecule has that functional group. As an example, if fr_Al_OH(mol) returns a value of >= 1, then that means the compound is an alcohol.

Related

Why is my Doc2Vec model in gensim not reproducible?

I have noticed that my gensim Doc2Vec (DBOW) model is sensitive to document tags. My understanding was that these tags are cosmetic and so they should not influence the learned embeddings. Am I misunderstanding something? Here is a minimal example:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import numpy as np
import os
os.environ['PYTHONHASHSEED'] = '0'
reps = []
for a in [0,500]:
documents = [TaggedDocument(doc, [i + a])
for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=100, window=2, min_count=0,
workers=1, epochs=10, dm=0, seed=0)
reps.append(np.array([model.docvecs[k] for k in range(len(common_texts))])
reps[0].sum() == reps[1].sum()
This last line returns False. I am working with gensim 3.8.3 and Python 3.5.2. More generally, is there any role that the values of the tags play (assuming they are unique)? I ask because I have found that using different tags for documents in a classification task leads to widely varying performance.
Thanks in advance.
First & foremost, your test isn't even comparing vectors corresponding to the same texts!
In run #1, the vector for the 1st text in in model.docvecs[0]. In run #2, the vector for the 1st text is in model.docvecs[1].
And, in run #2, the vector at model.docvecs[0] is just a randomly-initialized, but never-trained, vector - because none of the training texts had a document tag of (int) 0. (If using pure ints as the doc-tags, Doc2Vec uses them as literal indexes - potentially leaving any unused slots less than your highest tag allocated-and-initialized, but never-trained.)
Since common_texts only has 11 entries, by the time you reach run #12, all the vectors in your reps array of the first 11 vectors are garbage uncorrelated with any of your texts/
However, even after correcting that:
As explained in the Gensim FAQ answer #11, determinism in this algorithm shouldn't generally be expected, given many sources of potential randomness, and the fuzzy/approximate nature of the whole approach. If you're relying on it, or testing for it, you're probably making some unwarranted assumptions.
In general, tests of these algorithms should be evaluating "roughly equivalent usefulness in comparative uses" rather than "identical (or even similar) specific vectors". For example, a test whether apple and orange are roughly at the same positions in each others' nearest-neighbor rankings makes more sense than checking their (somewhat arbitrary) exact vector positions or even cosine-similarity.
Additionally:
tiny toy datasets like common_texts won't show the algorithm's usual behavior/benefits
PYTHONHASHSEED is only consulted by the Python interpreter at startup; setting it from Python can't have any effect. But also, the kind of indeterminism it introduces only comes up with separate interpreter launches: a tight loop within a single interpreter run like this wouldn't be affected by that in any case.
Have you checked the magnitude of the differences?
Just running:
delta = reps[0].sum() - reps[1].sum()
for the aggregate differences results with -1.2598932e-05 when I run it.
Comparison dimension-wise:
eps = 10**-4
over = (np.abs(diff) <= eps).all()
Returns True on a vast majority of the runs which means that you are getting quite reproducible results given the complexity of the calculations.
I would blame numerical stability of the calculations or uncontrolled randomness. Even though you do try to control the random seed, there is a different random seed in NumPy and different in random standard library so you are not controlling for all of the sources of randomness. This can also have an influence on the results but I did not check the actual implementation in gensim and it's dependencies.
Change
import os
os.environ['PYTHONHASHSEED'] = '0'
to
import os
import sys
hashseed = os.getenv('PYTHONHASHSEED')
if not hashseed:
os.environ['PYTHONHASHSEED'] = '0'
os.execv(sys.executable, [sys.executable] + sys.argv)

Ada random Integer in range of array length

It's a simple question, yet I can't find anything that could help me...
I want to create some random connection between graph nodes. To do this I want do two random indexes and then connect the nodes.
declare
type randRange is range 0..100000;
n1: randRange;
n2: randRange;
package Rand_Int is new ada.numerics.discrete_random(randRange);
use Rand_Int;
gen : Generator;
begin
n1 := random(gen) mod n; -- first node
n2 := random(gen) mod n;
I wanted to define the range with length of my array but I got errors. Still, it doesn't compile.
Also I can't perform modulo as n is natural.
75:15: "Generator" is not visible
75:15: multiple use clauses cause hiding
75:15: hidden declaration at a-nudira.ads:50, instance at line 73
75:15: hidden declaration at a-nuflra.ads:47
And I have no idea what these errors mean - obviously, something is wrong with my generator.
I would appreciate if someone showed me a proper way to do this simple thing.
As others have answered, the invisibility of Generator is due to you having several "use" clauses for packages all of which have a Generator. So you must specify "Rand_Int.Generator" to show that you want the Generator from the Rand_Int package.
The problem with the "non-static expression" happens because you try to define a new type randRange, and that means the compiler has to decide how many bits it needs to use for each value of the type, and for that the type must have compile-time, i.e. static, bounds. You can instead define it as a subtype:
subtype randRange is Natural range 0 .. n-1;
and then the compiler knows that it can use the same number of bits as it uses for the Natural type. (I assume here that "n" is an Integer, or Natural or Positive; otherwise, use whatever type "n" is.)
Using a subtype should also resolve the problem with the "expected type".
You don't show us the whole code neccessary to reproduce the errors, but the error messages suggest you have another use clause somewhere, a use Ada.Numerics.Float_Random;. Either remove that, or specify which generator you want, ie. gen : Rand_Int.Generator;.
As for mod, you should specify the exact range you want when instantiating Discrete_Random instead:
type randRange is 0..n-1; -- but why start at 0? A list of nodes is better decribed with 1..n
package Rand_Int is new ada.numerics.discrete_random(randRange);
Now, there's no need for mod
The error messages you mention have to do with concept of visibility in Ada, which differs from most other languages. Understanding visibility is key to understanding Ada. I recommend that beginners avoid use <package> in order to avoid the visibility issues involved with such use clauses. As you gain experience with the language you can experiment with using common pkgs such as Ada.Text_IO.
As you seem to come from a language in which arrays have to have integer indices starting from zero, I recommend Ada Distilled, which does an excellent job of describing visibility in Ada. It is ISO/IEC 8652:2007, but you should have no difficulty picking up Ada-12 from that basis.
If you're interested in the issues involved in obtaining a random integer value in a subrange of an RNG's result range, or from a floating-point random value, you can look at PragmARC.Randomness.Real_Ranges and PragmARC.Randomness.U32_Ranges in the PragmAda Reusable Components.

How can Clojure data-structures best be tagged with a type?

I'm writing a program that's manipulating polynomials. I'm defining polynomials recursively as either a term (base case) or a sum or product of polynomials (recursive cases).
Sums and products are completely identical as far as their contents are concerned. They just contain a sequence of polynomials. But they need to be processed very differently. So to distinguish them I have to somehow tag my sequences of polynomials.
Currently I have two records - Sum and Product - defined. But this is causing my code to be littered with the line (:polynomials sum-or-product) to extract the contents of polynomials. Also printing out even small polynomials in the REPL produces so much boilerplate that I have to run everything through a dedicated prettyprinting routine if I want to make sense of it.
Alternatives I have considered are tagging my sums and products using metadata instead, or putting a + or * symbol at the head of the sequence. But I'm not convinced that either of these approaches are good style and I'm wondering if there's perhaps another option I haven't considered yet.
Putting a + or * symbol at the head of the sequence sounds like it would print out nicely. I would try implementing the processing of these two different "types" via multimethods, which keeps the calling convention neat and extensible. That document starts from object-oriented programmin view, but the "area of a shape" is a very neat example on what this approach can accomplish.
In your case you'd use first of the seq to determine if you are dealing with a sum or a product of polynomials, and the multimethod would automagically use the correct implementation.

How to implement a pseudo random function

I want to generate a sequence of random numbers that will be used to pick tiles for a "maze". Each maze will have an id and I want to use that id as a seed to a pseudo random function. That way I can generate the same maze over and over given it's maze id. Preferably I do not want to use a built in pseudo random function in a language since I do not have control over the algorithm and it could change from platform to platform. As such, I would like to know:
How should I go about implementing my own pseudo random function?
Is it even feasible to generate platform independent pseudo random numbers?
Yes, it is possible.
Here is an example of such an algorithm (and its use) for noise generation.
Those particular random functions (Noise1, Noise2, Noise3, ..) use input parameters and calculate the pseudo random values from there.
Their output range is from 0.0 to 1.0.
And there are many more out there (Like mentioned in the comments).
UPDATE 2019
Looking back at this answer, a better suited choice would be the below-mentioned mersenne twister. Or you could find any implementation of xorshift.
The Mersenne Twister may be a good pick for this. As you can see from the pseudocode on wikipedia, you can seed the RNG with whatever you prefer to produce identical values for any instance with that seed. In your case, the maze ID or the hash of the maze ID.
If you are using Python, you can use the random module by typing at the beginning,
import random. Then, to use it, you type-
var = random.randint(1000, 9999)
This gives the var a 4 digit number that can be used for its id
If you are using another language, there is likely a similar module

Generating random number in a given range in Fortran 77

I am a beginner trying to do some engineering experiments using fortran 77. I am using Force 2.0 compiler and editor. I have the following queries:
How can I generate a random number between a specified range, e.g. if I need to generate a single random number between 3.0 and 10.0, how can I do that?
How can I use the data from a text file to be called in calculations in my program. e.g I have temperature, pressure and humidity values (hourly values for a day, so total 24 values in each text file).
Do I also need to define in the program how many values are there in the text file?
Knuth has released into the public domain sources in both C and FORTRAN for the pseudo-random number generator described in section 3.6 of The Art of Computer Programming.
2nd question:
If your file, for example, looks like:
hour temperature pressure humidity
00 15 101325 60
01 15 101325 60
... 24 of them, for each hour one
this simple program will read it:
implicit none
integer hour, temp, hum
real p
character(80) junkline
open(unit=1, file='name_of_file.dat', status='old')
rewind(1)
read(1,*)junkline
do 10 i=1,24
read(1,*)hour,temp,p,hum
C do something here ...
10 end
close(1)
end
(the indent is a little screwed up, but I don't know how to set it right in this weird environment)
My advice: read up on data types (INTEGER, REAL, CHARACTER), arrays (DIMENSION), input/output (READ, WRITE, OPEN, CLOSE, REWIND), and loops (DO, FOR), and you'll be doing useful stuff in no time.
I never did anything with random numbers, so I cannot help you there, but I think there are some intrinsic functions in fortran for that. I'll check it out, and report tomorrow. As for the 3rd question, I'm not sure what you ment (you don't know how many lines of data you'll be having in a file ? or ?)
You'll want to check your compiler manual for the specific random number generator function, but chances are it generates random numbers between 0 and 1. This is easy to handle - you just scale the interval to be the proper width, then shift it to match the proper starting point: i.e. to map r in [0, 1] to s in [a, b], use s = r*(b-a) + a, where r is the value you got from your random number generator and s is a random value in the range you want.
Idigas's answer covers your second question well - read in data using formatted input, then use them as you would any other variable.
For your third question, you will need to define how many lines there are in the text file only if you want to do something with all of them - if you're looking at reading the line, processing it, then moving on, you can get by without knowing the number of lines ahead of time. However, if you are looking to store all the values in the file (e.g. having arrays of temperature, humidity, and pressure so you can compute vapor pressure statistics), you'll need to set up storage somehow. Typically in FORTRAN 77, this is done by pre-allocating an array of a size larger than you think you'll need, but this can quickly become problematic. Is there any chance of switching to Fortran 90? The updated version has much better facilities for dealing with standardized dynamic memory allocation, not to mention many other advantages. I would strongly recommend using F90 if at all possible - you will make your life much easier.
Another option, depending on the type of processing you're doing, would be to investigate algorithms that use only single passes through data, so you won't need to store everything to compute things like means and standard deviations, for example.
This subroutine generate a random number in fortran 77 between 0 and ifin
where i is the seed; some great number such as 746397923
subroutine rnd001(xi,i,ifin)
integer*4 i,ifin
real*8 xi
i=i*54891
xi=i*2.328306e-10+0.5D00
xi=xi*ifin
return
end
You may modifies in order to take a certain range.

Resources