Which RDKit fingerprint corresponds to the ECFP4 fingerprint - rdkit

I do have two questions about the Morgan fingerprint function of RDKit.
I couldn't figure out whether a Morgan fingerprint with the radius 2 or 4 corresponds to the ECFP4.
Furthermore I couldn't figure out, why the calculated similarity between two molecules differs substantially (much smaller) when using GetMorganFingerprintAsBitVect(nBits=2048) instead of GetMorganFingerprint?
Help or explanations would be very much appreciated.
Kind regards
Philipp

In answer to your first question, according to https://www.rdkit.org/docs/GettingStartedInPython.html, a radius of 2 is roughly equivalent to ecfp4.
The default atom invariants use connectivity information similar to
those used for the well known ECFP family of fingerprints.
Feature-based invariants, similar to those used for the FCFP
fingerprints, can also be used. The feature definitions used are
defined in the section Feature Definitions Used in the Morgan
Fingerprints. At times this can lead to quite different similarity
scores:
m1 = Chem.MolFromSmiles('c1ccccn1')
m2 = Chem.MolFromSmiles('c1ccco1')
fp1 = AllChem.GetMorganFingerprint(m1,2)
fp2 = AllChem.GetMorganFingerprint(m2,2)
ffp1 = AllChem.GetMorganFingerprint(m1,2,useFeatures=True)
ffp2 = AllChem.GetMorganFingerprint(m2,2,useFeatures=True)
DataStructs.DiceSimilarity(fp1,fp2)
0.36...
DataStructs.DiceSimilarity(ffp1,ffp2)
0.90...
When comparing the ECFP/FCFP fingerprints and the Morgan fingerprints generated by the RDKit, remember that the 4 in ECFP4
corresponds to the diameter of the atom environments considered, while
the Morgan fingerprints take a radius parameter. So the examples
above, with radius=2, are roughly equivalent to ECFP4 and FCFP4.

Related

Non Negative Matrix Factorization: How to predict values for a new entry?

My current understanding:
I have tried reading a few papers and links regarding NMF. It all talks about how we can split a MxN matrix into MxR and RxN matrices(R
Question:
I have a list of users(U) and some assignments(A) for each user. Now I split this matrix(UxA) using NMF. I get 2 Matrices UxR and RxA. How do I use these to predict what assignments(A') a new user(U') must have?
Any help would be appreciated as I couldn't understand this after trying to search for the answer.
Side question and opinion based:
Also if anyone can tell me with their experience, how do they chose R, specially when the number of assignments are in the order of 50,000 or perhaps a hundred thousand. I have been trying these with the scikit-learn library
Edit:
This can simply be done using model.inverse_transform(model.transform(User'))
You can try think this problem as recommender. you want to approximate decompose matrix X into two nonnegative matrix U and V.
see https://cambridgespark.com/content/tutorials/implementing-your-own-recommender-systems-in-Python/index.html
For pyothn scikit-learn, you can use:
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
from sklearn.decomposition import NMF
model = NMF(n_components=2, init='random', random_state=0)
W = model.fit_transform(X)
H = model.components_
Where X is the matrix you want to decomose. W and H is the nonnegative factor
To predict what assignments(A') a new user(U'), you just use WH' to complete the maitrx

Hashing algorithms for data summary

I am on the search for a non-cryptographic hashing algorithm with a given set of properties, but I do not know how to describe it in Google-able terms.
Problem space: I have a vector of 64-bit integers which are mostly linearlly distributed throughout that space. There are two exceptions to this rule: (1) The number 0 occurs considerably frequently and (2) if a number x occurs, it is more likely to occur again than 2^-64. The goal is, given two vectors A and B, to have a convenient mechanism for quickly detecting if A and B are not the same. Not all vectors are of fixed size, but any vector I wish to compare to another will have the same size (aka: a size check is trivial).
The only special requirement I have is I would like the ability to "back out" a piece of data. In other words, given A[i] = x and a hash(A), it should be cheap to compute hash(A) for A[i] = y. In other words, I want a non-cryptographic hash.
The most reasonable thing I have come up with is this (in Python-ish):
# Imagine this uses a Mersenne Twister or some other seeded RNG...
NUMS = generate_numbers(seed)
def hash(a):
out = 0
for idx in range(len(a)):
out ^= a[idx] ^ NUMS[idx]
return out
def hash_replace(orig_hash, idx, orig_val, new_val):
return orig_hash ^ (orig_val ^ NUMS[idx]) ^ (new_val ^ NUMS[idx])
It is an exceedingly simple algorithm and it probably works okay. However, all my experience with writing hashing algorithms tells me somebody else has already solved this problem in a better way.
I think what you are looking for is called homomorphic hashing algorithm and it has already been discussed Paillier cryptosystem.
As far as I can see from that discussion, there are no practical implementation nowadays.
The most interesting feature, the one for which I guess it fits your needs, is that:
H(x*y) = H(x)*H(y)
Because of that, you can freely define the lower limit of your unit and rely on that property.
I've used the Paillier cryptosystem a few years ago (there was a Java implementation somewhere, but I don't have anymore the link) during my studies, but it's far more complex in respect of what you are looking for.
It has interesting feature under certain constraints, like the following one:
n*C(x) = C(n*x)
Again, it looks to me similar to what you are looking for, so maybe you should search for this family of hashing algorithms. I'll have a try with Google searching for a more specific link.
References:
This one is quite interesting, but maybe it is not a viable solution because of your space that is [0-2^64[ (unless you accept to deal with big numbers).

How to fuzzily search for a dictionary word?

I have read a lot of threads here discussing edit-distance based fuzzy-searches, which tools like Elasticsearch/Lucene provide out of the box, but my problem is a bit different. Suppose I have a dictionary of words, {'cat', 'cot', 'catalyst'}, and a character similarity relation f(x, y)
f(x, y) = 1, if characters x and y are similar
= 0, otherwise
(These "similarities" can be specified by the programmer)
such that, say,
f('t', 'l') = 1
f('a', 'o') = 1
f('f', 't') = 1
but,
f('a', 'z') = 0
etc.
Now if we have a query 'cofatyst', the algorithm should report the following matches:
('cot', 0)
('cat', 0)
('catalyst', 0)
where the number is the 0-based starting index of the match found. I have tried the Aho-Corasick algorithm, and while it works great for exact matching and in the case when a character has relatively less number of "similar" characters, its performance drops exponentially as we increase the number of similar characters for a character. Can anyone point me to a better way of doing this? Fuzziness is an absolute necessity, and it must take in to account character similarities(i.e., not blindly depend on just edit-distances).
One thing to note is that in the wild, the dictionary is going to be really large.
I might try to use the cosine similarity using the position of each character as a feature and mapping the product between features using a match function based on your character relations.
Not a very specific advise, I know, but I hope it helps you.
edited: Expanded answer.
With the cosine similarity, you will compute how similar two vectors are. In your case the normalisation might not make sense. So, what I would do is something very simple (I might be oversimplifying the problem): First, see the matrix of CxC as a dependency matrix with the probability that two characters are related (e.g., P('t' | 'l') = 1). This will also allow you to have partial dependencies to differentiate between perfect and partial matches. After this I will compute, for each position the probability that the letter from each word is not the same (using the complement of P(t_i, t_j)) and then you can just aggregate the results using a sum.
It will count the number of terms that are different for a specific pair of words, and it allows you to define partial dependencies. Furthermore, the implementation is very simple and should scale well. This is why I am not sure if I misunderstood your question.
I am using Fuse JavaScript Library for a project of mine. It is a javascript file which works on JSON dataset. It is quite fast. Have a look at it.
It has implemented a full Bitap algorithm, leveraging a modified version of the Diff, Match & Patch tool by Google(from his site).
The code is simple to understand the algorithm implementation done.

When to stop the looping in random number generators?

I'm not sure StackOverflow is the right place to ask this question, because this question is half-programming and half-mathematics. And also really sorry if my question is stupid ^_^
I'm studying about Monte Carlo simulations via the "Monte Carlo Methods" book. One of the first thing I must learn is about Random Number Generator. The basic algorithm of RNG is:
1. Initialize: Draw the seed S0 from the distribution µ on S. Set t = 1.
2. Transition: Set St = f(St−1).
3. Output: Set Ut = g(St).
4. Repeat: Set t = t+ 1 and return to Step 2.
(µ is a probability distribution on the finite set of states S, the input is S0 and the random number we desire it the output Ut)
It is not hard to understand, but the problem here is I don't see the random factor which lie in the number of repeat. How can we decide when to stop the loop of the RNG? All examples I read which implement a RNG are loop for 100 times, and they returns the same value for a specific seed. It is not random at all >_<
Can someone explain what I'm missing here? Any help will be appreciated. Thanks everyone
You can't get a true sequence of random numbers on a computer, without specialized hardware. (Such specialized hardware performs the equivalent of an initial roll of the dice using physics to provide the randomness. Electronic ones often use the electronic noise of specialized diodes at constant temperatures; others use radioactive decay events.)
Without that specialized hardware, what you can generate are pseudorandom numbers which, as you've observed, always generate the same sequence of numbers for the same initial seed. For simple applications, you can often get away with generating an initial seed from the time of invocation, which is effectively random.
And when I say "simple applications," I am excluding cryptography. (Not just that, but especially that.)
Sometimes when you are trying to debug a simulation, you actually want to have a reproducible stream of "random" numbers so you might specifically sent a stream to start with a specific seed.
For instance in the answer Creating a facet_wrap plot with ggplot2 with different annotations in each plot rcs starts the answer by creating a reproducible set of data using the R code
set.seed(1)
df <- data.frame(x=rnorm(300), y=rnorm(300), cl=gl(3,100)) # create test data
before going on to demonstrate how to answer the actual question.

How to translate a set of/a chromosome/s to an attribute?

So, since the answer to this basically said that I really should look into encoding the genes of my creatures*, which I did!
So, I created the following neat little (byte[]-) structure:
Gene = { X, X, X, X, Y, Y, Y, Y, Z, Z, Z, Z }
Where
X = Represents a certain trait in a creature.
Y = These blocks controls how, if and when crossovers and mutation will happen (16 possible values, I think that should be more than enough!)
Z = The length of the strand (basically, this is for future builds, where I want to let the evolution control even the length of the entire strand).
(So Z and Y can be thought of as META-information)
(Before you ask, yes, that's a 12-byte :) )
My question to you are as follows:
How would I connect these the characteristics of each 'creature'?
Basically, I see it this way (and this will probably be how I will implement it):
Each 'creature' can run around, eat and breed, basic stuff. I don't think (I sure don't hope so at least!) I will need a fitness-function per se, but I hope evolution, as in the race for food, partners and space will drive the creatures to evolution.
Is this view wrong? Would it be easier (do note, I'm a programmer, not a mathematician!) to look at it as one big graph and "simply" take it from there?
Or, tl;dr: Can you point me in the right direction to articles, studies and/or examples of implementation of this?
(Even more tl;dr; How do I translate a Gene to, say for example the Length of a leg?)
*Read the question, I'm building a sort of simulator.
I haven't seen metainformation like your Y and Z in a genetic algorithm's gene sequence before (in my limited exposure to the technique). Is your genetic algorithm a nontraditional one?
How many traits does a creature have? If your X's are representing the values of traits, and a gene sequence can have variable length (Z), then what happens if there are not enough X's defined for all the traits? What happens if there are more X's than the traits you have for a creature?
Z should be a fixed value.
Y should be parameters of your genetic evolution routine
there should be an X (or set of X) for every trait of a creature (no more, no less)
If the number of X's you have is fixed, then for each trait, you assign a particular index (or set of indexes) to represent that trait.
EDIT:
You should determine the encoding of your traits that X should represent: for length of a leg, e.g., you could have a few bytes represent the leg-length. If bytes 3-5 were the length of the leg, you could represent the length in your X-vector like this:
[...101......]
Where the dots are other trait representations. The above snippet represents a leg length of 5 (whatever that means). The below genome still has 5 as the leg length, but other traits filled out as well.
[001101011011]
Looking at Mitchell, 1998, An Introduction to Genetic Algorithms, Chap. 3.3, I found a reference to Forrest and Jones, 1994, Modeling Complex Adaptive Systems with Echo. that refers to the software Echo that seems to do what you are looking for (evolving creature in a world). For the moment I can't find a link to it but here is a dissertation on implementing jEcho, by Brian McIndoe.

Resources