Get structure from morgan fingerprint, RDKit - rdkit

I made a model to predict molecules' solubility from their morgan fingerprint and now I have found the specific bits of fingerprints the model had a hard time predicting.
I would like to see what each bit of a fingerprint correlates to in structure of the molecule and thanks to the user rapelpy I found DrawMorganBits, but here I need the mol (or Smiles) of the molecule and I only have the fingerprints of a non-specific molecule.
Is it possible to either get the mol or smiles code from fingerprints or can I draw the structures just with the fingerprints some other way?
Thanks in advance.

You can use DrawMorganBit() as described in the RDKit-Blog

If you only have a molecular fingerprint, it is difficult to track back to the substructure that caused each bit to be set – and may even be impossible depending on which fingerprint you are using.
In the above RDKit blog, the bitInfo dict is capturing the substructure responsible for a bit being set prior to "folding"/"hashing" the fingerprint. The process of hashing causes bit collisions and so it is not possible to map back deterministically without having this dictionary in the first place.
If you have the willpower and keeping track of the bitInfo is really not possible, you could try generating structures (or randomly sampling structures) which set the bit you are interested in, this will allow you to guess which substructures may have originally been responsible.
A place to start might be the GuacaMol benchmark codebase, which includes tasks and baseline methods that can generate molecules from their fingerprints.

Related

What is the output of a fingerprint scanner? Is there any deterministic identifying information?

I am planning on generating a set of public/private keys from a deterministic identifying piece of information from a person and was planning on using fingerprints.
My question, therefore, is: what is the output of a fingerprint scanner? Is there any deterministic output I could use, or is it always going to be a matter of "confidence level"? i.e. Do I always get a "number" which, if matched exactly to the database, will allow access, or do I rather get a number which, if "close enough" to the stored value on the database, allows access, based on a high degree of confidence, rather than an exact match?
I am quite sure the second option is the answer but just wanted to double-check. Is there any way to get some sort of deterministic output? My hope was to re-generate keys every time rather than actually storing fingerprint data. That way a wrong fingerprint would simply generate a new and useless key.
Any suggestions?
Thanks in advance.
I would advise against it for several reasons.
The fingerprints are not entirely deterministic. As suggested in #ImSimplyAnna answer, you might 'round' the results in order to have more chances to obtain a deterministic result. But that would significantly reduce the number of possible/plausible fingerprints, and thus not meet the search space size requirement for a cryptographic algorithm. On top of it, I suspect the entropy of such result to be somehow low, compared to the requirements of modern algorithm which are always based on high quality random numbers.
Fingerprints are not secret, we expose them to everyone all the time, and they can be revealed to an attacker at any time, and stored in a picture using a simple camera. A key must be a secret, and the only place we know we can store secrets without exposing them is our brain (which is why we use passwords).
An important feature for cryptographic keys is the possibility to generate new one if there is a reason to believe the current ones might be compromised. This is not possible with fingerprints.
That is why I would advise against it. Globally, I discourage anyone (myself included) to write his/her own cryptographic algorithm, because it is so easy to screw them up. It might be the easiest thing to screw up, out of all the things you could write, because attacker are so vicicous!
The only good approach, if you're not a skilled specialist, is to use libraries that are used all around, because they've been written by experts on the matter, and they've been subject to many attacks and attempts to break them, so the ones still standing will offer much better levels of protection that anything a non specialist could write (or basically anything a single human could write).
You can also have a look at this question, on the crypto stack exchange. They also discourage the OP in using anything else than a battle hardened algorithm, or protocol.
Edit:
I am planning on generating a set of public/private keys from a
deterministic identifying piece of information
Actually, It did not strike me at first (it should have), but keys MUST NOT be generated from anything which is not random. NEVER.
You have to generate them randomly. If you don't, you already give more information to the attacker than he/she wants. Being a programmer does not make you a cryptographer. Your user's informations are at stake, do not take any chance (and if you're not a cryptographer, you actually don't stand any).
A fingerprint scanner looks for features where the lines on the fingerprint either split or end. It then calculates the distances and angles between such features in an attempt to find a match.
Here's some more reading on the subject:
https://www.explainthatstuff.com/fingerprintscanners.html
in the section "How fingerprints are stored and compared".
The source is the best explanation I can find, but looking around some more it seems that all fingerprint scanners use some variety of that algorithm to generate data that can be matched.
Storing raw fingerprints would not only take up way more space on a database but also be a pretty significant security risk if that information was ever leaked, so it's not really done unless absolutely necessary.
Judging by that algorithm, I would assume that there is always some "confidence level". The angles and distances will never be 100% equal between scans, so there has to be some leeway to make sure a match is still found even if the finger is pressed against the scanner a bit harder or the finger is at a slightly different angle.
Based on this, I'd assume that generating a key pair based on a fingerprint would be possible, if you can figure out a way to make similar scans result in the same information. Simply rounding the angles and distances may work, but may introduce cases where two different people generate the same key pairs, or cases where different scans of the same fingerprint have a high chance of generating several different keys.

Understanding the effect the distribution of data has on hashing

So I've read the Wikipedia page on Hash functions as I'm currently playing with some.
Both on that page and other sources I've read mention that the distribution of the data affects the hash function.
Despite some explanations it is still unclear to me what exactly those effects are and perhaps why. So my question:
Just to make sure I've got it right, when they mention
distribution is this the frequency of each word in the input data
set?
What effect does the distribution of input data have on hash
functions? Of particular interest is, the performance of the hash
function, in terms of both speed and uniformity of the output produced by the hash algorithm.
EDIT 1:
I'm thinking specifically of the Wikipedia English corpus vs data from a more dynamic source, Twitter's tweets for example.
Usually you do not have as many input datasets as you have possible inputs. The distribution is therefore more of a propability, that a certain input with certain features will be picked. (essentially the same as you said, but with p<1 for every word instead of some count n>1) E.g. if you know, that the first bit of the input will always be 1, then the data is not uniformly distributed.
If your hash were very simple, eg. by only taking the first byte as 'hash', then this non-uniform distribution would lead to more collisions than anticipated. (only 128 values are possible even though you expected to get 256 different values)
Most (cryptographic) hash functions that you might know by name are good enough so that you do not have to care about this. For cryptography it is even an explicit condition: you must not be able to tell how many bits in the input changed just by looking at the difference of the hashes. That does not mean that it is impossible though. I can vaguely remember a paper stating an increased collision rate for md5 when only ascii letters and digits were hashed. I cannot find it right now, so enjoy this piece of information with care - but even if i have mixed up something, such a scenario is easily possible. And no matter whether it is md5 or some other algorithm, if you actually have such a relation, then certainly your distribution of input datasets is relevant again.

OpenCV: Fingerprint Image and Compare Against Database

I have a database of images. When I take a new picture, I want to compare it against the images in this database and receive a similarity score (using OpenCV). This way I want to detect, if I have an image, which is very similar to the fresh picture.
Is it possible to create a fingerprint/hash of my database images and match new ones against it?
I'm searching for a alogrithm code snippet or technical demo and not for a commercial solution.
Best,
Stefan
As Pual R has commented, this "fingerprint/hash" is usually a set of feature vectors or a set of feature descriptors. But most of feature vectors used in computer vision are usually too computationally expensive for searching against a database. So this task need a special kind of feature descriptors because such descriptors as SURF and SIFT will take too much time for searching even with various optimizations.
The only thing that OpenCV has for your task (object categorization) is implementation of Bag of visual Words (BOW).
It can compute special kind of image features and train visual words vocabulary. Next you can use this vocabulary to find similar images in your database and compute similarity score.
Here is OpenCV documentation for bag of words. Also OpenCV has a sample named bagofwords_classification.cpp. It is really big but might be helpful.
Content-based image retrieval systems are still a field of active research: http://citeseerx.ist.psu.edu/search?q=content-based+image+retrieval
First you have to be clear, what constitutes similar in your context:
Similar color distribution: Use something like color descriptors for subdivisions of the image, you should get some fairly satisfying results.
Similar objects: Since the computer does not know, what an object is, you will not get very far, unless you have some extensive domain knowledge about the object (or few object classes). A good overview about the current state of research can be seen here (results) and soon here.
There is no "serve all needs"-algorithm for the problem you described. The more you can share about the specifics of your problem, the better answers you might get. Posting some representative images (if possible) and describing the desired outcome is also very helpful.
This would be a good question for computer-vision.stackexchange.com, if it already existed.
You can use pHash Algorithm and store phash value in Database, then use this code:
double const mismatch = algo->compare(image1Hash, image2Hash);
Here 'mismatch' value can easly tell you the similarity ratio between two images.
pHash function:
AverageHash
PHASH
MarrHildrethHash
RadialVarianceHash
BlockMeanHash
BlockMeanHash
ColorMomentHash
These function are well Enough to evaluate Image Similarities in Every Aspects.

Ideal hashing method for wide distribution of values?

As part of my rhythm game that I'm working, I'm allowing users to create and upload custom songs and notecharts. I'm thinking of hashing the song and notecharts to uniquely identify them. Of course, I'd like as few collisions as possible, however, cryptographic strength isn't of much importance here as a wide uniform range. In addition, since I'd be performing the hashes rarely, computational efficiency isn't too big of an issue.
Is this as easy as selecting a tried-and-true hashing algorithm with the largest digest size? Or are there some intricacies that I should be aware of? I'm looking at either SHA-256 or 512, currently.
All cryptographic-strength algorithm should exhibit no collision at all. Of course, collisions necessarily exist (there are more possible inputs than possible outputs) but it should be impossible, using existing computing technology, to actually find one.
When the hash function has an output of n bits, it is possible to find a collision with work about 2n/2, so in practice a hash function with less than about 140 bits of output cannot be cryptographically strong. Moreover, some hash functions have weaknesses that allow attackers to find collisions faster than that; such functions are said to be "broken". A prime example is MD5.
If you are not in a security setting, and fear only random collisions (i.e. nobody will actively try to provoke a collision, they may happen only out of pure bad luck), then a broken cryptographic hash function will be fine. The usual recommendation is then MD4. Cryptographically speaking, it is as broken as it can be, but for non-cryptographic purposes it is devilishly fast, and provides 128 bits of output, which avoid random collisions.
However, chances are that you will not have any performance issue with SHA-256 or SHA-512. On a most basic PC, they already process data faster than what a hard disk can provide: if you hash a file, the file reading will be the bottleneck, not the hashing. My advice would be to use SHA-256, possibly truncating its output to 128 bits (if used in a non-security situation), and consider switching to another function only if some performance-related trouble is duly noticed and measured.
If you're using it to uniquely identify tracks, you do want a cryptographic hash: otherwise, users could deliberately create tracks that hash the same as existing tracks, and use that to overwrite them. Barring a compelling reason otherwise, SHA-1 should be perfectly satisfactory.
If cryptographic security is not of concern then you can look at this link & this. The fastest and simplest (to implement) would be Pearson hashing if you are planing to compute hash for the title/name and later do lookup. or you can have look at the superfast hash here. It is also very good for non cryptographic use.
What's wrong with something like an md5sum? Or, if you want a faster algorithm, I'd just create a hash from the file length (mod 64K to fit in two bytes) and 32-bit checksum. That'll give you a 6-byte hash which should be reasonably well distributed. It's not overly complex to implement.
Of course, as with all hashing solutions, you should monitor the collisions and change the algorithm if the cardinality gets too low. This would be true regardless of the algorithm chosen (since your users may start uploading degenerate data).
You may end up finding you're trying to solve a problem that doesn't exist (in other words, possible YAGNI).
Isn't cryptographic hashing an overkill in this case, though I understand that modern computers do this calculation pretty fast? I assume that your users will have an unique userid. When they upload, you just need to increment a number. So, you will represent them internally as userid1_song_1, userid1_song_2 etc. You can store this info in a database with that as the unique key along with user specified name.
You also didn't mention the size of these songs. If it is midi, then file size will be small. If file sizes are big (say 3MB) then sha calculations will not be instantaneous. On my core2-duo laptop, sha256sum of a 3.8 MB file takes 0.25 sec; for sha1sum it is 0.2 seconds.
If you intend to use a cryptographic hash, then sha1 should be more than adequate and you don't need sha256. No collisions --- though they exist --- have been found yet. Git, Mercurial and other distributed version control systems use sh1. Git is a content based system and uses sha1 to find out if content has been modified.

Any caveats to generating unique filenames for random images by running MD5 over the image contents?

I want to generate unique filenames per image so I'm using MD5 to make filenames.Since two of the same image could come from different locations, I'd like to actually base the hash on the image contents. What caveats does this present?
(doing this with PHP5 for what it's worth)
It's a good approach. There is an extremely small possibility that two different images might hash to the same value, but in reality your data center has a greater probability of suffering a direct hit by an asteroid.
One caveat is that you should be careful when deleting images. If you delete an image record that points to some file and you delete the file too, then you may be deleting a file that has a different record pointing to the same image (that belongs to a different user, say).
Given completely random file contents and a good cryptographic hash, the probability that there will be two files with the same hash value reaches 50% when the number of files is roughly 2 to (number of bits in the hash function / 2). That is, for a 128 bit hash there will be a 50% chance of at least one collision when the number of files reaches 2^64.
Your file contents are decidedly not random, but I have no idea how strongly that influences the probability of collision. This is called the birthday attack, if you want to google for more.
It is a probabilistic game. If the number of images will be substantially less than 2^64, you're probably fine. If you're still concerned, using a combination of SHA-1 plus MD5 (as another answer suggested) gets you to a total of 288 high-quality hash bits, which means you'll have a 50% chance of a collision once there are 2^144 files. 2^144 is a mighty big number. Mighty big. One might even say huge.
You should use SHA-1 instead of MD5, because MD5 is broken. There are pairs of different files with the same MD5 hash (not theoretical; these are actually known, and there are algorithms to generate even more pairs). For your application, this means someone could upload two different images which would have the same MD5 hash (or someone could generate such a pair of images and publish them somewhere in the Internet such that two of your users will later try to upload them, with confusing results).
Seems fine to me, if you're ok with 32-character filenames.
Edit: I wouldn't use this as the basis of (say) the FBI's central database of terrorist mugshots, since a sufficiently motivated attacker could probably come up with an image that had the same MD5 as an existing one. If that was the case then you could use SHA1 instead, which is somewhat more secure.
You could use a UUID instead?
If you have two identical images loaded from different places, say a stock photo, then you could end up over-writing the 'original'. However, that would mean you're only storing one copy, not two.
With that being said, I don't see any big issues with doing it in the way you described.
It will be time consuming. Why don't you just assign them sequential ids?
You might want to look into the technology P2P networks use to identify duplicate files. A solution involving MD5, SHA-1, and file length would be pretty reliable (and probably overkill).
ImageMagick and the PHP class imagick, that access it are able to interpret images more subjectively than hashing functions by factors like colour. There are countless methods and user-preferences to consider so here are some resources covering afew approaches to see what might suit your intended application:
http://www.imagemagick.org/Usage/compare/
http://www.imagemagick.org/discourse-server/viewtopic.php?f=1&t=8968&start=0
http://galleryproject.org/node/11198#comment-39927
Any of the hashing functions like MD5 will only attempt to determine if the files are identical - bit-wise, not to check visual similarity (with a margin-of-error for lossy compression or slight crops).

Resources