I would like to know if it is possible to find the supported elliptic curves in ruby? by searching in the OpenSSL gem it doesn't seem possible.
The tmp_key method does not really return information.
For example, with www.google.com I get X25519 which corresponds to what I am looking for. But for some hosts like amazon.com I get id-ecPublicKey (rather than secp256r1)
Regards
Related
I made a model to predict molecules' solubility from their morgan fingerprint and now I have found the specific bits of fingerprints the model had a hard time predicting.
I would like to see what each bit of a fingerprint correlates to in structure of the molecule and thanks to the user rapelpy I found DrawMorganBits, but here I need the mol (or Smiles) of the molecule and I only have the fingerprints of a non-specific molecule.
Is it possible to either get the mol or smiles code from fingerprints or can I draw the structures just with the fingerprints some other way?
Thanks in advance.
You can use DrawMorganBit() as described in the RDKit-Blog
If you only have a molecular fingerprint, it is difficult to track back to the substructure that caused each bit to be set – and may even be impossible depending on which fingerprint you are using.
In the above RDKit blog, the bitInfo dict is capturing the substructure responsible for a bit being set prior to "folding"/"hashing" the fingerprint. The process of hashing causes bit collisions and so it is not possible to map back deterministically without having this dictionary in the first place.
If you have the willpower and keeping track of the bitInfo is really not possible, you could try generating structures (or randomly sampling structures) which set the bit you are interested in, this will allow you to guess which substructures may have originally been responsible.
A place to start might be the GuacaMol benchmark codebase, which includes tasks and baseline methods that can generate molecules from their fingerprints.
I have a bunch of PDB files containing the structures of novel peptides, and I want to see if any of them will bind to LPS via computational prediction. I could do molecular dynamics, but that requires a lot of computing power unfortunately. Any other good options?
Thanks :)
One easy idea is to take the PDB of a peptide you are confident binds to LPS and then claim that peptides that look structurally similar also have a shot of binding LPS. This is a semi-qualitative argument.
As an example, I found w/ a quick search this peptide sequence which binds LPS: KNYSSSISSIHAC
(source: https://www.ncbi.nlm.nih.gov/pubmed/20816904)
Then to get the predicted structure of that sequence I used PEP-Fold which is an online tool
(source: http://mobyle.rpbs.univ-paris-diderot.fr/cgi-bin/portal.py#forms::PEP-FOLD3)
I think you can download from the job I submitted
(http://mobyle.rpbs.univ-paris-diderot.fr/data/jobs/PEP-FOLD3/C24081178740978)
After you download the PDB, you want to find the RMSD of each of your peptide structures to this PDB, and low RMSDs could represent LPS binding, but its definitely not definitive and unclear where to draw the line of "close enough".
This is all very wishy-washy, and as you say, molecular dynamics is a better option. You might also consider just sequence alignments w/ peptides that do bind LPS w/out worrying about 3D.
You might get better responses at Biostars or some other forum since your question is not coding based
I have a database of images. When I take a new picture, I want to compare it against the images in this database and receive a similarity score (using OpenCV). This way I want to detect, if I have an image, which is very similar to the fresh picture.
Is it possible to create a fingerprint/hash of my database images and match new ones against it?
I'm searching for a alogrithm code snippet or technical demo and not for a commercial solution.
Best,
Stefan
As Pual R has commented, this "fingerprint/hash" is usually a set of feature vectors or a set of feature descriptors. But most of feature vectors used in computer vision are usually too computationally expensive for searching against a database. So this task need a special kind of feature descriptors because such descriptors as SURF and SIFT will take too much time for searching even with various optimizations.
The only thing that OpenCV has for your task (object categorization) is implementation of Bag of visual Words (BOW).
It can compute special kind of image features and train visual words vocabulary. Next you can use this vocabulary to find similar images in your database and compute similarity score.
Here is OpenCV documentation for bag of words. Also OpenCV has a sample named bagofwords_classification.cpp. It is really big but might be helpful.
Content-based image retrieval systems are still a field of active research: http://citeseerx.ist.psu.edu/search?q=content-based+image+retrieval
First you have to be clear, what constitutes similar in your context:
Similar color distribution: Use something like color descriptors for subdivisions of the image, you should get some fairly satisfying results.
Similar objects: Since the computer does not know, what an object is, you will not get very far, unless you have some extensive domain knowledge about the object (or few object classes). A good overview about the current state of research can be seen here (results) and soon here.
There is no "serve all needs"-algorithm for the problem you described. The more you can share about the specifics of your problem, the better answers you might get. Posting some representative images (if possible) and describing the desired outcome is also very helpful.
This would be a good question for computer-vision.stackexchange.com, if it already existed.
You can use pHash Algorithm and store phash value in Database, then use this code:
double const mismatch = algo->compare(image1Hash, image2Hash);
Here 'mismatch' value can easly tell you the similarity ratio between two images.
pHash function:
AverageHash
PHASH
MarrHildrethHash
RadialVarianceHash
BlockMeanHash
BlockMeanHash
ColorMomentHash
These function are well Enough to evaluate Image Similarities in Every Aspects.
I am aware of the duplicates of this question:
How does the Google “Did you mean?” Algorithm work?
How do you implement a “Did you mean”?
... and many others.
These questions are interested in how the algorithm actually works. My question is more like: Let's assume Google did not exist or maybe this feature did not exist and we don't have user input. How does one go about implementing an approximate version of this algorithm?
Why is this interesting?
Ok. Try typing "qualfy" into Google and it tells you:
Did you mean: qualify
Fair enough. It uses Statistical Machine Learning on data collected from billions of users to do this. But now try typing this: "Trytoreconnectyou" into Google and it tells you:
Did you mean: Try To Reconnect You
Now this is the more interesting part. How does Google determine this? Have a dictionary handy and guess the most probably words again using user input? And how does it differentiate between a misspelled word and a sentence?
Now considering that most programmers do not have access to input from billions of users, I am looking for the best approximate way to implement this algorithm and what resources are available (datasets, libraries etc.). Any suggestions?
Assuming you have a dictionary of words (all the words that appear in the dictionary in the worst case, all the phrases that appear in the data in your system in the best case) and that you know the relative frequency of the various words, you should be able to reasonably guess at what the user meant via some combination of the similarity of the word and the number of hits for the similar word. The weights obviously require a bit of trial and error, but generally the user will be more interested in a popular result that is a bit linguistically further away from the string they entered than in a valid word that is linguistically closer but only has one or two hits in your system.
The second case should be a bit more straightforward. You find all the valid words that begin the string ("T" is invalid, "Tr" is invalid, "Try" is a word, "Tryt" is not a word, etc.) and for each valid word, you repeat the algorithm for the remaining string. This should be pretty quick assuming your dictionary is indexed. If you find a result where you are able to decompose the long string into a set of valid words with no remaining characters, that's what you recommend. Of course, if you're Google, you probably modify the algorithm to look for substrings that are reasonably close typos to actual words and you have some logic to handle cases where a string can be read multiple ways with a loose enough spellcheck (possibly using the number of results to break the tie).
From the horse's mouth: How to Write a Spelling Corrector
The interesting thing here is how you don't need a bunch of query logs to approximate the algorithm. You can use a corpus of mostly-correct text (like a bunch of books from Project Gutenberg).
I think this can be done using a spellchecker along with N-grams.
For Trytoreconnectyou, we first check with all 1-grams (all dictionary words) and find a closest match that's pretty terrible. So we try 2-grams (which can be built by removing spaces from phrases of length 2), and then 3-grams and so on. When we try a 4-gram, we find that there is a phrase that is at 0 distance from our search term. Since we can't do better than that, we return that answer as the suggestion.
I know this is very inefficient, but Peter Norvig's post here suggests clearly that Google uses spell correcters to generate it's suggestions. Since Google has massive paralellization capabilities, they can accomplish this task very quickly.
Impressive tutroail one how its work you can found here http://alias-i.com/lingpipe-3.9.3/demos/tutorial/querySpellChecker/read-me.html.
In few word it is trade off of query modification(on character or word level) to increasing coverage in search documents. For example "aple" lead to 2mln documents, but "apple" lead to 60mln and modification is only one character, therefore it is obvious that you mean apple.
Datasets/tools that might be useful:
WordNet
Corpora such as the ukWaC corpus
You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.
You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.
Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.
Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.
Note: links removed as I'm a new user - sorry.
#Legend - Consider using one of the variations of the Soundex algorithm. It has some known flaws, but it works decently well in most applications that need to approximate misspelled words.
Edit (2011-03-16):
I suddenly remembered another Soundex-like algorithm that I had run across a couple of years ago. In this Dr. Dobb's article, Lawrence Philips discusses improvements to his Metaphone algorithm, dubbed Double Metaphone.
You can find a Python implementation of this algorithm here, and more implementations on the same site here.
Again, these algorithms won't be the same as what Google uses, but for English language words they should get you very close. You can also check out the wikipedia page for Phonetic Algorithms for a list of other similar algorithms.
Take a look at this: How does the Google "Did you mean?" Algorithm work?
I'm aware of SOUNDEX and (double) Metaphone, but these don't let me test for the similarity of words as a whole - for example "Hi" sounds very similar to "Bye", but both of these methods will mark them as completely different.
Are there any libraries in Ruby, or any methods you know of, that are capable of determining the similarity between two words? (Either a boolean is/isn't similar, or numerical 40% similar)
edit: Extra bonus points if there is an easy method to 'drop in' a different dialect or language!
I think you're describing levenshtein distance. And yes, there are gems for that. If you're into pure Ruby go for the text gem.
$ gem install text
The docs have more details, but here's the crux of it:
Text::Levenshtein.distance('test', 'test') # => 0
Text::Levenshtein.distance('test', 'tent') # => 1
If you're ok with native extensions...
$ gem install levenshtein
It's usage is similar. It's performance is very good. (It handles ~1000 spelling corrections per minute on my systems.)
If you need to know how similar two words are, use distance over word length.
If you want a simple similarity test, consider something like this:
Untested, but straight forward:
String.module_eval do
def similar?(other, threshold=2)
distance = Text::Levenshtein.distance(self, other)
distance <= threshold
end
end
What you need is a pronunciation dictionary. The best free one is the CMU Pronouncing Dictionary.
Map the strings to their pronunciations, then do a bit of preprocessing (for example, you'll probably want to remove the numbers that cmudict uses to indicate stress), then you could use one of the techniques others have suggested, such as levenshtein distance, on the pronunciation strings instead of the input strings.
For an example of something similar, see dict/dict.rb in Rhyme Ninja.
You might first preprocess the words using a thesaurus database, which will convert words with similar meaning to the same word. There are various thesaurus databases out there, unfortunately I couldn't find a decent free one for English ( http://www.gutenberg.org/etext/3202 is the one I found, but this doesn't show what relations the specific words have (like similar; opposite; alternate meaning; etc.), so all words on the same line have some relation, but you won't know what that relation is )
But for example for Hungarian there is a good free thesaurus database, but you don't have soundex/metaphone for hungarian texts...
If you have the database writing a program that preprocesses the texts isn't too hard (ultimately it's a simple search-replace, but you might want to preprocess the thesaurus database using simplex or methaphone too)