RDKIT: Combine/Add particles - rdkit

I have a database of macrocycles and covalent organic cages, where I wish to add a molecule/ion into the cavity. I need to do this through RDKIT. Is there an easy method to accomplish this task?
For example:
from rdkit import AllChem
guest = [x_value, y_value, z_value]
cage = AllChem.MolFromMolFile('cage_file.mol')
cage_guest = cage+guest (along the lines of)
I am then hoping to be able to manipulate the cage_guest in the usual fashion.

I do not think this is possible natively in rdkit. You should take a look at stk, which uses rdkit for building organic cages. Here seems to be the system you require.

Related

fine tuning word2vec on a specific article, using transfer learning

i try to fine tune an exicting model on specific article. I have tried transfer learning using genism build_vocab, adding gloveword2vec to a base model i trained on the article. but the build_vocab does not change the basic model- it is very small and no words are added to it's vocabulary.
this is the code:
#load glove model
glove_file = datapath("/content/glove.6B.200d.txt")
tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec(glove_file, tmp_file)
glove_vectors = KeyedVectors.load_word2vec_format(tmp_file)`
(in here - len(glove_vectors.wv.vocab) = 40000)
#create good article basic model
base_model = Word2Vec(size=300, min_count=5)
base_model.build_vocab([tokenizer.tokenize(data.text[0])])
total_examples = base_model.corpus_count`
(in here - len(base_model.wv.vocab) = 24)
#add GloVe's vocabulary & weights base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)
(in here- still - len(base_model_good_wv.vocab) = 24)
#training
base_model.train([tokenizer.tokenize(good_trump.text[0])], total_examples=total_examples, epochs=base_model.epochs+5)
base_model_wv = base_model.wv
i think that the
"base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)"
does nothing- so there is no transfer learning.
any recommendations?
i relied on this article for the guideline...
Many articles at the 'Towards Data Science' site are very confused, to the point of misleading more than helping. Unfortunately, the article you've linked is a good example:
The author first uses an unsupported value (workers=-1) that manages to make his local-corpus training do nothing, and rather than discovering & fixing that error, incorrectly concludes he needs to use 'transfer learning'/'fine-tuning' instead. (He doesn't.)
He then tries to improvise a re-use of the GLoVe vectors, but as you've noted, his build_vocab() only manages to add the word-tokens to the model's vocabulary. This operation does not copy over any of the actual vectors!
Then, by doing training in a model where the default workers=3 was still in-effect, he finally does real training on just his own texts – no contribution from GLoVe values at all. He attributes the improvement to GLoVE, but really multiple mistakes have just cancelled each other.
I would avoid relying on a 'Towards Data Science' source if any other docs or tutorials are available.
Further, many who think they want to do re-use of someone else's pretrained vectors, with a small update from their own texts, should really just improve their own training corpus, so that they have one unified, evenly-trained model that covers all their needed words.
There's no explicit support for 'fine-tuning' in Gensim. Bold advanced users can try to cobble it together from other methods, and tampering with the model between usual steps, but I've never seen a well-characterized & evaluated process for doing so. (Lots of the people fumbling through the process aren't even doing a good check of end-quality versus other approaches, just noting some improvement on a few ad hoc, perhaps unrepresentative tests.)
Are you sure you need to do this? What was wrong with vectors taught on just your corpus? Might extending your corpus with extra texts to expand its vocabulary work as well or better?
Or, you could try translating the new domain words from your limited corpus & model into the same coordinate space as some older larger set of pretrained vectors that you like. There's an example of that process in a Gensim demo notebook using its utility TranslationMatrix class: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb

Dutch pre-trained model not working in gensim

When trying to upload the fasttext model (cc.nl.300.bin) in gensim I get the following error:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.nl.300.bin.gz
!gunzip cc.nl.300.bin.gz
model = FastText_gensim.load_fasttext_format('cc.nl.300.bin')
model.build_vocab(cleaned_text, update=True)
AttributeError: 'FastTextTrainables' object has no attribute 'syn1neg'
The code goes wrong when building the vocab with my own dataset. The format of that dataset is all right, as I already used it to build and train other (not pre-trained) Word2Vec and FastText models.
I saw other had the same error on this blog, however their solution did not work for me: https://github.com/RaRe-Technologies/gensim/issues/2588
Also, I read somewhere that I should use 'load_facebook_model'? However I was not able to import load_facebook_model at all? Is this even a good way to solve this problem?
Any other suggestions?
Are you sure you're using the latest version of Gensim, 4.0.1, with many improvements to the FastText implementation?
And, there you will definitely want to use .load_facebook_model() to load a full .bin Facebook-format model:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model
But also note: the post-training expansion of the vocabulary is best considered an advanced & experimental function. It may not offer any improvement on typical tasks - indeed, without careful consideration of tradeoffs & balancing influence of later traiing against earlier, it can make things worse.
A FastText model trained on a large, diverse corpus may already be able to synthesize better-than-nothing guess vectors for out-of-vocabulary words, via its subword vectors.
If there's some data with very-different words & word-senses you need to integrate, it will often be better to re-train from scratch, using an equal combination of all desired text influences. Then you'll be doing things in a standard and balanced way, without harder-to-tune and harder-to-evaluate improvised changes to usual practice.

Use Google's libphonenumber with BaseX

I am using BaseX 9.2 to scrape an online phone directory. Nothing illegal, it belongs to a non-profit that my boss is a member in, so I have access to it. What I want is to add all those numbers to my personal phonebook so that I can know who is calling me (mainly to contact my boss). The data is in pretty bad shape, especially the numbers (about a thousand numbers, from all over the world). Some are in E164, some are not, some are downright invalid numbers.
I initially used OpenRefine 3.0 to cleanup the data. It also plays very nicely with Google's libphonenumber to whip the numbers in shape. It was as simple as downloading the JAR from Maven, putting it in OpenRefine's lib directory and then invoking Jython like this on each phone number (numberStr):
from com.google.i18n.phonenumbers import PhoneNumberUtil
from com.google.i18n.phonenumbers.PhoneNumberUtil import PhoneNumberFormat
pu = PhoneNumberUtil.getInstance()
numberStr = str(int(value))
number = pu.parse('+' + numberStr, 'ZZ')
try: country = pu.getRegionCodeForNumber(number)
except: country = 'US'
number = pu.parse(numberStr, (country if pu.isValidNumberForRegion(number, country) else 'US'))
return pu.format(number, PhoneNumberFormat.E164)
I discovered XPath and BaseX recently and find it to be very succint and powerful with HTML. While I could get OpenRefine to directly spit out a VCF, I can't find a way to plugin libphonenumber with BaseX. Since both are in Java, I thought it would be straight forward.
I tried their documentation (http://docs.basex.org/wiki/Java_Bindings), but BaseX does not discover the libphonenumber JAR out-of-the-box. I tried various path, renaming and location combinations. The only way I see is to write a wrapper and make it into an XQuery module (XAR) and import it. This will need significant time and Java coding skills and I definitely don't have the later.
Is there a simple way to hookup libphonenumber with BaseX? Or in general, is there a way to link external Java libs with XPath? I could go back to OpenRefine, but it has a very clumsy workflow IMHO. No way to ask the website admin to cleanup his act, either. Or, if OpenRefine and BaseX are not the right tools for the job, any other way to cleanup data, especially phone numbers? I need to do this every few months (for changes and updates on the site) and it's getting really tedious if I can't automate it fully.
Would want at least a basic working code sample for an answer .. (I directly work off the standalone BaseX JAR on a Windows 10 x64 machine)
Place libphonenumber-8.10.16.jar in the folder ..basex/lib/custom to get it on the classpath (see http://docs.basex.org/wiki/Startup#Full_Distributions) and run bin/basexgui.bat
declare namespace Pnu="java:com.google.i18n.phonenumbers.PhoneNumberUtil";
declare namespace Pn="java:com.google.i18n.phonenumbers.Phonenumber$PhoneNumber";
let $pnu:=Pnu:getInstance()
let $pn:= Pnu:parse($pnu,"044 668 18 00","CH")
return Pn:getCountryCode($pn)
Returns the string "41"
There is no standard way to call Java from XPath, however many Java based XPath implementations provide custom methods to do this.

Neo4j visualisation-manipulate the graph

I am currently using Neo4j Python rest client and I would like to visualise the graph and be able to amend it, add new nodes relationships etc. Also I would like the changes in the neo4j database as well. Is that possible? Also can self-loops be visualised? I have read about D3.js and Neoclipse and Gephi in http://www.neo4j.org/develop/visualize but I am not sure which one to use.
Thanks in advance.
You can manipulate the graph in Neo4J using Cypher, in particular using a the REST API.
Any kind of tool that allows you to interface with Cypher is potentially able to do what you are asking: it is a matter to combine some Cypher queries with the GUI.
Said that, create the right visualization for what you are doing might be tricky and general approach might no satisfy your needs: while Neoclipse can let you manipulate nodes and links in Neo4J (for free) you might want to do in a particular way (for example restricting the choice of editing or the field/properties to be changed).
Linkurious offers a solution to do that as well, but it's a commercial license.
Other solutions such KeyLines, d3.js, sigmaJS let you personalize that experience: note that they will require to create the interface yourself, but the result will be much better in case of a specific product IMHO.
So value your time and requirements and go with the proper solution.
For more tools have a look at the Neo4J visualization page: http://www.neo4j.org/develop/visualize
About self loops:
that's a tricky bit and there is not a right way to do those - imagine a scenario with hunders of multi-selfloops.
Personally I would recommend to NOT draw them on the chart as link/edges, while representing them in some other ways: es. glyphs, notes, bubbles on the node...
I believe the only tool that allows this today is Neoclipse, but I don't think it's updated to use the Labels and Indexing features released in 2.0.
As such, your best bet will be using the Neo4j Browser to visualize and Cypher to mutate your graph. If you want richer functionality and want a fun project to hack on, it shouldn't be super hard to build a basic visualization for Neo that allows mutating the graph. I would have a look at sigma.js: http://linkurio.us/sigma-js-1-0-next-gen-graph-drawing-lib-web/

Pulling stats out of a text

I'd like to know what are the most recurrent in a given text or group of text (pulled from a database) in ruby.
Does anyone know what are the best practices?
You might start with statistical natural language processing. Also, you may be able to leverage one or more of the libraries mentioned on the AI Ruby Plugins page.

Resources