Dependency conversion: difference between version of Stanford CoreNLP - stanford-nlp

I have used stanford-corenlp-full-2015-01-29 to convert basic dependencies generated by the MALT parser to CCprocessed dependencies.
In my experiments, I want to compare MALT and Stanford Parser, so I am parsing the same texts with s stanford-corenlp-full-2015-04-20, using the neural network model.
My question is: are there significant differences between 2015-04-20 and 2015-01-29, as far as not Universal Dependencies are concerned? In that case, not to affect my comparison, I'd need to either parse with the older version, or redo the conversion with the newer one.
Thanks!

I think there were no SD differences between 2015-01-29 and 2015-04-20, but it would be easy to run both conversions and to diff the output.... Did you train your own MALT parser? I think the one distributed on the MALT site was done with an earlier version of our dependencies.

Related

Fine-tuning pre-trained Word2Vec model with Gensim 4.0

With Gensim < 4.0, we can retrain a word2vec model using the following code:
model = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
model.train(my_corpus, total_examples=len(my_corpus), epochs=model.epochs)
However, what I understand is that Gensim 4.0 is no longer supporting Word2Vec.load_word2vec_format. Instead, I can only load the keyedVectors.
How to fine-tune a pre-trained word2vec model (such as the model trained on GoogleNews) with my domain-specific corpus using Gensim 4.0?
I don't think that code would've ever have worked in Gensim versions before 4.0. A plain list-of-word-vectors, like GoogleNews-vectors-negative300.bin, does not (& never has) had enough info to continue training.
It's missing the hidden-to-output layer weights & word-frequency info essential for training.
Looking at past source code, as of release 1.0.0 (February 2017), that code wouldn've already given a deprecation-error with a pointer to the method for loading a plain set-of-word-vectors - to address people with the mistaken notion that could work – and raised other errors on any attempts to train() such a model. (Pre-1.0.0, docs also warned that this would not work, & would have failed with a less-helpful error.)
As one of those errors mentioned, there has at times been experimental support for loading some of a prior set-of-word-vectors to clobber any words in an existing model's already-initialized vocabulary, via .intersect_word2vec_format(). But by default that both (1) locks the imported vectors against further change; (2) brings in no new words. That's unlike what people most often want from "fine-tuning", so it's not a ready-made help for that goal.
I believe some people have cobbled together custom code to achieve various kinds of fine-tuning in their projects – but I don't know of anyone who's published a reliable recipe or strong results. (And I suspect some of the people who think they're doing this well just haven't rigorously evaluated the steps they are taking.)
If you have any recipe you know worked pre-Gensim-4.0.0, it should be adaptable - 4.0 changes to the Word2Vec-related classes were mainly refactorings, optimizations, & new options (with little-to-none removal of functionality). But a reliable description of what used-to-work, or which particular fine-tuning strategy is being pursued for what specific benefits, to make more specific recommendations.

Google protocol buffers - Binary compatibility between protoc-c and protoc

I have C code written in proto2 format and compiled by the protoc-c compiler. What I would like to know is whether that code is binary compatible with serialization/de-serialization code generated by the 'protoc' compiler (that also happens to understand version 3 of protobuf)? For some reason I am not able to get a definitive answer to this question. The reason I am wondering is because there are already backwards compatibility issues between version 3 and version 2, so I am a little uncertain with the protoc-c and protoc toolkits and how they handle versions.
Thanks!
Yes, these two implementations should be compatible--you can serialize messages with one implementation and successfully parse them with another. I have not personally tried protobuf-c but based on its description it is just another implementation of the same protocol buffer wire format.
You mentioned differences between syntax = "proto2" and syntax = "proto3". It is true that these are different and you would have to be careful if you want to migrate from one to the other, but I think this issue is orthogonal to your question about compatibility between protobuf-c and Google's protobuf implementation.

CoreNLP: How can I get only collapesed dependencies?

I'm parsing over 60,000 sentences with CoreNLP to get dependencies relations.
Because I only need collapsed dependencies, other dependencies types -- basic and collapsed-cc-processed -- are redundant for my own use, and make it hard to build my own codes, which take xml-output as input.
Can I get only collapsed dependencies?
If so, please let me know.
Thanks.
There is currently no way to do this. Computing the additional representations take very little computation, and so they are always reported. They should be marked specially in the XML output, however; hopefully it's not hard to filter the correct representation in the downstream code.

Get TypedDependencies using StanfordParser Shift Reduce Parser

I am trying to use the Stanford Shift Reduce Parser with the Spanish model supplied. I am noticing, however, that unlike the Lexicalized Parser, I cannot get the TypedDependencies, despite sending the adequate flag -outputFormat typedDependencies, as it can be seen in lexparser.bat/sh.
Just in case, this is the Java code I'm using to pass the flags and creating the parser.
ShiftReduceParser model = ShiftReduceParser.loadModel(modelPath);
model.setOptionFlags("-factored", "-outputFormat", "penn,typedDependencies");
ArrayList<TaggedWord> taggedWords = new ArrayList<TaggedWord>();
Thank you
The problem here is not the ShiftReduceParser, but simply that we don't currently support typed dependencies for Spanish currently - we only have them for English and Chinese.
(Looking ahead, the most likely thing to appear first is support for Universal Dependencies in the Neural Network Dependency Parser. Indeed, you could probably train such a model yourself now.)

Converting stylesheet from XSLT 1.0 to 2.0

I have an xslt 1.0 stylesheet which needs to be converted to xslt 2.0.
I found this question here: Convert XSLT 1.0 to 2.0 which deals with the same issue.
According to that changing version attribute to 2.0 would do the trick. But is that the only thing which needs to be done?
Thanks in advance
I think the choice of strategy for conversion depends on how good a set of regression tests you have.
If you have a good set of regression tests, or if the consequences of introducing an error are not severe, then I would recommend the following steps:
(a) change the version attribute to 2.0
(b) run your test cases using an XSLT 2.0 processor and see if they work
(c) examine any test discrepancies and identify their cause (perhaps 80% of the time it will work correctly first time with no discrepancies).
If you don't have good tests or if you can't afford to take any risks, then you might need a more cautious strategy. (The ultimate in caution, of course, is the "don't change anything" strategy - stick with 1.0). Perhaps the best advice in this case is to start the conversion project by writing more test cases. At the very least, collect together a sample of the source documents you are currently processing, and the output that is generated for these source documents, and then use a file comparison tool to compare the output you get after conversion.
There are a few incompatibilities between 1.0 and 2.0; the one you are most likely to encounter is that xsl:value-of (and many other constructs) in 1.0 ignore all nodes in the supplied input sequence after the first, whereas XSLT 2.0 outputs all the nodes in the supplied sequence. There are two ways of dealing with this problem. Either (my recommendation) identify the places where this problem occurs, and fix them, usually by changing select="X" to select="X[1]"; or change the version attribute on the xsl:stylesheet back to version="1.0", which causes the XSLT 2.0 processor to run in backwards compatibility mode. The disadvantage of relying on backwards compatibility mode is that you lose the benefits of the stronger type-checking in XSLT 2.0, which makes complex stylesheet code much easier to debug.
In my experience the problems you encounter in conversion are more likely to depend on processor/implementation changes than on W3C language changes. Your code might be using vendor-defined extension functions that aren't supported in the 2.0 processor, or it might be relying on implementation-defined behaviour such as collating sequences that varies from one processor to another. I have seen code, for example, that relied on the specific format of the output produced by generate-id(), which is completely implementation-dependent.
"XSL Transformations (XSLT) Version 2.0", §J, "Changes from XSLT 1.0 (Non-Normative)" lists most the differences between XSLT 1.0 and XSLT 2.0 that you need to be aware of.

Resources