Stanford dependency parser training data format - stanford-nlp

I would like to add a new language to the Stanford Dependency Parser, but cannot for the life of me figure out how.
In what format should training data be?
How do I generate new language files?

The neural net dependency parser takes in CoNLL-X format data.
There is a description of the format in this paper:
https://ilk.uvt.nl/~emarsi/download/pubs/14964.pdf

Related

Does Constituency Parse annotator include the full Deparse?

From the Constituency parse documentation it seems obvious you can also get a dependency parse from the "parse" annotator. (Kind of like a bonus!) Is the dependency parse annotation produced by the constituency "parse" annotator the same output as the annotation produced by the "deparse" annotator?
In other words, if you run the constituency parse annotator, is it redundant to also run the "deparse" step?
I already use the dependency parser and want to start using the constituency parser as well. I don't want to double up on the parsers if I don't have to.
Thanks!
If you run the constituency parser there is a rule based process that will create a dependency parse structure based on the constituency parse, so yes you will automatically get a dependency parse for a sentence. You only need to run the parse annotator if you want both types of parses.
It is important to note that this won't necessarily be the same dependency parse that the neural model will generate. So in case 1 you create a statistical constituency parse, and then with rules convert that to a dependency parse. In case 2 you are using a neural model to only generate a dependency parse. I am sure quite regularly these parses are not identical.

Sentence segmentation with annotated corpus

I have a custom annotated corpus, in OpenNLP format. Ex:
<START:Person> John <END> went to <START:Location> London <END>. He visited <START:Organisation> ACME Co <END> in the afternoon.
What I need is to segment sentences from this corpus. But it won't always work as expected due to the annotations.
How can I do it without losing the entity annotations?
I am using OpenNLP.
In case you want to create multiple NLP models for OpenNLP you need multiple formats to train them:
The tokenizer requires a training format
The sentence detector requires a training format
The name finder requires a training format
Therefore, you need to manage these different annotation layers in some way.
I created an annotation tool and a Maven plugin which help you doing this, have a look here. All information can be stored in a single file and the Maven plugin will generate the NLP models for you.
Let me know if you have an further questions.

Training Stanford CoreNLP co-reference

I would like to use the Stanford CoreNLP library to do co-referencing in Dutch.
My question is, how do I train the CoreNLP to handle Dutch co-referencing resolution?
We've already created a Dutch NER model based on the 'conll2002' set (https://github.com/WillemJan/Stanford_ner_bugreport/raw/master/dutch.gz), but we would also like to use the co-referencing module in the same way.
Look at the class edu.stanford.nlp.scoref.StatisticalCorefTrainer.
The appropriate properties file for English is in:
edu/stanford/nlp/scoref/properties/scoref-train-conll.properties
You may have to get the latest code base from GitHub:
https://github.com/stanfordnlp/CoreNLP
While we are not currently supporting training of the statistical coreference models in the toolkit, I do believe the code for training them is included and it is certainly possible it works right now. I have yet to verify if it is functioning properly.
Please let me know if you need any more assistance. If you encounter bugs I can try to fix them...we would definitely like to get the statistical coreference training operational for future releases!

Data format for Stanford POS-tagger

I am re-training the Stanford POS-tagger on my own data. I have trained two other taggers on the same data in the following one-token-per-line format:
word1_TAG
word2_TAG
word3_TAG
word4_TAG
.
Is this format ok for the Stanford tagger, or does it need to be one-sentence-per-line?
word1_TAG word2_TAG word3_TAG word4_TAG .
Could using the first format for training and testing affect Stanford tagging results?
You should have one sentence per line (your second example).
Using the first format will certainly affect tagging results: you'll effectively build a unigram tagger, in which all tagging is done without any sentence context at all.

Segmentation of entities in Named Entity Recognition

I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.

Resources