Definition of POS tag and Dependency label sets are used within Parsey McParseface? - pos-tagger

The POS tags and Depedency labels output by Parsey McParseface are given in the tag-set and label-set files here respectively.
The Syntaxnet readme outlines that the model was trained on the Penn Treebank, OntoNotes and the English Web Treebanks.
Is there a detailed description of the corresponding POS tags and Dependency labels used in these treebanks similar to that given by the Universal Dependency project?

After a bit more searching, it looks like the Stanford dependency grammar detailed here and POS tags detailed here.

Related

Defining new language grammar rules?

Can you help me how could I edit the .tagger file using Stanford NLP? I have problem here, i can't open and edit the file to define the grammar rules for new language to generate part of speech?
The .tagger files are serialized statistical models used by a Maximum Entropy based sequence tagger. You can't edit them in any meaningful way.
If you want to create part of speech tags for a new language, you will have to create training data which consists of a large set of sentences in the language you want and having the correct part of speech tag for each word in the sentence, and then train a new part of speech tagging model.

Stanford core NLP models for English language

I am using stanford corenlp for a task. There are two models "stanford-corenlp-3.6.0-models" and "stanford-english-corenlp-2016-01-10-models" on stanford's website. I want to know what is the difference between these two models.
According to the "Human languages supported" section of CoreNLP Overview , the basic distribution provides model files for the analysis of well-edited English,which is the stanford-corenlp-3.6.0-models you mentioned.
But,CoreNLP member also provides a jar that contains all of their English models, which includes various variant models, and in particular has one optimized for working with uncased English (e.g., mostly or all either uppercase or lowercase).The newest one is stanford-english-corenlp-2016-10-31-models and the previous one is stanford-english-corenlp-2016-01-10-models you mentioned.
Reference:
http://stanfordnlp.github.io/CoreNLP/index.html#programming-languages-and-operating-systems
(the Stanford CoreNLP Overview page)

How to ignore MARKUPs from ruta output or from JCas?

I'm executing a ruta script dynamically from a Java Maven project. The script annotates an HTML file and the output is processed further. Now that the coveredText contains HTML tags in between as below;
(a+b) < SUP >2< /SUP> ==> is MARKed as formula
But I want it as
(a+b)2 ==> where the superscription is captured as another annotation and handled later.
How to arrive at the expected solution ?
In UIMA, the document text is static. If you want to change the text, you need to create a new view/CAS. In ruta, there are three components that can create a cas with modified document text: HtmlConverter, RutaModifier, RutaCutter. If you want to process it further in the same pipeline, you need an aggregate AE with sofa mapping (or a sofa aware analysis engine).
There is some documentation about these analysis engines and their usage. There is also an example project of these rules and and a StackOverflow question which discusses some possible problems. Information about Sofa mapping can be found in the UIMA documentation
(DISCLAIMER: I am a developer of UIMA Ruta)

Any hints on how to train a nn dependency parser on a new corpus?

We'd like to train the Stanford NN dependency parser on a Russian corpus, are there any hints on how to do it? The hyper-parameters are described in the paper, however it would be nice to understand how to prepare the training data (Annotations, and specifically how to create word2vec annotations). Any help or a reference to some document is greatly appreciated!
Thanks!
Here are some answers:
the site for word2vec if you want to build vector representations for Russian:
https://code.google.com/p/word2vec/
the dependencies need to be in the CoNLL-X format:
http://ilk.uvt.nl/conll/#dataformat
The word embeddings should be in this format (each word vector on its own line):
WORD\tn0 n1 n2 n3 n4 ...
for instance:
apple .45242 .392323 .111423 .999334
put your embeddings in a file called russian_embeddings.txt
the training command (assumes your word vectors have dimension=50)
java edu.stanford.nlp.parser.nndep.DependencyParser -tlp edu.stanford.nlp.trees.international.RussianTreebankLanguagePack -trainFile russian/train.conll -devFile russian/dev.conll -embedFile russian_embeddings.txt -embeddingSize 50 -model nndep.russian.model.txt.gz
A big complication is that as of the moment, edu.stanford.nlp.trees.international.RussianTreebankLanguagePack does not exist, so you will have to create this class and model it after the TreebankLanguagePacks for other languages ; if you look around in the package edu.stanford.nlp.trees.international , you can see what these TreebankLanguagePack files look like for other languages (note: the French one is only 143 lines long, so making a similar class for Russian is not out of the question at all) ; I will consult with other group members and see if I can get some clarity on what you'd have to do to complete this task
There are a lot of challenges to building this Russian NN dependency parse model. If you would like more help please let me know. I will talk to the developers of the NN parser and see if I can give you more advice, these answers are meant as a starting point!

NLP Postagger can't grok imperatives?

Stanford NLP postagger claims imperative verbs added to recent version. I've inputted lots of text with abundant and obvious imperatives, but there seems to be no tag for them on output. Must one, after all, train it for this pos?
There is no special tag for imperatives, they are simply tagged as VB.
The info on the website refers to the fact that we added a bunch of manually annotated imperative sentences to our training data such that the POS tagger gets more of them right, i.e. tags the verb as VB.

Resources