From the Constituency parse documentation it seems obvious you can also get a dependency parse from the "parse" annotator. (Kind of like a bonus!) Is the dependency parse annotation produced by the constituency "parse" annotator the same output as the annotation produced by the "deparse" annotator?
In other words, if you run the constituency parse annotator, is it redundant to also run the "deparse" step?
I already use the dependency parser and want to start using the constituency parser as well. I don't want to double up on the parsers if I don't have to.
Thanks!
If you run the constituency parser there is a rule based process that will create a dependency parse structure based on the constituency parse, so yes you will automatically get a dependency parse for a sentence. You only need to run the parse annotator if you want both types of parses.
It is important to note that this won't necessarily be the same dependency parse that the neural model will generate. So in case 1 you create a statistical constituency parse, and then with rules convert that to a dependency parse. In case 2 you are using a neural model to only generate a dependency parse. I am sure quite regularly these parses are not identical.
Related
I have a custom annotated corpus, in OpenNLP format. Ex:
<START:Person> John <END> went to <START:Location> London <END>. He visited <START:Organisation> ACME Co <END> in the afternoon.
What I need is to segment sentences from this corpus. But it won't always work as expected due to the annotations.
How can I do it without losing the entity annotations?
I am using OpenNLP.
In case you want to create multiple NLP models for OpenNLP you need multiple formats to train them:
The tokenizer requires a training format
The sentence detector requires a training format
The name finder requires a training format
Therefore, you need to manage these different annotation layers in some way.
I created an annotation tool and a Maven plugin which help you doing this, have a look here. All information can be stored in a single file and the Maven plugin will generate the NLP models for you.
Let me know if you have an further questions.
Can anyone explain why nlp tags from tregex response differs from the tags obtained in constituency parse as shown in the figure below.
In the above figure, engineer is tagged as NN by constituency parse annotator, but tregex outputs it as NNP.
Is it because the annotator pipeline used to perform constituency parse uses a different parse model compared to the pipeline used to perform tregex?
It appears different pipelines are being used.
When you run the standard annotation process it will use the pipeline you specify, which it appears in your example includes the pos annotator. Since the pos annotator's tags are being used, you are seeing the NN.
When you submit a tregex request, it is simply running a pipeline with tokenize,ssplit,parse (you can see this in the code for StanfordCoreNLPServer.java , which has a specific tregex handler)
This means it is using the constituency parser's part of speech tagging, which produces a different result than the dedicated part-of-speech tagger's results. In this case the constituency parser applies the tag of NNP. I should note if you use the shift reduce parser it will require the part of speech tags to be provided by the part of speech tagger, whereas the lexical parser has the ability to create it's own part of speech tags.
I would like to add a new language to the Stanford Dependency Parser, but cannot for the life of me figure out how.
In what format should training data be?
How do I generate new language files?
The neural net dependency parser takes in CoNLL-X format data.
There is a description of the format in this paper:
https://ilk.uvt.nl/~emarsi/download/pubs/14964.pdf
We need to add terms to named entity extraction tables/model in Stanford and can't figure out how. Use case - we need to build up a set of IED terms over time and want to the Stanford pipeline to extract the terms when found in text files.
Looking to see if this is something someone has done before
Please take a look at http://nlp.stanford.edu/software/regexner/ to see how to use it. It allows you to specify a file of mappings of phrases to entity types. When you want to update the mappings, you update the file and rerun the Stanford pipeline.
If you are interested in how to actually learn patterns for the terms over time, you can take a look at our pattern learning system: http://nlp.stanford.edu/software/patternslearning.shtml
Could you specify the tags you want to apply?
To use the RegexNER all you have to do is build a file with 1 entry per line of the form:
TEXT_PATTERN\tTAG
You would put all of the things you want in your custom dictionary into a file, say custom_dictionary.txt
I am assuming by IED you mean
https://en.wikipedia.org/wiki/Improvised_explosive_device ??
So your file might look like:
VBIED\tIED_TERM
sticky bombs\tIED_TERM
RCIED\tIED_TERM
New Country\tLOCATION
New Person\tPERSON
(Note Stack Overflow has some strange formatting, there should not be blank lines between each entry, it should be 1 entry per line!!)
If you then run this command:
java -mx1g -cp '*' edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,regexner,ner' -file sample_input.txt -regexner.mapping custom_dictionary.txt
you will tag sample_input.txt
Updating is merely a matter of updating custom_dictionary.txt
One thing to be on the lookout for, it matters if you put "ner" first or "regexner" first in your list of annotators.
If your highest priority is tagging with your specialized terms (for instance IED_TERM), I would run the regexner first in the pipeline, since there are some tricky issues with how the taggers overwrite each other.
I want to use gate-EN-twitter.model for pos tagging when in the process of parsing by Stanford parser. Is there an option on command line that does that? like -pos.model gate-EN-twitter.model? Or do I have to use Stanford pos tagger with gate model for tagging first then use its output as input for the parser?
Thanks!
If I understand you correctly, you want to force the Stanford Parser to use the tags generated by this Twitter-specific POS tagger. That's definitely possible, though this tweet from Stanford NLP about this exact model should serve as a warning:
Tweet from Stanford NLP, 13 Apr 2014:
Using CoreNLP on social media? Try GATE Twitter model (iff not parsing…) -pos.model gate-EN-twitter.model https://gate.ac.uk/wiki/twitter-postagger.html #nlproc
(https://twitter.com/stanfordnlp/status/455409761492549632)
That being said, if you really want to try, we can't stop you :)
There is a parser FAQ entry on forcing in your own tags. See http://nlp.stanford.edu/software/parser-faq.shtml#f
Basically, you have two options (see the FAQ for full details):
If calling the parser from the command line, you can pre-tag your text file and then alert the parser to the fact that the text is pre-tagged using some command-line options.
If parsing programmatically, the LexicalizedParser#parse method will accept any List<? extends HasTag> and treat the tags in that list as golden. Just pre-tag your list (using the CoreNLP pipeline or MaxentTagger) and pass on that token list to the parser.