I want to use gate-EN-twitter.model for pos tagging when in the process of parsing by Stanford parser. Is there an option on command line that does that? like -pos.model gate-EN-twitter.model? Or do I have to use Stanford pos tagger with gate model for tagging first then use its output as input for the parser?
Thanks!
If I understand you correctly, you want to force the Stanford Parser to use the tags generated by this Twitter-specific POS tagger. That's definitely possible, though this tweet from Stanford NLP about this exact model should serve as a warning:
Tweet from Stanford NLP, 13 Apr 2014:
Using CoreNLP on social media? Try GATE Twitter model (iff not parsing…) -pos.model gate-EN-twitter.model https://gate.ac.uk/wiki/twitter-postagger.html #nlproc
(https://twitter.com/stanfordnlp/status/455409761492549632)
That being said, if you really want to try, we can't stop you :)
There is a parser FAQ entry on forcing in your own tags. See http://nlp.stanford.edu/software/parser-faq.shtml#f
Basically, you have two options (see the FAQ for full details):
If calling the parser from the command line, you can pre-tag your text file and then alert the parser to the fact that the text is pre-tagged using some command-line options.
If parsing programmatically, the LexicalizedParser#parse method will accept any List<? extends HasTag> and treat the tags in that list as golden. Just pre-tag your list (using the CoreNLP pipeline or MaxentTagger) and pass on that token list to the parser.
Related
Let's take the following"
client = CoreNLPClient(memory='1G', threads=1, annotators=['tokenize','ssplit','pos','lemma','ner','depparse'], timeout=1000)
ann = client.annotate('Wow a nice sentence here')
sentence = ann.sentence[0]
Then I process the tags, dependencies, etc. but I also want to use TokensRegex to extract specific words. I saw the answer using requests (here) however it seems odd to have to send another request (and do the tagging again) in order to use TokensRegex. Can we just use the already annotated sentence with TokensRegex?
Edit
I see that we can use client.tokensregex('Wow a nice sentence here', <pattern>) however this still has to send a request again I guess
There is a tokensregex annotator that you can place at the end of your pipeline that will run rules.
See here: https://stanfordnlp.github.io/CoreNLP/tokensregex.html
Can anyone explain why nlp tags from tregex response differs from the tags obtained in constituency parse as shown in the figure below.
In the above figure, engineer is tagged as NN by constituency parse annotator, but tregex outputs it as NNP.
Is it because the annotator pipeline used to perform constituency parse uses a different parse model compared to the pipeline used to perform tregex?
It appears different pipelines are being used.
When you run the standard annotation process it will use the pipeline you specify, which it appears in your example includes the pos annotator. Since the pos annotator's tags are being used, you are seeing the NN.
When you submit a tregex request, it is simply running a pipeline with tokenize,ssplit,parse (you can see this in the code for StanfordCoreNLPServer.java , which has a specific tregex handler)
This means it is using the constituency parser's part of speech tagging, which produces a different result than the dedicated part-of-speech tagger's results. In this case the constituency parser applies the tag of NNP. I should note if you use the shift reduce parser it will require the part of speech tags to be provided by the part of speech tagger, whereas the lexical parser has the ability to create it's own part of speech tags.
Is there an option in Stanford NER toolkit to force the output to have the same line splits as the input?
I'm looking for something similar to "-sentences newline" option in Stanford parser.
If you give the option -tokenizerOptions "tokenizeNLs=true,tokenizePerLine=true" then each line will be treated as a separate sentence, which should give the results you are hoping for - it will be written as one line if you use an outputFormat like slashTags or inlineXML.
CoreNLP also provides some options for processing text line by line.
Stanford NLP postagger claims imperative verbs added to recent version. I've inputted lots of text with abundant and obvious imperatives, but there seems to be no tag for them on output. Must one, after all, train it for this pos?
There is no special tag for imperatives, they are simply tagged as VB.
The info on the website refers to the fact that we added a bunch of manually annotated imperative sentences to our training data such that the POS tagger gets more of them right, i.e. tags the verb as VB.
Can I detect sentences via the command line interface of Stanford NLP like Apache OpenNLP?
https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.sentdetect
Based on the docs, Stanford NLP requires tokenization as per http://nlp.stanford.edu/software/corenlp.shtml
Our pipeline requires that you tokenize first; we use these tokens in the sentence-splitting algorithm. If your text is pre-tokenized, you can use DocumentPreproccesor and request whitespace-only tokenization.
Let me know if I misunderstood your question.