NLP Postagger can't grok imperatives? - stanford-nlp

Stanford NLP postagger claims imperative verbs added to recent version. I've inputted lots of text with abundant and obvious imperatives, but there seems to be no tag for them on output. Must one, after all, train it for this pos?

There is no special tag for imperatives, they are simply tagged as VB.
The info on the website refers to the fact that we added a bunch of manually annotated imperative sentences to our training data such that the POS tagger gets more of them right, i.e. tags the verb as VB.

Related

Does Stanford Core NLP support Russian sentence and word tokenization?

I could not see any Russian pre-trained tokenizer in Sandford-NLP and stanfordCoreNLP. Are there any models for Russian yet?
Unfortunately I don't know of any extensions that handle that for Stanford CoreNLP.
You can use Stanza (https://stanfordnlp.github.io/stanza/) which is our Python package to get Russian tokenization and sentence splitting.
You could theoretically tokenize and sentence split with Stanza, and then use the Stanford CoreNLP Server (which you can also use via Stanza) if you had any CoreNLP specific components you wanted to work with.
A group a while back submitted some models for Russian, but I don't see anything for tokenization.
The link to their resources is here: https://stanfordnlp.github.io/CoreNLP/model-zoo.html

Stanford NLP core 4.0.0 no longer splitting verbs and pronouns in Spanish

Very helpfully Stanford NLP core 3.9.2 used to split rolled together Spanish verbs and pronouns
This is the 4.0.0 output:
The previous version had more .tagger files. These have not been included with the 4.0.0 distribution.
Is that the cause. Will be they added back?
There are some documentation updates that still need to be made for Stanford CoreNLP 4.0.0.
A major change is that a new multi-word-token annotator has been added, that makes tokenization conform with the UD standard. So the new default Spanish pipeline should run tokenize,ssplit,mwt,pos,depparse,ner. It may not be possible to run such a pipeline from the server demo at this time, as some modifications will need to be made. I can try to send you what such modifications would be soon. We will try to make a new release in early summer to handle issues like this that we missed.
It won't split the word in your example unfortunately, but I think in many cases it will do the correct thing. The Spanish mwt model is just based off of a large dictionary of terms, and was tuned to optimize performance on the Spanish training data.

Can I choose a pos.model in Stanford parser?

I want to use gate-EN-twitter.model for pos tagging when in the process of parsing by Stanford parser. Is there an option on command line that does that? like -pos.model gate-EN-twitter.model? Or do I have to use Stanford pos tagger with gate model for tagging first then use its output as input for the parser?
Thanks!
If I understand you correctly, you want to force the Stanford Parser to use the tags generated by this Twitter-specific POS tagger. That's definitely possible, though this tweet from Stanford NLP about this exact model should serve as a warning:
Tweet from Stanford NLP, 13 Apr 2014:
Using CoreNLP on social media? Try GATE Twitter model (iff not parsing…) -pos.model gate-EN-twitter.model https://gate.ac.uk/wiki/twitter-postagger.html #nlproc
(https://twitter.com/stanfordnlp/status/455409761492549632)
That being said, if you really want to try, we can't stop you :)
There is a parser FAQ entry on forcing in your own tags. See http://nlp.stanford.edu/software/parser-faq.shtml#f
Basically, you have two options (see the FAQ for full details):
If calling the parser from the command line, you can pre-tag your text file and then alert the parser to the fact that the text is pre-tagged using some command-line options.
If parsing programmatically, the LexicalizedParser#parse method will accept any List<? extends HasTag> and treat the tags in that list as golden. Just pre-tag your list (using the CoreNLP pipeline or MaxentTagger) and pass on that token list to the parser.

Segmentation of entities in Named Entity Recognition

I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.

How can I add more tagged words to the Stanford POS-Tagger's trained models?

I haven't found anything in the documentation about adding more tagged words to the tagger, specifically the bi-directional one.
Thanks
At present, you can't. Model training is an all-at-one-time operation. (Since the tagger uses weights that take into account contexts and frequencies, it isn't trivial to add new words to it post hoc.)
There is a workaround. It is ugly but should do the trick:
build a list of "your" words
scan text for these words
if any matches found to POS tagging yourself (NLTK can help you here)
feed it to Stanford parser.
FROM: http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf
"You can also give it POS tagged text; the parser will try to use
your tags if they make sense.
You might want to do this if the parser makes tagging
mistakes in your text domain."

Resources