Is token normalization implemented in Stanford NLP? - stanford-nlp

I am trying to normalize tokens (potentially merging them if needed) before running the RegexNER annotator over them.
Is there something already implemented for this in Stanford CoreNLP or in Stanford NLP in general?
If not, what's the best way to implement it? Writing a custom annotator in CoreNLP?

There are definitely some options for token normalization. You apply the -options flag with a comma separated list containing the options you want.
This is described in more detail on this link:
http://nlp.stanford.edu/software/tokenizer.shtml
Near the bottom there is a section about Options which shows a list of possibilities.
Are there other normalizations you are interested in that are not on that list?

Related

NLP Postagger can't grok imperatives?

Stanford NLP postagger claims imperative verbs added to recent version. I've inputted lots of text with abundant and obvious imperatives, but there seems to be no tag for them on output. Must one, after all, train it for this pos?
There is no special tag for imperatives, they are simply tagged as VB.
The info on the website refers to the fact that we added a bunch of manually annotated imperative sentences to our training data such that the POS tagger gets more of them right, i.e. tags the verb as VB.

Stanford NLP: Sentence splitting without tokenization?

Can I detect sentences via the command line interface of Stanford NLP like Apache OpenNLP?
https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.sentdetect
Based on the docs, Stanford NLP requires tokenization as per http://nlp.stanford.edu/software/corenlp.shtml
Our pipeline requires that you tokenize first; we use these tokens in the sentence-splitting algorithm. If your text is pre-tokenized, you can use DocumentPreproccesor and request whitespace-only tokenization.
Let me know if I misunderstood your question.

Can I choose a pos.model in Stanford parser?

I want to use gate-EN-twitter.model for pos tagging when in the process of parsing by Stanford parser. Is there an option on command line that does that? like -pos.model gate-EN-twitter.model? Or do I have to use Stanford pos tagger with gate model for tagging first then use its output as input for the parser?
Thanks!
If I understand you correctly, you want to force the Stanford Parser to use the tags generated by this Twitter-specific POS tagger. That's definitely possible, though this tweet from Stanford NLP about this exact model should serve as a warning:
Tweet from Stanford NLP, 13 Apr 2014:
Using CoreNLP on social media? Try GATE Twitter model (iff not parsing…) -pos.model gate-EN-twitter.model https://gate.ac.uk/wiki/twitter-postagger.html #nlproc
(https://twitter.com/stanfordnlp/status/455409761492549632)
That being said, if you really want to try, we can't stop you :)
There is a parser FAQ entry on forcing in your own tags. See http://nlp.stanford.edu/software/parser-faq.shtml#f
Basically, you have two options (see the FAQ for full details):
If calling the parser from the command line, you can pre-tag your text file and then alert the parser to the fact that the text is pre-tagged using some command-line options.
If parsing programmatically, the LexicalizedParser#parse method will accept any List<? extends HasTag> and treat the tags in that list as golden. Just pre-tag your list (using the CoreNLP pipeline or MaxentTagger) and pass on that token list to the parser.

Segmentation of entities in Named Entity Recognition

I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.

How can I add more tagged words to the Stanford POS-Tagger's trained models?

I haven't found anything in the documentation about adding more tagged words to the tagger, specifically the bi-directional one.
Thanks
At present, you can't. Model training is an all-at-one-time operation. (Since the tagger uses weights that take into account contexts and frequencies, it isn't trivial to add new words to it post hoc.)
There is a workaround. It is ugly but should do the trick:
build a list of "your" words
scan text for these words
if any matches found to POS tagging yourself (NLTK can help you here)
feed it to Stanford parser.
FROM: http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf
"You can also give it POS tagged text; the parser will try to use
your tags if they make sense.
You might want to do this if the parser makes tagging
mistakes in your text domain."

Resources