Can I detect sentences via the command line interface of Stanford NLP like Apache OpenNLP?
https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.sentdetect
Based on the docs, Stanford NLP requires tokenization as per http://nlp.stanford.edu/software/corenlp.shtml
Our pipeline requires that you tokenize first; we use these tokens in the sentence-splitting algorithm. If your text is pre-tokenized, you can use DocumentPreproccesor and request whitespace-only tokenization.
Let me know if I misunderstood your question.
Related
I could not see any Russian pre-trained tokenizer in Sandford-NLP and stanfordCoreNLP. Are there any models for Russian yet?
Unfortunately I don't know of any extensions that handle that for Stanford CoreNLP.
You can use Stanza (https://stanfordnlp.github.io/stanza/) which is our Python package to get Russian tokenization and sentence splitting.
You could theoretically tokenize and sentence split with Stanza, and then use the Stanford CoreNLP Server (which you can also use via Stanza) if you had any CoreNLP specific components you wanted to work with.
A group a while back submitted some models for Russian, but I don't see anything for tokenization.
The link to their resources is here: https://stanfordnlp.github.io/CoreNLP/model-zoo.html
Is there an option in Stanford NER toolkit to force the output to have the same line splits as the input?
I'm looking for something similar to "-sentences newline" option in Stanford parser.
If you give the option -tokenizerOptions "tokenizeNLs=true,tokenizePerLine=true" then each line will be treated as a separate sentence, which should give the results you are hoping for - it will be written as one line if you use an outputFormat like slashTags or inlineXML.
CoreNLP also provides some options for processing text line by line.
I am using python's inbuilt library nltk to get stanford ner tagger api setup but i am seeing inconsistency between tagging of words by this api and online demo on stanford's ner tagger website.Some words are being tagged in online demo while they are not being in api in python and similarly some words are being tagged differently.I have used the same classifiers as mentioned in the website. Can anyone tell me why is the problem coming and what's the solution for it..?
I was running into the same issue and determined that my code and the online demo were applying different formatting rules for the text.
https://github.com/dat/pyner/blob/master/ner/client.py
for s in ('\f', '\n', '\r', '\t', '\v'): #strip whitespaces
text = text.replace(s, '')
text += '\n' #ensure end-of-line
I am trying to normalize tokens (potentially merging them if needed) before running the RegexNER annotator over them.
Is there something already implemented for this in Stanford CoreNLP or in Stanford NLP in general?
If not, what's the best way to implement it? Writing a custom annotator in CoreNLP?
There are definitely some options for token normalization. You apply the -options flag with a comma separated list containing the options you want.
This is described in more detail on this link:
http://nlp.stanford.edu/software/tokenizer.shtml
Near the bottom there is a section about Options which shows a list of possibilities.
Are there other normalizations you are interested in that are not on that list?
I want to use gate-EN-twitter.model for pos tagging when in the process of parsing by Stanford parser. Is there an option on command line that does that? like -pos.model gate-EN-twitter.model? Or do I have to use Stanford pos tagger with gate model for tagging first then use its output as input for the parser?
Thanks!
If I understand you correctly, you want to force the Stanford Parser to use the tags generated by this Twitter-specific POS tagger. That's definitely possible, though this tweet from Stanford NLP about this exact model should serve as a warning:
Tweet from Stanford NLP, 13 Apr 2014:
Using CoreNLP on social media? Try GATE Twitter model (iff not parsing…) -pos.model gate-EN-twitter.model https://gate.ac.uk/wiki/twitter-postagger.html #nlproc
(https://twitter.com/stanfordnlp/status/455409761492549632)
That being said, if you really want to try, we can't stop you :)
There is a parser FAQ entry on forcing in your own tags. See http://nlp.stanford.edu/software/parser-faq.shtml#f
Basically, you have two options (see the FAQ for full details):
If calling the parser from the command line, you can pre-tag your text file and then alert the parser to the fact that the text is pre-tagged using some command-line options.
If parsing programmatically, the LexicalizedParser#parse method will accept any List<? extends HasTag> and treat the tags in that list as golden. Just pre-tag your list (using the CoreNLP pipeline or MaxentTagger) and pass on that token list to the parser.