Euro symbol(€) is displayed as $ by Stanford NLP - stanford-nlp

We are trying to extract the EURO value from the document. Stanford is recognizing the money as expected. However it is during extracting it is converting € to $.

Here is a sample command to run Stanford CoreNLP and turn off the currency normalization:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -file sample-sentence.txt -outputFormat text -tokenize.options "normalizeCurrency=false"

If you are using corenlp as a dedicated server, you can include -tokenize.options parameter in url when sending the request.
Eg.
http://corenlp.run?properties={"timeout":"36000","annotators":"tokenize,ssplit,parse,lemma,ner,regexner","tokenize.options":"normalizeCurrency=false,invertible=true"}

Related

How to do the 7 classes NER on Chinese with "Stanford Named Entity Recognizer" (not CoreNLP)

I found that I can do the 7 classes NER(wit date, money ...) on Chinese sentences with CoreNLP.
But I can only do the 4 classes NER on Chinese sentences with "Stanford Named Entity Recognizer".
That is the case in offical demo website.
So how can I do the 7 classes NER on chinese sentences with "Stanford Named Entity Recognizer"?
If you run this command:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file example.txt -outputFormat text
it will run 7-class NER tagging on Chinese text.

Can I stop Stanford POS and NER taggers from removing "#" and "#" characters?

I'm doing some processing with the Stanford NLP software. First of all, thanks everyone at Stanford for all this great stuff!!! Here's my conundrum:
I have a sentences that can have URLs ("http://"), email addresses ("&"), and hashtags ("#"). I'm using python to do the work. If I use the POS and NER tagging built into nltk, all these special characters are kept in their tokenized words. But it's very slow to use these since each call fires up a new java instance. So I've run the taggers in server mode instead. And when I pass the full sentences through, they come back with all those special characters stripped off. I'm using the python sner package to interface with the servers.
Here's what I mean. To use the nltk StanfordPOSTagger, you have to pass in a pre-tokenized sentence. I'm using the StanfordTokenizer.
>>>from nltk.tag.stanford import StanfordPOSTagger
>>>from nltk.tokenize import StanfordTokenizer
>>>import sner # https://pypi.python.org/pypi/sner
>>>sent="Here's an #example from me#y.ou url http://me.you"
>>>st=StanfordTokenizer(homedir+'models/stanford-postagger.jar',
options={"ptb3Ellipsis":False})
>>>nltk_pos=StanfordPOSTagger(homedir+'models/english-bidirectional-distsim.tagger',
homedir+'models/stanford-postagger.jar')
>>>pos_args=['java', '-mx300m', '-cp', homedir+'/models/stanford-postagger.jar',
edu.stanford.nlp.tagger.maxent.MaxentTaggerServer','-model',
homedir+'models/english-bidirectional-distsim.tagger','-port','2020']
>>>POS=Popen(pos_args)
>>>sp=sner(host="localhost",port=2020)
>>>nltk_pos.tag(st.tokenize(sent))
>>>[(u'Here', u'RB'), (u"'s", u'VBZ'), (u'an', u'DT'),
(u'#example', u'NN'), (u'from', u'IN'), (u'me#y.ou', u'NN'),
(u'url', u'NN'), (u'http://me.you', u'NN')]
>>>sp.tag(sent)
>>>[(u'Here', u'RB'), (u"'s", u'VBZ'), (u'an', u'DT'),
(u'example', u'NN'), (u'from', u'IN'), (u'y.ou', u'NN'),
(u'url', u'NN'), (u'//me.you', u'NN')]
I'm curious why the difference and if there is a way to get the servers to not strip out those characters? I've read that there are ways to pass flags to the POS server to use pre-tokenized text ("-tokenize false"), but I can't figure out how to pass that list of strings to the server with the python interface. In the sner package, the text to be parsed is sent as a single string, not a list of strings as is returned by a tokenizer.
-b
The problem is because of nltk use edu.stanford.nlp.process.WhitespaceTokenizer as tokenizerFactory.
You can just change ner-server parameter like this:
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions tokenizeNLs=false

stanfordNPL Chinese relationship extracor

I would like to extract the Chinese relationship through the stanfordNLP, there is no example of training or have been trained in the Chinese relational model? very grateful!
Here is an example command for getting the TAC-KBP relations from Stanford CoreNLP:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators tokenize,ssplit,pos,lemma,ner,parse,mention,coref,entitymentions,kbp -file example.txt -outputFormat text

Is there a combined model which can generate both POS and NER tags using Stanford's NLP Library

Here is an example of a sample text output:
Good/NNP afternoon/NNP Rajat/PERSON Raina/PERSON,/O how/WRB are/VBP you/PRP today/NN ?/O
There are many ways to do this.
This command will take in a file of input text, and create an output json with all of the tokens, each having their POS tag and NER tag:
java -Xmx6g -cp "*:." -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -file input.txt -outputFormat json

Process many texts with Stanford parser

I am trying to process many snippets of text using the stanford parser. I am outputing to xml using this command
java -cp stanford-corenlp-3.3.1.jar:stanford-corenlp-3.3.1-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse -file test
all i need is the sentence parse of each of the snippets. The problem is that the snippets can have more than one sentence, and the output xml gives all the sentences together, so i cant know which sentences belong to what snippet. I could add a separator word between different sentences, but i think there must be a built in capability to show separation.
There is a parameter -fileList that takes a string of comma-separated files as its input.
Example:
java -cp stanford-corenlp-3.3.1.jar:stanford-corenlp-3.3.1-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse -fileList=file1.txt,file2.txt,file3.txt
Have a look at the SentimentPipeline.java (edu.stanford.nlp.sentiment) for further details.

Resources