Name Entity Recognition for Arabic documents - stanford-nlp

I need your help please, I am doing NER project using NetBeans v.8.0.2.
I need to get the Person Names and Places out of any Arabic document-file and categorize them as person name, Place. I saw all Stanford files, POS tagger, parser and also Stanford NER. And I tried them all, the tagger works fine with me.
But i had problems with Parser especially in this line of code
LexicalizedParser lp = LexicalizedParser.loadModel(grammar, options);
from ParserDemo and no output comes up. Do i need the parser first to tokenize the document then to use POS tagger, or i can just use the POS tagger with some editing (like using if statement to combine all NNP together and the same for places).

So first of all as of the moment we do not have any Arabic NER models.
Secondly, I'll post some steps for running the Stanford parser on Arabic text.
Get the Stanford parser: http://nlp.stanford.edu/software/lex-parser.shtml
Compile ParserDemo.java ; you need the jars present in the directory stanford-parser-full-2015-04-20 to compile
I ran this command at the command line while in the stanford-parser-full-2015-04-20 directory, (do the analogous thing in NetBeans):
java -cp ".:*" ParserDemo edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz data/arabic-onesent-utf8.txt
You should get a proper parse of the Arabic example sentence.
So when you run ParserDemo in NetBeans, make sure you provide "edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz" as the first argument to ParserDemo , so it knows to load the Arabic model.
For this input:
و نشر العدل من خلال قضاء مستقل
I get this output:
(ROOT
(S (CC و)
(VP (VBD نشر)
(NP (DTNN العدل))
(PP (IN من)
(NP (NN خلال)
(NP (NN قضاء) (JJ مستقل)))))
(PUNC .)))
I am happy to help you further, please let me know if you need any more info.
FYI here is some more info on the Arabic parser:
http://nlp.stanford.edu/software/parser-arabic-faq.shtml

Related

Adding metadata into Stanford coreNLP input

I have a large corpus of sentences (~ 1.1M) to parse through Stanford Core NLP but in the output I get more sentences than in the input, probably the system segments some sentences beyond the given segmentation into lines.
To control what happens I would like to include "tags" into the input. These tags should be recognizable in the output and should not affect parsing.
Something like
<0001>
I saw a man with a telescope .
</0001>
or
#0001#
I saw a man with a telescope .
#/0001#
I have tried many formats, in all cases the "tag" has been parsed as if it were part of the text.
Is there some way to tell the parser "do not parse this, just keep it as is in the output"?
===A few hours later===
As I'm getting no answer, here is an example: I would like to process the sentence “Manon espérait secrètement y revoir un garçon qui l'avait marquée autrefois.” that carries tag 151_0_4. I imagined to write the tag between two rows of equal signs on a separate line, followed by a period, to be sure that the tag will, at worst, be processed as a separate sentence:
=====151_0_4======.
Manon espérait secrètement y revoir un garçon qui l'avait marquée autrefois.
=====151_0_4======.
And here is what this produced:
(ROOT (SENT (NP (SYM =)) (NP (SYM =) (PP (SYM =) (NP (SYM =) (PP (SYM =) (NP (NUM 151_0_4) (SYM =) (SYM =) (NP (SYM =) (PP (SYM =) (NP (SYM =) (SYM =))))))))) (PUNCT .)))
As you see the tags are definitely considered as being part of the sentence, no way to separate them from it.
Same thing happened with XML-like tags <x151_0_4> or tags using the hash character...
If your current data is strictly one sentence per line, then by far the easiest thing to do is to just leave it like that and to give the option -ssplit.eolonly=true.
There unfortunately isn't an option to pass through certain kinds of meta-data or delimiters without attempting to parse or process them. However, you can indicate that they should not be made part of other sentences by means of the ssplit.boundaryTokenRegex or ssplit.boundaryMultiTokenRegex properties. However, your choices are then either to just delete them (see ssplit.tokenPatternsToDiscard) or else to process them as weird sentences, which you'd then need to clean up.

Extracting information from a sentence using NLP

I want to extract information from sentences. I am a newbie in this field. I have sentences as :
 
"Andrew query pizza king what is today's deal"
 "Andrew order flower shop to send my wife roses"
Format : <Name> <command> <company name> <connecting word> <action>
With the help of standford NLP parser how to extract the sentences as the format above? Like After extracting If i want to print action of the sentence it should give {is today's deal, me send my wife roses}
That's a hard task. If you have a very, very restricted set of sentences you can try to use the parser dependencies and model your problem with rules. However, I ran your sentence through the Stanford parser and got obviously wrong result:
(ROOT
(FRAG
(NP
(NP (NNP Andrew) (NN query) (NN pizza) (NN king))
(SBAR
(WHNP (WP what))
(S
(VP (VBZ is)
(NP
(NP (NN today) (POS 's))
(NN deal))))))))
As you can see, it sees Andrew query pizza king as a noun phrase, it would do the same with "Andrew dog carrot soup what is today's deal". Obviously it misses the verb "query", the target "pizza king", etc.
Even if that worked, the syntax parser models only syntax, ignoring semantics. You should check Semantic Role Labeling, Named Entity Recognition, Relation Extraction, etc. For your specific task most probably you would have to define your own semantics, then use a statistical algorithm for analyzing the text and extracting the needed information.
Here is a nice article about approaches to building chatbots: https://techinsight.com.vn/language/en/three-basic-nlp-problems-one-develops-chatbot-system-typical-approaches/

Specific Part of Speech labels for Java Stanford NLP

What are the set of PoS labels produced by Standford NLP (including PoS for punctuation tokens), and its description?
I know this question has been asked several times, such as in:
Java Stanford NLP: Part of Speech labels?
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf
but those answers list some typical PoS labels which are not specific to Standfor NLP. For instance, none of those answers list the -LRB- PoS label used by Stanford NKLP for the ( punctuation.
Where can I find this list of PoS labels in the source code of the Stanford NLP?
Also, what are some token examples annotated with the SYM PoS label?
Also, how to know if a token is a punctuation?
Here they define isPunctation == true if its PoS is :|,|.|“|”|-LRB-|-RRB-|HYPH|NFP|SYM|PUNC. However Stanford NLP does not have all these PoS.
It is the Penn Treebank POS set, but many descriptions of this tag set seem to omit punctuation marks. Here is a complete list of tags:
https://www.eecis.udel.edu/~vijay/cis889/ie/pos-set.pdf
(But parentheses are tagged as -LRB- and -RRB-, not sure why they don't mention this in the documentation.)

How can I expand stanford coreNLP spanish model/dictionary

I just run a "hello world" using Standford Core NLP to get named entities from text. But some places are not recognized properly such as "Ixhuatlancillo" or "Veracruz", both cities which has to be labeled as LUG (place) are labeled as ORG.
I will like to expand the spanish model or dictionary to add places(cities) from México, and to add person names. How can I do this?
Thanks in advance.
The fastest and easiest way would be to use the regexner annotator. You can use this to manually build a dictionary.
Here is an example rule format (separated by tab, the first column can be any number of words)
system administrator TITLE MISC 2
token sequence tag tags-that-can-be-overwritten priority
That above rule would mark "system administrator" in text as TITLE.
For your case:
Veracruz LUG MISC,ORG,PERS 2
This will allow the dictionary to overwrite MISC,ORGS, and PERS. Without adding extra tags in the third column it won't overwrite previously tagged ner tags.
You might use a command like this to run it:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -props StanfordCoreNLP-spanish.properties -regexner.mapping /path/to/new_spanish.rules - regexner.ignorecase -regexner.validpospattern "^(NN|JJ|NNP).*" -outputFormat text -file sample-text.txt
Note that regexner.ignorecase means to make caseless matches, and -regexner.validpospattern is saying you should only match sequences with the specified pos tag pattern.
All of this being said, I just ran on the sentence:
Ella fue a Veracruz.
and it tagged it properly. Could you let me know what sentence you ran on that caused an incorrect tag for Veracruz?

How to get the text for Tree in reverse way

Please, Could anybody help me in order to extract the text from a Tree??
e.g. : (NP (NP (DT the) (JJ main) (NN road)) (PP (IN of) (NP (NNP Rontau))))
The text : "the main road of Rontau"
I'm using stanford trees package.
You want the "yield" of a parse tree. You can access this as a list of Label instances using the method Tree#yield.

Resources