How can I train my own Chinese NER model - stanford-nlp

I'm trying to train my own Chinese NER model by https://nlp.stanford.edu/software/crf-faq.html mentioned. I converted the data to one Chinese character per line, and labeled entities after character, it likes:
红 ORG
帽 ORG
首 O
席 O
执 O
行 O
官 O
Jim PERSON
Whitehurst PERSON
曾 O
表 O
示 O
, O
亚 ORG
马 ORG
逊 ORG
公 O
共 O
云 O
有 O
许 O
多 O
...
After using command java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop tech.prop, it finally generated the classfier(chinese.misc.distsim.crf.ser.gz). Then I checked the classfier how it works on annotated test data, I used command java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier chinese.misc.distsim.crf.ser.gz -testFile test.tsv, it seems to work.
But when I checked the classifier by a text paragraph instead of annotated test data using commandjava -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier chinese.misc.distsim.crf.ser.gz -textfile test.txt, it seems that the classifier is useless, it didn't recognize the word-segmented Chinese.
Is there any problem when I trained new Chinese NER model?
One probably problem I think is that I convert training data to one Chinese character per line. In fact in Chinese one Chinese character not a Chinese word, should I use word-segmented Chinese training data and convert the data to one Chinese word per line, then label the Chinese word instead of Chinese character.

Flags that may be useful for handling different types of text input are:
-plainTextDocumentReaderAndWriter CLASSNAME Specify a class to read text documents (which extends DocumentReaderAndWriter)
-tokenizerFactory CLASSNAME Specify a class to do tokenization (which extends TokenizerFactory)
-tokenizerOptions "tokenizeNLs=true,asciiQuotes=true" Give options to the tokenizer, such as the two example options here.
This might be useful too:
https://stanfordnlp.github.io/CoreNLP/human-languages.html
Apart from these, you should take a look at Chinese word-segmenter features in SeqClassifierFlags.

Related

How can I expand stanford coreNLP spanish model/dictionary

I just run a "hello world" using Standford Core NLP to get named entities from text. But some places are not recognized properly such as "Ixhuatlancillo" or "Veracruz", both cities which has to be labeled as LUG (place) are labeled as ORG.
I will like to expand the spanish model or dictionary to add places(cities) from México, and to add person names. How can I do this?
Thanks in advance.
The fastest and easiest way would be to use the regexner annotator. You can use this to manually build a dictionary.
Here is an example rule format (separated by tab, the first column can be any number of words)
system administrator TITLE MISC 2
token sequence tag tags-that-can-be-overwritten priority
That above rule would mark "system administrator" in text as TITLE.
For your case:
Veracruz LUG MISC,ORG,PERS 2
This will allow the dictionary to overwrite MISC,ORGS, and PERS. Without adding extra tags in the third column it won't overwrite previously tagged ner tags.
You might use a command like this to run it:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -props StanfordCoreNLP-spanish.properties -regexner.mapping /path/to/new_spanish.rules - regexner.ignorecase -regexner.validpospattern "^(NN|JJ|NNP).*" -outputFormat text -file sample-text.txt
Note that regexner.ignorecase means to make caseless matches, and -regexner.validpospattern is saying you should only match sequences with the specified pos tag pattern.
All of this being said, I just ran on the sentence:
Ella fue a Veracruz.
and it tagged it properly. Could you let me know what sentence you ran on that caused an incorrect tag for Veracruz?

stanfordnlp - Training space separated words as a single token to Stanford NER model generation

I have read the detailed description given here- http://nlp.stanford.edu/software/crf-faq.shtml#a on training the model based on the labelled input file according to the .prop file. But the article says-
You should make sure each line consists of solely content fields and tab characters. Spaces don't work. Extra tabs will cause problems.
My text corpus has some space separated words which are all combinedly form a token instead of single word. For instance, "Wright State University" is a single token though Wright, State and University are entities individually. I would like to generate the model with the above token as a single one. The article says that the input file to generate the model should be given as a tab separated words with first column being the token and the second column the label. How can I achieve this?
Typically NER training data is in the form of natural language sentences where each token has an NER tag. You might have 10,000 sentences or more.
For instance: "He attended Wright State University."
should be represented as:
He O
attended O
Wright SCHOOL
State SCHOOL
University SCHOOL
. O
If don't have sentences, and you simply have a list of strings that should be tagged a certain way, it makes more sense to use RegexNER.
You can find a thorough description of how to use RegexNER here:
http://nlp.stanford.edu/software/regexner.html

Stanford NER Classifier linefeed issue

I'm using the Stanford NER with a 3 class model to identify PERSON, LOCATION, and ORGANIZATION in a file. It works fine except when there are names separated by a newline:
JANE DOE
JOHN DOE
JANE SMITH
The NER tools thinks these three names as one big name and not three names. If I put a comma after each name, it picks up the three names. How can I tell the tool to use the newline to separate the three names?
If the names end up as successive tokens in the same "sentence", that is what will happen. The main thing you can do is to have the system tokenize/sentence split on newlines, then you will get a separate sentence for each name and things will work fine. In general, this will work fine if your text is formatted as one paragraph per-line (with soft line-wrapping, as is usual in modern text), but badly if you have text with hard line breaks (not at sentence/paragraph boundaries), because then the system will wrongly treat each line as a sentence. Commands that do this for both calling Stanford NER directly and through CoreNLP are:
java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators "tokenize,ssplit,pos,lemma,ner" -file taylorswift.txt -outputFormat conll -ssplit.newlineIsSentenceBreak always
java edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz -textFile taylorswift.txt -tokenizerOptions tokenizeNLs=true

Stanford NER tool -- spaces in training file

I've been looking through the Stanford NER classifier. I have been able to train a model using a simple file that has spaces only to delimit the items the system expects. For instance,
/a/b/c sanferro 2
/d/e/f ginger 2
However, I run into errors while trying forms such as:
/a/b/c san ferro 2
Here "san ferro" is a single "word" and "2" is the "answer" or desired labeling output.
How can I encode spaces? I've tried enclosing a double quotes but that doesn't work.
Typically you use CoNLL style data to train a CRF. Here is an example:
-DOCSTART- O
John PERSON
Smith PERSON
went O
to O
France LOCATION
. O
Jane PERSON
Smith PERSON
went O
to O
Hawaii LOCATION
. O
A "\t" character separates the tokens and the tags. You put a blank space in between the sentences. You use the special symbol "-DOCSTART-" to indicate where a new document starts. Typically you provide a large set of sentences. This is the case when you are training a CRF.
If you just want to tag certain patterns the same way all the time, you may want to use RegexNER, which is described here: http://nlp.stanford.edu/software/regexner/
Here is more documentation on using the NER system: http://nlp.stanford.edu/software/crf-faq.shtml

Replace accented character in Ruby

I had a script or Ruby, and when I try to replace accented charcater gsub doesn't work with me :
my floder name is "Réé Ab"
name = File.basename(Dir.getwd)
name.downcase!
name.gsub!(/[àáâãäå]/,'a')
name.gsub!(/æ/,'ae')
name.gsub!(/ç/, 'c')
name.gsub!(/[èéêë]/,'e')
name.gsub!(/[ìíîï]/,'i')
name.gsub!(/[ýÿ]/,'y')
name.gsub!(/[òóôõö]/,'o')
name.gsub!(/[ùúûü]/,'u')
the output "réé ab", why the accented characters stil there ?
The é in your name are actually two different Unicode codepoints: U+0065 (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT).
p 'é'.each_codepoint.map{|e|"U+#{e.to_s(16).upcase.rjust(4,'0')}"} * ' ' # => "U+0065 U+0301"
However the é in your regex is only one: U+00E9 (LATIN SMALL LETTER E WITH ACUTE). Wikipedia has an article about Unicode equivalence. The official Unicode FAQ also contains explanations and information about this topic.
How to normalize Unicode strings in Ruby depends on its version. It has Unicode normalization support since 2.2. You don't have to require a library or install a gem like in previous versions (here's an overview). To normalize name simpy call String#unicode_normalize with :nfc or :nfkc as argument to compose é (U+0065 and U+0301) to é (U+00E9):
name = File.basename(Dir.getwd)
name.unicode_normalize! # thankfully :nfc is the default
name.downcase!
Of course, you could also use decomposed characters in your regular expressions but that probably won't work on other file systems and then you would also have to normalize: NFD or NFKD to decompose.
I also like to or even should point out that converting é to e or ü to u causes information loss. For example, the German word Müll (trash) would be converted to Mull (mull / forest humus).

Resources