How to replace the matched text with a word specifying by rules files? - stanford-nlp

I am currently working on a Stanford CoreNLP program that replaces a matched text with a specified word using a list of given rules. I checked TokensRegex Expression, I know there is a regex function can be used in Action field:
Replace(List<CoreMap>, tokensregex, replacement)<br>Match(String,regex,replacement)
to do that. However, it is not clear to me how to implement this function in my rules files. And I couldn't find any example on GitHub or other web pages.
Here is an example of a replacement:
Input text: John Smith is a member of the NLP lab.
Matched pattern: "John Smith" is replaced with "Student A" in the text.
Resulting text: Student A is a member of the NLP lab.
Anyone could help me? I am new to Stanford CoreNLP and have a lot of things to learn.

Related

Struggling to understand user dictionary format in Elasticsearch Kuromoji tokenizer

I wanted to use Elasticsearch Kuromoji plugin for Japanese language. However, I'm struggling to understand the user_dictionary format of the file in the tokenizer. It's explained in elastic doc https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji-tokenizer.html as the CSV of the following form:
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A user_dictionary may be appended to the default dictionary. The dictionary should have the following CSV format:
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
So there is not much in the documentation about that.
When looking at the sample entry the doc shows, it can looks like below:
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
So, breaking it down, the first element is the dictionary text:
東京スカイツリー - Tokyo Sky Tree
東京 スカイツリー - is Tokyo Sky tree - I assuming the space here is to denote token, but wondering why only "Tokyo" is a separate token, but sky tree is not split into "sky" "tree" ?
トウキョウ スカイツリー - Then we have a reading forms. And again, "Tokyo" "sky tree" - again, why it's splited such way. Can I specify more than one reading form of the text in this column (of course if there are any)
And the last is the part of speech, which is the bit I don't understand. カスタム名詞 means "Custom noun". I assuming I can define the part of speech such as verb, noun etc. But what are the rules, should it follow some format of part of speech name. I saw examples where it's specified as "noun" - 名詞. But in this example is custom noun.
Anyone have some ideas, materials especially around Part of speech field - such as what are the available values. Additionally, what impact has this field to the overall tokenizer capabilities ?
Thanks
Do you try to define "tokyo sky tree" like this
"東京スカイツリー,東京スカイツリー,トウキョウスカイツリー,カスタム名詞"
"東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"
I encounter another problem Found duplicate term [東京スカイツリー] in user dictionary at line [1]

entity recognize,just a word,not a whole sentence to reognize,what should use Algorithm about NLP

when i input a parameter "marry" into function,then,return "this is name",when i input a parameter "MIT",then,return "this is institution name",just a word,not a whole sentence to reognize,what should use Algorithm about NLP? thanks.
This looks like part-of-speech tagging and/or named entity recognition BUT if you are processing English, single words without context are potentially ambiguous. Also, single words may not be informative. "new" on it's own can be an adjective (POS) but "New York" is most likely a location (NER). Check some literature on both tasks and consider processing at least sentence-level features.

Defining TokensRegex for Stanford NLP RegexNER

I'm trying to create a regex token to tag Universities as schools in input text. For e.g. University of Wisconsin or Universidad Anahuac should get tagged as SCHOOL.
I have this as my pattern
( /University|Universidad/ /of?/ [ {ner:LOCATION}|{ner:ORGANIZATION} ]+ ) SCHOOL
I can't seem to get the syntax correct. Any help would be appreciated.

stanfordnlp - Training space separated words as a single token to Stanford NER model generation

I have read the detailed description given here- http://nlp.stanford.edu/software/crf-faq.shtml#a on training the model based on the labelled input file according to the .prop file. But the article says-
You should make sure each line consists of solely content fields and tab characters. Spaces don't work. Extra tabs will cause problems.
My text corpus has some space separated words which are all combinedly form a token instead of single word. For instance, "Wright State University" is a single token though Wright, State and University are entities individually. I would like to generate the model with the above token as a single one. The article says that the input file to generate the model should be given as a tab separated words with first column being the token and the second column the label. How can I achieve this?
Typically NER training data is in the form of natural language sentences where each token has an NER tag. You might have 10,000 sentences or more.
For instance: "He attended Wright State University."
should be represented as:
He O
attended O
Wright SCHOOL
State SCHOOL
University SCHOOL
. O
If don't have sentences, and you simply have a list of strings that should be tagged a certain way, it makes more sense to use RegexNER.
You can find a thorough description of how to use RegexNER here:
http://nlp.stanford.edu/software/regexner.html

Stanford NER tool -- spaces in training file

I've been looking through the Stanford NER classifier. I have been able to train a model using a simple file that has spaces only to delimit the items the system expects. For instance,
/a/b/c sanferro 2
/d/e/f ginger 2
However, I run into errors while trying forms such as:
/a/b/c san ferro 2
Here "san ferro" is a single "word" and "2" is the "answer" or desired labeling output.
How can I encode spaces? I've tried enclosing a double quotes but that doesn't work.
Typically you use CoNLL style data to train a CRF. Here is an example:
-DOCSTART- O
John PERSON
Smith PERSON
went O
to O
France LOCATION
. O
Jane PERSON
Smith PERSON
went O
to O
Hawaii LOCATION
. O
A "\t" character separates the tokens and the tags. You put a blank space in between the sentences. You use the special symbol "-DOCSTART-" to indicate where a new document starts. Typically you provide a large set of sentences. This is the case when you are training a CRF.
If you just want to tag certain patterns the same way all the time, you may want to use RegexNER, which is described here: http://nlp.stanford.edu/software/regexner/
Here is more documentation on using the NER system: http://nlp.stanford.edu/software/crf-faq.shtml

Resources