Entity Mention Detection is not working properly with TokensRegex

Entity Mention Detection is not working properly with TokensRegex - stanford-nlp

entitymention doesn't seem to work. I followed similar approach mentioned here by adding entitymentions as one of the annotators
How can I detect named entities that have more than 1 word using CoreNLP's RegexNER?
Input : "Here is your 24 USD"
I have a TokensRegex:
{ ruleType: "tokens", pattern: ([{ner:"NUMBER"}] + [{word:"USD"}]), action: Annotate($0, ner, "NEW_MONEY"), result: "NEW_MONEY_RESULT" }
Init Pipeline:
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex,entitymentions");
props.setProperty("tokensregex.rules", "basic_ner.rules");
I still got 2 CoreEntityMention instead of just 1.
Both of them have the same value for edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation which is NEW_MONEY
but they have different edu.stanford.nlp.ling.CoreAnnotations$EntityMentionIndexAnnotation
which is 0 for 24
1 for USD
How can they be merged since they both have same entity tag annotation.
3.9.2 version of stanford library is used.

The issue is that numbers have a normalized name entity tag.
Here is a rules file that will work:
# these Java classes will be used by the rules
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
normNER = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation" }
tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }
# rule for recognizing company names
{ ruleType: "tokens", pattern: ([{ner:"NUMBER"}] [{word:"USD"}]), action: (Annotate($0, ner, "NEW_MONEY"), Annotate($0, normNER, "NEW_MONEY")), result: "NEW_MONEY" }
You should not add an extra tokensregex annotator and entitymentions annotator at the end. The ner annotator will run these as sub-annotators.
Here is an example command:
java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.tokensregex.rules new_money.rules -file new_money_example.txt -outputFormat text
More documentation here:
https://stanfordnlp.github.io/CoreNLP/tokensregex.html
https://stanfordnlp.github.io/CoreNLP/ner.html

Related

Custom protoc plugin parsing not working for custom options

I am trying to write a protoc plugin that requires me to use custom options. I defined my custom option as shown in the example (https://developers.google.com/protocol-buffers/docs/proto#customoptions):
import "google/protobuf/descriptor.proto";
extend google.protobuf.MessageOptions {
string my_option = 51234;
}
I use it as follows:
message Hello {
bool greeting = 1;
string name = 2;
int32 number = 3;
option (my_option) = "telephone";
}
However, when I read the parsed request, the options field is empty for the "Hello" message.
I am doing the following to read
data = sys.stdin.read()
request = plugin.CodeGeneratorRequest()
request.ParseFromString(data)
When I print "request," it just gives me this
message_type {
name: "Hello"
field {
name: "greeting"
number: 1
label: LABEL_REQUIRED
type: TYPE_BOOL
json_name: "greeting"
}
field {
name: "name"
number: 2
label: LABEL_REQUIRED
type: TYPE_STRING
json_name: "name"
}
field {
name: "number"
number: 3
label: LABEL_OPTIONAL
type: TYPE_INT32
json_name: "number"
}
options {
}
}
As seen, the options field is empty even though I defined options in my .proto file. Is my syntax incorrect for defining custom options? Or could it be a problem with my version of protoc?

I'm making my protobuf python plugin.
I also got the problem like yours and i have found a solution for that.
Put your custom options to a file my_custom.proto
Use protoc to gen a python file from my_custom.proto => my_custom_pb2.py
In your python plugin code, import my_custom_pb2.py import my_custom_pb2

Turns out you need to have the _pb2.py file imported for the .proto file in which the custom option is defined. For example, it you are parsing a file (using ParseFromString) called example.proto which uses a custom option defined in option.proto, you must import option_pb2.py in the Python file that calls ParseFromString.

Chinese Lemmetization in the StanfordNLP

Is the Stanford CoreNLP package for Chinese is able to detect chengyu(成语） and sayings (格言／谚语／惯用语 (eg. 冰冻三尺，非一日之寒）)？

Superise me too! it realy does!
The following is generated by Stanford-NLP pipeline (with Chinese models): tokenize, ssplit, pos, lemma, ner
[
[
{
"category2":null,
"offset-begin":"0",
"ner2":"O",
"lemma2":"冰冻三尺",
"word2":null,
"index":"1",
"index2":"1",
"lemma":"冰冻三尺",
"offset-begin2":"null",
"tag2":"",
"originalText":"",
"offset-end":"4",
"answer":null,
"pos":"VV",
"offset-end2":"null",
"ner":"O",
"tag":"VV",
"originalText2":null,
"category":null,
"word":"冰冻三尺",
"value":"冰冻三尺"
},
{
"category2":null,
"offset-begin":"4",
"ner2":"O",
"lemma2":"，",
"word2":null,
"index":"2",
"index2":"2",
"lemma":"，",
"offset-begin2":"null",
"tag2":"",
"originalText":"",
"offset-end":"5",
"answer":null,
"pos":"PU",
"offset-end2":"null",
"ner":"O",
"tag":"PU",
"originalText2":null,
"category":null,
"word":"，",
"value":"，"
},
{
"category2":null,
"offset-begin":"5",
"ner2":"O",
"lemma2":"非一日之寒",
"word2":null,
"index":"3",
"index2":"3",
"lemma":"非一日之寒",
"offset-begin2":"null",
"tag2":"",
"originalText":"",
"offset-end":"10",
"answer":null,
"pos":"VV",
"offset-end2":"null",
"ner":"O",
"tag":"VV",
"originalText2":null,
"category":null,
"word":"非一日之寒",
"value":"非一日之寒"
}
]
]

TokensRegex rules to get correct output for Named Entities

I was finally able to get my TokensRegex code to give some kind of output for named entities. But the output is not exactly what I want. I believe the rules need some tweaking.
Here's the code:
public static void main(String[] args)
{
String rulesFile = "D:\\Workspace\\resource\\NERRulesFile.rules.txt";
String dataFile = "D:\\Workspace\\data\\GoldSetSentences.txt";
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
props.setProperty("ner.useSUTime", "0");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.addAnnotator(new TokensRegexAnnotator(rulesFile));
String inputText = "Bill Edelman, CEO and chairman of Paragonix Inc. announced that the company is expanding it's operations in China.";
Annotation document = new Annotation(inputText);
pipeline.annotate(document);
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, rulesFile);
/* Next we can go over the annotated sentences and extract the annotated words,
Using the CoreLabel Object */
for (CoreMap sentence : sentences)
{
List<MatchedExpression> matched = extractor.extractExpressions(sentence);
for(MatchedExpression phrase : matched){
// Print out matched text and value
System.out.println("matched: " + phrase.getText() + " with value: " + phrase.getValue());
// Print out token information
CoreMap cm = phrase.getAnnotation();
for (CoreLabel token : cm.get(TokensAnnotation.class))
{
if (token.tag().equals("NNP")){
String leftContext = token.before();
String rightContext = token.after();
System.out.println(leftContext);
System.out.println(rightContext);
String word = token.get(TextAnnotation.class);
String lemma = token.get(LemmaAnnotation.class);
String pos = token.get(PartOfSpeechAnnotation.class);
String ne = token.get(NamedEntityTagAnnotation.class);
System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + "ne=" + ne);
}
}
}
}
}
}
And here's the rules file:
$TITLES_CORPORATE = (/chief/ /administrative/ /officer/|cao|ceo|/chief/ /executive/ /officer/|/chairman/|/vice/ /president/)
$ORGANIZATION_TITLES = (/International/|/inc\./|/corp/|/llc/)
# For detecting organization names like 'Paragonix Inc.'
{ ruleType: "tokens",
pattern: ([{pos: NNP}]+ $ORGANIZATION_TITLES),
action: ( Annotate($0, ner, "ORGANIZATION"),Annotate($1, ner, "ORGANIZATION") )
}
# For extracting organization names from a pattern - 'Genome International is planning to expand its operations in China.'
#(in the sentence given above the words planning and expand are part of the $OrgContextWords macros )
{
ruleType: "tokens",
pattern: (([{tag:/NNP.*/}]+) /,/*? /is|had|has|will|would/*? /has|had|have|will/*? /be|been|being/*? (?:[]{0,5}[{lemma:$OrgContextWords}]) /of|in|with|for|to|at|like|on/*?),
result: ( Annotate($1, ner, "ORGANIZATION") )
}
# For sentence like - Bill Edelman, Chairman and CEO of Paragonix Inc./ Zuckerberg CEO Facebook said today....
ENV.defaults["stage"] = 1
{
pattern: ( $TITLES_CORPORATE ),
action: ( Annotate($1, ner, "PERSON_TITLE"))
}
ENV.defaults["stage"] = 2
{
ruleType: "tokens",
pattern: ( ([ { pos:NNP} ]+) /,/*? (?:TITLES_CORPORATE)? /and|&/*? (?:TITLES_CORPORATE)? /,/*? /of|for/? /,/*? [ { pos:NNP } ]+ ),
result: (Annotate($1, ner, "PERSON"),Annotate($2, ner, "ORGANIZATION"))
}
The output I get is:
matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=PERSON
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION
The output I am expecting is:
matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=ORGANIZATION
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION
Also here Bill Edelman does not get identified as person. The phrase containing Bill Edelman does not get identified although I have a rule in place for it. Do I need to stage my rules for the entire phrase to get matched against each rule as a result not miss out on any entities?

I have produced a jar that represents the latest Stanford CoreNLP on the main GitHub page (as of April 14).
This command (with the latest code) should work for using the TokensRegexAnnotator (alternatively the tokensregex settings can be passed into a Properties object if using the Java API):
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,tokensregex -tokensregex.rules example.rules -tokensregex.caseInsensitive -file example.txt -outputFormat text
Here is a rule file I wrote that shows matching based on a sentence pattern:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
$ORGANIZATION_TITLES = "/inc\.|corp\./"
$COMPANY_INDICATOR_WORDS = "/company|corporation/"
{ pattern: (([{pos: NNP}]+ $ORGANIZATION_TITLES) /is/ /a/ $COMPANY_INDICATOR_WORDS), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }
{ pattern: ($COMPANY_INDICATOR_WORDS /that/ ([{pos: NNP}]+) /works/ /for/), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }
Note that $0 means the entire pattern, and $1 means the first capture group. So in this example, I put an extra parentheses around the text that represented what I wanted to match.
I ran this on the example: Paragonix Inc. is a company that Joe Smith works for.
This example shows using an extraction from a first round in a second round:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
$ORGANIZATION_TITLES = "/inc\.|corp\./"
$COMPANY_INDICATOR_WORDS = "/company|corporation/"
ENV.defaults["stage"] = 1
{ pattern: (/works/ /for/ ([{pos: NNP}]+ $ORGANIZATION_TITLES)), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }
ENV.defaults["stage"] = 2
{ pattern: (([{pos: NNP}]+) /works/ /for/ [{ner: "RULE_FOUND_ORG"}]), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }
This example should work properly on the sentence Joe Smith works for Paragonix Inc.

Add a filter to appender in elasticsearch - logging yml syntax

I am using Elasticsearch v1.3.2.
I would like to switch on logging of slow search executions times, but I would like to filter out searches against marvel.
Elasticsearch seems to say we can customise the logging based on the log4j v1.2 documentation (see bottom of w ww.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html)
I've had a look at the log4j v1.2 doc (https://logging.apache.org/log4j/1.2/apidocs/index.html, and http://wiki.apache.org/logging-log4j/Log4jXmlFormat?highlight=%28filter%29), and it looks as though I should be able to add a stringMatchFilter to the index_search_slow_log_file appender, but everything I try spits it out.
This is what I expect should work in logging.yml:
index_search_slow_log_file:
type: dailyRollingFile
file: ${path.logs}/${cluster.name}_index_search_slowlog.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
filter:
type: stringMatchFilter
acceptOnMatch: true
stringToMatch: "marvel"
This gives this exception:
log4j:WARN Failed to set property [filter] to value "stringMatchFilter".
log4j:ERROR Could not instantiate class [true].
java.lang.ClassNotFoundException: true
at ..........
I've also tried:
index_search_slow_log_file:
...
filter:
type: stringMatch
acceptOnMatch: true
stringToMatch: "marvel"
and every other combination I can think of, including removing quotation marks.
Can anyone see what I am doing wrong?
Thanks!
Isabel

Your syntax is a bit incomplete :-), please use the following and see if it works for you. Filters syntax uses an identifier, thus the 1 in my configuration below. Also, note that if you want to filter out the "marvel" ones then you need acceptOnMatch: false.
filter:
1:
type: org.apache.log4j.varia.StringMatchFilter
StringToMatch: "marvel"
AcceptOnMatch: false

I just realised that the level that you log to index_search_slow_log_file can be specific to the index.
So I don't actually need to filter out marvel logs, I can just set the default to no logging in elasticsearch.yml (i.e. don't change it), enable logging to index_search_slow_log_file and then put an index specific override via the index settings API.
elasticsearch.yml: no change
logging.yml:
additivity:
index.search.slowlog: true
...
Index settings API:
PUT /index_name/_settings
{
"index": {
"search": {
"slowlog": {
"threshold": {
"fetch": {
"trace": "0ms",
"info": "500ms",
"warn": "1s"
}
}
}
}
}
}

To add to andrei-stefan's answer, you can also invert the sense of matching to AcceptOnMatch: true, but this requires adding an explicit DenyAllFilter after. The full example looks like this:
filter:
1:
type: org.apache.log4j.varia.StringMatchFilter
StringToMatch: "my-important-index"
AcceptOnMatch: true
2:
type: org.apache.log4j.varia.DenyAllFilter

Gant: Copy with filtering

I have a 'doc' directory containing HTML documentation and each HTML contains placeholders for the application version and the SVN revision:
Welcome to the ... V${version} r${buildNumber}
In my Grails/Gant build script we create a doc package for which we first copy the doc directory to a staging area before zipping it up. I now wanna replace these placeholders with values like this (assume the variables appVersion and svnRevision are set properly:
ant.mkdir(dir: "${baseDocDir}")
ant.copy(todir: "${baseDocDir}") {
fileset(dir: "./src/main/doc", includes: '*.html')
filterset {
filter ( token : 'version' , value: appVersion )
filter ( token : 'buildNumber' , value : svnRevision )
}
}
The copy works but somehow the filter does not!

I can answer the question myself now. The following code works:
ant.copy(todir: "${baseDocDir}") {
filterset(begintoken: "\${", endtoken: "}") {
filter(token: "version", value: appVersion)
filter(token: "buildNumber", value: svnRevision)
}
fileset(dir: "./src/main/doc/", includes: "**/*.html")
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Entity Mention Detection is not working properly with TokensRegex - stanford-nlp

Related

Custom protoc plugin parsing not working for custom options

Chinese Lemmetization in the StanfordNLP

TokensRegex rules to get correct output for Named Entities

Add a filter to appender in elasticsearch - logging yml syntax

Gant: Copy with filtering

Categories

Resources