Chinese Lemmetization in the StanfordNLP - stanford-nlp

Is the Stanford CoreNLP package for Chinese is able to detect chengyu(成语) and sayings (格言/谚语/惯用语 (eg. 冰冻三尺,非一日之寒))?

Superise me too! it realy does!
The following is generated by Stanford-NLP pipeline (with Chinese models): tokenize, ssplit, pos, lemma, ner
[
[
{
"category2":null,
"offset-begin":"0",
"ner2":"O",
"lemma2":"冰冻三尺",
"word2":null,
"index":"1",
"index2":"1",
"lemma":"冰冻三尺",
"offset-begin2":"null",
"tag2":"",
"originalText":"",
"offset-end":"4",
"answer":null,
"pos":"VV",
"offset-end2":"null",
"ner":"O",
"tag":"VV",
"originalText2":null,
"category":null,
"word":"冰冻三尺",
"value":"冰冻三尺"
},
{
"category2":null,
"offset-begin":"4",
"ner2":"O",
"lemma2":",",
"word2":null,
"index":"2",
"index2":"2",
"lemma":",",
"offset-begin2":"null",
"tag2":"",
"originalText":"",
"offset-end":"5",
"answer":null,
"pos":"PU",
"offset-end2":"null",
"ner":"O",
"tag":"PU",
"originalText2":null,
"category":null,
"word":",",
"value":","
},
{
"category2":null,
"offset-begin":"5",
"ner2":"O",
"lemma2":"非一日之寒",
"word2":null,
"index":"3",
"index2":"3",
"lemma":"非一日之寒",
"offset-begin2":"null",
"tag2":"",
"originalText":"",
"offset-end":"10",
"answer":null,
"pos":"VV",
"offset-end2":"null",
"ner":"O",
"tag":"VV",
"originalText2":null,
"category":null,
"word":"非一日之寒",
"value":"非一日之寒"
}
]
]

Related

Entity Mention Detection is not working properly with TokensRegex

entitymention doesn't seem to work. I followed similar approach mentioned here by adding entitymentions as one of the annotators
How can I detect named entities that have more than 1 word using CoreNLP's RegexNER?
Input : "Here is your 24 USD"
I have a TokensRegex:
{ ruleType: "tokens", pattern: ([{ner:"NUMBER"}] + [{word:"USD"}]), action: Annotate($0, ner, "NEW_MONEY"), result: "NEW_MONEY_RESULT" }
Init Pipeline:
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex,entitymentions");
props.setProperty("tokensregex.rules", "basic_ner.rules");
I still got 2 CoreEntityMention instead of just 1.
Both of them have the same value for edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation which is NEW_MONEY
but they have different edu.stanford.nlp.ling.CoreAnnotations$EntityMentionIndexAnnotation
which is 0 for 24
1 for USD
How can they be merged since they both have same entity tag annotation.
3.9.2 version of stanford library is used.
The issue is that numbers have a normalized name entity tag.
Here is a rules file that will work:
# these Java classes will be used by the rules
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
normNER = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation" }
tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }
# rule for recognizing company names
{ ruleType: "tokens", pattern: ([{ner:"NUMBER"}] [{word:"USD"}]), action: (Annotate($0, ner, "NEW_MONEY"), Annotate($0, normNER, "NEW_MONEY")), result: "NEW_MONEY" }
You should not add an extra tokensregex annotator and entitymentions annotator at the end. The ner annotator will run these as sub-annotators.
Here is an example command:
java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.tokensregex.rules new_money.rules -file new_money_example.txt -outputFormat text
More documentation here:
https://stanfordnlp.github.io/CoreNLP/tokensregex.html
https://stanfordnlp.github.io/CoreNLP/ner.html

Google Cloud Speech API. Help getting Google's own example to work

Trying to get hints to work with Google Cloud Speech API. However, I can not get Google's own example to work for me. I get the same result with and without hints. I believe the following code is what the documentation is recommending:
Here is my script:
#!/usr/bin/python
import os
import base64
import googleapiclient.discovery
speech_file = os.path.join(
os.path.dirname(__file__),
'resources',
'shwazil_hoful.flac')
with open(speech_file, 'rb') as speech:
b64speech = base64.urlsafe_b64encode(speech.read())
service = googleapiclient.discovery.build('speech', 'v1')
service_request = service.speech().recognize(
body={
"config": {
"encoding": "FLAC", # raw 16-bit signed LE samples
"sampleRateHertz": 16000, # 16 khz
"languageCode": "en-US", # a BCP-47 language tag
"speechContexts": [{
"phrases": ["hoful","shwazil"]
}]
},
"audio": {
"content": b64speech
#"uri":"gs://cloud-samples-tests/speech/brooklyn.flac"
}
})
response = service_request.execute()
recognized_text = 'Transcribed Text: \n'
for i in range(len(response['results'])):
recognized_text += response['results'][i]['alternatives'][0]['transcript']
print(recognized_text)
Output:
it's a swazzle Hopple day
I was expecting:
it's a swazzle hoful day
Is there anything I am doing wrong?
I've tried both Python2 and Python3

How do I setup YAML linting in Arcanist?

I can't figure out how to do custom linting pre-diff in Arcanist (YAML, specifically). The instructions don't explain how to integrate a new linter into my existing .arclint configuration.
I figured this out on my own, and thought I'd share here in case anyone else has this issue.
The following .arclint file does the trick:
{
"linters": {
"yaml": {
"type": "script-and-regex",
"script-and-regex.script": "yamllint",
"script-and-regex.regex": "/^(?P<line>\\d+):(?P<offset>\\d+) +(?P<severity>warning|error) +(?P<message>.*) +\\((?P<name>.*)\\)$/m",
"include": "(\\.yml$)",
"exclude": [ ]
}
}
}
I haven't extensively tried out that regex, but it works for my purposes so far.
You can configure Yamllint by populating a .yamllint file in the repository root.

TokensRegex rules to get correct output for Named Entities

I was finally able to get my TokensRegex code to give some kind of output for named entities. But the output is not exactly what I want. I believe the rules need some tweaking.
Here's the code:
public static void main(String[] args)
{
String rulesFile = "D:\\Workspace\\resource\\NERRulesFile.rules.txt";
String dataFile = "D:\\Workspace\\data\\GoldSetSentences.txt";
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
props.setProperty("ner.useSUTime", "0");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.addAnnotator(new TokensRegexAnnotator(rulesFile));
String inputText = "Bill Edelman, CEO and chairman of Paragonix Inc. announced that the company is expanding it's operations in China.";
Annotation document = new Annotation(inputText);
pipeline.annotate(document);
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, rulesFile);
/* Next we can go over the annotated sentences and extract the annotated words,
Using the CoreLabel Object */
for (CoreMap sentence : sentences)
{
List<MatchedExpression> matched = extractor.extractExpressions(sentence);
for(MatchedExpression phrase : matched){
// Print out matched text and value
System.out.println("matched: " + phrase.getText() + " with value: " + phrase.getValue());
// Print out token information
CoreMap cm = phrase.getAnnotation();
for (CoreLabel token : cm.get(TokensAnnotation.class))
{
if (token.tag().equals("NNP")){
String leftContext = token.before();
String rightContext = token.after();
System.out.println(leftContext);
System.out.println(rightContext);
String word = token.get(TextAnnotation.class);
String lemma = token.get(LemmaAnnotation.class);
String pos = token.get(PartOfSpeechAnnotation.class);
String ne = token.get(NamedEntityTagAnnotation.class);
System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + "ne=" + ne);
}
}
}
}
}
}
And here's the rules file:
$TITLES_CORPORATE = (/chief/ /administrative/ /officer/|cao|ceo|/chief/ /executive/ /officer/|/chairman/|/vice/ /president/)
$ORGANIZATION_TITLES = (/International/|/inc\./|/corp/|/llc/)
# For detecting organization names like 'Paragonix Inc.'
{ ruleType: "tokens",
pattern: ([{pos: NNP}]+ $ORGANIZATION_TITLES),
action: ( Annotate($0, ner, "ORGANIZATION"),Annotate($1, ner, "ORGANIZATION") )
}
# For extracting organization names from a pattern - 'Genome International is planning to expand its operations in China.'
#(in the sentence given above the words planning and expand are part of the $OrgContextWords macros )
{
ruleType: "tokens",
pattern: (([{tag:/NNP.*/}]+) /,/*? /is|had|has|will|would/*? /has|had|have|will/*? /be|been|being/*? (?:[]{0,5}[{lemma:$OrgContextWords}]) /of|in|with|for|to|at|like|on/*?),
result: ( Annotate($1, ner, "ORGANIZATION") )
}
# For sentence like - Bill Edelman, Chairman and CEO of Paragonix Inc./ Zuckerberg CEO Facebook said today....
ENV.defaults["stage"] = 1
{
pattern: ( $TITLES_CORPORATE ),
action: ( Annotate($1, ner, "PERSON_TITLE"))
}
ENV.defaults["stage"] = 2
{
ruleType: "tokens",
pattern: ( ([ { pos:NNP} ]+) /,/*? (?:TITLES_CORPORATE)? /and|&/*? (?:TITLES_CORPORATE)? /,/*? /of|for/? /,/*? [ { pos:NNP } ]+ ),
result: (Annotate($1, ner, "PERSON"),Annotate($2, ner, "ORGANIZATION"))
}
The output I get is:
matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=PERSON
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION
The output I am expecting is:
matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=ORGANIZATION
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION
Also here Bill Edelman does not get identified as person. The phrase containing Bill Edelman does not get identified although I have a rule in place for it. Do I need to stage my rules for the entire phrase to get matched against each rule as a result not miss out on any entities?
I have produced a jar that represents the latest Stanford CoreNLP on the main GitHub page (as of April 14).
This command (with the latest code) should work for using the TokensRegexAnnotator (alternatively the tokensregex settings can be passed into a Properties object if using the Java API):
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,tokensregex -tokensregex.rules example.rules -tokensregex.caseInsensitive -file example.txt -outputFormat text
Here is a rule file I wrote that shows matching based on a sentence pattern:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
$ORGANIZATION_TITLES = "/inc\.|corp\./"
$COMPANY_INDICATOR_WORDS = "/company|corporation/"
{ pattern: (([{pos: NNP}]+ $ORGANIZATION_TITLES) /is/ /a/ $COMPANY_INDICATOR_WORDS), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }
{ pattern: ($COMPANY_INDICATOR_WORDS /that/ ([{pos: NNP}]+) /works/ /for/), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }
Note that $0 means the entire pattern, and $1 means the first capture group. So in this example, I put an extra parentheses around the text that represented what I wanted to match.
I ran this on the example: Paragonix Inc. is a company that Joe Smith works for.
This example shows using an extraction from a first round in a second round:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
$ORGANIZATION_TITLES = "/inc\.|corp\./"
$COMPANY_INDICATOR_WORDS = "/company|corporation/"
ENV.defaults["stage"] = 1
{ pattern: (/works/ /for/ ([{pos: NNP}]+ $ORGANIZATION_TITLES)), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }
ENV.defaults["stage"] = 2
{ pattern: (([{pos: NNP}]+) /works/ /for/ [{ner: "RULE_FOUND_ORG"}]), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }
This example should work properly on the sentence Joe Smith works for Paragonix Inc.

How to load OBJ model with three.js in TypeScript

I am using TypeScript and three.d.ts from definitely typed. I have no problems using THREE.JSONLoader, but how do I use an OBJLoader from here in a TypeScript project. I probably need to create an OBJLoader.d.ts file, but I have no idea how to do it and then how to use the created definition. I tried to simply copy the THREE.JSONLoader definition and rename it to OBJLoader, but that didn't work.
The latest Three.js now has ES Module versions of all the classes in the examples/ folder, along with type declaration files. So, now you can:
import {OBJLoader} from 'three/examples/jsm/loaders/OBJLoader'
And it will be typed in TypeScript as expected (hover on it to see tooltips in VS Code).
This answer was correct at the time of posting, but it's out of date now in 2019. See #trusktr's response below for a better solution today.
Having looked at the source of the OBJLoader here, (and with reference to three.d.ts) a simple objloader.d.ts file might look like this:
/// <reference path="three.d.ts" />
export class OBJLoader extends EventDispatcher {
constructor();
load(url: string, callback?: (response:any) => any): void;
parse(data:any):any; // Not sure if the return value can be typed. Seems to be a group but I can't find a definition for that in three.d.ts?
}
Caveat: this is quickly hacked together and not tested, but may help you to get started.
You would then reference your objloader.d.ts in the same way you are currently using three.d.ts. Don't forget to include both the three.js and OBJLoader.js files in your html page, or import them if you are working with external modules.
Add the libraries to your index.html or to your angular-cli.json if you're using angular2 cli:
$ cat angular-cli.json
{
"project": {
"version": "1.0.0-beta.16",
"name": "ssp"
},
"apps": [
{
"root": "src",
"outDir": "dist",
"assets": "assets",
"index": "index.html",
"main": "main.ts",
"test": "test.ts",
"tsconfig": "tsconfig.json",
"prefix": "app",
"mobile": false,
"styles": [
"styles.css"
],
"scripts": [
"../node_modules/three/build/three.js",
"../node_modules/three/examples/js/controls/VRControls.js",
"../node_modules/three/examples/js/effects/VREffect.js",
"../node_modules/webvr-boilerplate/build/webvr-manager.js",
"../node_modules/dat-gui/vendor/dat.gui.js",
"../node_modules/stats-js/build/stats.min.js",
"../node_modules/three/examples/js/controls/OrbitControls.js",
"../node_modules/three/examples/js/loaders/OBJLoader.js", <-- add
"../node_modules/three/examples/js/loaders/MTLLoader.js" <-- add
],
"environments": {
"source": "environments/environment.ts",
"dev": "environments/environment.ts",
"prod": "environments/environment.prod.ts"
}
}
],
Then reference the libraries like "var mtlLoader = new (THREE as any).MTLLoader( );":
var mtlLoader = new (THREE as any).MTLLoader( );
mtlLoader.setPath( '../../assets/models' );
mtlLoader.load( 'myProject.mtl', function( materials ) {
materials.preload();
var loader = new (THREE as any).OBJLoader();
loader.setMaterials(materials);
loader.load( '../../assets/models/myProject.obj', function(object) {
... do stuff
You won't get type checking, but it's a quick way to get started until someone adds an entry for the loaders to definitely typed.

Resources