Override named entity with RegexNER instead of CRF model - stanford-nlp

I am trying to detect named entities using Stanford CoreNLP in a task.
I have already given a rule as follows in my RegexNER mapping file as follows:
Train VEHICLE_TYPE 2.0
But its identifying Train as CRIMINAL_CHARGE type of entity.
I have added this option ner.applyFineGrained and set it to true maybe that's why its overriding with CoreNLP's CRF model.
My question is how to add exceptions like this in RegexNER mapping file or is there some better approach.

You should use these settings:
# run fine-grained NER with a custom rules file
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.fine.regexner.mapping custom.rules -file example.txt -outputFormat text
You need to make sure to set ner.fine.regexner.mapping to your custom rules file to use that instead of the default fine-grained rules which would label things such as CRIMINAL_CHARGE

Related

Does SPIED CoreNLP support languages other than English?

I don't know Java at all so I'm struggling a bit to figure out whether SPIED could work with languages other than English.
I've tried substituting default models.jar with Spanish specific models.jar and overriding default props with Spanish specific props.
Still, edu.stanford.nlp.patterns.GetPatternsFromDataMultiClass seems to be using English specific annotators.
The command executed was:
java -cp stanford-corenlp-3.9.2.jar:stanford-spanish-corenlp-2018-10-05-models.jar:javax.json.jar:joda-time.jar:jollyday.jar edu.stanford.nlp.patterns.GetPatternsFromDataMultiClass -props patterns/example.properties
Where example.properties contains the properties for the Spanish model (as in the default Spanish properties file) as well as the properties for the patterns module.
That didn't work. Is there any straightforward way to apply patterns module to other languages?

How to build entitymentions from tokens tagged by the `regexner` annotator?

This question is similar to Can I get an entityMention from the result of a TokensRegex match in Stanford CoreNLP?
I have a set of TokensRegex rules that tag tokens with a different tag than the standard "LOCATION", "PERSON" etc.
The entitymentions annotator is very useful for multi-token named entities. How can I also build entitymentions for token sequences that are tagged by the regexner annotator? They don't appear to be built with standard settings.
I'm using CoreNLP 3.9.2 with the http API
Thanks for the help
Here is an example command
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.tokensregex.rules example.rules -file example.txt -outputFormat text
Some more info...The ner annotator will run a series of steps
statistical ner
numeric sequences and SUTime
fine-grained NER (example: LOCATION --> STATE_OR_PROVINCE)
additional TokensRegexNER rules
additional TokensRegex rules
entity building
So after steps 1-5 are run, the entities will be built, and will see tags from your TokensRegex rules.
This is in the current GitHub code and version 3.9.2 (won't work with older versions).
More info here: https://stanfordnlp.github.io/CoreNLP/ner.html

CoreNLP API equivalent to command line?

For one of our project, we are currently using the syntax analysis component with the command line. We want to move from this approach to now use the corenlp server (for better performances).
Our command line options are as follow:
java -mx4g -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -tokenized -escaper edu.stanford.nlp.process.PTBEscapingProcessor -sentences newline -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory -outputFormat "wordsAndTags,typedDependenciesCollapsed"
I've tried a few things but I didn't manage to find the proper options when using the corenlp API (with Python).
For instance, how to specify that the text is already tokenised?
I would really appreciate any help.
In general, the server calls into CoreNLP rather than the individual NLP components, so the documentation on CoreNLP may be useful. The body of the text being annotated is sent to the server as the POST body; the properties are passed in as URL params. For example, for your case, I believe the following curl command should do the trick (and should be easy to adapt to the language of your choice):
curl -X POST -d "it's split on whitespace" \
'http://localhost:9000/?annotators=tokenize,ssplit,pos,parse&tokenize.whitespace=true&ssplit.eolonly=true'
Note that we're just passing the following properties into the server:
annotators = tokenize,ssplit,pos,parse (specifies that we want the parser, and all its prerequisites).
tokenize.whitespace = true will call the withespace tokenizer.
ssplit.eolonly = true will split sentences on and only on newlines.
Other potentially useful options are documented on the parser annotator page.

Stanford NLP - NER & Models

I was looking at the online demo: http://nlp.stanford.edu:8080/ner/process
Try a simple testcase like: John Chambers studied in London (UK) and Mumbai (India).
The 3-class Classifier identifies the Person, the 7-class Classifier does not identify the Person. Seems like I need to run the parser on both the Models: once to identify Person, Location & Organization. And once just for Currency?
When I run this command it finds all of the appropriate entities on your example:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -file sample-sentence.txt -outputFormat text
When you run the NERCombinerAnnotator which corresponds to the annotator ner it will run a combination of several models automatically for you.

Extending Stanford NER terms with new terms

We need to add terms to named entity extraction tables/model in Stanford and can't figure out how. Use case - we need to build up a set of IED terms over time and want to the Stanford pipeline to extract the terms when found in text files.
Looking to see if this is something someone has done before
Please take a look at http://nlp.stanford.edu/software/regexner/ to see how to use it. It allows you to specify a file of mappings of phrases to entity types. When you want to update the mappings, you update the file and rerun the Stanford pipeline.
If you are interested in how to actually learn patterns for the terms over time, you can take a look at our pattern learning system: http://nlp.stanford.edu/software/patternslearning.shtml
Could you specify the tags you want to apply?
To use the RegexNER all you have to do is build a file with 1 entry per line of the form:
TEXT_PATTERN\tTAG
You would put all of the things you want in your custom dictionary into a file, say custom_dictionary.txt
I am assuming by IED you mean
https://en.wikipedia.org/wiki/Improvised_explosive_device ??
So your file might look like:
VBIED\tIED_TERM
sticky bombs\tIED_TERM
RCIED\tIED_TERM
New Country\tLOCATION
New Person\tPERSON
(Note Stack Overflow has some strange formatting, there should not be blank lines between each entry, it should be 1 entry per line!!)
If you then run this command:
java -mx1g -cp '*' edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,regexner,ner' -file sample_input.txt -regexner.mapping custom_dictionary.txt
you will tag sample_input.txt
Updating is merely a matter of updating custom_dictionary.txt
One thing to be on the lookout for, it matters if you put "ner" first or "regexner" first in your list of annotators.
If your highest priority is tagging with your specialized terms (for instance IED_TERM), I would run the regexner first in the pipeline, since there are some tricky issues with how the taggers overwrite each other.

Resources