Can I get an entityMention from the result of a TokensRegex match in Stanford CoreNLP? - stanford-nlp

I want to add addresses (and possibly other rules based entities) to an NER pipeline and the Tokens Regex seems like a terribly useful DSL for doing so. Following https://stackoverflow.com/a/42604225, I'm created this rules file:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
{ pattern: ([{ner:"NUMBER"}] [{pos:"NN"}|{pos:"NNP"}] /ave(nue)?|st(reet)?|boulevard|blvd|r(oa)?d/), action: Annotate($0, ner, "address") }
Here's a scala repl session showing how I'm trying to set up an annotation pipeline.
# import edu.stanford.nlp.pipeline.{StanfordCoreNLP, CoreDocument}
# import edu.stanford.nlp.util.PropertiesUtils.asProperties
# val pipe = new StanfordCoreNLP(asProperties(
"customAnnotatorClass.tokensregex", "edu.stanford.nlp.pipeline.TokensRegexAnnotator",
"annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex",
"ner.combinationMode", "HIGH_RECALL",
"tokensregex.rules", "addresses.tregx"))
pipe: StanfordCoreNLP = edu.stanford.nlp.pipeline.StanfordCoreNLP#2ce6a051
# val doc = new CoreDocument("Adam Smith lived at 123 noun street in Glasgow, Scotland")
doc: CoreDocument = Adam Smith lived at 123 noun street in Glasgow, Scotland
# pipe.annotate(doc)
# doc.sentences.get(0).nerTags
res5: java.util.List[String] = [PERSON, PERSON, O, O, address, address, address, O, CITY, O, COUNTRY]
# doc.entityMentions
res6: java.util.List[edu.stanford.nlp.pipeline.CoreEntityMention] = [Adam Smith, 123, Glasgow, Scotland]
As you can see, the address gets correctly tagged in the nerTags for the sentence, but it doesn't show up in the documents entityMentions. Is there a way to do this?
Also, is there a way from the document to discern two adjacent matches of the tokenregex from a single match (assuming I have more complicated set of regexes; in the current example I only match exactly 3 tokens, so I could just count tokens)?
I tried approaching it using the regexner with a tokens regex described here https://stanfordnlp.github.io/CoreNLP/regexner.html, but I couldn't seem to get that working.
Since I'm working in scala I'll be happy to dive into the Java API to get this to work, rather than fiddle with properties and resource files, if that's necessary.

Yes, I've recently added some changes (in the GitHub version) to make this easier! Make sure to download the latest version from GitHub. Though we are aiming to release Stanford CoreNLP 3.9.2 fairly soon and it will have these changes.
If you read this page you can get an understanding of the full NER pipeline run by the NERCombinerAnnotator.
https://stanfordnlp.github.io/CoreNLP/ner.html
Furthermore there is a lot of write up on the TokensRegex here:
https://stanfordnlp.github.io/CoreNLP/tokensregex.html
Basically what you want to do is run the ner annotator, and use it's TokensRegex sub-annotator. Imagine you have some named entity rules in a file called my_ner.rules.
You could run a command like this:
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.tokensregex.rules my_ner.rules -outputFormat text -file example.txt
This will run a TokensRegex sub-annotator during the full named entity recognition process. Then when the final step of entity mentions are run, it will operate on the rules extracted named entities and create entity mentions from them.

Related

how to handle spelling mistake(typos) in entity extraction in Rasa NLU?

I have few intents in my training set(nlu_data.md file) with sufficient amount of training examples under each intent.
Following is an example,
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
I have added multiple sentences like this.
At the time of testing, all sentences in training file are working fine. But if any input query is having spelling mistake e.g, hotol/hetel/hotele for hotel keyword then Rasa NLU is unable to extract it as an entity.
I want to resolve this issue.
I am allowed to change only training data, also restricted not to write any custom component for this.
To handle spelling mistakes like this in entities, you should add these examples to your training data. So something like this:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place) in Chennai
- [hetel](place) in Berlin please
Once you've added enough examples, the model should be able to generalise from the sentence structure.
If you're not using it already, it also makes sense to use the character-level CountVectorFeaturizer. That should be in the default pipeline described on this page already
One thing I would highly suggest you to use is to use look-up tables with fuzzywuzzy matching. If you have limited number of entities (like country names) look-up tables are quite fast, and fuzzy matching catches typos when that entity exists in your look-up table (searching for typo variations of those entities). There's a whole blogpost about it here: on Rasa.
There's a working implementation of fuzzy wuzzy as a custom component:
class FuzzyExtractor(Component):
name = "FuzzyExtractor"
provides = ["entities"]
requires = ["tokens"]
defaults = {}
language_list ["en"]
threshold = 90
def __init__(self, component_config=None, *args):
super(FuzzyExtractor, self).__init__(component_config)
def train(self, training_data, cfg, **kwargs):
pass
def process(self, message, **kwargs):
entities = list(message.get('entities'))
# Get file path of lookup table in json format
cur_path = os.path.dirname(__file__)
if os.name == 'nt':
partial_lookup_file_path = '..\\data\\lookup_master.json'
else:
partial_lookup_file_path = '../data/lookup_master.json'
lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)
with open(lookup_file_path, 'r') as file:
lookup_data = json.load(file)['data']
tokens = message.get('tokens')
for token in tokens:
# STOP_WORDS is just a dictionary of stop words from NLTK
if token.text not in STOP_WORDS:
fuzzy_results = process.extract(
token.text,
lookup_data,
processor=lambda a: a['value']
if isinstance(a, dict) else a,
limit=10)
for result, confidence in fuzzy_results:
if confidence >= self.threshold:
entities.append({
"start": token.offset,
"end": token.end,
"value": token.text,
"fuzzy_value": result["value"],
"confidence": confidence,
"entity": result["entity"]
})
file.close()
message.set("entities", entities, add_to_output=True)
But I didn't implement it, it was implemented and validated here: Rasa forum
Then you will just pass it to your NLU pipeline in config.yml file.
Its a strange request that they ask you not to change the code or do custom components.
The approach you would have to take would be to use entity synonyms. A slight edit on a previous answer:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place:hotel) in Chennai
- [hetel](place:hotel) in Berlin please
This way even if the user enters a typo, the correct entity will be extracted. If you want this to be foolproof, I do not recommend hand-editing the intents. Use some kind of automated tool for generating the training data. E.g. Generate misspelled words (typos)
First of all, add samples for the most common typos for your entities as advised here
Beyond this, you need a spellchecker.
I am not sure whether there is a single library that can be used in the pipeline, but if not you need to create a custom component. Otherwise, dealing with only training data is not feasible. You can't create samples for each typo.
Using Fuzzywuzzy is one of the ways, generally, it is slow and it doesn't solve all the issues.
Universal Encoder is another solution.
There should be more options for spell correction, but you will need to write code in any way.

How to force NER class in Stanford NLP

Historically I have used OpenNLP for natural language processing. I decided to give Stanford NLP a try on my latest project and am running into issues with NER. Specifically, when a specific token is processed (TOKENP in my example), I would like it to classify this as a type of TOKENP.
I have read through the documentation multiple times, read through the response to this related SO post, and I cannot get it to reliably be assigned TOKENP.
Here is the rules file (labels.txt):
TOKENP TOKENP PERSON 5
Here is the input file (tmp.txt):
Michael Scott Dunder Mifflin TOKENP
Here is the command I am using:
java -mx4g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,ner' -ner.fine.regexner.mapping labels.txt -outputFormat text -file tmp.txt
And here is the output:
Tokens:
[Text=Michael CharacterOffsetBegin=0 CharacterOffsetEnd=7 PartOfSpeech=NNP Lemma=Michael NamedEntityTag=PERSON]
[Text=Scott CharacterOffsetBegin=8 CharacterOffsetEnd=13 PartOfSpeech=NNP Lemma=Scott NamedEntityTag=PERSON]
[Text=Dunder CharacterOffsetBegin=14 CharacterOffsetEnd=20 PartOfSpeech=NNP Lemma=Dunder NamedEntityTag=PERSON]
[Text=Mifflin CharacterOffsetBegin=21 CharacterOffsetEnd=28 PartOfSpeech=NNP Lemma=Mifflin NamedEntityTag=PERSON]
[Text=TOKENP CharacterOffsetBegin=29 CharacterOffsetEnd=35 PartOfSpeech=NNP Lemma=TOKENP NamedEntityTag=PERSON]
Extracted the following NER entity mentions:
Michael Scott Dunder Mifflin TOKENP PERSON
I expect the TOKENP as the last token in the input to receive a class of TOKENP based on the rules.
The issue is that the rules system won't break up the entity of "Michael Scott Dunder Mifflin TOKENP".
I could try adding an option that would write a rule based tag over anything under all circumstances (or see if there is already such an option!)

How to set up training and feature template files for NER? - CRF++

For the problem of named entity recognition,
After tokenizing the sentences, how do you set up the columns? it looks like one column in the documentation is POS tag, but where do these come from? Am I supposed to tag the POS myself or is there a tool to generate these?
What is the next column represent? A class like PERSON, LOCATION, etc? and does it have to be in any particular format?
Is there any example of a completed training file and template for NER?
You can find example training and test data in the crf++ repo here. The training data for noun phrase chunking looks like this:
Confidence NN B
in IN O
the DT B
pound NN I
is VBZ O
widely RB O
expected VBN O
... etc ...
The columns are arbitrary in that they can be anything. CRF++ requires that every line have the same number of columns (or be blank, to separate sentences), not all CRF packages require that. You will have to provide the data values yourself; they are the data the classifier learns from.
While anything can go in the various columns, one convention you should know is IOB Format. To deal with potentially multi-token entities, you mark them as Inside/Outside/Beginning. It may be useful to give an example. Pretend we are training a classifier to detect names - for compactness I'll write this on one line:
John/B Smith/I ate/O an/O apple/O ./O
In columnar format it would look like this:
John B
Smith I
ate O
an O
apple O
. O
With these tags, B (beginning) means the word is the first in an entity, I means a word is inside an entity (it comes after a B tag), and O means the word is not an entity. If you have more than one type of entity it's typical to use labels like B-PERSON or I-PLACE.
The reason for using IOB tags is so that the classifier can learn different transition probabilities for starting, continuing, and ending entities. So if you're learning company names It'll learn that Inc./I-COMPANY usually transitions to an O label because Inc. is usually the last part of a company name.
Templates are another problem and CRF++ uses its own special format, but again, there are examples in the source distribution you can look at. Also see this question.
To answer the comment on my answer, you can generate POS tags using any POS tagger. You don't even have to provide POS tags at all, though they're usually helpful. The other labels can be added by hand or automatically; for example, you can use a list of known nouns as a starting point. Here's an example using spaCy for a simple name detector:
import spacy
nlp = spacy.load('en')
names = ['John', 'Jane', etc...]
text = nlp("John ate an apple.")
for word in text:
person = 'O' # default not a person
if str(word) in names:
person = 'B-PERSON'
print(str(word), word.pos_, person)

How do I instruct NER SUTime to resolve-to-future?

I see that there is an option inside of SUTime to resolve ambiguous time references to the future, but I am not sure how to tell NER annotator to do so. For example, when annotating this sentence "let's go out on Friday" (and let's say that today's Sunday), I want SUTime to return next Friday's date, not the previous one, which appears by default, since it's closer to Sunday. Thanks.
You have to provide your own grammar file. You can copy the default one from the corenlp. It should be located somewhere like stanford-sutime-models-1.3.5.jar:edu/stanford/nlp/models/sutime/english.sutime.txt
Then add following code to the end of the section, that starts with comment # Final rules to determine how to resolve date:
{
pattern: ( [ $hasTemporal ] ),
action: VTag( $0[0].temporal.value, "resolveTo", RESOLVE_TO_FUTURE)
}
This will tag all temporals to be resolved into the future. Note, that there're several predefined tags that resolves some time patterns into the past. You can delete/modify them too.
Then provide a resource path to your file to a TimeAnnotator constructor:
Properties props = new Properties();
props.setProperty("sutime.rules", "edu/stanford/nlp/models/sutime/defs.sutime.txt,PATH_TO_YOUR_RESOURCE_FOLDER/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt");
TimeAnnotator timeAnnotator = new TimeAnnotator("sutime", props);
There is also a small trick with a DocDateAnnotation. If you want time patterns like "on Friday at 7pm" to be resolved correctly, you should provide an iso formatted datetime (not only a date like YYYY-MM-DD) into a DocDateAnnotation.

How to get CoreAnnotations.CoNLLDepAnnotation and CoreAnnotations.GovernorAnnotation as annotators through StanfordCorenlp pipeline?

There are standard names for annotations (like tokenize, ssplit, pos) but i am not sure what name should be specified for the CoNLLDepAnnotations and GovernorAnnotations and also what other annotations does these depend on.
Dependency Parse require the annotator (parse). All dependency that you mentioned are performed under this annotator. The code below will print the semantic graph of the sentence in a List format.
for (CoreMap sentenceAnnotation:art.sentences){
SemanticGraph deps = sentenceAnnotation.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
System.out.println(deps.toList());
}
For example the output of the sentence: Apple even went as far to make an electric guitar version . will be:
root(ROOT-0, went-3)
nsubj(went-3, Apple-1)
advmod(went-3, even-2)
advmod(far-5, as-4)
advmod(went-3, far-5)
mark(make-7, to-6)
xcomp(went-3, make-7)
det(version-11, an-8)
amod(version-11, electric-9)
compound(version-11, guitar-10)
dobj(make-7, version-11)
punct(went-3, .-12)
where for the first token Apple the relation is nsubj and the governor is the third token went.

Resources