Annotating text with NER: Exception: couldn't read TokensRegexNER - stanford-nlp

I'm trying to annotate text with Stanford CoreNLP v3.9.1 in Java.
The annotators used are : tokenize, ssplit, pos, lemma, ner
I've included the model jar from https://stanfordnlp.github.io/CoreNLP/download.html.
Both english models are included in my project (normal + kbp).
However, after loading the english.muc.7class.distsim.crf.ser.gz classifier, the following exception is thrown: Couldn't read TokensRegexNER from edu/stanford/nlp/models/kbp/regexner_caseless.tab.
After opening the download jar model stanford-english-kbp-corenlp-2018-02-27-models.jar, the correct path to regexner_caseless.tab is edu/stanford/nlp/models/kbp/english/regexner_caseless.tab (notice the english subpath).
How do I make Stanford CoreNLP use the correct path?

You are missing the main models jar that comes with the distribution.
stanford-corenlp-2018-02-27-models.jar

Related

Incompatible class of models when using the models trained in maven project A to maven project B under the same Maven dependency

I have a big maven project A, where I use weka library to train the models. Now I create another small maven project B using the same maven dependency as A, and want to use the models trained in A.
When I use SerializationHelper class to read the models and use the models to predict for new instance in project B, it encounters some errors like incompatible class of the models(see picture below). I just wonder whether there is a way to use the models trained in A in project B if the maven dependency for A and B is the same. Or I have to retrain the models in B and use it in B. Thanks.
New exception
Classifier cls = (Classifier) weka.core.SerializationHelper.read(model);
double clsLabel = cls.classifyInstance(Data.instance(i));
SerializationHelper is supposed to used for different projects and different machines. Your mistake should be elsewhere. Try following suggestions.
Your project A and project B has different java compiler settings. May be A is 5.0 and B is 8.0.
Your model file is corrupted. Try it first in project where you saved it, Project A.
You are saving different object not Classifier. Try to print out your class name to system.out and see.
Object objModel = weka.core.SerializationHelper.read(model);
String modelClassName = objModel.getClass().getCanonicalName();
System.out.println(objModel);
For Weka classifiers, arff files' headers should be same, exactly same. If your training and testing arff file headers are not same. You will have problems.

Stanford Ner : build my own model or use RegexNer?

I would like some advices for Stanford NER, I'm wondering what it is the best way to detect new entities :
Use RegexNer to detect new entities ?
Train my own NER model with new entities ?
Thank you in advance.
If you can easily generate a large list of the type of entity you want to tag, I would suggest using RegexNER. For instance if you were trying to tag sports teams, it would probably be easier to just compile a large list of sports team names and directly match. Building a large training set can take a lot of effort.

How do I create my own annotator on Stanford's CoreNLP?

I am using Stanford's Deepdive project to a annotate a huge list of public complaints on specific vehicles. My project is to use the problem descriptions and to teach Deepdive to learn how to categorize the problems based on the words in their sentences. For example, if a customer stated something like the "airbag malfunctioned", then deepdive should be able to tell that this is a safety issue and they are talking about a part of the car. So what I am trying to do is update Stanford's CoreNLP Named Entity Recognition(NER) list to start finding words like these as well and label them things such as "CAR SAFETY ISSUE". Could anybody explain in depth how to go about adding a new annotator so CoreNLP could analyze these sentences based on cars parts and general issues.
Thank You
Did you look over the TokenRegexAnnotator ? With rules you can extract such expressions and annotate tokens with a custom NER tag :
{
ruleType: "tokens",
pattern: (/airbag/ /malfunctioned/),
result: Annotate($0, ner, 'CAR SAFETY ISSUE')
}
#Blaise is correct that this sounds like a good fit for TokensRegex. However, if you do want to create a custom annotator, the process is laid out at: http://nlp.stanford.edu/software/corenlp-faq.shtml#custom .
At a high level, you want to create a class inheriting from Annotator and implementing a 2-argument constructor MyClass(String name, Properties props). Then, in your properties file you pass into CoreNLP, you should specify customAnnotatorClass.your_annotator_name = your.annotator.Class. You can pass properties to this annotator in the usual way, by specifying your_annotator_name.key = value.

Why does Stanford CoreNLP NER-annotator load 3 models by default?

When I add the "ner" annotator to my StanfordCoreNLP object pipeline, I can see that it loads 3 models, which takes a lot of time:
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [10.3 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [10.1 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [6.5 sec].
Initializing JollyDayHoliday for SUTime from classpath: edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Is there a way to just load a subset that will work equally? Particularly, I am unsure why it is loading the 3-class and 4-class NER models when it has the 7-class model, and I'm wondering if not loading these two will still work.
You can set which models are loaded in this manner:
command line:
-ner.model model_path1,model_path2
Java code:
props.put("ner.model", "model_path1,model_path2");
Where model_path1 and model_path2 should be something like:
"edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz"
The models are applied in layers. The first model is run and its tags applied. Then the second, the third, and so on. If you want less models, you can put 1 or 2 models in the list instead of the three default, but this will change how the system performs.
If you set "ner.combinationMode" to "HIGH_RECALL", all models will be allowed to apply all of their tags. If you set "ner.combinationMode" to "NORMAL", then a future model cannot apply any tags set by previous models.
All three models in the default were trained on different data. For instance, the 3-class was trained with substantially more data than the 7-class model. So each model is doing something different and their results are all being combined to create the final tag sequence.

New entities discovery from text

i'm working on new entities discovery from text and was wondering if stanford nlp can be used for this purpose ?
Actually what i know is that stanford requires trained classifiers to recognize entities but if i'm not wrong it will only detect already known entities for example if your models contains "stanford is a good university" and stanford is already a know entity, if i try "fooo is a good university" it won't recognize it as a new entity
This project should be of interest to you:
http://nlp.stanford.edu/software/patternslearning.shtml
OK - if javascript is fine for you (node.js/browser) please see : http://github.com/redaktor/nlp_compromise/
This is a "No training" solution. I worked especially on NER (named entity extraction) the last days - just described it here Named entity recognition with a small data set (corpus)
Feel free to ask me about it in the github issues because I did not document the new methods (no time and still working on it)

Resources