Stanford Ner : build my own model or use RegexNer? - stanford-nlp

I would like some advices for Stanford NER, I'm wondering what it is the best way to detect new entities :
Use RegexNer to detect new entities ?
Train my own NER model with new entities ?
Thank you in advance.

If you can easily generate a large list of the type of entity you want to tag, I would suggest using RegexNER. For instance if you were trying to tag sports teams, it would probably be easier to just compile a large list of sports team names and directly match. Building a large training set can take a lot of effort.

Related

ML.NET doesn't support resuming training for ImageClassificationTrainer

I want to continue training the model.zip file with more images without retraining from the baseline model from scratch, how do I do that?
This isn't possible at the moment. ML.NET's ImageClassificationTrainer already uses a pre-trained model, so you're using transfer learning to create your model. Any additions would have to be "from scratch" on the pre-trained model.
Also, looking at the existing trainers that can be re-trained, the ImageClassificationTrainer isn't listed among them.

How do I create my own annotator on Stanford's CoreNLP?

I am using Stanford's Deepdive project to a annotate a huge list of public complaints on specific vehicles. My project is to use the problem descriptions and to teach Deepdive to learn how to categorize the problems based on the words in their sentences. For example, if a customer stated something like the "airbag malfunctioned", then deepdive should be able to tell that this is a safety issue and they are talking about a part of the car. So what I am trying to do is update Stanford's CoreNLP Named Entity Recognition(NER) list to start finding words like these as well and label them things such as "CAR SAFETY ISSUE". Could anybody explain in depth how to go about adding a new annotator so CoreNLP could analyze these sentences based on cars parts and general issues.
Thank You
Did you look over the TokenRegexAnnotator ? With rules you can extract such expressions and annotate tokens with a custom NER tag :
{
ruleType: "tokens",
pattern: (/airbag/ /malfunctioned/),
result: Annotate($0, ner, 'CAR SAFETY ISSUE')
}
#Blaise is correct that this sounds like a good fit for TokensRegex. However, if you do want to create a custom annotator, the process is laid out at: http://nlp.stanford.edu/software/corenlp-faq.shtml#custom .
At a high level, you want to create a class inheriting from Annotator and implementing a 2-argument constructor MyClass(String name, Properties props). Then, in your properties file you pass into CoreNLP, you should specify customAnnotatorClass.your_annotator_name = your.annotator.Class. You can pass properties to this annotator in the usual way, by specifying your_annotator_name.key = value.

Multiple Stanford CoreNLP model files made, which one is the correct one to use?

I made a sentiment analysis model using Standford CoreNLP's library. So I have a bunch of ser.gz files that look like the following:
I was wondering what model to use in my java code, but based on a previous question,
I just used the model with the highest F1 score, which in this case is model-0014-93.73.ser.gz. And in my java code, I pointed to the model I want to use by using the following line:
props.put("sentiment.model", "/path/to/model-0014-93.73.ser.gz.");
However, by referring to just that model, am I excluding the sentiment analysis from the other models that were made? Should I be referring to all the model files to make sure I "covered" all the bases or does the highest scoring model trump everything else?
You should point to only the single highest scoring model. The code has no way to make use of multiple models at the same time.

New entities discovery from text

i'm working on new entities discovery from text and was wondering if stanford nlp can be used for this purpose ?
Actually what i know is that stanford requires trained classifiers to recognize entities but if i'm not wrong it will only detect already known entities for example if your models contains "stanford is a good university" and stanford is already a know entity, if i try "fooo is a good university" it won't recognize it as a new entity
This project should be of interest to you:
http://nlp.stanford.edu/software/patternslearning.shtml
OK - if javascript is fine for you (node.js/browser) please see : http://github.com/redaktor/nlp_compromise/
This is a "No training" solution. I worked especially on NER (named entity extraction) the last days - just described it here Named entity recognition with a small data set (corpus)
Feel free to ask me about it in the github issues because I did not document the new methods (no time and still working on it)

Train and retrain Stanford tagger using the API

I want to train the Stanford tagger using a corpus which consists of multiple files and will be extended in the future.
Is it possible to update an existant model or do I have to train using the entire corpus every time?
Are there any examples of how to do the training using the API? The JavaDoc of MaxentTagger only covers training via command line.
Thank you!
At present, you have to train using the entire corpus every time. (Updating a model with additional data is theoretically possible, but it's not something that currently exists and it isn't on our front burner.)
We do all our training of models from the command line.... Actually, looking at the code, it seems like the train method is private, so you'd need to make it more public to be able to do training from the API. We should fix that. Might try to do this.
If the access level was different, you could create a TaggerConfig and then call this method:
private static void trainAndSaveModel(TaggerConfig config) throws IOException { ... }
But, even then, it currently always saves its built tagger to disk. So, things could do with a bit of reworking to enable this smoothly.

Resources