Online training for Stanford NER - stanford-nlp

I understand that Stanford NER only supports training through a file... is there a way to add more training data at a later stage to update the NER model once it is already trained?
I understand that I can keep all the training datasets from the past and re-train the model, but, I am wondering if there is a way to update the NER model rather than retrain it from scratch.

For the larger audience: StanfordNER does not support Online Training. Marking this question as closed.

Related

If I train a custom tokenizer on my dataset, I would still be able to leverage a pre-trained model weight

This is a declaration, but I'm not sure it is correct. I can elaborate.
I have a considerably large dataset (23Gb). I'd like to pre-train the Roberta-base or XLM-Roberta-base, so my language model would fit better to be used in further downstream tasks.
I know I can just run it against my dataset for a few epochs and get good results. But, what if I also train the tokenizer to generate a new vocab, and merge files? The weights from the pre-trained model I started from will still be used, or the new set of tokens will demand complete training from scratch?
I'm asking this because maybe some layers can still contribute with knowledge, so the final model will have the better of both worlds: A tokenizer that fits my dataset, and the weights from previous training.
That makes sense?
In short no.
You cannot use your own pretrained tokenizer for a pretrained model. The reason is that the vocabulary for your tokenizer and the vocabulary of the tokenizer that was used to pretrain the model that later you will use it as pretrained model are different. Thus a word-piece token which is present in Tokenizers's vocabulary may not be present in pretrained model's vocabulary.
Detailed answers can be found here,

How to create a training pipeline for huggingface bert base uncased clinical NER

Current BERT base uncased clinical NER predict clinical entities( Problem, Test, Treatment)
I want to train on different clinical dataset to get entity like ( Disease, Medicine, Problem)
How to achieve that??
Model
There are several models in Huggingface which are trained on medical specific articles, those will definitely perform better than normal bert-base-uncased. BioELECTRA is one of them and it managed to outperform existing biomedical NLP models in several benchmark tests.
There are 3 different versions of those models depending on their pretraining dataset. But I think these 2 will be the best to start with.
Bioelectra-base-discriminator-pubmed: Pretrained on pubmed
Bioelectra-base-discriminator-pubmed-pmc: Pretrained on pubmed and pmc
NER Datasets:
Now coming to NER dataset there are several dataset you might like or you might want to create a composite dataset. Some of these are -
BC5-disease, NCBI-disease, BC5CDR-disease from BLUE benchmark
[Let me know if you need any help with model creation or setting up the finetuning setup. Also please use proper metrics to evaluate them and do share the metrics dashboard after it gets finished.]

How to increase NER accuracy of Stanford's CRF for a small training dataset?

I am trying to train a NER model for a different domain using the Stanford's Named Entity Recognizer module (https://nlp.stanford.edu/software/CRF-NER.html) and the tag set is quite varied. The size of my dataset is : around 1100 sentences for training and 150 sentences for testing. The problem that I am facing is my test accuracy is coming out to be very low : 20-40%.
I tried to play with the different features provided in the NERFeatureFactory but none of them are actually helping. What could be the actual problem with this? Is it only because of the limited amount of training data or am I missing something else as well?
The other thing that I want to try is to do some fine-tuning and validation on the trained model. Is there a way to do so?

Train a non-english Stanford NER models

I'm seeing several posts about training the Stanford NER for other languages.
eg: https://blog.sicara.com/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486
However, the Stanford CRF-Classifier uses some language dependent features (such as: Part Of Speechs tags).
Can we really train non-English models using the same Jar file?
https://nlp.stanford.edu/software/crf-faq.html
Training a NER classifier is language independent. You have to provide high quality training data and create meaningful features. The point is, that not all features are equally useful for every languages. Capitalization for instance, is a good indicator for a named entity in english. But in German all nouns are capitalized, which makes this features less useful.
In Stanford NER you can decide which features the classifier has to use and therefore you can disable POS tags (in fact, they are disabled by default). Of course, you could also provide your own POS tags in your desired language.
I hope I could clarify some things.
I agree with previous comment that NER classification model is language independent.
If you have issue with training data I could suggest you this link with a huge amount of labeled datasets for different languages.
If you would like to try another model, I suggest ESTNLTK - library for Estonian language, but it could fit language independent ner models (documentation).
Also, here you could find example how to train ner model using spaCy.
I hope it helps. Good luck!

Conventions for making Stanford Ner CRF Training data

I have to make a good NER CRF based model. I am targeting a vast domain and total no of classes that I am targeting are 17. I have also made a good set of features set(austen.prop) that should work for me by doing a lot of experiments. NER is not producing good results. I need to know limitations of NER which is CRF based in context of training data size etc.
I searched a lot but till now I am unable to find the conventions that one should follow in making training data.
(Note: I know completely how to make model and use it, I just need to know is there any conventions that some percentage of each target class should exist etc.)
If anybody can guide me, I would be thankful to you.
For English, a standard training data set is CoNLL 2003 which has something like 15,000 tagged sentences for 4 classes (ORG, PERSON, LOCATION, MISC).

Resources