How to increase NER accuracy of Stanford's CRF for a small training dataset? - stanford-nlp

I am trying to train a NER model for a different domain using the Stanford's Named Entity Recognizer module (https://nlp.stanford.edu/software/CRF-NER.html) and the tag set is quite varied. The size of my dataset is : around 1100 sentences for training and 150 sentences for testing. The problem that I am facing is my test accuracy is coming out to be very low : 20-40%.
I tried to play with the different features provided in the NERFeatureFactory but none of them are actually helping. What could be the actual problem with this? Is it only because of the limited amount of training data or am I missing something else as well?
The other thing that I want to try is to do some fine-tuning and validation on the trained model. Is there a way to do so?

Related

How to train on very small data set?

We are trying to understand the underlying model of Rasa - the forums there still didnt get us an answer - on two main questions:
we understand that Rasa model is a transformer-based architecture. Was it
pre-trained on any data set? (eg wikipedia, etc)
then, if we
understand correctly, the intent classification is a fine tuning task
on top of that transformer. How come it works with such small
training sets?
appreciate any insights!
thanks
Lior
the transformer model is not pre-trained on any dataset. We use quite a shallow stack of transformer which is not as data hungry as deeper stacks of transformers used in large pre-trained language models.
Having said that, there isn't an exact number of data points that will be sufficient for training your assistant as it varies by the domain and your problem. Usually a good estimate is 30-40 examples per intent.

Keras «Powerful image classification with little data»: disparity between training and validation

I followed this post and first made it work on the dataset «Cats vs dogs». Then I substituted this set with my own images, which show the presence of an object vs the absence of that object. My dataset is even smaller than the one in the post. I only have 496 images containing that object for training and 160 images with that object for validation. For the «absent» class I have numerous samples (without that object in an image).
So far I didn't try class_weight to tackle the imbalanced data problem. I just randomly choose 496 and 160 images without that object for training and validation, respectively. Basically, I do a two class image classification with a smaller dataset using the techniques in this post. Thus I expected a worse performance in comparison due to the insufficient data. But the actual problem is that the performance is not convergent as shown in the figures.
Could you tell me possible reasons that lead to the unconvergence? I guess the problem is related to my dataset as the model works perfectly for «cats vs dogs». But I don't know how to address it. Are there any good techniques to make it convergent?
Thank you.
This performance plot is based on VGG16, keeping all layers up to fully connected layer and training a small fully connected layer with 256 neurons.
This performance plot is also based on VGG16, but using 128 neurons instead of 256 neurons. Also I set epochs to 80.
Based on the suggestions provided so far, I'm thinking to have a customized convnet model to fight the overfitting problem. But how to do this? One of my worries is that a model with fewer layers will downgrade the performance for training. Any guidelines to customize a good model for little data? Thank you.
Updates:
Now I think I know the half reason that leads to the unconvergent problem. You know, Actually I only have 100+ images. The rest images are downloaded from Flickr. I thought those images having centric objects and better quality will work for the model. But later on I found they can not contribute to the accuracy and even worse the output class probabilities. After removing these downloaded images, the performance is bumping upward a little and the uncovergency is gone. Note I only use 64*2 images for training and 48*2 images for testing. Also I found the image augmentation could not improve the performance for my dataset. Without image augmentation, the training accuracy could reach 1. But if I add some image augmentation, the training accuracy is only around 85%. Did somebody have such experience? Why doesn't data augmentation always work? Because our specific dataset? Thank you very much.
Your model is working great, but it's "overfitting". It means it's capable of memorizing all your training data without really "thinking". That leads to great training results and bad test results.
Common ways to avoid overfitting are:
More data - If you have little data, the chance of overfitting increases
Less units/layers - Make the model less capable, so it will stop memorizing and start thinking.
Add "dropouts" to your layers (something that randomly discards part of the results to prevent the model from being too powerful)
Do more layers mean more power and performance?
If by performance you mean capability of learning, yes. (If you mean "speed", no)
Yes, more layers mean more power. But too much power leads to overfitting: the model is so capable that it can memorize training data.
So there is an optimal point:
A model that is not very capable will not give you the proper results (both training and test results will be bad)
A model that is too capable will memorize the training data (excellent training results, but bad test results)
A balanced model will learn the right things (good training and test results)
That's exactly why we use test data, it's data that is not presented for training, so the model doesn't learn from the test data.

Conventions for making Stanford Ner CRF Training data

I have to make a good NER CRF based model. I am targeting a vast domain and total no of classes that I am targeting are 17. I have also made a good set of features set(austen.prop) that should work for me by doing a lot of experiments. NER is not producing good results. I need to know limitations of NER which is CRF based in context of training data size etc.
I searched a lot but till now I am unable to find the conventions that one should follow in making training data.
(Note: I know completely how to make model and use it, I just need to know is there any conventions that some percentage of each target class should exist etc.)
If anybody can guide me, I would be thankful to you.
For English, a standard training data set is CoNLL 2003 which has something like 15,000 tagged sentences for 4 classes (ORG, PERSON, LOCATION, MISC).

Good results when training and cross-validating a model, but test data set shows poor results

My problem is that I obtain a model with very good results (training and cross-validating), but when I test it again (with a different data set) poor results appear.
I got a model which has been trained and cross-validating tested. The model shows AUC=0.933, TPR=0.90 and FPR=0.04
I guess there is no overfitting present looking at pictures, corresponding to learning curve (error), learning curve (score), and deviance curve:
The problem is that when I test this model with a different test data set, I obtain poor results, nothing to do with my previus results AUC=0.52, TPR=0.165 and FPR=0.105
I used Gradient Boosting Classifier to train my model, with learning_rate=0.01, max_depth=12, max_features='auto', min_samples_leaf=3, n_estimators=750
I used SMOTE to balance the class. It is binary model. I vectorized my categorical attributes. I used 75% of my data set to train and 25% tot test. My model has a very low training error, and a low test error, so I guess it is not overfitted. Training error is very low, so there are not outliers in the training and cv-test data sets. What can I do from now on to find the problem? Thanks
If the process generating your datasets is non-stationary it could cause the behavior you describe.
In that case the distribution of the dataset you're using to test has not been used for training

How to improve the accuracy of a Naive Bayes Classifier?

I am using Naive Bayes Classifier. Following this tutorial.
For the the trained data, i am using 308 questions and categorizing them into 26 categories which are manually tagged.
Before sending the data i am performing NLP. In NLP i am performing(punctuation removal, tokenization, stopword removal and stemming)
This filtered data, am using as input for mahout.
Using mahout NBC's i train this data and get the model file. Now when i run
mahout testnb
command i get Correctly Classified Instances as 96%.
Now for my test data i am using 100 questions which i have manually tagged. And when i use the trained model with the test data, i get Correctly Classified Instances as 1%.
This is pissing me off.
Can anyone suggest me what i doing wrong or suggest me some ways to increase the performance of NBC.?
Also, ideally how much of questions data should i use to train and test?
This appears to be the classic problem of "overfitting"... where you get a very high % accuracy on the training set, but a low % in real situations.
You probably need more training instances. Also, there is the possibility that the 26 categories don't correlate to the features you have. Machine Learning isn't magical and needs some sort of statistical relationship between the variables and the outcomes. Effectively, what NBC might be doing here is effectively "memorizing" the training set, which is completely useless for questions outside of memory.

Resources