Conventions for making Stanford Ner CRF Training data - stanford-nlp

I have to make a good NER CRF based model. I am targeting a vast domain and total no of classes that I am targeting are 17. I have also made a good set of features set(austen.prop) that should work for me by doing a lot of experiments. NER is not producing good results. I need to know limitations of NER which is CRF based in context of training data size etc.
I searched a lot but till now I am unable to find the conventions that one should follow in making training data.
(Note: I know completely how to make model and use it, I just need to know is there any conventions that some percentage of each target class should exist etc.)
If anybody can guide me, I would be thankful to you.

For English, a standard training data set is CoNLL 2003 which has something like 15,000 tagged sentences for 4 classes (ORG, PERSON, LOCATION, MISC).

Related

How to train on very small data set?

We are trying to understand the underlying model of Rasa - the forums there still didnt get us an answer - on two main questions:
we understand that Rasa model is a transformer-based architecture. Was it
pre-trained on any data set? (eg wikipedia, etc)
then, if we
understand correctly, the intent classification is a fine tuning task
on top of that transformer. How come it works with such small
training sets?
appreciate any insights!
thanks
Lior
the transformer model is not pre-trained on any dataset. We use quite a shallow stack of transformer which is not as data hungry as deeper stacks of transformers used in large pre-trained language models.
Having said that, there isn't an exact number of data points that will be sufficient for training your assistant as it varies by the domain and your problem. Usually a good estimate is 30-40 examples per intent.

How to increase NER accuracy of Stanford's CRF for a small training dataset?

I am trying to train a NER model for a different domain using the Stanford's Named Entity Recognizer module (https://nlp.stanford.edu/software/CRF-NER.html) and the tag set is quite varied. The size of my dataset is : around 1100 sentences for training and 150 sentences for testing. The problem that I am facing is my test accuracy is coming out to be very low : 20-40%.
I tried to play with the different features provided in the NERFeatureFactory but none of them are actually helping. What could be the actual problem with this? Is it only because of the limited amount of training data or am I missing something else as well?
The other thing that I want to try is to do some fine-tuning and validation on the trained model. Is there a way to do so?

Train a non-english Stanford NER models

I'm seeing several posts about training the Stanford NER for other languages.
eg: https://blog.sicara.com/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486
However, the Stanford CRF-Classifier uses some language dependent features (such as: Part Of Speechs tags).
Can we really train non-English models using the same Jar file?
https://nlp.stanford.edu/software/crf-faq.html
Training a NER classifier is language independent. You have to provide high quality training data and create meaningful features. The point is, that not all features are equally useful for every languages. Capitalization for instance, is a good indicator for a named entity in english. But in German all nouns are capitalized, which makes this features less useful.
In Stanford NER you can decide which features the classifier has to use and therefore you can disable POS tags (in fact, they are disabled by default). Of course, you could also provide your own POS tags in your desired language.
I hope I could clarify some things.
I agree with previous comment that NER classification model is language independent.
If you have issue with training data I could suggest you this link with a huge amount of labeled datasets for different languages.
If you would like to try another model, I suggest ESTNLTK - library for Estonian language, but it could fit language independent ner models (documentation).
Also, here you could find example how to train ner model using spaCy.
I hope it helps. Good luck!

How to create a gazetteer based Named Entity Recognition(NER) system?

I have tried my hands on many NER tools (OpenNLP, Stanford NER, LingPipe, Dbpedia Spotlight etc).
But what has constantly evaded me is a gazetteer/dictionary based NER system where my free text is matched with a list of pre-defined entity names, and potential matches are returned.
This way I could have various lists like PERSON, ORGANIZATION etc. I could dynamically change the lists and get different extractions. This would tremendously decrease training time (since most of them are based on maximum entropy model so they generally includes tagging a large dataset, training the model etc).
I have built a very crude gazetteer based NER system using a OpenNLP POS tagger, from which I used to take out all the Proper nouns (NP) and then look them up in a Lucene index created from my lists. This however gives me a lot of false positives. For ex. if my lucene index has "Samsung Electronics" and my POS tagger gives me "Electronics" as a NP, my approach would return me "Samsung Electronics" since I am doing partial matches.
I have also read people talking about using gazetteer as a feature in CRF algorithms. But I never could understand this approach.
Can any of you guide me towards a clear and solid approach that builds NER on gazetteer and dictionaries?
I'll try to make the use of gazetteers more clear, as I suspect this is what you are looking for. Whatever training algorithm used (CRF, maxent, etc.) they take into account features, which are most of the time:
tokens
part of speech
capitalization
gazetteers
(and much more)
Actually gazetteers features provide the model with intermediary information that the training step will take into account, without explicitly being dependent on the list of NEs present in the training corpora. Let's say you have a gazetteer about sport teams, once the model is trained you can expand the list as much as you want without training the model again. The model will consider any listed sport team as... a sport team, whatever its name.
In practice:
Use any NER or ML-based framework
Decide what gazetteers are useful (this is maybe the most crucial part)
Affect to each gazetteer a relevant tag (e.g. sportteams, companies, cities, monuments, etc.)
Populate gazetteers with large lists of NEs
Make your model take into account those gazetteers as features
Train a model on a relevant corpus (it should containing many NEs from gazetteers)
Update your list as much as you want
Hope this helps!
You can try this minimal bash Named-Entity Recognizer:
https://github.com/lasigeBioTM/MER
Demo: http://labs.fc.ul.pt/mer/

Stanford NER Tool -- training for a new domain

What is the amount of sentences needed to effectively train the CRF for a domain like restauarants (restaurant names, addresses, cusines) or music (artist name, song name genre).
As a point of reference, I believe the CoNLL training data for (location, organization, person, misc) NER has around 14,000 sentences.
It depends a lot of the kind of data you will be tagging and how variable it will be. I've worked on a project also involving restaurant and music domains. In my case we would be handling user queries, which tend to be short and don't present that much variability (particularly for restaurant, but not for music, which is a very noisy domain).
For the restaurant domain, training it with ~2k sentences was fine, but of course, if you can get more data, your model will be much more accurate.
For music, the situation is a little bit more tricky since song/band names can be virtually anything. In this case, only data alone might not be enough to get an acceptable accuracy. In my project we used ~5k for music and many features and some additional post-processing to get things right.

Resources