Train a non-english Stanford NER models - stanford-nlp

I'm seeing several posts about training the Stanford NER for other languages.
eg: https://blog.sicara.com/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486
However, the Stanford CRF-Classifier uses some language dependent features (such as: Part Of Speechs tags).
Can we really train non-English models using the same Jar file?
https://nlp.stanford.edu/software/crf-faq.html

Training a NER classifier is language independent. You have to provide high quality training data and create meaningful features. The point is, that not all features are equally useful for every languages. Capitalization for instance, is a good indicator for a named entity in english. But in German all nouns are capitalized, which makes this features less useful.
In Stanford NER you can decide which features the classifier has to use and therefore you can disable POS tags (in fact, they are disabled by default). Of course, you could also provide your own POS tags in your desired language.
I hope I could clarify some things.

I agree with previous comment that NER classification model is language independent.
If you have issue with training data I could suggest you this link with a huge amount of labeled datasets for different languages.
If you would like to try another model, I suggest ESTNLTK - library for Estonian language, but it could fit language independent ner models (documentation).
Also, here you could find example how to train ner model using spaCy.
I hope it helps. Good luck!

Related

How to create a gazetteer based Named Entity Recognition(NER) system?

I have tried my hands on many NER tools (OpenNLP, Stanford NER, LingPipe, Dbpedia Spotlight etc).
But what has constantly evaded me is a gazetteer/dictionary based NER system where my free text is matched with a list of pre-defined entity names, and potential matches are returned.
This way I could have various lists like PERSON, ORGANIZATION etc. I could dynamically change the lists and get different extractions. This would tremendously decrease training time (since most of them are based on maximum entropy model so they generally includes tagging a large dataset, training the model etc).
I have built a very crude gazetteer based NER system using a OpenNLP POS tagger, from which I used to take out all the Proper nouns (NP) and then look them up in a Lucene index created from my lists. This however gives me a lot of false positives. For ex. if my lucene index has "Samsung Electronics" and my POS tagger gives me "Electronics" as a NP, my approach would return me "Samsung Electronics" since I am doing partial matches.
I have also read people talking about using gazetteer as a feature in CRF algorithms. But I never could understand this approach.
Can any of you guide me towards a clear and solid approach that builds NER on gazetteer and dictionaries?
I'll try to make the use of gazetteers more clear, as I suspect this is what you are looking for. Whatever training algorithm used (CRF, maxent, etc.) they take into account features, which are most of the time:
tokens
part of speech
capitalization
gazetteers
(and much more)
Actually gazetteers features provide the model with intermediary information that the training step will take into account, without explicitly being dependent on the list of NEs present in the training corpora. Let's say you have a gazetteer about sport teams, once the model is trained you can expand the list as much as you want without training the model again. The model will consider any listed sport team as... a sport team, whatever its name.
In practice:
Use any NER or ML-based framework
Decide what gazetteers are useful (this is maybe the most crucial part)
Affect to each gazetteer a relevant tag (e.g. sportteams, companies, cities, monuments, etc.)
Populate gazetteers with large lists of NEs
Make your model take into account those gazetteers as features
Train a model on a relevant corpus (it should containing many NEs from gazetteers)
Update your list as much as you want
Hope this helps!
You can try this minimal bash Named-Entity Recognizer:
https://github.com/lasigeBioTM/MER
Demo: http://labs.fc.ul.pt/mer/

Stanford NLP training documentpreprocessor

Does Stanford NLP provide a train method for the DocumentPreprocessor to train with own corpora and creating own models for sentence splitting?
I am working with German sentences and I need to create my own German model for sentence splitting tasks. Therefore, I need to train the sentence splitter, DocumentPreprocessor.
Is there a way I can do it?
No. At present, tokenization of all European languages is done by a (hand-written) finite automaton. Machine learning-based tokenization is used for Chinese and Arabic. At present, sentence splitting for all languages is done by rule, exploiting the decisions of the tokenizer. (Of course, that's just how things are now, not how they have to be.)
At present we have no separate German tokenizer/sentence splitter. The current properties file just re-uses the English ones. This is clearly sub-optimal. If someone wanted to produce something for German, that would be great to have. (We may do it at some point, but German development is not currently at the top of the list of priorities.)

Conventions for making Stanford Ner CRF Training data

I have to make a good NER CRF based model. I am targeting a vast domain and total no of classes that I am targeting are 17. I have also made a good set of features set(austen.prop) that should work for me by doing a lot of experiments. NER is not producing good results. I need to know limitations of NER which is CRF based in context of training data size etc.
I searched a lot but till now I am unable to find the conventions that one should follow in making training data.
(Note: I know completely how to make model and use it, I just need to know is there any conventions that some percentage of each target class should exist etc.)
If anybody can guide me, I would be thankful to you.
For English, a standard training data set is CoNLL 2003 which has something like 15,000 tagged sentences for 4 classes (ORG, PERSON, LOCATION, MISC).

what is typical way to improve model precision/recall for text classification

I am working on a data mining project that try to auto classify text into t category.
it is a multi-class supervised learning, the input feature include title and body (both are text).
Current accuracy rate is not good, could you please advise some method to improve accuracy?
here is something i have already tried.
Pre-processing: Term (could you please suggest a method to extract
term automatically)
Stopword removal (could you please suggest some stop
word set for English)
Stemming
Lemmatization
N-gram
Feature Selection (Information Gain Ratio)
Algorithms: GBDT, LR, SVM and others.
There are plenty of tools you can use in order to extract sensible linguistically-grounded feature types. It depends on what is your favourite programming language/environment and if you want to use a machine learning suite which has some text mining components in it, or a text mining component only.
Have a look at:
Java: Weka (video about text classification), OpenNLP
Python: Scikit-learn and NLTK.
About the stopword lists:
http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
http://www.ranks.nl/stopwords
http://www.textfixer.com/resources/common-english-words.txt
http://norm.al/2009/04/14/list-of-english-stop-words/
http://snowball.tartarus.org/algorithms/english/stop.txt

Sentiment analysis

while performing sentiment analysis, how can I make the machine understand that I'm referring apple (the iphone), instead of apple (the fruit)?
Thanks for the advise !
Well, there are several methods,
I would start with checking Capital letter, usually, when referring to a name, first letter is capitalized.
Before doing sentiment analysis, I would use some Part-of-speech and Named Entity Recognition to tag the relevant words.
Stanford CoreNLP is a good text analysis project to start with, it will teach
you the basic concepts.
Example from CoreNLP:
You can see how the tags can help you.
And check out more info
As described by Ofiris, NER is only one way to do solve your problem. I feel it's more effective to use word embedding to represent your words. In that way machine automatically recognize the context of the word. As an example "Apple" is mostly coming together with "eat" and But if the given input "Apple" is present with "mobile" or any other word in that domain, Machine will understand it's "iPhone apple" instead of "apple fruit". There are 2 popular ways to generate word embeddings such as word2vec and fasttext.
Gensim provides more reliable implementations for both word2vec and fasttext.
https://radimrehurek.com/gensim/models/word2vec.html
https://radimrehurek.com/gensim/models/fasttext.html
In presence of dates, famous brands, vip or historical figures you can use a NER (named entity recognition) algorithm; in such case, as suggested by Ofiris, the Stanford CoreNLP offers a good Named entity recognizer.
For a more general disambiguation of polysemous words (i.e., words having more than one sense, such as "good") you could use a POS tagger coupled with a Word Sense Disambiguation (WSD) algorithm. An example of the latter can be found HERE, but I do not know any freely downloadable library for this purpose.
This problem has already been solved by many open source pre-trained NER models. Anyways you can try retraining an existing NER models to finetune them to solve this issue.
You can find an demo of NER results as done by Spacy NER here.

Resources