Stanford Parser models - stanford-nlp

Stanford CoreNLP contains several models for parsing English sentences.
englishSR
english_SD
english_UD (default for depparse annotator)
englishRNN
englishFactored
englishPCFG (default for parse annotator)
englishPCFG.caseless
wsjRNN
wsjFactored
wsjPCFG
There are some comparisons in following papers:
http://nlp.stanford.edu/software/stanford-dependencies.shtml#English
http://nlp.stanford.edu/pubs/lrecstanforddeps_final_final.pdf
http://nlp.stanford.edu/pubs/SocherBauerManningNg_ACL2013.pdf
http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf
I couldn't find full description and comparison for all models.
Does it exist anywhere? If not I think it is worth to create.

I can't give a full list (maybe Chris will chime in?), but my understanding is that these models are:
englishSR: The shift reduce model trained on various standard treebanks, and some of Stanford's hand-annotated data. This is the fastest and most accurate model we have, but the model is huge to load.
english_SD: The NN Dependency Parser model for Stanford Dependencies. Deprecated in favor of english_UD -- the Universal Dependencies model.
english_UD: The NN Dependency Parser model for Universal Dependencies. This is the fastest and most accurate way to get dependency trees, but it won't give you constituency parses.
englishRNN: The hybrid PCFG + Neural constituency parser model. More accurate than any of the constituency parsers other than the shift-reduce model, but also noticeably slower.
englishFactored: Not 100% sure what this is, but my impression is that both accuracy and speed-wise it's between englishPCFG and englishRNN.
englishPCFG: A regular old PCFG model for constituency parsing. Fast to load, and faster than any of the constituency models other than the shift-reduce model, but also kind of mediocre accuracy by modern standards. Nonetheless, a good default.
englishPCFG.caseless: A caseless version of the PCFG model.
I assume the wsj* models are there to reproduce numbers in papers (trained on the proper WSJ splits), but again I'm not 100% sure what they are.
To help chose the right model based on speed, accuracy, and the base memory used by the model:
Fast: 10x, accurate, high-memory: englishSR
Medium: 1x, ok accuracy, low-memory: englishPCFG
Slow: ~0.25x, accurate, low-memory: englishRNN
Fast: 100x, accurate, low-memory, dependency parses only: english_UD

Related

Train a non-english Stanford NER models

I'm seeing several posts about training the Stanford NER for other languages.
eg: https://blog.sicara.com/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486
However, the Stanford CRF-Classifier uses some language dependent features (such as: Part Of Speechs tags).
Can we really train non-English models using the same Jar file?
https://nlp.stanford.edu/software/crf-faq.html
Training a NER classifier is language independent. You have to provide high quality training data and create meaningful features. The point is, that not all features are equally useful for every languages. Capitalization for instance, is a good indicator for a named entity in english. But in German all nouns are capitalized, which makes this features less useful.
In Stanford NER you can decide which features the classifier has to use and therefore you can disable POS tags (in fact, they are disabled by default). Of course, you could also provide your own POS tags in your desired language.
I hope I could clarify some things.
I agree with previous comment that NER classification model is language independent.
If you have issue with training data I could suggest you this link with a huge amount of labeled datasets for different languages.
If you would like to try another model, I suggest ESTNLTK - library for Estonian language, but it could fit language independent ner models (documentation).
Also, here you could find example how to train ner model using spaCy.
I hope it helps. Good luck!

How to create a gazetteer based Named Entity Recognition(NER) system?

I have tried my hands on many NER tools (OpenNLP, Stanford NER, LingPipe, Dbpedia Spotlight etc).
But what has constantly evaded me is a gazetteer/dictionary based NER system where my free text is matched with a list of pre-defined entity names, and potential matches are returned.
This way I could have various lists like PERSON, ORGANIZATION etc. I could dynamically change the lists and get different extractions. This would tremendously decrease training time (since most of them are based on maximum entropy model so they generally includes tagging a large dataset, training the model etc).
I have built a very crude gazetteer based NER system using a OpenNLP POS tagger, from which I used to take out all the Proper nouns (NP) and then look them up in a Lucene index created from my lists. This however gives me a lot of false positives. For ex. if my lucene index has "Samsung Electronics" and my POS tagger gives me "Electronics" as a NP, my approach would return me "Samsung Electronics" since I am doing partial matches.
I have also read people talking about using gazetteer as a feature in CRF algorithms. But I never could understand this approach.
Can any of you guide me towards a clear and solid approach that builds NER on gazetteer and dictionaries?
I'll try to make the use of gazetteers more clear, as I suspect this is what you are looking for. Whatever training algorithm used (CRF, maxent, etc.) they take into account features, which are most of the time:
tokens
part of speech
capitalization
gazetteers
(and much more)
Actually gazetteers features provide the model with intermediary information that the training step will take into account, without explicitly being dependent on the list of NEs present in the training corpora. Let's say you have a gazetteer about sport teams, once the model is trained you can expand the list as much as you want without training the model again. The model will consider any listed sport team as... a sport team, whatever its name.
In practice:
Use any NER or ML-based framework
Decide what gazetteers are useful (this is maybe the most crucial part)
Affect to each gazetteer a relevant tag (e.g. sportteams, companies, cities, monuments, etc.)
Populate gazetteers with large lists of NEs
Make your model take into account those gazetteers as features
Train a model on a relevant corpus (it should containing many NEs from gazetteers)
Update your list as much as you want
Hope this helps!
You can try this minimal bash Named-Entity Recognizer:
https://github.com/lasigeBioTM/MER
Demo: http://labs.fc.ul.pt/mer/

Validation of hurdle model?

I built a hurdle model, and then used that model to predict from known to unknown data points using the predict command. Is there a way to validate the model and these predictions? Do I have to do this in two parts, for example using sensitivity and specificity for the binomial part of the model?
Any other ideas for how to assess the validity of this model?
For validating predictive models, I usually trust Cross-Validation.
In short: With cross-validation you can measure the predictive performance of your model using only the training data (data with known results). Thus you can get a general opinion on how your model works. Cross-validation works quite well for wide variety of different models. The downside is that it can get quite computation heavy.
With large data sets, 10-fold cross-validation is enough. The smaller your dataset is, the more "folds" you have to do (i.e. with very small datasets, you have to do leave-one-out cross-validation)
With cross-validation, you get predictions for the whole data set. You can then compare these predictions to the actual outputs and measure how well your model performed.
Cross-validated results can take a bit to understand in more complicated comparisons, but for your general purpose question "how to assess the validity of the model", the results should be quite easy to use.

Conventions for making Stanford Ner CRF Training data

I have to make a good NER CRF based model. I am targeting a vast domain and total no of classes that I am targeting are 17. I have also made a good set of features set(austen.prop) that should work for me by doing a lot of experiments. NER is not producing good results. I need to know limitations of NER which is CRF based in context of training data size etc.
I searched a lot but till now I am unable to find the conventions that one should follow in making training data.
(Note: I know completely how to make model and use it, I just need to know is there any conventions that some percentage of each target class should exist etc.)
If anybody can guide me, I would be thankful to you.
For English, a standard training data set is CoNLL 2003 which has something like 15,000 tagged sentences for 4 classes (ORG, PERSON, LOCATION, MISC).

information retrieval probabilistic model

Do you know where I can find source code(any language) to program an information retrieval system based on the probabilistic model?
I tried to search it on the web and found an algorithm named bm25 or bmf25, but I don't know if it is useful.
Basically I´m trying to compare the performance of 3 IR algorithms: Vector space model, boolean model and the probabilistic model. Right now I have found the vector space and the boolean models. Depending on the results we need to use the best of them to develop a question-answering system
Thanks in advance
If you are looking for an IR engine that have BM25 implemented, you can try Terrier IR Platform
The language is Java. You can either use the engine itself or look into the source code for implementations of BM25 or other term weighting models.
The confusion here is that there are several probabilistic IR models (e.g. 2-Poisson, Binary Independence Model, language modeling variants), so the question is ambiguous. But in my experience, when people say "the probabilistic model" they usually mean some variant of the Binary Independence model due to Robertson and Sparch-Jones. BM25 (quite roughly) approximates this model, and that's what I'd use in this case. A canonical implementation of BM25 is included in the Lemur Toolkit. See:
http://www.lemurproject.org/doxygen/lemur/html/OkapiRetMethod_8hpp-source.html

Resources