Where do I find Stanford NLP Tagger training files for: english-left3words-distsim.tagger - stanford-nlp

I am trying to understand how to train my own tagger in Stanford NLP. How do I get the training files for:
english-left3words-distsim.tagger
It lists these files in the props file, but not too sure on the full download url.
/u/nlp/software/CoreNLP-models/models/english-left3words-distsim-4.1.1-v5/data/wsj-train.tagged.txt;
/u/nlp/software/CoreNLP-models/models/english-left3words-distsim-4.1.1-v5/data/ewt-train.tagged.txt;
/u/nlp/software/CoreNLP-models/models/english-left3words-distsim-4.1.1-v5/data/ontonotes-train.tagged.txt;
/u/nlp/software/CoreNLP-models/models/english-left3words-distsim-4.1.1-v5/data/craft-train.tagged.txt;
/u/nlp/software/CoreNLP-models/models/english-left3words-distsim-4.1.1-v5/data/english-handparsed-train.tagged.txt;
/u/nlp/software/CoreNLP-models/models/english-left3words-distsim-4.1.1-v5/data/questionbank-train.tagged.txt
Any suggestions?

Related

Stanford dependency parser training data format

I would like to add a new language to the Stanford Dependency Parser, but cannot for the life of me figure out how.
In what format should training data be?
How do I generate new language files?
The neural net dependency parser takes in CoNLL-X format data.
There is a description of the format in this paper:
https://ilk.uvt.nl/~emarsi/download/pubs/14964.pdf

Gazettes with Stanford NER

I am making my own model of Stanford NER which is CRF based, by following conventions given at this link.I want to add Gazettes and following this from same link. I am mentioning all of my Gazettes using this property, gazette=file1.txt;file2.txt and also mentioning useGazettes=true in austen.prop. After making model when I am testing data from my Gazettes then it is not TAGGING correctly. The tag which I mentioned in files in not coming correctly. These are little bit surprising results for me as Stanford NER is not giving them same tag as mentioned in those files.
Is there some limitations of Stanford NER with Gazettes or I am still missing something? If somebody can help me I will be thankful to you.

inconsistency between nltk stanford ner tagger and stanford ner tagger online demo

I am using python's inbuilt library nltk to get stanford ner tagger api setup but i am seeing inconsistency between tagging of words by this api and online demo on stanford's ner tagger website.Some words are being tagged in online demo while they are not being in api in python and similarly some words are being tagged differently.I have used the same classifiers as mentioned in the website. Can anyone tell me why is the problem coming and what's the solution for it..?
I was running into the same issue and determined that my code and the online demo were applying different formatting rules for the text.
https://github.com/dat/pyner/blob/master/ner/client.py
for s in ('\f', '\n', '\r', '\t', '\v'): #strip whitespaces
text = text.replace(s, '')
text += '\n' #ensure end-of-line

Segmentation of entities in Named Entity Recognition

I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.

stanford corenlp 3.3.1 language support

I'm starting to use coreNLP library 3.3.1 to analyze italian text documents. Have anybody tried to run a language other than English ? Did you find the models needed to train the algoritmhs ?
Thanks
Carlo
At the moment, beyond English, we only package models for Chinese (see http://nlp.stanford.edu/software/corenlp.shtml#History), but people have also successfully used the German and French models that we distribute with the Stanford Parser, Stanford NER, or the Stanford POS Tagger inside CoreNLP. For Italian, you'd need annotated data available to train your own models. There are some treebanks available for Italian and the Stanford Parser has been trained for Italian. For info on resources for Italian, see: http://aclweb.org/aclwiki/index.php?title=Resources_for_Italian#Treebanks.

Resources