purpose of Penn TreeBank and PCFG Model in Stanford Parser - stanford-nlp

I was confused by the purpose of englishPCFG Model and Penn treebank annotation, the package of Standford Parser only included all kind of Models, it always questions me how does this model works if we already have annotation from Peen treebank. Simply, what is Peen Treebank Annaotation works for Parser and how does Model come out? if a raw text come for parser, does it need to query Treebank to predict trees agian?
I am reading some materials, but still dont know when did Model being generate at below steps.
1, Choose an available treebank.
2, Choose a parser engine suitable for the treebank annotation.
3, Select training and test data.
4, Train the parser on the training set.
5, Evaluate the parser's accuracy on the test set.
6, Write a report on the project with experimental results.
anyone can help?

It is saved state after step 4, which you can use to evaluate the parser or to parse text at any later time, without needing to retrain.

Related

Are there any alternate ways other than Named Entity Recognition to extract event names from sentences?

I'm a newbie to NLP and I'm working on NER using OpenNLP. I have a sentence like " We have a dinner party today ". Here "dinner party" is an event type. Similarly consider this sentence- "we have a room reservation" here room reservation is an event type. My goal is to extract such words from sentences and label it as "Event_types" as the final output. This can be fairly achieved by creating custom NER model's by annotating sentences with proper tags in the training dataset. But the event types can be heterogeneous and random and hence it is very hard to label all possible patterns(ie. event types can be anything like "security meeting", "family function","parents teachers meeting", etc,etc,...). So I'm looking for an alternate way to achieve this problem... Immediate response would be appreciated. Thanks ! :)
Basically you have two options: 1) A list-based approach where you have lists of entities you will extract from text. To solve the heterogeneous language use, one can train an embedding (e.g. Word2Vec or FastText) to identify contextually similar phrases for your list. 2) Train a custom CRF with data you have annotated (this obviously requires that you annotate bunch of sentences with corresponding tags). I guess the ideal solution really depends on the data and people's willingness to annotate it.

Add domain-specific entities to spaCy or Stanford NLP training set

We would like to add some custom entities to the training set of either Stanford NLP or spaCy, before re-training the model. We are willing to label our custom entities, but we would like to add these to the existing training set, so as to not spend too much time labeling.
We assume that the NLP model was trained on a large labeled data set, which includes labels for words that are labeled "O" ("other", i.e. nothing of interest) as well as words that are labeled "DATE", "PERSON", "ORGANIZATION", etc. We have a custom set of ORGANIZATION words, but we would like to add these to all the other labeled data, before re-training the model.
Is this possible? How can we do this? Do we have to get the labeled dataset that the models were trained on, so we can add our own data? If so, how can we do that?
We have built prototypes using both Stanford NLP and spaCy, so an answer for either one works for us.
For spaCy, you should just be able to call nlp.update(). This will make a weight update against the current weights, allowing you to resume training. If you want to make many updates, you might want to parse some text with the original model and mix that through your training, to avoid the "catastrophic forgetting" problem.
You can use this entity tagger tool by helkaroui to create your own training set.

Sentence segmentation with annotated corpus

I have a custom annotated corpus, in OpenNLP format. Ex:
<START:Person> John <END> went to <START:Location> London <END>. He visited <START:Organisation> ACME Co <END> in the afternoon.
What I need is to segment sentences from this corpus. But it won't always work as expected due to the annotations.
How can I do it without losing the entity annotations?
I am using OpenNLP.
In case you want to create multiple NLP models for OpenNLP you need multiple formats to train them:
The tokenizer requires a training format
The sentence detector requires a training format
The name finder requires a training format
Therefore, you need to manage these different annotation layers in some way.
I created an annotation tool and a Maven plugin which help you doing this, have a look here. All information can be stored in a single file and the Maven plugin will generate the NLP models for you.
Let me know if you have an further questions.

Training caseless NER models with Stanford corenlp

I know how to train an NER model as specified here and have a very successful one in fact. I also know about the 3 provided caseless models as talked about here. But what if I want to train my own caseless model, what is the trick there? I have a bunch of all uppercase documents for training. Do I use the same training process or are there special/different features for the caseless models or are there properties that need to be set? I can't find a description as to how the provided caseless models were created.
There is only one property change in our models, which is that you want to have it invoke a function that removes case information before words are processed for classification. We do that with this property value (which also maps some words to American spelling):
wordFunction = edu.stanford.nlp.process.LowercaseAndAmericanizeFunction
but there is also simply:
wordFunction = edu.stanford.nlp.process.LowercaseFunction
Having more automatic stuff for deciding document format (hard/soft line breaks), case, or even language would be nice, but at present we don't have any of those....

Segmentation of entities in Named Entity Recognition

I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.

Resources