I have trained a T5 model on a specific dataset for the purpose of keyword extraction. I wish to use Allen NLP Interpret to know various saliency mappings for the inputs given to my model. Where do I make changes such that I can use the package.
The AllenNLP guide has a chapter on interpreting models: https://guide.allennlp.org/interpret
Also, for custom models, here's an example: How to use Allen NLP interpret on custom models
Related
We would like to add some custom entities to the training set of either Stanford NLP or spaCy, before re-training the model. We are willing to label our custom entities, but we would like to add these to the existing training set, so as to not spend too much time labeling.
We assume that the NLP model was trained on a large labeled data set, which includes labels for words that are labeled "O" ("other", i.e. nothing of interest) as well as words that are labeled "DATE", "PERSON", "ORGANIZATION", etc. We have a custom set of ORGANIZATION words, but we would like to add these to all the other labeled data, before re-training the model.
Is this possible? How can we do this? Do we have to get the labeled dataset that the models were trained on, so we can add our own data? If so, how can we do that?
We have built prototypes using both Stanford NLP and spaCy, so an answer for either one works for us.
For spaCy, you should just be able to call nlp.update(). This will make a weight update against the current weights, allowing you to resume training. If you want to make many updates, you might want to parse some text with the original model and mix that through your training, to avoid the "catastrophic forgetting" problem.
You can use this entity tagger tool by helkaroui to create your own training set.
I have a custom annotated corpus, in OpenNLP format. Ex:
<START:Person> John <END> went to <START:Location> London <END>. He visited <START:Organisation> ACME Co <END> in the afternoon.
What I need is to segment sentences from this corpus. But it won't always work as expected due to the annotations.
How can I do it without losing the entity annotations?
I am using OpenNLP.
In case you want to create multiple NLP models for OpenNLP you need multiple formats to train them:
The tokenizer requires a training format
The sentence detector requires a training format
The name finder requires a training format
Therefore, you need to manage these different annotation layers in some way.
I created an annotation tool and a Maven plugin which help you doing this, have a look here. All information can be stored in a single file and the Maven plugin will generate the NLP models for you.
Let me know if you have an further questions.
I know how to train an NER model as specified here and have a very successful one in fact. I also know about the 3 provided caseless models as talked about here. But what if I want to train my own caseless model, what is the trick there? I have a bunch of all uppercase documents for training. Do I use the same training process or are there special/different features for the caseless models or are there properties that need to be set? I can't find a description as to how the provided caseless models were created.
There is only one property change in our models, which is that you want to have it invoke a function that removes case information before words are processed for classification. We do that with this property value (which also maps some words to American spelling):
wordFunction = edu.stanford.nlp.process.LowercaseAndAmericanizeFunction
but there is also simply:
wordFunction = edu.stanford.nlp.process.LowercaseFunction
Having more automatic stuff for deciding document format (hard/soft line breaks), case, or even language would be nice, but at present we don't have any of those....
I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.
I want to experiment NER on a specific domain, that is location names extraction from travel offers in Italian language.
So far I've got that I need to prepare the training set by myself, so I'm going to put the
<START:something><END>
tags in some offers from my training set.
But looking at OpenNLP documentation on how to train for NER, I ended up in having a couple of questions:
1) When defining the START/END tags, I'm I free to use whatever name inside the tags (where I wrote "something" a few line above) or is there a restricted set to be bound?
2) I noticed that the call to the training tool
opennlp TokenNameFinderTrainer
takes a string representing the language as the first argument. What is that for? Considering I want to train a model on Italian language that is NOT supported, is there any additional task to be done before I could train for NER?
1) Yes, you can specify multiple types. If the training file contains multiple types, the created model will also be able to detect these multiple types.
2) I think that "lang" parameter has the same meaning/use of other commands (e.g. opennlp TokenizerTrainer -lang it ...)