Training existing core nlp model - stanford-nlp

I want to train existing Stanford core-nlp's english-left3words-distsim.bin model with some more data which fits my use case. I want to assign custom tags for certain words like run will be a COMMAND.
Where can I get the training data set? I could follow something like model training

For the most part it is sections 0-18 of the WSJ Penn Treebank.
Link: https://catalog.ldc.upenn.edu/ldc99t42
We have some extra data sets we don't distribute as well that we add on to the WSJ data.

Related

How can i add or remove labels while finetune bert ner model?

I want to finetune a BERT NER model and remove or add new labels.
For example,
I have these labels:
LOCATION MONEY ORGANIZATION PERSON PRODUCT TIME TVSHOW.
I want more labels or to remove labels while finetuning it. Is this possible? If it is not, what are the other solutions?
I could not find a solution.
BERT enabeles you to do this but you cannot use finetuned model. Such as we tried a finetuned BERTurk model to try this but the architecture of the model did not match with our labels so we decided to try BERTurk original model and it worked. I think BERT can be trained when it is not trained for downstream task such as NER.

Is there any best way to train a custom domain specific text summarization model?

I tried some pretrained summarization models from HuggingFace, like Bert, T5, Bart, etc. But, the summarized content doesn't extract some important content from the original data. I need to do an abstract summary and need to extract the relevant information from the original content.

How to export a Google AutoML Text Classification model?

I just finished training my AutoML Text Classification model (single-label).
I was planning to run a Batch Prediction using the console, but I just found out how expensive that will be because I have over 300,000 text records to analyze.
So now I want to export the model to my local machine and run the predictions there.
I found instructions here to export "AutoML Tabular Models" and "AutoML Edge Models". But there is nothing available for text classification models.
I tried following the "AutoML Tabular Model" instructions because that looked like the closest thing to a text classification model, but I could not find the "Export" button that was supposed to exist on the model detail page.
So I have some questions regarding this:
How do I export a AutoML Text Classification model?
Is a AutoML Text Classification model the same thing as an AutoML Tabular model? They seem very similar because my text classifiction model used tabular CSV to assign labels and train the model.
If I cannot export AutoML Text Classification model (urgh!), can I train a new "Tabular" model to do the same thing?
Currently, there is no feature to export an AutoML text classification model. Already a feature request exists, you can follow its progress on this issue tracker.
Both the models are quite similar. A tabular data classification model analyzes your tabular data and returns a list of categories that describe the data. A text data classification model analyzes text data and returns a list of categories that apply to the text found in the data. Refer to this doc for more information about AutoML model types.
Yes, you can do the same thing in an AutoML tabular data classification model if your training data is in tabular CSV file format. Refer to this doc for more information about how to prepare tabular training data.
If your model trained successfully in an AutoML tabular data classification, you can find an Export option at the top. Refer to this doc for more information about how to export tabular classification models.

unable to tag gazette entities using own crf model

I followed this Entities on my gazette are not recognized
Even after adding minimal example of training data "Damiano" in gazette entity i am not able to recognition John or Andrea as PERSON.
I tried this using on large training data and gazette but still not able to tag any gazette entity. why?

If I don't specify a sentiment model in CoreNLP, what will it use to score the data?

I've been creating sentiment analysis models to use with Stanford CoreNLP, and I've been using the one with the highest F1 score in my java code, like so:
props.put("sentiment.model", "/path/to/model-0014-93.73.ser.gz.");
But if I remove this line, what does CoreNLP use to score the data? Is there a default coreNLP model that's used if the user does not specify a model?
If no model is given, it'll use the default model included in the release trained on the Stanford Sentiment Treebank: http://nlp.stanford.edu/sentiment/treebank.html

Resources