Is there any best way to train a custom domain specific text summarization model? - huggingface-transformers

I tried some pretrained summarization models from HuggingFace, like Bert, T5, Bart, etc. But, the summarized content doesn't extract some important content from the original data. I need to do an abstract summary and need to extract the relevant information from the original content.

Related

Annotation specs - AutoML (VertexAi)

We're trying to build an imaged based product search for our webshop using the vertex ai image classification model (single label).
Currently we have around 20k products with xx images per product.
So our dataset containing 20k of labels (one for each product - product number), but on import we receive the following error message:
There are too many AnnotationSpecs in the dataset. Up to 5000 AnnotationSpecs are allowed in one Dataset. Check your csv/jsonl format with our public documentation.
Looks like not more than 5000 labels are allowed per Dataset... This quota is not really visible in the documentation - or we didn't find it.
Anyway, any ideas how we can make it work? Does we have to build 5 Datasets with 5 different Endpoints and than query every Enpoint for matching?
You can find those limits in the AutoML quotas and limits documentation.
It is possible to have multiple models for group of products -- Maybe even something like: one initial model to classify the product category (jewery, watches, shoes, toys, etc) and a second step for a specific model (to identify the specific product belong toys, or belong shoes, etc). But to be honest, it seems a bit hard to support - but certainly worth trying.
A second option would be training a custom model where you could do a fine tuning on some larger model (ie. inception, resnet, etc) do know all your 20k+ classes (products). It could add a little bit more work at first, but after established, it will become a single model for inference and re-training would be simpler using MLOps mechanisms (ie. Vertex Pipelines).

How to export a Google AutoML Text Classification model?

I just finished training my AutoML Text Classification model (single-label).
I was planning to run a Batch Prediction using the console, but I just found out how expensive that will be because I have over 300,000 text records to analyze.
So now I want to export the model to my local machine and run the predictions there.
I found instructions here to export "AutoML Tabular Models" and "AutoML Edge Models". But there is nothing available for text classification models.
I tried following the "AutoML Tabular Model" instructions because that looked like the closest thing to a text classification model, but I could not find the "Export" button that was supposed to exist on the model detail page.
So I have some questions regarding this:
How do I export a AutoML Text Classification model?
Is a AutoML Text Classification model the same thing as an AutoML Tabular model? They seem very similar because my text classifiction model used tabular CSV to assign labels and train the model.
If I cannot export AutoML Text Classification model (urgh!), can I train a new "Tabular" model to do the same thing?
Currently, there is no feature to export an AutoML text classification model. Already a feature request exists, you can follow its progress on this issue tracker.
Both the models are quite similar. A tabular data classification model analyzes your tabular data and returns a list of categories that describe the data. A text data classification model analyzes text data and returns a list of categories that apply to the text found in the data. Refer to this doc for more information about AutoML model types.
Yes, you can do the same thing in an AutoML tabular data classification model if your training data is in tabular CSV file format. Refer to this doc for more information about how to prepare tabular training data.
If your model trained successfully in an AutoML tabular data classification, you can find an Export option at the top. Refer to this doc for more information about how to export tabular classification models.

Training existing core nlp model

I want to train existing Stanford core-nlp's english-left3words-distsim.bin model with some more data which fits my use case. I want to assign custom tags for certain words like run will be a COMMAND.
Where can I get the training data set? I could follow something like model training
For the most part it is sections 0-18 of the WSJ Penn Treebank.
Link: https://catalog.ldc.upenn.edu/ldc99t42
We have some extra data sets we don't distribute as well that we add on to the WSJ data.

HippoCMS translated documents with shared fields

I am evaluating HippoCMS and am trying to model a schema of Venues. I want to model a document that has non-translatable features such as telephoneNumber and emailAddress, plus translatable features such as description.
How do I model this in HippoCMS? How do I ensure that the non-translated fields are shared between the different translations, to avoid each translated document having its own copy of a value. Obviously no matter which language you are reading a site in, the telephoneNumber shouldn't change.
The only way I have found for the moment is to create a document called Venue and another document called VenueTranslation. Venue would contain the telephoneNumber and VenueTranslation would contain its description and a link back to the Venue document. There would then be VenueTranslation documents for each language.
Is this the correct approach?
That could work, but you will run into usability issues. I'd say it depends on how many venues you plan to enter into the system, how many languages you are targeting, and, in the end, how keen are your CMS users to pick the right Venue document for every VenueTranslation corresponding to a language. I can see how this will quickly become error prone and cumbersome, but I don't have the numbers.
Regarding the final question, it's not correct nor incorrect: it's just that since the granularity of the translations in Hippo is at the document level and not at the field level, you have to do it this way. Your model makes sense but is not well supported in the CMS. This use case is trivial in a CMS that supports the notion of translatable field.

How do you model big web forms in an MVC-based web app?

I'm talking about HUGE forms - like medical forms with 1000+ fields.
How do you logically create models for them? Do you include every single little field as seperate model? Do you have the whole form as a HUGE model with every single field? Do you have formsections as models and each formsection has few fields?
I know this might be subjective, but I really want some advice on someone who has dealt with this before and save others a lot of time down the road by avoiding mistakes at the onset.
Your data model should follow an EAV method. Medical systems are well suited to this approach as not all patients are going to have all this information filled in. This method allows you to fill in what is appropriate and populate your model. Makes organizing the data easier as well.
As for organizing it in the view, I suggest you break it up into sections where sections are logically related to each other (past history, family history or by type of information), making the information easier to digest.

Resources