Training Google-Cloud-Automl Model on multiple datasets - google-cloud-automl

I would like to train an automl model on gcp's vertex ai using multiple datasets. I would like to keep the datasets separate, since they come from different sources, want to train on them separately, etc. Is that possible? Or will I need to create a dataset containing both datasets? It looks like I can only select one dataset in the web UI.

It is possible via the Vertex AI API as long as your sources are in Google Cloud Storage, just provide a list of training data which are in JSON or CSV format that qualifies with the best practices for formatting of training data.
See code for creating and importing datasets. See documentation for code reference and further details.
from typing import List, Union
from google.cloud import aiplatform
def create_and_import_dataset_image_sample(
project: str,
location: str,
display_name: str,
src_uris: Union[str, List[str]], // example: ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]
sync: bool = True,
):
aiplatform.init(project=project, location=location)
ds = aiplatform.ImageDataset.create(
display_name=display_name,
gcs_source=src_uris,
import_schema_uri=aiplatform.schema.dataset.ioformat.image.single_label_classification,
sync=sync,
)
ds.wait()
print(ds.display_name)
print(ds.resource_name)
return ds
NOTE: The links provided are for Vertex AI AutoML Image. If you access the links there are options for other AutoML products like Text, Tabular and Video.

Related

How combine results from multiple models in Google Vertex AI?

I have multiple models in Google Vertex AI and I want to create an endpoint to serve my predictions.
I need to run aggregation algorithms, like the Voting algorithm on the output of my models.
I have not found any ways of using the models together so that I can run the voting algorithms on the results.
Do I have to create a new model, curl my existing models and then run my algorithms on the results?
There is no in-built provision to implement aggregation algorithms in Vertex AI. To curl results from the models then aggregate them, we would need to deploy all of them to individual endpoints. Instead, I would suggest the below method to deploy the models and the meta-model(aggregate model) to a single endpoint using custom containers for prediction. The custom container requirements can be found here.
You can load the model artifacts from GCS into a custom container. If the same set of models are used (i.e) the input models to the meta-model do not change, you can package them inside the container to reduce load time. Then, a custom HTTP logic can be used to return the aggregation output like so. This is a sample custom flask server logic.
def get_models_from_gcs():
## Pull the required model artifacts from GCS and load them here.
models = [model_1, model_2, model_3]
return models
def aggregate_predictions(predictions):
## Your aggregation algorithm here
return aggregated_result
#app.post(os.environ['AIP_PREDICT_ROUTE'])
async def predict(request: Request):
body = await request.json()
instances = body["instances"]
inputs = np.asarray(instances)
preprocessed_inputs = _preprocessor.preprocess(inputs)
models = get_models_from_gcs()
predictions = []
for model in models:
predictions.append(model.predict(preprocessed_inputs))
aggregated_result = aggregate_predictions(predictions)
return {"aggregated_predictions": aggregated_result}

How to provide parameter input for interaction variable in H2OGradientBoostingEstimator?

I need to use the interaction variable feature of multiclass classification in H2OGradientBoostingEstimator in H2O in Python. I am not sure which parameter to use & how to use that. Can anyone please help me out with this?
Currently, I am using the below code -
pros_gbm = H2OGradientBoostingEstimator(nfolds=0,seed=1234, keep_cross_validation_predictions = False, ntrees=10, max_depth=3, learn_rate=0.01, distribution='multinomial')
hist_gbm = pros_gbm.train(x=predictors, y=target, training_frame=hf_train, validation_frame = hf_test,verbose=True)
GBM inherently creates interactions. You can extract information about feature interactions using the .feature_interaction() extractor method (for an H2O Model). More information is provided in the user guide and the Python docs.
If you want to explicitly add a new column that is the interaction between two numerics, you could create that manually by multiplying the two (or more) columns together to get a new interaction column.
For categorical interactions, there's also the the h2o.interaction() method in Python here to create interaction columns in the data (prior to sending it to the GBM or any algorithm).

Download pre-trained sentence-transformers model locally

I am using the SentenceTransformers library (here: https://pypi.org/project/sentence-transformers/#pretrained-models) for creating embeddings of sentences using the pre-trained model bert-base-nli-mean-tokens. I have an application that will be deployed to a device that does not have internet access. Here, it's already been answered, how to save the model Download pre-trained BERT model locally. Yet I'm stuck at loading the saved model from the locally saved path.
When I try to save the model using the above-mentioned technique, these are the output files:
('/bert-base-nli-mean-tokens/tokenizer_config.json',
'/bert-base-nli-mean-tokens/special_tokens_map.json',
'/bert-base-nli-mean-tokens/vocab.txt',
'/bert-base-nli-mean-tokens/added_tokens.json')
When I try to load it in the memory, using
tokenizer = AutoTokenizer.from_pretrained(to_save_path)
I'm getting
Can't load config for '/bert-base-nli-mean-tokens'. Make sure that:
- '/bert-base-nli-mean-tokens' is a correct model identifier listed on 'https://huggingface.co/models'
- or '/bert-base-nli-mean-tokens' is the correct path to a directory containing a config.json
You can download and load the model like this
from sentence_transformers import SentenceTransformer
modelPath = "local/path/to/model
model = SentenceTransformer('bert-base-nli-stsb-mean-tokens')
model.save(modelPath)
model = SentenceTransformer(modelPath)
this worked for me.You can check the SBERT documentation for model details for the SentenceTransformer class [Here][1]
[1]: https://www.sbert.net/docs/package_reference/SentenceTransformer.html#:~:text=class,Optional%5Bstr%5D%20%3D%20None)
There are many ways to solve this issue:
Assuming you have trained your BERT base model locally (colab/notebook), in order to use it with the Huggingface AutoClass, then the model (along with the tokenizers,vocab.txt,configs,special tokens and tf/pytorch weights) has to be uploaded to Huggingface. The steps to do this is mentioned here. Once it is uploaded, there will be a repository created with your username, and then the model can be accessed as follows:
from transformers import AutoTokenizer
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("<username>/<model-name>")
The second way is to use the trained model locally, and this can be done by using pipelines.The following is an example how to use this model trained(&saved) locally for your use-case (giving an example from my locally trained QA model):
from transformers import AutoModelForQuestionAnswering,AutoTokenizer,pipeline
nlp_QA=pipeline('question-answering',model='./abhilash1910/distilbert-squadv1',tokenizer='./abhilash1910/distilbert-squadv1')
QA_inp={
'question': 'What is the fund price of Huggingface in NYSE?',
'context': 'Huggingface Co. has a total fund price of $19.6 million dollars'
}
result=nlp_QA(QA_inp)
result
The third way is to directly use Sentence Transformers from the Huggingface models repo.
There are also other ways to resolve this but these might help. Also this list of pretrained models might help.

How do I train a encoder-decoder model for a translation task using hugging face transformers?

I would like to train a encoder decoder model as configured below for a translation task. Could someone guide me as to how I can set-up a training pipeline for such a model? Any links or code snippets would be appreciated to understand.
from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel
# Initializing a BERT bert-base-uncased style configuration
config_encoder = BertConfig()
config_decoder = BertConfig()
config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
# Initializing a Bert2Bert model from the bert-base-uncased style configurations
model = EncoderDecoderModel(config=config)
The encoder-decoder models are used in the same as any other models in Transformers. It accepts batches of tokenized text as vocabulary indices (i.e., you need a tokenizer that is suitable for your sequence-to-sequence task). When you feed the model with the input (input_ids) and the desired output (decoder_input_ids and labels), you will get the loss value that you can optimize during training. Note that if the sentences in the batch have different lengths, you need to do masking too. This is a minimum example for the EncoderDecoderModel documentation:
from transformers import EncoderDecoderModel, BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
'bert-base-uncased', 'bert-base-uncased')
input_ids = torch.tensor(
tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)
outputs = model(
input_ids=input_ids, decoder_input_ids=input_ids, labels=input_ids,
return_dict=True)
loss = outputs.loss
If you do not want to write the training loop yourself, you can use dataset processing (DataCollatorForSeq2Seq) and training (Seq2SeqTrainer) utilities from Transformers. You can follow the Seq2Seq example on GitHub.

How to fetch details of non-leader models generated by h2o automl?

after running automl (classification of 3 classes), I can see a list of models as follows:
model_id mean_per_class_error
StackedEnsemble_BestOfFamily_0_AutoML_20180420_174925 0.262355
StackedEnsemble_AllModels_0_AutoML_20180420_174925 0.262355
XRT_0_AutoML_20180420_174925 0.266606
DRF_0_AutoML_20180420_174925 0.278428
GLM_grid_0_AutoML_20180420_174925_model_0 0.442917
but mean_per_class_error is not a good metric for my case, where classes are unbalanced (one class has very small population). How to fetch details of non-leader models and calculate other metrics? Thanks.
python version: 3.6.0
h2o version: 3.18.0.5
actually just figured this out myself (assuming aml is the h2o automl object after training):
for m in aml.leaderboard.as_data_frame()['model_id']:
print(m)
print(h2o.get_model(m))
You can also grab the corresponding model you're interested in using the following line:
model6 = h2o.get_model(aml.leaderboard.as_data_frame()['model_id'][6])
where 6 is the index number of the model in the leaderboard.

Resources