Pytorch lightning resuming from checkpoint with new data - pytorch-lightning

I'm wanting to continue the training process for a model using new data.
I understand that you can continue training a Pytorch Lightning model e.g.
pl.Trainer(max_epochs=10, resume_from_checkpoint='./checkpoints/blahblah.ckpt') for example, if you last checkpoint is saved at epoch 5. But is there a way to continue training by adding different data?

Yes, when you resume from a checkpoint you can provide the new DataLoader or DataModule during the training and your training will resume from the last epoch with the new data.
trainer = pl.Trainer(max_epochs=10, resume_from_checkpoint='./checkpoints/blahblah.ckpt')
trainer.fit(model, new_train_dataloader)

Related

How do I see the training splits for an AutoML Tables training job?

When training, AutoML will create three data splits: training, validation, and test. How do I see these splits when training?
When doing custom code training, these splits will be materialized on GCS/BigQuery with URIs given by the environment variables: AIP_TRAINING_DATA_URI, AIP_VALIDATION_DATA_URI, and AIP_TEST_DATA_URI. Is there something similar for AutoML?
We don't provide the training/validation set if you're using the managed AutoML training API. But you can optionally export the test set when you create the training job. There's a checkbox in the training workflow.
However, if you're using AutoML through the new Tabular Workflows, you will have access to the split training data as an intermediate training artifact.

Poor spark performance writing to csv

Context
I'm trying to write a dataframe using PySpark to .csv. In other posts, I've seen users question this, but I need a .csv for business requirements.
What I've Tried
Almost everything. I've tried .repartition(), I've tried increasing driver memory to 1T. I also tried caching my data first and then writing to csv(which is why the screenshots below indicate I'm trying to cache vs. write out to csv) Nothing seems to work.
What Happens
So, the UI does not show that any tasks fail. The job--whether it's writing to csv or caching first, gets close to completion and just hangs.
Screenshots
Then..if I drill down into the job..
And if I drill down further
Finally, here are my settings:
You don't need to cache the dataframe as cache helps when there are multiple actions performed and if not required I would suggest you to remove count also..
Now while saving the dataframe make sure all the executors are being used.
If your dataframe is of 50 gb make sure you are not creating multiple small files as it will degrade the performance.
You can repartition the data before saving so if your dataframe have a column whic equally divides the dataframe use that or find optimum number to repartition.
df.repartition('col', 10).write.csv()
Or
#you have 32 executors with 12 cores each so repartition accordingly
df.repartition(300).write.csv()
As you are using databricks.. can you try Using the databricks-csv package and let us know
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
train.write.format('com.databricks.spark.csv').save('file_after_processing.csv')

Cannot run a parallel grid search in H2O

I'm trying to use the new "parallelism" option of H2O's grid search to tune the hyperparameters of a GBM model with a 3-fold cross-validation. However, the search is failing, or rather just stopping, as soon as the first batch of the models is built.
A bit of context: I am submitting this grid search job to an H2O cluster on the remote server on hadoop. I'm creating the cluster with, say, 5 nodes. Here's an example: hadoop jar /usr/local/h2o/bin28/h2odriver.jar -nodes 5 -mapperXmx 30g -baseport 54364 -disown. I have a an indicator column for the fold assignment.
With parallelism = 0, the grid search is starting with 5 models in parallel (with 2 cv models for each first, and then the 3rd cv model once those are done). As soon as these 5 models complete, the search is just finishing.
The grid search works fine if I run it sequentially with parallelism turned off, but I can't figure out why it won't work with parallelism.
I would appreciate any help with this.
Thank you!
EDIT:
Correction - looks the "parallelism = 1" option isn't working either. The search just stops after one model. This was not an issue with the previous version of H2O - v3.26.03.

ML model update in spark streaming

I have persisted machine learning model in hdfs via spark batch job and i am consuming this in my spark streaming. Basically, the ML model is broadcasted to all executors from the spark driver.
Can some one suggest how i can update the model in real time without stopping the spark streaming job? Basically a new ML model will get created as and when more data points are available but not have any idea how the NEW model will need to be sent to the spark executors.
Request to post some sample code as well.
Regards,
Deepak.
The best approach is probably updating the model on each batch. Since you would probably rather not update too often, you probably want to check if you actually need to load the model and skip that if possible.
In your case of a model stored on hdfs, you can just check for a new timestamp on the model file (or a new model present in a directory) before updating the value of the variable holding the loaded model.

Spark streaming get pre computed features

I am trying to use spark streaming to deal with some order stream, I have some previous computed features for maybe a buyer_id for order in the stream.
I need to get these features while the Spark Streaming is running.
Now, I stored the buyer_id features in a hive table and load it into and RDD and
val buyerfeatures = loadBuyerFeatures()
orderstream.transform(rdd => rdd.leftOuterJoin(buyerfeatures))
to get the pre-computed features.
another way to deal with this is maybe save the features in to a hbase table. and fire a get on every buyer_id.
which one is better ? or maybe I can solve this in another way.
From my short experience:
Loading the necessary data for the computation should be done BEFORE starting the streaming context:
If you are loading inside a DStream operation, this operation will be repeated at each Batch Inteverval time.
If you load each time from Hive, you should seriously consider overhead costs and possible problems during data transfer.
So, if your data is already computed and "small" enough, load it at the beginning of the program in a Broadcast variable or,even better, in a final variable. Either this, or create an RDD before the DStream and keep it as reference (which looks like what you are doing now), although remember to cache it (always if you have enough space).
If you actually do need to read it at streaming time (maybe you receive your query key from the stream), then try to do it once in a foreachPartition and save it in a local variable.

Resources