PyTorch Lightning - How can I output a summary of training to the console - pytorch-lightning

PyTorch Lighting can log to TensorBoard. How can I make it log to the console a table summarizing the training runs (similar to Huggingface's Transformers, shown below):
Epoch Training Loss Validation Loss Runtime Samples Per Second
1 1.220600 1.160322 39.574900 272.496000
2 0.945200 1.121690 39.706000 271.596000
3 0.773000 1.157358 39.734000 271.405000

You can write your own callback function and add it into the trainer.
from pytorch_lightning.utilities import rank_zero_info
class LoggingCallback(pl.Callback):
def on_validation_end(self, trainer: pl.Trainer, pl_module: pl.LightningModule):
rank_zero_info("***** Test results *****")
metrics = trainer.callback_metrics
# Log results
for key in sorted(metrics):
if key not in ["log", "progress_bar"]:
rank_zero_info("{} = {}\n".format(key, str(metrics[key])))

Related

ljspeech Hugging Face examples not working

When trying to run the ljspeech example, I get the following error, even when the model is moved to the only GPU in the system. I am using Cuda 11.7, Pytorch 1.13.1, and Fairseq 0.12.2.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)
The code used:
from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
from fairseq.models.text_to_speech.hub_interface import TTSHubInterface
import IPython.display as ipd
import torch
models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
"facebook/fastspeech2-en-ljspeech",
arg_overrides={"vocoder": "hifigan", "fp16": False}
)
model = models[0].to(torch.device('cuda'))
models[0] = model
TTSHubInterface.update_cfg_with_data_cfg(cfg, task.data_cfg)
generator = task.build_generator(models, cfg)
text = "Hello, this is a test run."
sample = TTSHubInterface.get_model_input(task, text)
wav, rate = TTSHubInterface.get_prediction(task, model, generator, sample)
ipd.Audio(wav, rate=rate)

Loading multiple CSV files (silos) to compose Tensorflow Federated dataset

I am working on pre-processed data that were already siloed into separated csv files to represent separated local data for federated learning.
To correct implement the federated learning with these multiple CSVs on TensorFlow Federated, I am just trying to reproduce the same approach with a toy example in the iris dataset. However, when trying to use the method tff.simulation.datasets.TestClientData, I am getting the error:
TypeError: can't pickle _thread.RLock objects
The current code is as follows, first, load the three iris dataset CSV files (50 samples on each) into a dictionary from the filenames iris1.csv, iris2.csv, and iris3.csv:
silos = {}
for silo in silos_files:
silo_name = silo.replace(".csv", "")
silos[silo_name] = pd.read_csv(silos_path + silo)
silos[silo_name]["variety"].replace({"Setosa" : 0, "Versicolor" : 1, "Virginica" : 2}, inplace=True)
Creating a new dict with tensors:
silos_tf = collections.OrderedDict()
for key, silo in silos.items():
silos_tf[key] = tf.data.Dataset.from_tensor_slices((silo.drop(columns=["variety"]).values, silo["variety"].values))
Finally, trying to converting the Tensorflow Dataset into a Tensorflow Federated Dataset:
tff_dataset = tff.simulation.datasets.TestClientData(
silos_tf
)
That raises the error:
TypeError Traceback (most recent call last)
<ipython-input-58-a4b5686509ce> in <module>()
1 tff_dataset = tff.simulation.datasets.TestClientData(
----> 2 silos_tf
3 )
/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/simulation/datasets/from_tensor_slices_client_data.py in __init__(self, tensor_slices_dict)
59 """
60 py_typecheck.check_type(tensor_slices_dict, dict)
---> 61 tensor_slices_dict = copy.deepcopy(tensor_slices_dict)
62 structures = list(tensor_slices_dict.values())
63 example_structure = structures[0]
...
/usr/lib/python3.7/copy.py in deepcopy(x, memo, _nil)
167 reductor = getattr(x, "__reduce_ex__", None)
168 if reductor:
--> 169 rv = reductor(4)
170 else:
171 reductor = getattr(x, "__reduce__", None)
TypeError: can't pickle _thread.RLock objects
I also tried to use Python dictionary instead of OrderedDict but the error is the same. For this experiment, I am using Google Colab with this notebook as reference running with TensorFlow 2.8.0 and TensorFlow Federated version 0.20.0. I also used these previous questions as references:
Is there a reasonable way to create tff clients datat sets?
'tensorflow_federated.python.simulation' has no attribute 'FromTensorSlicesClientData' when using tff-nightly
I am not sure if this is a good way that derives for a case beyond the toy example, please, if any suggestion on how to bring already siloed data for TFF tests, I am thankful.
I did some search of public code in github using class tff.simulation.datasets.TestClientData, then I found the following implementation (source here):
def to_ClientData(clientsData: np.ndarray, clientsDataLabels: np.ndarray,
ds_info, is_train=True) -> tff.simulation.datasets.TestClientData:
"""Transform dataset to be fed to fedjax
:param clientsData: dataset for each client
:param clientsDataLabels:
:param ds_info: dataset information
:param train: True if processing train split
:return: dataset for each client cast into TestClientData
"""
num_clients = ds_info['num_clients']
client_data = collections.OrderedDict()
for i in range(num_clients if is_train else 1):
client_data[str(i)] = collections.OrderedDict(
x=clientsData[i],
y=clientsDataLabels[i])
return tff.simulation.datasets.TestClientData(client_data)
I understood from this snippet that the tff.simulation.datasets.TestClientData class requires as argument an OrderedDict composed by numpy arrays instead of a dict of tensors (as my previous implementation), now I changed the code for the following:
silos_tf = collections.OrderedDict()
for key, silo in silos.items():
silos_tf[key] = collections.OrderedDict(
x=silo.drop(columns=["variety"]).values,
y=silo["variety"].values)
Followed by:
tff_dataset = tff.simulation.datasets.TestClientData(
silos_tf
)
That correctly runs as the following output:
>>> tff_dataset.client_ids
['iris3', 'iris1', 'iris2']

pycave kmeans-gpu disable model trainer outputs from being printed

This is a code snippet to run KMeans using GPU.
Documentation-link:https://pycave.borchero.com/sites/generated/clustering/kmeans/pycave.clustering.KMeans.html
import torch
from pycave.clustering import KMeans
X = torch.cat([
torch.randn(1000, 6) - 5,
torch.randn(1000, 6),
torch.randn(1000, 6) + 5,
])
estimator = KMeans(num_clusters = 3, trainer_params=dict(gpus=1,
enable_progress_bar=0,
max_epochs=100,))
labels = estimator.fit_predict(X).numpy()
pd.value_counts(labels)
The issue is with how to disable the console output from the estimator.
Current Output:
Running initialization...
{'batch_size': 3000, 'collate_fn': <function collate_tensor at 0x000002BE21221700>}
Fitting K-Means...
{'batch_size': 3000, 'collate_fn': <function collate_tensor at 0x000002BE21221700>}
{'batch_size': 1, 'sampler': None, 'batch_sampler': <pytorch_lightning.overrides.distributed.IndexBatchSamplerWrapper object at 0x000002BE593A55B0>, 'collate_fn': <function collate_tensor at 0x000002BE21221700>, 'shuffle': False, 'drop_last': False}
0 1000
2 1000
1 1000
dtype: int64
Expected Output:
0 1000
2 1000
1 1000
dtype: int64
Info regarding trainer_params parameter
(Optional[Dict[str, Any]]) --
Initialization parameters to use when initializing a PyTorch Lightning trainer. By default, it disables various stdout logs unless PyCave is configured to do verbose logging. Checkpointing and logging are disabled regardless of the log level.
The dictionaries that are printed should never be there, that's a bug in a dependency. Resolved in the latest build.
As far as the PyCave logs are concerned (Running initialization... and Fitting K-Means...), you can turn them off easily by adding the following:
import logging
from pycave import set_logging_level
set_logging_level(logging.WARNING)
Note that set_logging_level(logging.WARNING) also turns off the progress bar and the model summary automatically so you don't have to set these flags explicitly.

Limit(n) vs Show(n) performance disparity in Pyspark

Trying to get a deeper understanding of how spark works and was playing around with the pyspark cli (2.4.0). I was looking for the difference between using limit(n).show() and show(n). I ended up getting two very different performance times for two very similar queries. Below are the commands I ran. The parquet file referenced in the code below has about 50 columns and is over 50gb in size on remote HDFS.
# Create dataframe
>>> df = sqlContext.read.parquet('hdfs://hdfs.host/path/to.parquet') ↵
# Create test1 dataframe
>>> test1 = df.select('test_col') ↵
>>> test1.schema ↵
StructType(List(StructField(test_col,ArrayType(LongType,true),true)))
>>> test1.explain() ↵
== Physical Plan ==
*(1) Project [test_col#40]
+- *(1) FileScan parquet [test_col#40]
Batched: false,
Format: Parquet,
Location: InMemoryFileIndex[hdfs://hdfs.host/path/to.parquet],
PartitionCount: 25,
PartitionFilters: [],
PushedFilters: [],
ReadSchema: struct<test_col:array<bigint>>
# Create test2 dataframe
>>> test2 = df.select('test_col').limit(5) ↵
>>> test2.schema ↵
StructType(List(StructField(test_col,ArrayType(LongType,true),true)))
>>> test2.explain() ↵
== Physical Plan ==
CollectLimit 5
+- *(1) Project [test_col#40]
+- *(1) FileScan parquet [test_col#40]
Batched: false,
Format: Parquet,
Location: InMemoryFileIndex[hdfs://hdfs.host/path/to.parquet],
PartitionCount: 25,
PartitionFilters: [],
PushedFilters: [],
ReadSchema: struct<test_col:array<bigint>>
Notice that the physical plan is almost identical for both test1 and test2. The only exception is test2's plan starts with "CollectLimit 5". After setting this up I ran test1.show(5) and test2.show(5). Test 1 returned the results instantaneously. Test 2 showed a progress bar with 2010 tasks and took about 20 minutes to complete (I only had one executor)
Question
Why did test 2 (with limit) perform so poorly compared to test 1 (without limit)? The data set and result set were identical and the physical plan was nearly identical.
Keep in mind:
show() is an alias for show(20) and relies internally on take(n: Int): Array[T]
limit(n: Int) returns another dataset and is an expensive operation that reads the whole source
Limit - result in new dataframe and taking longer time because this is because predicate pushdown is currently not supported in your input file format. Hence reading entire dataset and applying limit.

GridSearch with XGBoost producing Depreciation error on infinite loop

I am trying to do a hyperparameter tuning using GridSearchCV on XGBoost.But, I'm getting the following error.
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.if diff:
This keeps on running forever. Given below is the code.
classifier = xgb.XGBClassifier()
from sklearn.grid_search import GridSearchCV
n_estimators=[10,50,100,150,200,250,300]
max_depth=[2,3,4,5,6,7,8,9,10]
learning_rate=[0.1,0.01,0.09,0.08,0.07,0.001]
colsample_bytree=[0.5,0.6,0.7,0.8,0.9]
min_child_weight=[1,2,3,4,5,6,7,8,9,10]
gamma=[0.001,0.01,0.1,0.2,0.3,0.4,0.5,1]
subsample=[0.5,0.6,0.7,0.8,0.9]
param_grid=dict(n_estimators=n_estimators,max_depth=max_depth,learning_rate=learning_rate,colsample_bytree=colsample_bytree,min_child_weight=min_child_weight,gamma=gamma,subsample=subsample)
grid = GridSearchCV(classifier, param_grid, cv=10, scoring='accuracy')
grid.fit(X, Y)
grid.grid_scores_
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)
# Predicting the Test set results
Y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)
I am using python3.5, XGBOOT and gridsearch library has already been preloaded. I am running this on google collaboratory.
Please suggest what is going wrong ?

Resources