pycave kmeans-gpu disable model trainer outputs from being printed - pytorch-lightning

This is a code snippet to run KMeans using GPU.
Documentation-link:https://pycave.borchero.com/sites/generated/clustering/kmeans/pycave.clustering.KMeans.html
import torch
from pycave.clustering import KMeans
X = torch.cat([
torch.randn(1000, 6) - 5,
torch.randn(1000, 6),
torch.randn(1000, 6) + 5,
])
estimator = KMeans(num_clusters = 3, trainer_params=dict(gpus=1,
enable_progress_bar=0,
max_epochs=100,))
labels = estimator.fit_predict(X).numpy()
pd.value_counts(labels)
The issue is with how to disable the console output from the estimator.
Current Output:
Running initialization...
{'batch_size': 3000, 'collate_fn': <function collate_tensor at 0x000002BE21221700>}
Fitting K-Means...
{'batch_size': 3000, 'collate_fn': <function collate_tensor at 0x000002BE21221700>}
{'batch_size': 1, 'sampler': None, 'batch_sampler': <pytorch_lightning.overrides.distributed.IndexBatchSamplerWrapper object at 0x000002BE593A55B0>, 'collate_fn': <function collate_tensor at 0x000002BE21221700>, 'shuffle': False, 'drop_last': False}
0 1000
2 1000
1 1000
dtype: int64
Expected Output:
0 1000
2 1000
1 1000
dtype: int64
Info regarding trainer_params parameter
(Optional[Dict[str, Any]]) --
Initialization parameters to use when initializing a PyTorch Lightning trainer. By default, it disables various stdout logs unless PyCave is configured to do verbose logging. Checkpointing and logging are disabled regardless of the log level.

The dictionaries that are printed should never be there, that's a bug in a dependency. Resolved in the latest build.
As far as the PyCave logs are concerned (Running initialization... and Fitting K-Means...), you can turn them off easily by adding the following:
import logging
from pycave import set_logging_level
set_logging_level(logging.WARNING)
Note that set_logging_level(logging.WARNING) also turns off the progress bar and the model summary automatically so you don't have to set these flags explicitly.

Related

How to not exit a for loop inside a pytest although few items fail

I would like to run the pytest for all the items in the for loop. The pytest should fail in the end but it should run all the elements in the for loop.
The code looks like this
#pytest.fixture
def library():
return Library( spec_dir = service_spec_dir)
#pytest.fixture
def services(library):
return list(library.service_map.keys())
def test_properties(service, services):
for service_name in services:
model = library.models[service_name]
proxy = library.get_service(service_name)
if len(model.properties ) != 0 :
for prop in model.properties:
try:
method = getattr(proxy, f'get_{prop.name}')
method()
except exception as ex:
pytest.fail(ex)
The above code fails if one property of one service fails. I am wondering if there is a way to to run the test for all the service and get a list of failed cases for all the services.
I tried parametrize But based on this stackoverflow discussion. The parameter list should be resolved during the collection phase and in our case the library is loaded during the execution phase. Hence I am also not sure if it can be parametrized.
The goal is run all the services and its properties and get the list of failed items in the end.
I moved the variables to the global scope. I can parametrize the test now\
library = Library( spec_dir = service_spec_dir)
service_names = list(library.service_map.keys())
#pytest .mark.paramertize("serivce_name", service_names)
def test_properties(service):
pass
Don't use pytest.fail, but pytest_check.check instead.
fail point, is that you stop test execution on condition, while check made for collect how much cases were failed.
import logging
import pytest
import pytest_check as check
def test_000():
li = [1, 2, 3, 4, 5, 6]
for i in li:
logging.info(f"Test still running. i = {i}")
if i % 2 > 0:
check.is_true(False, msg=f"value of i is odd: {i}")
Output:
tests/main_test.py::test_000
-------------------------------- live log call --------------------------------
11:00:05 INFO Test still running. i = 1
11:00:05 INFO Test still running. i = 2
11:00:05 INFO Test still running. i = 3
11:00:05 INFO Test still running. i = 4
11:00:05 INFO Test still running. i = 5
11:00:05 INFO Test still running. i = 6
FAILED [100%]
================================== FAILURES ===================================
__________________________________ test_000 ___________________________________
FAILURE: value of i is odd: 1
assert False
FAILURE: value of i is odd: 3
assert False
FAILURE: value of i is odd: 5
assert False

PyTorch Lightning - How can I output a summary of training to the console

PyTorch Lighting can log to TensorBoard. How can I make it log to the console a table summarizing the training runs (similar to Huggingface's Transformers, shown below):
Epoch Training Loss Validation Loss Runtime Samples Per Second
1 1.220600 1.160322 39.574900 272.496000
2 0.945200 1.121690 39.706000 271.596000
3 0.773000 1.157358 39.734000 271.405000
You can write your own callback function and add it into the trainer.
from pytorch_lightning.utilities import rank_zero_info
class LoggingCallback(pl.Callback):
def on_validation_end(self, trainer: pl.Trainer, pl_module: pl.LightningModule):
rank_zero_info("***** Test results *****")
metrics = trainer.callback_metrics
# Log results
for key in sorted(metrics):
if key not in ["log", "progress_bar"]:
rank_zero_info("{} = {}\n".format(key, str(metrics[key])))

Limit(n) vs Show(n) performance disparity in Pyspark

Trying to get a deeper understanding of how spark works and was playing around with the pyspark cli (2.4.0). I was looking for the difference between using limit(n).show() and show(n). I ended up getting two very different performance times for two very similar queries. Below are the commands I ran. The parquet file referenced in the code below has about 50 columns and is over 50gb in size on remote HDFS.
# Create dataframe
>>> df = sqlContext.read.parquet('hdfs://hdfs.host/path/to.parquet') ↵
# Create test1 dataframe
>>> test1 = df.select('test_col') ↵
>>> test1.schema ↵
StructType(List(StructField(test_col,ArrayType(LongType,true),true)))
>>> test1.explain() ↵
== Physical Plan ==
*(1) Project [test_col#40]
+- *(1) FileScan parquet [test_col#40]
Batched: false,
Format: Parquet,
Location: InMemoryFileIndex[hdfs://hdfs.host/path/to.parquet],
PartitionCount: 25,
PartitionFilters: [],
PushedFilters: [],
ReadSchema: struct<test_col:array<bigint>>
# Create test2 dataframe
>>> test2 = df.select('test_col').limit(5) ↵
>>> test2.schema ↵
StructType(List(StructField(test_col,ArrayType(LongType,true),true)))
>>> test2.explain() ↵
== Physical Plan ==
CollectLimit 5
+- *(1) Project [test_col#40]
+- *(1) FileScan parquet [test_col#40]
Batched: false,
Format: Parquet,
Location: InMemoryFileIndex[hdfs://hdfs.host/path/to.parquet],
PartitionCount: 25,
PartitionFilters: [],
PushedFilters: [],
ReadSchema: struct<test_col:array<bigint>>
Notice that the physical plan is almost identical for both test1 and test2. The only exception is test2's plan starts with "CollectLimit 5". After setting this up I ran test1.show(5) and test2.show(5). Test 1 returned the results instantaneously. Test 2 showed a progress bar with 2010 tasks and took about 20 minutes to complete (I only had one executor)
Question
Why did test 2 (with limit) perform so poorly compared to test 1 (without limit)? The data set and result set were identical and the physical plan was nearly identical.
Keep in mind:
show() is an alias for show(20) and relies internally on take(n: Int): Array[T]
limit(n: Int) returns another dataset and is an expensive operation that reads the whole source
Limit - result in new dataframe and taking longer time because this is because predicate pushdown is currently not supported in your input file format. Hence reading entire dataset and applying limit.

How can you decode output sequences from TFGPT2Model?

I'm trying to get generated text from the TFGPT2Model in the Transformers library. I can see the output tensor, but I'm not able to decode it. Is the tokenizer not compatible with the TF model for decoding?
Code is:
import tensorflow as tf
from transformers import (
TFGPT2Model,
GPT2Tokenizer,
GPT2Config,
)
model_name = "gpt2-medium"
config = GPT2Config.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = TFGPT2Model.from_pretrained(model_name, config=config)
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute",
add_special_tokens=True))[None, :] # Batch size 1
outputs = model(input_ids)
print(outputs[0])
result = tokenizer.decode(outputs[0])
print(result)
The resulting output is:
$ python run_tf_gpt2.py
2020-04-16 23:43:11.753181: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-04-16 23:43:11.777487: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-04-16 23:43:27.617982: W tensorflow/python/util/util.cc:319] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
2020-04-16 23:43:27.693316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-16 23:43:27.824075: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA n
ode, so returning NUMA node zero
...
...
2020-04-16 23:43:38.149860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10565 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:25:00.0, compute capability: 3.7)
2020-04-16 23:43:38.150217: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-16 23:43:38.150913: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10565 MB memory) -> physical GPU (device: 2, name: Tesla K80, pci bus id: 0000:26:00.0, compute capability: 3.7)
2020-04-16 23:43:44.438587: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
tf.Tensor(
[[[ 0.671073 0.60760975 -0.10744217 ... -0.51132596 -0.3369941
0.23458953]
[ 0.6403012 0.00396247 0.7443729 ... 0.2058892 -0.43869907
0.2180479 ]
[ 0.5131284 -0.35192695 0.12285632 ... -0.30060387 -1.0279727
0.13515341]
[ 0.3083361 -0.05588413 1.0543617 ... -0.11589152 -1.0487361
0.05204075]
[ 0.70787597 -0.40516227 0.4160383 ... 0.44217822 -0.34975922
0.02535546]
[-0.03940453 -0.1243843 0.40204537 ... 0.04586177 -0.48230025
0.5768887 ]]], shape=(1, 6, 1024), dtype=float32)
Traceback (most recent call last):
File "run_tf_gpt2.py", line 19, in <module>
result = tokenizer.decode(outputs[0])
File "/home/.../transformers/src/transformers/tokenization_utils.py", line 1605, in decode
filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
File "/home/.../transformers/src/transformers/tokenization_utils.py", line 1575, in convert_ids_to_tokens
index = int(index)
File "/home/.../venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 853, in __int__
return int(self._numpy())
TypeError: only size-1 arrays can be converted to Python scalars
(I removed all the TF messages and modified paths of my environment)
Apparently, you are using the wrong GPT2-Model. I tried your example by using the GPT2LMHeadModel which is the same Transformer just with a language modeling head on top. It also returns prediction_scores. In addition to that, you need to use model.generate(input_ids) in order to get an output for decoding. By default, a greedy search is performed.
import tensorflow as tf
from transformers import (
TFGPT2LMHeadModel,
GPT2Tokenizer,
GPT2Config,
)
model_name = "gpt2-medium"
config = GPT2Config.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = TFGPT2LMHeadModel.from_pretrained(model_name, config=config)
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True))[None, :] # Batch size 1
outputs = model.generate(input_ids=input_ids)
print(outputs[0])
result = tokenizer.decode(outputs[0])
print(result)

GridSearch with XGBoost producing Depreciation error on infinite loop

I am trying to do a hyperparameter tuning using GridSearchCV on XGBoost.But, I'm getting the following error.
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.if diff:
This keeps on running forever. Given below is the code.
classifier = xgb.XGBClassifier()
from sklearn.grid_search import GridSearchCV
n_estimators=[10,50,100,150,200,250,300]
max_depth=[2,3,4,5,6,7,8,9,10]
learning_rate=[0.1,0.01,0.09,0.08,0.07,0.001]
colsample_bytree=[0.5,0.6,0.7,0.8,0.9]
min_child_weight=[1,2,3,4,5,6,7,8,9,10]
gamma=[0.001,0.01,0.1,0.2,0.3,0.4,0.5,1]
subsample=[0.5,0.6,0.7,0.8,0.9]
param_grid=dict(n_estimators=n_estimators,max_depth=max_depth,learning_rate=learning_rate,colsample_bytree=colsample_bytree,min_child_weight=min_child_weight,gamma=gamma,subsample=subsample)
grid = GridSearchCV(classifier, param_grid, cv=10, scoring='accuracy')
grid.fit(X, Y)
grid.grid_scores_
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)
# Predicting the Test set results
Y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)
I am using python3.5, XGBOOT and gridsearch library has already been preloaded. I am running this on google collaboratory.
Please suggest what is going wrong ?

Resources