Ray Tune ; combine population based training schedule with Hyperopt - performance

Are Population Based Training (PBT) and HyperOpt Search combinable ?
The AsyncHyperBandScheduler is used in the Hyperopt Example of ray.tune
Here config set some parameters for the run() function
config = {
"num_samples": 10 if args.smoke_test else 1000,
"config": {
"iterations": 100,
},
"stop": {
"timesteps_total": 100
},
}
and space is the argument for the hyperparameter Space inside the Tune Search Algorithm with hyperopt functions:
space = {
"width": hp.uniform("width", 0, 20),
"height": hp.uniform("height", -100, 100),
"activation": hp.choice("activation", ["relu", "tanh"])
}
algo = HyperOptSearch()
scheduler = AsyncHyperBandScheduler()
run(easy_objective, search_alg=algo, scheduler=scheduler, **config)
yet in the population-based example with Keras the hyperparameter space is loosely given by hyperparam_mutations inside the Tune Trials Scheduler with numpy functions
pbt = PopulationBasedTraining(
hyperparam_mutations={
"dropout": lambda: np.random.uniform(0, 1),
"lr": lambda: 10**np.random.randint(-10, 0),
"rho": lambda: np.random.uniform(0, 1)
})
and config is used in a different way:
To set starting parameters for individuals of the population
run(MemNNModel,scheduler=pbt,
config={
"batch_size": 32,
"epochs": 1,
"dropout": 0.3,
"lr": 0.01,
"rho": 0.9
})
To summarize:
Hyperopt takes its the Hyperparameter Space inside the Search Algorithm with hyperopt functions
Population-Based Training takes its the Hyperparameter Space inside the Trials Scheduler with numpy functions
Both use Config differently
From this answer I take that, the Hyperopt Search will only take completed Trials into account.
Is there a conflict with sampled hyperparameters from Hyperopt Search and "changed while running" hyperparameters from Population-Based Training?
quote: "a single trial may see many different hyperparameters over its lifetime"

Related

Binary Image Classification - Validation loss is much higher than training loss

I´m facing a strange behaviour which I can´t figure out why it is happening. I´m getting a really high loss(BinaryCrossentropy) on my validation batch around 20 or even higher while training. But after the training I do a prediction on the tet set and I get a loss which is lower than 1. Why is that? I went through my code over and over and can´t find the problem.
I´m doing a binary image classification for brian tumors on a dataset provided via kaggle(Link.
And you can find my notebook here: Google-Colab Notebook
My data is loaded this way:
batch_size = 20
train_ds = tf.keras.utils.image_dataset_from_directory(
train_data_path,
subset='training',
seed=42,
color_mode='grayscale',
batch_size=batch_size,
validation_split=0.30
)
valid_ds = tf.keras.utils.image_dataset_from_directory(
train_data_path,
subset='validation',
seed=42,
batch_size=batch_size,
color_mode='grayscale',
validation_split=0.30
)
test_ds = tf.keras.utils.image_dataset_from_directory(
test_data_path,
color_mode='grayscale',
batch_size=batch_size,
shuffle=False
)
This is my modle strcuture
input_shape = image_batch[0].shape
# set up the model structure
model = tf.keras.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=input_shape),
layers.MaxPooling2D((2,2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Dropout(0.3),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Dropout(0.3),
layers.Flatten(),
tf.keras.layers.Dense(32, activation="relu"),
layers.Dropout(0.3),
layers.Dense(1, activation="sigmoid")
])
model.summary()
This is my callback function which returns the plots during training:
class PlotLearning(tf.keras.callbacks.Callback):
"""
Callback to plot the learning curves of the model during training.
"""
def on_train_begin(self, logs={}):
self.metrics = {}
for metric in logs:
self.metrics[metric] = []
def on_epoch_end(self, epoch, logs={}):
# Storing metrics
print(logs)
for metric in logs:
if metric in self.metrics:
self.metrics[metric].append(logs.get(metric))
else:
self.metrics[metric] = [logs.get(metric)]
# Plotting
metrics = [x for x in logs if 'val' not in x]
f, axs = plt.subplots(1, len(metrics), figsize=(15,5))
clear_output(wait=True)
for i, metric in enumerate(metrics):
axs[i].plot(range(1, epoch + 2),
self.metrics[metric],
label=metric)
if logs['val_' + metric]:
axs[i].plot(range(1, epoch + 2),
self.metrics['val_' + metric],
label='val_' + metric)
axs[i].legend()
axs[i].grid()
plt.tight_layout()
plt.show()
callbacks_list = [PlotLearning()]
and this is the part where I start the training
# compile model
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
model.compile(optimizer=optimizer,
loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
metrics=['accuracy']
)
# fit model
history = model.fit(prep_train_ds,
epochs=30,
validation_data=valid_ds,
callbacks=callbacks_list)
This is the output of the callback function after the last epoch run through:
As you can see the loss is really high and oscillating around 20, so I guess it is overfitting.
But as mentiod above, here is what I get when I make a prediction on the test set and calculate the binary crossentropy. The loss is again less than 1 and at least in the range of the training loss
I tried so many things like, chaning batch size, bcs. not enough samples of one class might be in one batch. Then I wanted to see if it is overfitting and changed the number of filters, applyed droput etc. But I couldn´t get the loss function down on the validation set. I´m quite new in the field of image classification and maybe I´m oversseing a thing.

Why does the number of observations change the prediction of a sarimax model with fixed coefficients?

After training a sarimax model, I had hoped to be able to preform forecasts in future using it with new observations without having to retrain it. However, I noticed that the number of observations i use in the newly applied forecast change the predictions.
From my understanding, provided that enough observations are given to allow the autoregression and moving average to be calculated correctly, the model would not even use the earlier historic observations to inform itself as the coefficients are not being retrained. In a (3,0,1) example i would have thought it would need atleast 3 observations to apply its trained coefficients. However this does not seem to be the case and i am questioning whether i have understood the model correctly.
as an example and test, i have applied a trained sarimax to the exact same data with the initial few observations removed to test the effect of the number of rows on the prediction with the following code:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX, SARIMAXResults
y = [348, 363, 435, 491, 505, 404, 359, 310, 337, 360, 342, 406, 396, 420, 472, 548, 559, 463, 407, 362, 405, 417, 391, 419, 461, 472, 535, 622, 606, 508, 461, 390, 432]
ynew = y[10:]
print(ynew)
model = SARIMAX(endog=y, order=(3,0,1))
model = model.fit()
print(model.params)
pred1 = model.predict(start=len(y), end = len(y)+7)
model2 = model.apply(ynew)
print(model.params)
pred2 = model2.predict(start=len(ynew), end = len(ynew)+7)
print(pd.DataFrame({'pred1': pred1, 'pred2':pred2}))
The results are as follows:
pred1 pred2
0 472.246996 472.711770
1 494.753955 495.745968
2 498.092585 499.427285
3 489.428531 490.862153
4 477.678527 479.035869
5 469.023243 470.239459
6 465.576002 466.673790
7 466.338141 467.378903
Based on this, it means that if I were to produce a forecast from a trained model with new observations, the change in the number of observations itself would impact the integrity of the forecast.
What is the explanation for this? What is the standard practice for applying a trained model on new observations given the change in the number of them?
If i wanted to update the model but could not control for whether or not i had all of the original observations from the very start of my training set, this test would indicate that my forecast might as well be random numbers.
Main issue
The main problem here is that you are not using your new results object (model2) for your second set of predictions. You have:
pred2 = model.predict(start=len(ynew), end = len(ynew)+7)
but you should have:
pred2 = model2.predict(start=len(ynew), end = len(ynew)+7)
If you fix this, you get very similar predictions:
pred1 pred2
0 472.246996 472.711770
1 494.753955 495.745968
2 498.092585 499.427285
3 489.428531 490.862153
4 477.678527 479.035869
5 469.023243 470.239459
6 465.576002 466.673790
7 466.338141 467.378903
To understand why they're not identical, there is a second issue (which is not a problem in your code, but just a statistical feature of your data/model).
Secondary issue
Your estimated parameters imply an extremely persistent model:
print(params)
gives
ar.L1 2.134401
ar.L2 -1.683946
ar.L3 0.549369
ma.L1 -0.874801
sigma2 1807.187815
with is associated with a near-unit-root process (largest eigenvalue
= 0.99957719).
What this means is that it takes a very long time for the effects of a particular datapoint on the forecast to die out. In your case, this just means that there are still small effects on the forecasts from the first 10 periods.
This isn't a problem, it's just the way this particular estimated model works.

Fitting Lightgbm distributed with lgb.train hangs

I'm trying to learn how to use lightgbm distributed.
I wrote a simple hello world kind of code where I use iris dataset with 150 rows, split it into train (100 rows) and test(50 rows). Then training the train test set are further split into two parts. Each part is fed into two machines with appropriate rank.
The problem I see is that lgb.train hangs.
Here is the code:
import argparse
import logging
import lightgbm as lgb
import pandas as pd
from sklearn import datasets
import socket
print('lightgbm', lgb.__version__)
HOST = socket.gethostname()
ip_address = socket.gethostbyname(HOST)
print("IP=", ip_address)
# looks like lightgbm operates only with ip addresses
IPS = ['10.121.22.166', '10.121.22.83']
assert ip_address in IPS
logger = logging.getLogger(__name__)
pd.set_option('display.max_rows', 4)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 10000)
pd.set_option('max_colwidth', 100)
pd.set_option('precision', 5)
def read_train_data(rank):
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
partition = rank
assert partition < 2
separate = 100
train_df = iris_df.iloc[:separate]
test_df = iris_df.iloc[separate:]
separate_train = 60
separate_test = 30
if partition == 0:
train_df = train_df.iloc[:separate_train]
test_df = test_df.iloc[:separate_test]
else:
train_df = train_df.iloc[separate_train:]
test_df = test_df.iloc[separate_test:]
def get_lgb_dataset(df):
target_column = df.columns[-1]
columns = df.columns[:-1]
assert target_column not in columns
print('Target column', target_column)
x = df[columns]
y = df[target_column]
print(x)
ds = lgb.Dataset(free_raw_data=False, data=x, label=y, params={
"enable_bundle": False
})
ds.construct()
return ds
dtrain = get_lgb_dataset(train_df)
dtest = get_lgb_dataset(test_df)
return dtrain, dtest
def train(args):
port0 = 56456
rank = IPS.index(ip_address)
print("Rank=", rank, HOST)
print("RR", rank)
dtrain, dtest = read_train_data(rank=rank)
params = {'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree': 1.0,
'importance_type': 'split',
'learning_rate': 0.1,
'max_depth': 2,
'min_child_samples': 20,
'min_child_weight': 0.001,
'min_split_gain': 0.0,
'n_estimators': 1,
'num_leaves': 31,
'objective': 'regression',
'metric': 'rmse',
'random_state': None,
'reg_alpha': 0.0,
'reg_lambda': 0.0,
'silent': False,
'subsample': 1.0,
'subsample_for_bin': 200000,
'subsample_freq': 0,
'tree_learner': 'data_parallel',
'num_threads': 48,
'machines': ','.join([f'{machine}:{port0}' for i, machine in enumerate(IPS)]),
'local_listen_port': port0,
'time_out': 120,
'num_machines': len(IPS)
}
print(params)
logging.info("starting to train lgb at node with rank %d", rank)
evals_result = {}
if args.scikit == 1:
print("Using scikit learn")
bst = lgb.sklearn.LGBMRegressor(**params)
bst.fit(
dtrain.data,
dtrain.label,
eval_set=[(dtest.data, dtest.label)],
)
else:
print("Using regular LGB")
bst = lgb.train(params,
dtrain,
valid_sets=[dtest],
evals_result=evals_result)
print(evals_result)
logging.info("finish xgboost training at node with rank %d", rank)
return bst
def main(args):
logging.info("starting the train job")
model = train(args)
pd.set_option('display.max_rows', 500)
print("OUT", model.__class__)
try:
print(model.trees_to_dataframe())
except:
print(model.booster_.trees_to_dataframe())
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--scikit',
help='scikit',
default=0,
type=int,
)
main(parser.parse_args())
I can run it with the scikit fit interface by running: python simple_distributed_lgb_test.py --scikit 1
On the two machines. It produces a reasonable result.
However, when I use -- scikit 0 (which uses lgb.train), then fitting just hangs on both nodes. Last messages before it hangs:
[LightGBM] [Info] Total Bins 22
[LightGBM] [Info] Number of data points in the train set: 40, number of used features: 2
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Start training from score 0.873750
Is that a bug or an expected behavior? dask.py in lightgbm does use scikit learn fit interface.
I use an overnight master version 3.2.1.99. 5b7a6f3e7150aeb704d1dd2b852d246af3e913a3 tag to be exact from Jul 12.
UPDATE 1
I'm trying to dig into the code. So far I see few things:
scikit.train interface appears to have an extra syncronization step before fitting first tree. lgb.train doesn't have it. Dunno yet where it comes from. (I see some Network::Allreduce operations)
It appears that scikit.train has workers syncronized - each worker knows the correct sizes of the blocks to send and receive during reducescatter operations. For example one the first allreduce worker1 sends 208 blocks and receives 368 blocks of data (in Linkers::SendRecv), while worker2 is reversed - sends 368 and receives 208. So allreduce completes fine. ()
On the contrary, lgb.train has workers not syncronized - each worker has numbers for send and receive blocks during reducescatter at the first DataParallelTreeLearner::FindBestSplits encounter. But they don't match. Worker1 sends 208 abd wants to receive 400. Worker2 sends 192 and wants to receive 176. So, the worker that wants to receive more just hangs. The other worker eventually hangs too.
Possibly it has something to do with lgb.Dataset. That thing may need to have same bins or something. I tried to force it by forcedbins_filename parameter. But it doesn't seem to help with lgb.train.
UPDATE 2
Success. If I remove the following line from the example:
ds.construct()
Everything works. So I guess we can't use construct on Dataset when using distributed training.

pymc3 Multi-category Bayesian network - how to sample?

I have set up a Bayes net with 3 states per node as below, and can read logp's for particular states from it (as in the code).
Next I would like to sample from it. In the code below, sampling runs but I don't see distributions over the three states in the outputs; rather, I see a mean and variance as if they were continuous nodes. How do I get the posterior over the three states?
import numpy as np
import pymc3 as mc
import pylab, math
model = mc.Model()
with model:
rain = mc.Categorical('rain', p = np.array([0.5, 0. ,0.5]))
sprinkler = mc.Categorical('sprinkler', p=np.array([0.33,0.33,0.34]))
CPT = mc.math.constant(np.array([ [ [.1,.2,.7], [.2,.2,.6], [.3,.3,.4] ],\
[ [.8,.1,.1], [.3,.4,.3], [.1,.1,.8] ],\
[ [.6,.2,.2], [.4,.4,.2], [.2,.2,.6] ] ]))
p_wetgrass = CPT[rain, sprinkler]
wetgrass = mc.Categorical('wetgrass', p_wetgrass)
#brute force search (not working)
for val_rain in range(0,3):
for val_sprinkler in range(0,3):
for val_wetgrass in range(0,3):
lik = model.logp(rain=val_rain, sprinkler=val_sprinkler, wetgrass=val_wetgrass )
print([val_rain, val_sprinkler, val_wetgrass, lik])
#sampling (runs but don't understand output)
if 1:
niter = 10000 # 10000
tune = 5000 # 5000
print("SAMPLING:")
#trace = mc.sample(20000, step=[mc.BinaryGibbsMetropolis([rain, sprinkler])], tune=tune, random_seed=124)
trace = mc.sample(20000, tune=tune, random_seed=124)
print("trace summary")
mc.summary(trace)
answering own question: the trace does contain the discrete values but the mc.summary(trace) function is set up to compute continuous mean and variance stats. To make a histogram of the discrete states, use h = hist(trace.get_values(sprinkler)) :-)

Dirichlet vs Binomial in pymc3

I am having trouble sampling from a Dirichlet/Multinomial distribution with pymc3.
I tried to create a simple test-case to recreate a Beta/Binomial using Dirichlet/Multinomial with n=2, but I can't get it to work.
Below I have some code that works for Binomial but fails for Multinomial.
One of the obvious differences is that the Multinomial model is more constrained:
i.e. to start, rating is set to 10 in the Binomial model, and [10,10] in the Multinomial.
The pymc3 Dirichlet code does say "Only the first k-1 elements of x are expected" but only arrays of shape 2 seem to work in my code.
The output shows that num_friends and rating are being sampled in the Binomial case, but not in the Multinomial case. friends_ratings is being sampled in both. Thanks!
Oh, also Dirichlet('d', np.array([1,1])) crashes with "Floating point error 8". It only appears to fail when two integers of value 1 are passed in. np.array([1.,1.]) works.
import pymc as pm
import numpy as np
print "TEST BINOMIAL"
with pm.Model() as model:
friends_ratings = pm.Beta('friends_ratings', alpha=1, beta=2)
num_friends = pm.DiscreteUniform('num_friends', lower=0, upper=100)
rating = pm.Binomial('rating', n=num_friends, p=friends_ratings)
step = pm.Metropolis([num_friends, friends_ratings, rating])
start = {"friends_ratings":.5, "num_friends":20, 'rating':10}
tr = pm.sample(5, step, start=start, progressbar=False)
print "friends", [tr[i]['num_friends'] for i in range(len(tr))]
print "friends_ratings", [tr[i]['friends_ratings'] for i in range(len(tr))]
print "rating", [tr[i]['rating'] for i in range(len(tr))]
print "TEST DIRICHLET"
with pm.Model() as model:
friends_ratings = pm.Dirichlet('friends_ratings', np.array([1.,1.]), shape=2)
num_friends = pm.DiscreteUniform('num_friends', lower=0, upper=100)
rating = pm.Multinomial('rating', n=num_friends, p=friends_ratings, shape=2)
step = pm.Metropolis([num_friends, friends_ratings, rating])
start = {'friends_ratings': np.array([0.5,0.5]), 'num_friends': 20, 'rating': [10,10]}
tr = pm.sample(5, step, start=start, progressbar=False)
print "friends", [tr[i]['num_friends'] for i in range(len(tr))]
print "friends_ratings", [tr[i]['friends_ratings'] for i in range(len(tr))]
print "rating", [tr[i]['rating'] for i in range(len(tr))]
Output:
TEST BINOMIAL
friends [22.0, 24.0, 24.0, 23.0, 23.0]
friends_ratings [0.5, 0.5, 0.41, 0.41, 0.41]
ratingf [10.0, 11.0, 11.0, 11.0, 11.0]
TEST DIRICHLET
friends [20.0, 20.0, 20.0, 20.0, 20.0]
friends_ratings [array([ 0.51369621, 1.490608 ]), ... ]
rating [array([ 10., 10.]), array([ 10., 10.]), ... ]
PyMC3 does not automatically normalize the Dirichlet. So far you have to do this explicitly using simplextransform. See here for an example.
There is an issue of making this transform automatic though: https://github.com/pymc-devs/pymc3/issues/315
EDIT (9/14/2015): PyMC3 now automatically transforms the dirichlet distribution (as any other distribution). So you don't need to specify that manually anymore.

Resources