I have a challenge and I am trying to solve this in order to move forward, it is the final piece of the puzzle for my model operations.
What I am trying to do?:*
is verify the videos that are being used in the Xval_test variable via the split operations here as per the example via here In Python sklearn, how do I retrieve the names of samples/variables in test/training data? :
X_train, Xval_test, Y_train, Yval_test = train_test_split(
X, Y, train_size=0.8, test_size=0.2, random_state=1, shuffle=True)
1.
What I tried?:
is calling the name from the actual tag via file_path name, however that is not working. (every time the code runs the names from the file path are taken and not from the actual split operations Xval_test variable. This causes an issue during the model.fit() procedures as it changes the 1D flattened tensor to (a number of rows, 1 column)
file_paths = []
for file_name in os.listdir(root):
file_path = os.path.join(root, file_name)
if os.path.isfile(file_path):
file_paths.append(file_path)
print('**********************************************************')
print('ALL Directory File Paths Completed', file_paths)
I am not sure if the files are being shuffled properly with my weak attempt as per the guidelines from the split() forum. (based on my knowledge, every time I run the code, those files would be shuffled to a new Xval_test set relative to it's specified split parameter 80:20.
2.
I tried calling the model.predict(), that presents no labels for which I was hoping that it did (maybe I am using it the wrong way for calling the indices, I don't know).
my_pred = model.predict(Xval_test).argmax(axis=1)
I tried calling np.argsmax():( I KNOW THE TOTAL AMT OF FILES IN Xval_test is 16 based on the split())
Y_valpred = np.argmax(model.predict(Xval_test), axis=1) # model
This returns just the class label and not it's contents e.g. the classes in the datastore are folders containing (walking and fencing) rather than the actual video labels such as (walking0.avi....100/n and fencing0.avi.....100n/) !!!???!
I am not certain of the operation to get the folder content's tags, the actual file itself. It is this that I am trying to get from the X_test variable.
(or maybe its the wrong variable or functioning I using, again I am lacking the knowledge to understand this, please assist so that I can move to the next stage).
3.
I tried printing all of the variable's from the previous operations to see where that name tag would be stored and it is stored in the name variable below as per my operations: (but how do I call these folder content's file tags forward to the X_test variable or as per my choice the model.predict() outputs in a column together with the other metrics. So far, this causes issues with the model.fit() function???)
for files3 in files2:
name = os.path.join(namelist, files3)
name1 = name.strip("./dataset/")
name2 = name1.strip("Fencing/")
name3 = name2.strip("Stabing/")
name3 = name3.replace('.av', '')
name4 = name3.split()
# print("This is name1 ", name1)
# name5 = pd.DataFrame({"vid_names": name4}).to_csv("results.csv")
# name1 = name1.replace('[]', '')
with open('vid_names.csv', 'a',newline='') as f:
writer = csv.writer(f)
writer = writer.writerow(name4)
# print("My Video Names => ", name3)
3A.
Thank you in advance, I am grateful for any guidance provided, Please assist!
QUESTIONS: ############################################
Ques: 1.
Is it possible to see what video label tags are segmented within the X_Test Variable?
Ques: 1A.
If yes, may I request your guidance here, please, on how this can be done?:
I have been researching for weeks and cannot seem to get this sorted, your efforts would be greatly appreciated.
Ques: 2. MY Expected OUTCOME:
I am trying to access the prediction. So, In the end I would get an output relative to the actual video tag that insinuates the actual video that was used in the prediction operation along with its class tag (see below):
Initially, the model.predict() operations outputs numerical data relative to the class label.
I am trying to access the actual file label as well:
For example, what I want the predictions to look like is as follows:
X_test_labs Pred_labs Actual_File Pred_Score
0 Fencing Fencing fencing0.avi 0.99650866
1 Walking Fencing walking6.avi 0.9948837
2 Walking Walking walking21.avi 0.9967557
3 Fencing Fencing fencing32.avi 0.9930409
4 Walking Fencing walking43.avi 0.9961387
5 Walking Walking walking48.avi 0.6467387
6 Walking Walking walking50.avi 0.5465369
7 Walking Walking walking9.avi 0.3478027
8 Fencing Fencing fencing22.avi 0.1247543
9 Fencing Fencing fencing46.avi 0.7477777
10 Walking Walking walking37.avi 0.8499399
11 Fencing Fencing fencing19.avi 0.8887722
12 Walking Walking walking12.avi 0.7775351
13 Fencing Fencing fencing33.avi 0.4323323
14 Fencing Fencing fencing51.avi 0.7812434
15 Fencing Fencing fencing8.avi 0.8723476
I am not sure how to achieve this task, this one is a little more tricky for me than anticipated
This is my code*
'''*******Load Dependencies********'''
from keras.regularizers import l2
from keras.layers import Dense
from keras_tqdm import TQDMNotebookCallback
from tqdm.keras import TqdmCallback
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import math
import tensorflow as tf
from tqdm import tqdm
import videoto3d
import seaborn as sns
import scikitplot as skplt
from sklearn import preprocessing
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from keras.utils.vis_utils import plot_model
from keras.utils import np_utils
from tensorflow.keras.optimizers import Adam
from keras.models import Sequential
from keras.losses import categorical_crossentropy
from keras.layers import (Activation, Conv3D, Dense, Dropout, Flatten,MaxPooling3D)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import argparse
import time
import sys
import openpyxl
import os
import re
import csv
from keras import models
import cv2
import pickle
import glob
from numpy import load
np.seterr(divide='ignore', invalid='ignore')
print('**********************************************************')
print('Graphical Representation Of Accuracy & Validation Results Completed')
def plot_history(history, result_dir):
plt.plot(history.history['val_accuracy'], marker='.')
plt.plot(history.history['accuracy'], marker='.')
plt.title('model accuracy')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.grid()
plt.legend(['Val_acc', 'Test_acc'], loc='lower right')
plt.savefig(os.path.join(result_dir, 'model_accuracy.png'))
plt.close()
plt.plot(history.history['val_loss'], marker='.')
plt.plot(history.history['loss'], marker='.')
plt.title('model Loss')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.grid()
plt.legend(['Val_loss', 'Test_loss'], loc='upper right')
plt.savefig(os.path.join(result_dir, 'model_loss.png'))
plt.close()
# Saving History Accuracy & Validation Acuuracy Results To Directory
print('**********************************************************')
print('Generating History Acuuracy Results Completed')
def save_history(history, result_dir):
loss = history.history['loss']
acc = history.history['accuracy']
val_loss = history.history['val_loss']
val_acc = history.history['val_accuracy']
nb_epoch = len(acc)
# Creating The Results File To Directory = Store Results
print('**********************************************************')
print('Saving History Acuuracy Results To Directory Completed')
with open(os.path.join(result_dir, 'result.txt'), 'w') as fp:
fp.write('epoch\tloss\tacc\tval_loss\tval_acc\n')
# print(fp)
for i in range(nb_epoch):
fp.write('{}\t{}\t{}\t{}\t{}\n'.format(
i, loss[i], acc[i], val_loss[i], val_acc[i]))
print('**********************************************************')
print('Loading All Specified Video Data Samples From Directory Completed')
def loaddata(video_dir, vid3d, nclass, result_dir, color=False, skip=True):
files = os.listdir(video_dir)
with open('files.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(files)
root = '/Users/symbadian/3DCNN_latest_Version/3DCNNtesting/dataset/'
dirlist = [item for item in os.listdir(
root) if os.path.isdir(os.path.join(root, item))]
print('Get the filesname and path')
print('DIRLIST Directory Completed', dirlist)
file_paths = []
for file_name in os.listdir(root):
file_path = os.path.join(root, file_name)
if os.path.isfile(file_path):
file_paths.append(file_path)
print('**********************************************************')
print('ALL Directory File Paths Completed', file_paths)
roots, dirsy, fitte = next(os.walk(root), ([], [], []))
print('**********************************************************')
print('ALL Directory ROOTED', roots, fitte, dirsy)
X = []
print('X labels==>', X) # This stores all variable data in an object format
labellist = []
pbar = tqdm(total=len(files)) # generate progress bar for file processing
print('**********************************************************')
print('Generating/Join Class Labels For Video Dataset For Input Completed')
# Accessing files and labels from dataset directory
for filename in files:
pbar.update(1)
if filename == '.DS_Store':#.DS_Store
continue
namelist = os.path.join(video_dir, filename)
files2 = os.listdir(namelist)
###############################################################################
######### NEEDS TO FIX THIS Data Adding to CSV Rather Than REWRITTING #########
for files3 in files2:
name = os.path.join(namelist, files3)
#Call a function that extract the frames details of all file names
label = vid3d.get_UCF_classname(filename)
if label not in labellist:
if len(labellist) >= nclass:
continue
labellist.append(label)
# This X variable is the point where the lables are store (I think??!?!)
X.append(vid3d.video3d(name, color=color, skip=skip))
pbar.close()
# generating labellist/ writing to directory
print('******************************************************')
print('Saving All Class Labels For Referencing To Directory Completed')
with open(os.path.join(result_dir, 'classes.txt'), 'w') as fp:
for i in range(len(labellist)):
# print('These are labellist i classes',i) #Not This
fp.write('{}\n'.format(labellist[i]))
# print('These are my labels: ==>',mylabel)
for num, label in enumerate(labellist):
for i in range(len(labels)):
if label == labels[i]:
labels[i] = num
# print('This is labels i',labels[i]) #Not this
if color: # conforming image channels of image for input sequence
return np.array(X).transpose((0, 2, 3, 4, 1)), labels
else:
return np.array(X).transpose((0, 2, 3, 1)), labels
print('**********************************************************')
print('Generating Args Informative Messages/ Tuning Parameters Options Completed')
def main():
parser = argparse.ArgumentParser(description='A 3D Convolution Model For Action Recognition')
parser.add_argument('--batch', type=int, default=130)
parser.add_argument('--epoch', type=int, default=100)
parser.add_argument('--videos', type=str, default='dataset',help='Directory Where Videos Are Stored')# UCF101
parser.add_argument('--nclass', type=int, default= 2)
parser.add_argument('--output', type=str, required=True)
parser.add_argument('--color', type=bool, default=False)
parser.add_argument('--skip', type=bool, default=True)
parser.add_argument('--depth', type=int, default=10)
args = parser.parse_args()
# print('This is the Option Arguments ==>',args)
print('**********************************************************')
print('Specifying Input Size and Channels Completed')
img_rows, img_cols, frames = 32, 32, args.depth
channel = 3 if args.color else 1
print('**********************************************************')
print('Saving Dataset As NPZ To Directory Completed')
fname_npz = 'dataset_{}_{}_{}.npz'.format(args.nclass, args.depth, args.skip)
vid3d = videoto3d.Videoto3D(img_rows, img_cols, frames)
nb_classes = args.nclass
# loading the data
if os.path.exists(fname_npz):
loadeddata = np.load(fname_npz)
X, Y = loadeddata["X"], loadeddata["Y"]
else:
x, y = loaddata(args.videos, vid3d, args.nclass,args.output, args.color, args.skip)
X = x.reshape((x.shape[0], img_rows, img_cols, frames, channel))
Y = np_utils.to_categorical(y, nb_classes)
X = X.astype('float32')
#save npzdata to file
np.savez(fname_npz, X=X, Y=Y)
print('Saved Dataset To dataset.npz. Completed')
print('X_shape:{}\nY_shape:{}'.format(X.shape, Y.shape))
print('**********************************************************')
print('Initialise Model Layers & Layer Parameters Completed')
# Sequential groups a linear stack of layers into a tf.keras.Model.
# Sequential provides training and inference features on this model
model = Sequential()
model.add(Conv3D(32, kernel_size=(3, 3, 3),input_shape=(X.shape[1:]), padding='same'))
model.add(Activation('relu'))
model.add(Conv3D(32, kernel_size=(3, 3, 3), padding='same'))
model.add(MaxPooling3D(pool_size=(3, 3, 3), padding='same'))
model.add(Conv3D(64, kernel_size=(3, 3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv3D(64, kernel_size=(3, 3, 3), padding='same'))
model.add(MaxPooling3D(pool_size=(3, 3, 3), padding='same'))
model.add(Conv3D(128, kernel_size=(3, 3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv3D(128, kernel_size=(3, 3, 3), padding='same'))
model.add(MaxPooling3D(pool_size=(3, 3, 3), padding='same'))
model.add(Dropout(0.5))
model.add(Conv3D(256, kernel_size=(3, 3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv3D(256, kernel_size=(3, 3, 3), padding='same'))
model.add(MaxPooling3D(pool_size=(3, 3, 3), padding='same'))
model.add(Dropout(0.5))
model.add(Flatten())
# Dense function to convert FCL to 512 values
model.add(Dense(512, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes, activation='softmax'))
model.compile(loss=categorical_crossentropy,optimizer=Adam(), metrics=['accuracy'])
model.summary()
print('this is the model shape')
model.output_shape
plot_model(model, show_shapes=True,to_file=os.path.join(args.output, 'model.png'))
print('**********************************************************')
print("Train Test Method HoldOut Performance")
X_train, Xval_test, Y_train, Yval_test = train_test_split(
X, Y, train_size=0.8, test_size=0.2, random_state=1, stratify=Y, shuffle=True)
print('**********************************************************')
print('Deploying Data Fitting/ Performance Accuracy Guidance Completed')
#Stop operations when experiencing no learning
rlronp = tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss", factor=0.5, patience=1, mode='auto', min_delta=0.0001, cooldown=1, min_lr=0.0001)
# Fit the training data
history = model.fit(X_train, Y_train, validation_split=0.20, batch_size=args.batch,epochs=args.epoch, verbose=1, callbacks=[rlronp], shuffle=True)
# Predict X_Test (Xval_test) data and Labels
predict_labels = model.predict(Xval_test, batch_size=args.batch,verbose=1,use_multiprocessing=True)
classes = np.argmax(predict_labels, axis = 1)
label = np.argmax(Yval_test,axis = 1)
print('This the BATCH size', args.batch)
print('This the DEPTH size', args.depth)
print('This the EPOCH size', args.epoch)
print('This the TRAIN SPLIT size', len(X_train))
print('This the TEST SPLIT size', len(Xval_test))
# https://stackoverflow.com/questions/52261597/keras-model-fit-verbose-formatting
# A json file enhances the model performance by a simple to save/load model
model_json = model.to_json()
if not os.path.isdir(args.output):
os.makedirs(args.output)
with open(os.path.join(args.output, 'ucf101_3dcnnmodel.json'), 'w') as json_file:
json_file.write(model_json)
# hd5 contains multidimensional arrays of scientific data
model.save_weights(os.path.join(args.output, 'ucf101_3dcnnmodel.hd5'))
''' Evaluation is a process
'''
print('**********************************************************')
print('Displying Test Loss & Test Accuracy Completed')
loss, acc = model.evaluate(Xval_test, Yval_test, verbose=2, batch_size=args.batch, use_multiprocessing=True) # verbose 0
print('this is args output', args.output)
plot_history(history, args.output)
save_history(history, args.output)
print('**********************************************************')
# Generating Picture Of Confusion matrix
print('**********************************************************')
print('Generating CM InputData/Classification Report Completed')
#Ground truth (correct) target values.
y_valtest_arg = np.argmax(Yval_test, axis=1)
#Estimated targets as returned by a classifier
Y_valpred = np.argmax(model.predict(Xval_test), axis=1) # model
print('y_valtest_arg Shape is ==>', y_valtest_arg.shape)
print('Y_valpred Shape is ==>', Y_valpred.shape)
print('**********************************************************')
print('Classification_Report On Model Performance Completed==')
print(classification_report(y_valtest_arg.round(), Y_valpred.round(), target_names=filehandle, zero_division=1))
'''Intitate Confusion Matrix'''
# print('Model Confusion Matrix Per Test Data Completed===>')
cm = confusion_matrix(y_valtest_arg, Y_valpred, normalize=None)
print('Display Confusion Matrix ===>', cm)
print('**********************************************************')
print('Model Overall Accuracy')
print('Model Test loss:', loss)
print('**********************************************************')
print('Model Test accuracy:', acc)
print('**********************************************************')
if __name__ == '__main__':
main()
I Think the solution is around the prediction, train, test split and the evaluation arguments. However, I am lacking the knowledge to access the details required from the train, test, split(), if that's where the issue is.
I am really thankful for your guidance in advance, thanks a whole lot for clarifying this for me and closing the gaps in my understanding. really appreciate this!!!
I´m facing a strange behaviour which I can´t figure out why it is happening. I´m getting a really high loss(BinaryCrossentropy) on my validation batch around 20 or even higher while training. But after the training I do a prediction on the tet set and I get a loss which is lower than 1. Why is that? I went through my code over and over and can´t find the problem.
I´m doing a binary image classification for brian tumors on a dataset provided via kaggle(Link.
And you can find my notebook here: Google-Colab Notebook
My data is loaded this way:
batch_size = 20
train_ds = tf.keras.utils.image_dataset_from_directory(
train_data_path,
subset='training',
seed=42,
color_mode='grayscale',
batch_size=batch_size,
validation_split=0.30
)
valid_ds = tf.keras.utils.image_dataset_from_directory(
train_data_path,
subset='validation',
seed=42,
batch_size=batch_size,
color_mode='grayscale',
validation_split=0.30
)
test_ds = tf.keras.utils.image_dataset_from_directory(
test_data_path,
color_mode='grayscale',
batch_size=batch_size,
shuffle=False
)
This is my modle strcuture
input_shape = image_batch[0].shape
# set up the model structure
model = tf.keras.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=input_shape),
layers.MaxPooling2D((2,2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Dropout(0.3),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Dropout(0.3),
layers.Flatten(),
tf.keras.layers.Dense(32, activation="relu"),
layers.Dropout(0.3),
layers.Dense(1, activation="sigmoid")
])
model.summary()
This is my callback function which returns the plots during training:
class PlotLearning(tf.keras.callbacks.Callback):
"""
Callback to plot the learning curves of the model during training.
"""
def on_train_begin(self, logs={}):
self.metrics = {}
for metric in logs:
self.metrics[metric] = []
def on_epoch_end(self, epoch, logs={}):
# Storing metrics
print(logs)
for metric in logs:
if metric in self.metrics:
self.metrics[metric].append(logs.get(metric))
else:
self.metrics[metric] = [logs.get(metric)]
# Plotting
metrics = [x for x in logs if 'val' not in x]
f, axs = plt.subplots(1, len(metrics), figsize=(15,5))
clear_output(wait=True)
for i, metric in enumerate(metrics):
axs[i].plot(range(1, epoch + 2),
self.metrics[metric],
label=metric)
if logs['val_' + metric]:
axs[i].plot(range(1, epoch + 2),
self.metrics['val_' + metric],
label='val_' + metric)
axs[i].legend()
axs[i].grid()
plt.tight_layout()
plt.show()
callbacks_list = [PlotLearning()]
and this is the part where I start the training
# compile model
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
model.compile(optimizer=optimizer,
loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
metrics=['accuracy']
)
# fit model
history = model.fit(prep_train_ds,
epochs=30,
validation_data=valid_ds,
callbacks=callbacks_list)
This is the output of the callback function after the last epoch run through:
As you can see the loss is really high and oscillating around 20, so I guess it is overfitting.
But as mentiod above, here is what I get when I make a prediction on the test set and calculate the binary crossentropy. The loss is again less than 1 and at least in the range of the training loss
I tried so many things like, chaning batch size, bcs. not enough samples of one class might be in one batch. Then I wanted to see if it is overfitting and changed the number of filters, applyed droput etc. But I couldn´t get the loss function down on the validation set. I´m quite new in the field of image classification and maybe I´m oversseing a thing.
I'm trying to learn how to use lightgbm distributed.
I wrote a simple hello world kind of code where I use iris dataset with 150 rows, split it into train (100 rows) and test(50 rows). Then training the train test set are further split into two parts. Each part is fed into two machines with appropriate rank.
The problem I see is that lgb.train hangs.
Here is the code:
import argparse
import logging
import lightgbm as lgb
import pandas as pd
from sklearn import datasets
import socket
print('lightgbm', lgb.__version__)
HOST = socket.gethostname()
ip_address = socket.gethostbyname(HOST)
print("IP=", ip_address)
# looks like lightgbm operates only with ip addresses
IPS = ['10.121.22.166', '10.121.22.83']
assert ip_address in IPS
logger = logging.getLogger(__name__)
pd.set_option('display.max_rows', 4)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 10000)
pd.set_option('max_colwidth', 100)
pd.set_option('precision', 5)
def read_train_data(rank):
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
partition = rank
assert partition < 2
separate = 100
train_df = iris_df.iloc[:separate]
test_df = iris_df.iloc[separate:]
separate_train = 60
separate_test = 30
if partition == 0:
train_df = train_df.iloc[:separate_train]
test_df = test_df.iloc[:separate_test]
else:
train_df = train_df.iloc[separate_train:]
test_df = test_df.iloc[separate_test:]
def get_lgb_dataset(df):
target_column = df.columns[-1]
columns = df.columns[:-1]
assert target_column not in columns
print('Target column', target_column)
x = df[columns]
y = df[target_column]
print(x)
ds = lgb.Dataset(free_raw_data=False, data=x, label=y, params={
"enable_bundle": False
})
ds.construct()
return ds
dtrain = get_lgb_dataset(train_df)
dtest = get_lgb_dataset(test_df)
return dtrain, dtest
def train(args):
port0 = 56456
rank = IPS.index(ip_address)
print("Rank=", rank, HOST)
print("RR", rank)
dtrain, dtest = read_train_data(rank=rank)
params = {'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree': 1.0,
'importance_type': 'split',
'learning_rate': 0.1,
'max_depth': 2,
'min_child_samples': 20,
'min_child_weight': 0.001,
'min_split_gain': 0.0,
'n_estimators': 1,
'num_leaves': 31,
'objective': 'regression',
'metric': 'rmse',
'random_state': None,
'reg_alpha': 0.0,
'reg_lambda': 0.0,
'silent': False,
'subsample': 1.0,
'subsample_for_bin': 200000,
'subsample_freq': 0,
'tree_learner': 'data_parallel',
'num_threads': 48,
'machines': ','.join([f'{machine}:{port0}' for i, machine in enumerate(IPS)]),
'local_listen_port': port0,
'time_out': 120,
'num_machines': len(IPS)
}
print(params)
logging.info("starting to train lgb at node with rank %d", rank)
evals_result = {}
if args.scikit == 1:
print("Using scikit learn")
bst = lgb.sklearn.LGBMRegressor(**params)
bst.fit(
dtrain.data,
dtrain.label,
eval_set=[(dtest.data, dtest.label)],
)
else:
print("Using regular LGB")
bst = lgb.train(params,
dtrain,
valid_sets=[dtest],
evals_result=evals_result)
print(evals_result)
logging.info("finish xgboost training at node with rank %d", rank)
return bst
def main(args):
logging.info("starting the train job")
model = train(args)
pd.set_option('display.max_rows', 500)
print("OUT", model.__class__)
try:
print(model.trees_to_dataframe())
except:
print(model.booster_.trees_to_dataframe())
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--scikit',
help='scikit',
default=0,
type=int,
)
main(parser.parse_args())
I can run it with the scikit fit interface by running: python simple_distributed_lgb_test.py --scikit 1
On the two machines. It produces a reasonable result.
However, when I use -- scikit 0 (which uses lgb.train), then fitting just hangs on both nodes. Last messages before it hangs:
[LightGBM] [Info] Total Bins 22
[LightGBM] [Info] Number of data points in the train set: 40, number of used features: 2
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Start training from score 0.873750
Is that a bug or an expected behavior? dask.py in lightgbm does use scikit learn fit interface.
I use an overnight master version 3.2.1.99. 5b7a6f3e7150aeb704d1dd2b852d246af3e913a3 tag to be exact from Jul 12.
UPDATE 1
I'm trying to dig into the code. So far I see few things:
scikit.train interface appears to have an extra syncronization step before fitting first tree. lgb.train doesn't have it. Dunno yet where it comes from. (I see some Network::Allreduce operations)
It appears that scikit.train has workers syncronized - each worker knows the correct sizes of the blocks to send and receive during reducescatter operations. For example one the first allreduce worker1 sends 208 blocks and receives 368 blocks of data (in Linkers::SendRecv), while worker2 is reversed - sends 368 and receives 208. So allreduce completes fine. ()
On the contrary, lgb.train has workers not syncronized - each worker has numbers for send and receive blocks during reducescatter at the first DataParallelTreeLearner::FindBestSplits encounter. But they don't match. Worker1 sends 208 abd wants to receive 400. Worker2 sends 192 and wants to receive 176. So, the worker that wants to receive more just hangs. The other worker eventually hangs too.
Possibly it has something to do with lgb.Dataset. That thing may need to have same bins or something. I tried to force it by forcedbins_filename parameter. But it doesn't seem to help with lgb.train.
UPDATE 2
Success. If I remove the following line from the example:
ds.construct()
Everything works. So I guess we can't use construct on Dataset when using distributed training.
I'm trying to figure out the reason behind speed different from two different models.
an LSTM RNN model built using tensorflow 1.x:
self.input_placeholder = tf.placeholder(
tf.int32, shape=[self.config.batch_size, self.config.num_steps], name='Input')
self.labels_placeholder = tf.placeholder(
tf.int32, shape=[self.config.batch_size, self.config.num_steps], name='Target')
embedding = tf.get_variable(
'Embedding', initializer = self.embedding_matrix, trainable = False)
inputs = tf.nn.embedding_lookup(embedding, self.input_placeholder)
inputs = [tf.squeeze(x, axis = 1) for x in tf.split(inputs, self.config.num_steps, axis = 1)]
self.initial_state = tf.zeros([self.config.batch_size, self.config.hidden_size])
lstm_cell = tf.contrib.rnn.BasicLSTMCell(self.config.hidden_size)
outputs, _ = tf.contrib.rnn.static_rnn(
lstm_cell, inputs, dtype = tf.float32,
sequence_length = [self.config.num_steps]*self.config.batch_size)
with tf.variable_scope('Projection'):
proj_U = tf.get_variable('Matrix', [self.config.hidden_size, self.config.vocab_size])
proj_b = tf.get_variable('Bias', [self.config.vocab_size])
outputs = [tf.matmul(o, proj_U) + proj_b for o in rnn_outputs]
the same model (at least from my understanding) built using tensorflow 2.0 keras:
def setup_model():
model = Sequential()
model.add(Embedding(input_dim=vocab_size,
output_dim=embedding_dim,
weights=[embedding_matrix],
input_length=4,
trainable=False))
model.add(LSTM(config.hidden_size, activation="tanh"))
model.add(Dense(vocab_size, activation="softmax"))
return model
The architecture is:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 4, 100) 55400
_________________________________________________________________
lstm (LSTM) (None, 100) 80400
_________________________________________________________________
dense (Dense) (None, 554) 55954
=================================================================
Total params: 191,754
Trainable params: 136,354
Non-trainable params: 55,400
_________________________________________________________________
I was expecting similar inference runtime but the one built with tensorflow 1.x is much much faster. I was trying to convert tensorflow 1.x model to tensorflow 2 using only native tensorlow functions, but I have trouble converting due to the big change in tensorflow from 1.x to 2, and I was only able to create it using tf.keras.
In terms of speed, since I'm using both for generating text sequences + getting word probabilities, so I don't have a single inference time difference (I can't modify existing API from tensorflow 1.x model to get this). But in general, I'm seeing at least 2x difference in time from my use cases.
What can be the possible reasons behind this difference of inference speed? I'm happy to provide more information if needed.
I am trying to fine-tune Inception-v3, but no matter which layer I choose to freeze I get random predictions. I found that other people are having the same problem: https://github.com/keras-team/keras/issues/9214 . It seems that the problem comes from setting the BN layer to not trainable.
Now I am trying to get the output of the last layer I want to freeze and use it as an input to the following layers, which I will then train:
train_generator = train_datagen.flow_from_directory(
os.path.join(directory, "train_data"),
target_size=size,
interpolation="bilinear",
classes=["a", "b", "c","d"],
batch_size=1,
shuffle=False) base_model = InceptionV3(weights='imagenet', include_top=True, input_shape=(299, 299, 3))
model_features = Model(inputs=base_model.input, outputs=base_model.get_layer(
self.Inception_Fine_Tune_Layers[layer_freeze]).output)
#I want to use this as input
values_train = model_features.predict_generator(train_generator, verbose=1)
However, I get Memory error like this, although I have 12Gb, which is more than what I need:
....
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3268864 totalling 3.12MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3489024 totalling 3.33MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 4211968 totalling 4.02MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 5129472 totalling 4.89MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 3.62GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 68719476736
InUse: 3886957312
MaxInUse: 3889054464
NumAllocs: 3709
MaxAllocSize: 8388608
Any suggestion how to fix that or another workaround to fine-tune Inception will be very helpful.
I can't tell if you're preprocessing your input properly from what you've provided. However, Keras provides functions for preprocessing that are specific to the pre-trained net, in this case Inception V3.
from keras.applications.inception_v3 import preprocess_input
Try adding this to your data generator as the preprocessing function like so...
train_generator = train_datagen.flow_from_directory(
os.path.join(directory, "train_data"),
preprocessing_function=preprocess_input, # <---
target_size=size,
interpolation="bilinear",
classes=["a", "b", "c","d"],
batch_size=1,
shuffle=False)
You should then be able to unfreeze all of the layers, or the select few that you want to train.
Hope that helps!