How to measure execution time for prediction per image (keras) - time

I have a simple model created with Keras and I need to measure the execution time for prediction per image. Right now I just do this:
start = time.clock()
my_model.predict(images_test)
end = time.clock()
print("Time per image: {} ".format((end-start)/len(images_test)))
But I noticed that the calculated time is bigger when len(images_test) is smaller. For example when len(images_test) = 32 I get: 0.06 and when len(images_test) = 1024 I get: 0.006
Is there a "right" way to do this ?

if use TF it seems no Asynchronous problem
but if use pytorch it has Asynchronous problem.
in TF:
start = time.clock()
result = my_model.predict(images_test)
end = time.clock()
in pytorch:
torch.cuda.synchronize()
start = time.clock()
my_model.predict(images_test)
torch.cuda.synchronize()
end = time.clock()
But i think you can do 10 times Loop model_predict
and print time_list
(computer need load keras model so first time load slower than other times )
in TF:
pred_time_list=[]
for i in range(10):
start = time.clock()
result = my_model.predict(images_test)
end = time.clock()
pred_time_list.append(end-start)
print(pred_time_list)
(print the pred_time_list and you may find out why the times incorrect)
Reference:
[1]
https://discuss.pytorch.org/t/doing-qr-decomposition-on-gpu-is-much-slower-than-on-cpu/21213/6
[2]
https://discuss.pytorch.org/t/is-there-any-code-torch-backends-cudnn-benchmark-torch-cuda-synchronize-similar-in-tensorflow/51484/2

Related

Improve code result speed by multiprocessing

I'm self study of Python and it's my first code.
I'm working for analyze logs from the servers. Usually I need analyze full day logs. I created script (this is example, simple logic) just for check speed. If I use normal coding the duration of analyzing 20mil rows about 12-13 minutes. I need 200mil rows by 5 min.
What I tried:
Use multiprocessing (met issue with share memory, think that fix it). But as the result - 300K rows = 20 sec and no matter how many processes. (PS: Also need control processors count in advance)
Use threading (I found that it's not give any speed, 300K rows = 2 sec. But normal code same, 300K = 2 sec)
Use asyncio (I think that script is slow because need reads many files). Result same as threading - 300K = 2 sec.
Finally I think that all three my script incorrect and didn't work correctly.
PS: I try to avoid use specific python modules (like pandas) because in this case it will be more difficult to execute on different servers. Better to use common lib.
Please help to check 1st - multiprocessing.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a, n):
proc_num = os.getpid()
a_temp_m = a["vod_miss"]
a_temp_h = a["vod_hit"]
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m[n] = a_temp_m[n] + 1
elif j[3].find('HIT') != -1:
a_temp_h[n] = a_temp_h[n] + 1
a["vod_miss"][n] = a_temp_m[n]
a["vod_hit"][n] = a_temp_h[n]
if __name__ == '__main__':
procs = []
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
n = 1
vod_live_cuts[i] = manager.list([0] * cpu)
vod_live_cuts[ii] = manager.list([0] * cpu)
for m in file:
proc = Process(target=argument, args=(m, vod_live_cuts, (n-1)))
procs.append(proc)
proc.start()
if n >= cpu:
n = 1
proc.join()
else:
n += 1
[proc.join() for proc in procs]
[proc.close() for proc in procs]
I'm expect, each file by def argument will be processed by independent process and finally all results will be saved in dict vod_live_cuts. For each process I added independent list in dict. I think it will help cross operation for use this parameter. But maybe it's wrong way :(
using IPC is costly, so only use "shared objects" for saving the final result, not for intermediate results while parsing the file.
limiting the number of processes is done by using a multiprocessing.Pool, the following code uses it to reach the max hard-disk speed, you only need to post-process the results.
you can only parse data as fast as your HDD can read it (typically 30-80 MB/s), so if you need to improve the performance further you should use SSD or RAID0 for higher disk speed, you cannot get much faster than this without changing your hardware.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager, Pool
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a):
proc_num = os.getpid()
a_temp_m_n = 0 # make it local to process
a_temp_h_n = 0 # as shared lists use IPC
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m_n = a_temp_m_n + 1
elif j[3].find('HIT') != -1:
a_temp_h_n = a_temp_h_n + 1
a["vod_miss"].append(a_temp_m_n)
a["vod_hit"].append(a_temp_h_n)
if __name__ == '__main__':
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
vod_live_cuts[i] = manager.list()
vod_live_cuts[ii] = manager.list()
with Pool(cpu) as pool:
tasks = []
for m in file:
task = pool.apply_async(argument, args=(m, vod_live_cuts))
tasks.append(task)
for task in tasks:
task.get()
print(list(vod_live_cuts[i]))
print(list(vod_live_cuts[ii]))

Pytorch - Using more GPUs and increasing batch size makes training slower in DistributedDataParallel

I am trying to implement StyuleGAN2. My code works well when I am just using single GPU to do the training. I would like to speed up the training by utlilizing 8 GPUs by using DistributedDataParallel. However, I noticed that using more GPUs does not speed up the training for me at all. Instead, using more GPUs makes the training slower.
I also tried to modify the batch size and I noticed that batch size = 8 trains the model fastest. Increasing the batch size will makes the training significantly slower.
I tried to measure the time for each epoch and found the training time is significantly longer every 4 epochs.
EP0_elapsed_time: 3.3021082878112793 sec
EP1_elapsed_time: 0.8542821407318115 sec
EP2_elapsed_time: 0.7720010280609131 sec
EP3_elapsed_time: 7.11009407043457 sec
EP4_elapsed_time: 0.7670211791992188 sec
EP5_elapsed_time: 0.7623276710510254 sec
EP6_elapsed_time: 0.7690849304199219 sec
EP7_elapsed_time: 7.0614259243011475 sec
EP8_elapsed_time: 0.7806422710418701 sec
EP9_elapsed_time: 0.7751979827880859 sec
EP10_elapsed_time: 0.7685496807098389 sec
EP11_elapsed_time: 7.09734845161438 sec
EP12_elapsed_time: 0.7923364639282227 sec
EP13_elapsed_time: 0.7789566516876221 sec
EP14_elapsed_time: 0.7974681854248047 sec
EP15_elapsed_time: 7.120237350463867 sec
I notice a similar post and it has not been solved.
No speedup doing multi GPU training with DistributedDataParallel vs. single GPU
How can I solve this issue?
main()
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N')
parser.add_argument('-g', '--gpus', default=1, type=int, help='number of gpus per node')
parser.add_argument('-nr', '--nr', default=0, type=int, help='ranking/index within the nodes')
parser.add_argument('--epochs', default=400, type=int, metavar='N', help='number of total epochs to run')
parser.add_argument('--model_dir', default='stylegan2ada_002', type=str, help='model dir name')
parser.add_argument('--train_img_dir_path', default='./GAN/clean_2', type=str, help='training images dir path')
parser.add_argument('--img_size', default=64, type=int, help='target image size')
parser.add_argument('--batch_size', default=32, type=int, help='batch size')
parser.add_argument('--g_latent_dim', default=512, type=int, help='dim of generator noise z and w')
parser.add_argument('--mn_num_layers', default=8, type=int, help='number of layers in the mapping network (8 according to paper)')
parser.add_argument('--g_lr', default=1e-3, type=float, help='generator learning rate')
parser.add_argument('--d_lr', default=1e-3, type=float, help='discriminator learning rate')
parser.add_argument('--mn_lr', default=1e-5, type=float, help='mapping network learning rate')
parser.add_argument('--adam_betas', default=(0.0, 0.99), type=tuple, help='betas of adam optimizers')
parser.add_argument('--gradient_accumulate_steps', default=1, type=int, help='gradient accumulate steps')
parser.add_argument('--lazy_gradient_penalty_interval', default=4, type=int, help='lazy gradient penalty interval')
parser.add_argument('--lazy_path_penalty_after', default=5000, type=int, help='the point that starts to apply lazy path penalty')
parser.add_argument('--lazy_path_penalty_interval', default=32, type=int, help='lazy path penalty interval')
parser.add_argument('--gradient_penalty_coefficient', default=10., type=float, help='gradient penalty coefficient')
parser.add_argument('--style_mixing_prob', default=0.9, type=float, help='style mixing prob')
parser.add_argument('--generate_img_interval', default=100, type=int, help='generate images every x epochs')
parser.add_argument('--generate_img_after_percent', default=0.4, type=float, help='generate images after y% of the total epochs')
args = parser.parse_args()
args.world_size = args.gpus * args.nodes
args.distributed = True
args.dist_backend = 'nccl'
args.dist_url = 'env://'
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '7788'
mp.spawn(train, nprocs=args.gpus, args=(args,))
train()
def train(gpu, args):
rank = args.nr * args.gpus + gpu
torch.cuda.set_device(gpu)
dist.init_process_group(
backend=args.dist_backend,
init_method=args.dist_url,
world_size=args.world_size,
rank=rank,
)
dist.barrier()
# Measure time
start_init = time.time()
# Make dirs
### Create model dir if not existed
model_dir_path = f'./{args.model_dir}'
if not (os.path.exists(model_dir_path)):
try:
os.makedirs(model_dir_path)
except:
pass
### Create 'images' dir if not existed
img_dir_path = f'./{args.model_dir}/images'
if not (os.path.exists(img_dir_path)):
try:
os.makedirs(img_dir_path)
except:
pass
### Create 'checkpoints' dir if not existed
ckpt_dir_path = f'./{args.model_dir}/checkpoints'
if not (os.path.exists(ckpt_dir_path)):
try:
os.makedirs(ckpt_dir_path)
except:
pass
# Dataset and Dataloader
### Create the dataset
dataset = ImageDataset(path=args.train_img_dir_path, image_size=args.img_size)
sampler = torch.utils.data.distributed.DistributedSampler(
dataset,
num_replicas=args.world_size,
rank=rank,
)
### Create the dataloader
dataloader = torch.utils.data.DataLoader(
dataset,
batch_size=args.batch_size,
num_workers=0,
shuffle=False,
drop_last=True,
pin_memory=True,
sampler=sampler,
)
# Initialization
### Setup gpu device
device = torch.device('cuda', rank)
### Get log2 of the target image size
log_resolution = log2(args.img_size)
### Create discriminator
discriminator = Discriminator(log_resolution)
### Put the discriminator to the device
discriminator.to(device)
### Apply DDP
discriminator = nn.parallel.DistributedDataParallel(discriminator, device_ids=[gpu])
### Create discriminator loss
discriminator_loss = DiscriminatorLoss().to(device)
### Create discriminator optimizer
discriminator_optimizer = torch.optim.Adam(
discriminator.parameters(),
lr = args.d_lr,
betas = args.adam_betas,
)
### Create gradient penalty (gp) loss
gradient_penalty = GradientPenalty()
### Create generator
generator = Generator(device, log_resolution, args.g_latent_dim, args.style_mixing_prob)
### Put the generator to the device
generator.to(device)
### Apply DDP
generator = nn.parallel.DistributedDataParallel(generator, device_ids=[gpu])
### Create generator loss
generator_loss = GeneratorLoss().to(device)
### Create generator optimizer
generator_optimizer = torch.optim.Adam(
generator.parameters(),
lr = args.g_lr,
betas = args.adam_betas,
)
### Create path length penalty (PLP) loss
path_length_penalty = PathLengthPenalty(0.99).to(device)
### Create mapping network
mapping_network = MappingNetwork(args.g_latent_dim, args.mn_num_layers)
### Put the mapping network to the device
mapping_network.to(device)
### Apply DDP
mapping_network = nn.parallel.DistributedDataParallel(mapping_network, device_ids=[gpu])
### Create mapping network optimizer
mapping_network_optimizer = torch.optim.Adam(
mapping_network.parameters(),
lr = args.mn_lr,
betas = args.adam_betas,
)
generate_img_after = int(args.epochs * args.generate_img_after_percent)
# Measure time
torch.cuda.synchronize()
end_init = time.time()
init_time = end_init - start_init
print(f'Init_time: {init_time} sec')
# Training steps and losses tracking
disc_loss_y = []
gen_loss_y = []
# Measure time
times = []
for i in range(args.epochs):
start_epoch = time.time()
disc_loss, gen_loss = step(
i,
device,
args.batch_size,
dataloader,
args.gradient_accumulate_steps,
args.style_mixing_prob,
discriminator,
discriminator_loss,
discriminator_optimizer,
gradient_penalty,
args.gradient_penalty_coefficient,
args.lazy_gradient_penalty_interval,
generator,
generator_loss,
generator_optimizer,
path_length_penalty,
args.g_latent_dim,
args.lazy_path_penalty_after,
args.lazy_path_penalty_interval,
mapping_network,
mapping_network_optimizer,
args.model_dir,
args.generate_img_interval,
generate_img_after,
)
# Measure time
torch.cuda.synchronize()
end_epoch = time.time()
elapsed = end_epoch - start_epoch
times.append(elapsed)
print(f'EP{i}_elapsed_time: {elapsed} sec')
### Append losses of each step into the lists
disc_loss_y.append(disc_loss)
gen_loss_y.append(gen_loss)
# Measure time
avg_time = sum(times)/args.epochs
print(f'avg_time: {avg_time} sec')
### Plot the losses
epoch_x = np.linspace(1, args.epochs, args.epochs).astype(int)
plt.plot(epoch_x, disc_loss_y, label='disc_loss')
plt.plot(epoch_x, gen_loss_y, label='gen_loss')
plt.legend()
plt.savefig(f'{img_dir_path}/loss.png')

Running Out of RAM using FilePerUserClientData

I have a problem with training using tff.simulation.FilePerUserClientData - I am quickly running out of RAM after 5-6 rounds with 10 clients per round.
The RAM usage is steadily increasing with each round.
I tried to narrow it down and realized that the issue is not the actual iterative process but the creation of the client datasets.
Simply calling create_tf_dataset_for_client(client) in a loop causes the problem.
So this is a minimal version of my code:
import tensorflow as tf
import tensorflow_federated as tff
import numpy as np
import pickle
BATCH_SIZE = 16
EPOCHS = 2
MAX_SEQUENCE_LEN = 20
NUM_ROUNDS = 100
CLIENTS_PER_ROUND = 10
def decode_fn(record_bytes):
return tf.io.parse_single_example(
record_bytes,
{"x": tf.io.FixedLenFeature([MAX_SEQUENCE_LEN], dtype=tf.string),
"y": tf.io.FixedLenFeature([MAX_SEQUENCE_LEN], dtype=tf.string)}
)
def dataset_fn(path):
return tf.data.TFRecordDataset([path]).map(decode_fn).padded_batch(BATCH_SIZE).repeat(EPOCHS)
def sample_client_data(data, client_ids, sampling_prob):
clients_total = len(client_ids)
x = np.random.uniform(size=clients_total)
sampled_ids = [client_ids[i] for i in range(clients_total) if x[i] < sampling_prob]
data = [train_data.create_tf_dataset_for_client(client) for client in sampled_ids]
return data
with open('users.pkl', 'rb') as f:
users = pickle.load(f)
train_client_ids = users["train"]
client_id_to_train_file = {i: "reddit_leaf_tf/" + i for i in train_client_ids}
train_data = tff.simulation.datasets.FilePerUserClientData(
client_ids_to_files=client_id_to_train_file,
dataset_fn=dataset_fn
)
sampling_prob = CLIENTS_PER_ROUND / len(train_client_ids)
for round_num in range(0, NUM_ROUNDS):
print('Round {r}'.format(r=round_num))
participants_data = sample_client_data(train_data, train_client_ids, sampling_prob)
print("Round Completed")
I am using tensorflow-federated 19.0.
Is there something wrong with the way I create the client datasets or is it somehow expected that the RAM from the previous round is not freed?
schmana# noticed this occurs when changing the cardinality of the CLIENTS placement (different number of client datasets) each round. This results in a cache filing up as documented in http://github.com/tensorflow/federated/issues/1215.
A workaround in the immediate term would be to call:
tff.framework.get_context_stack().current.executor_factory.clean_up_executors()
At the start or end of every round.

Detectron2- How to log validation loss during training?

I copied the idea from mnslarcher and wrote the following two functions for my keypoint detector (resnet50 backbone) algorithm.
def build_valid_loader(cfg):
_cfg = cfg.clone()
_cfg.defrost() # make this cfg mutable.
_cfg.DATASETS.TRAIN = cfg.DATASETS.TEST
return build_detection_train_loader(_cfg)
def store_valid_loss(model, data, storage):
training_mode = model.training
with torch.no_grad():
loss_dict = model(data)
losses = sum(loss_dict.values())
assert torch.isfinite(losses).all(), loss_dict
loss_dict_reduced = {k: v.item()
for k, v in comm.reduce_dict(loss_dict).items()}
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
if comm.is_main_process():
storage.put_scalars(val_loss=losses_reduced, **loss_dict_reduced)
model.train(training_mode)
then in plain_train_net.py I am calling them as bellow.
val_data_loader = build_valid_loader(cfg)
logger.info("Starting training from iteration {}".format(start_iter))
with EventStorage(start_iter) as storage:
for data, val_data, iteration in zip(data_loader, val_data_loader, range(start_iter, max_iter)):
iteration = iteration + 1
..
..
#At the end of the for loop.
# Calculate and log validation loss.
store_valid_loss(model, val_data, storage)
after 1k iteration, loss_keypoint is increasing, but total_loss is same compared to without store_valid_loss call. What am I missing? Can anyone please help to understand?
I am using 4 GeForce RTX 2080 Ti.

Why pytorch training on CUDA works much slower than in CPU?

I guess i have made something in folowing simple neural network with PyTorch, because this runs much slower with CUDA then in CPU, can you find the mistake pls. The using function like
def backward(ctx, input):
return backward_sigm(ctx, input)
seems have no real impact on preformance
import torch
import torch.nn as nn
import torch.nn.functional as f
dname = 'cuda:0'
dname = 'cpu'
device = torch.device(dname)
print(torch.version.cuda)
def forward_sigm(ctx, input):
sigm = 1 / (1 + torch.exp(-input))
ctx.save_for_backward(sigm)
return sigm
def forward_step(ctx, input):
return torch.tensor(input > 0.5, dtype = torch.float32, device = device)
def backward_sigm(ctx, grad_output):
sigm, = ctx.saved_tensors
return grad_output * sigm * (1-sigm)
def backward_step(ctx, grad_output):
return grad_output
class StepAF(torch.autograd.Function):
#staticmethod
def forward(ctx, input):
return forward_sigm(ctx, input)
#staticmethod
def backward(ctx, input):
return backward_sigm(ctx, input)
#else return grad_output
class StepNN(torch.nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(StepNN, self).__init__()
self.linear1 = torch.nn.Linear(input_size, hidden_size)
#self.linear1.cuda()
self.linear2 = torch.nn.Linear(hidden_size, output_size)
#self.linear2.cuda()
#self.StepAF = StepAF.apply
def forward(self,x):
h_line_1 = self.linear1(x)
h_thrash_1 = StepAF.apply(h_line_1)
h_line_2 = self.linear2(h_thrash_1)
output = StepAF.apply(h_line_2)
return output
inputs = torch.tensor( [[1,0,1,0],[1,0,0,1],[0,1,0,1],[0,1,1,0],[1,0,0,0],[0,0,0,1],[1,1,0,1],[0,1,0,0],], dtype = torch.float32, device = device)
expected = torch.tensor( [[1,0,0],[1,0,0],[0,1,0],[0,1,0],[1,0,0],[0,0,1],[0,1,0],[0,0,1],], dtype = torch.float32, device = device)
nn = StepNN(4,8,3)
#print(*(x for x in nn.parameters()))
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(nn.parameters(), lr=1e-3)
steps = 50000
print_steps = steps // 20
good_loss = 1e-5
for t in range(steps):
output = nn(inputs)
loss = criterion(output, expected)
if t % print_steps == 0:
print('step ',t, ', loss :' , loss.item())
if loss < good_loss:
print('step ',t, ', loss :' , loss.item())
break
optimizer.zero_grad()
loss.backward()
optimizer.step()
test = torch.tensor( [[0,1,0,1],[0,1,1,0],[1,0,1,0],[1,1,0,1],], dtype = torch.float32, device=device)
print(nn(test))
Unless you have large enough data, you won't see any performance improvement while using GPU. The problem is that GPUs use parallel processing, so unless you have large amounts of data, the CPU can process the samples almost as fast as the GPU.
As far as I can see in your example, you are using 8 samples of size (4, 1). I would imagine maybe when having over hundreds or thousands of samples, then you would see the performance improvement on a GPU. In your case, the sample size is (4, 1), and the hidden layer size is 8, so the CPU can perform the calculations fairly quickly.
There are lots of example notebooks online of people using MNIST data (it has around 60000 images for training), so you could load one in maybe Google Colab and then try training on the CPU and then on GPU and observe the training times. You could try this link for example. It uses TensorFlow instead of PyTorch but it will give you an idea of the performance improvement of a GPU.
Note : If you haven't used Google Colab before, then you need to change the runtime type (None for CPU and GPU for GPU) in the runtime menu at the top.
Also, I will post the results from this notebook here itself (look at the time mentioned in the brackets, and if you run it, you can see firsthand how fast it runs) :
On CPU :
INFO:tensorflow:loss = 294.3736, step = 1
INFO:tensorflow:loss = 28.285727, step = 101 (23.769 sec)
INFO:tensorflow:loss = 23.518856, step = 201 (24.128 sec)
On GPU :
INFO:tensorflow:loss = 295.08328, step = 0
INFO:tensorflow:loss = 47.37291, step = 100 (4.709 sec)
INFO:tensorflow:loss = 23.31364, step = 200 (4.581 sec)
INFO:tensorflow:loss = 9.980572, step = 300 (4.572 sec)
INFO:tensorflow:loss = 17.769928, step = 400 (4.560 sec)
INFO:tensorflow:loss = 16.345463, step = 500 (4.531 sec)

Resources