Parallel hyperparameter optimization with pytorch on a multi-gpu machine - parallel-processing

I have access to a multi-gpu machine and I am running a grid search loop for parameter optimisation. I would like to know if I can distribute several iterations of the loop on multiple gpu at the same time, and if so how do I do it (what me mechanism? threading? how to gather the results if the loop execute asynchronously? etc.)
Thank you.

I'd suggest using Optuna to handle hyper-parameters search, which should in general perform better than grid search (you can still use it with grid sampling though). I have modified Optuna distributed example to use one GPU per process.
Create a training script like:
# optimize.py
import sys
import optuna
import your_model
DEVICE = 'cuda:' + sys.argv[1]
def objective(trial):
hidden_size = trial.suggest_int('hidden_size', 8, 64, log=True)
# define other hyperparameters
return your_model.score(hidden_size=hidden_size, device=DEVICE)
if __name__ == '__main__':
study = optuna.load_study(study_name='distributed-example', storage='sqlite:///example.db')
study.optimize(objective, n_trials=100)
In terminal:
pip install optuna
optuna create-study --study-name "distributed-example" --storage "sqlite:///example.db"
Then for every GPU device:
python optimize.py 0
python optimize.py 1
...
Finally, best results can be easily discovered:
import optuna
study = optuna.create_study(study_name='distributed-example', storage='sqlite:///example.db', load_if_exists=True)
print(study.best_params)
print(study.best_value)
Or even visualized.

Related

Due to IPython and Windows limitation, python multiprocessing isn't available now. So `number_workers` is changed to 0 to avoid getting stuck

enter image description here
#id first_training
#caption Results from the first training
# CLICK ME
from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'
def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224))
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
Due to IPython and Windows limitation, python multiprocessing isn't available now.
So number_workers is changed to 0 to avoid getting stuck
Hi, i am studying with Fastai book and i run this code without colab or paperspace.
But as not what i expected, it is taking so long time (my computer is workstation)
but i am wondering if i clear that error
maybe increasing 'number_workers', it would be much faster than before.
How to solve this problem?
Thanks

Train RoBERTa from scratch where dataset is larger than the capacity of RAM?

I have a corpus that is 16 GB large and my ram IS around 16 GB ish. If I load the entire dataset to train the language model RoBERTa from scratch, I am going to have a memory issue. I intend to train my RoBERTa using the script provided from Huggingface's tutorial in their blog post: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
However, their blog post suggests the usage of LineByLineTextDatase. However, this loads the dataset eagerly.
class LineByLineTextDataset(Dataset):
"""
This will be superseded by a framework-agnostic approach
soon.
"""
def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
assert os.path.isfile(file_path)
# Here, we do not cache the features, operating under the assumption
# that we will soon use fast multithreaded tokenizers from the
# `tokenizers` repo everywhere =)
logger.info("Creating features from dataset file at %s", file_path)
with open(file_path, encoding="utf-8") as f:
lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
self.examples = batch_encoding["input_ids"]
def __len__(self):
return len(self.examples)
def __getitem__(self, i) -> torch.Tensor:
return torch.tensor(self.examples[i], dtype=torch.long)
Unexpectedly, my kernel crashed on the part where they read the line. I wonder if there is a way to make it read lazily. It will be very desirable if the suggested answer can create minimum code change with the posted tutorial since I'm rather new with Huggingface and afraid I won't be able to debug it on my own.
I would recommend using HuggingFace's own datasets library. The documentation says:
It provides a very efficient way to load and process data from raw files (CSV/JSON/text) or in-memory data (python dict, pandas dataframe) with a special focus on memory efficiency and speed. As a matter of example, loading a 18GB dataset like English Wikipedia allocate 9 MB in RAM and you can iterate over the dataset at 1-2 GBit/s in python.
The quick tour has good explanations and code snippets for creating a dataset object with your own data and it also explains how to train your own model.

Unable to release GPU memory after training a CNN model using keras

On a Google Colab notebook with keras(2.2.4) and tensorflow(1.13.1) as a backend, I am trying to tune a CNN, I use a simple and basic table of hyper-parameters and run my tests in a set of loops.
My problem is that I can't free the GPU memory after each iteration and Keras doesn't seem to be able to release GPU memory automatically. So every time I get a Ressource Exhausted : Out Of Memory (OOM) error
I did some digging up and run into this function that reassembles different solutions that have been suggested to solve this problem ( didn't work for me though)
for _ in hyper_parameters :
Run_model(_)
reset_keras()
my set of hyper parameters being :
IMG_SIZE = 800,1000,3
BATCH_SIZEs = [4,8,16]
EPOCHSs = [5,10,50]
LRs = [0.008,0.01]
MOMENTUMs = [0.04,0.09]
DECAYs = [0.1]
VAL_SPLITs = [0.1]
and the function I used to free the GPU memory :
def reset_keras():
sess = get_session()
clear_session()
sess.close()
sess = get_session()
try:
del model # this is from global space - change this as you need
except:
pass
print(gc.collect()) # if it's done something you should see a number being outputted
# use the same config as you used to create the session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 1
config.gpu_options.visible_device_list = "0"
set_session(tf.Session(config=config))
The only thing that I didn't fully grasp is the "same config as you used to create your model "(line 14) since with Keras we don't chose explicitly a certain configuration.
I get by for one iteration, some times two, but I can't go beyond. I already tried to change the batch_size and for the moment I am unable to afford for a machine with higher performances.
I am using a custom image generator that inherits from keras.utils.Sequence.
I monitor the state of GPU memory using this piece of code :
import psutil
import GPUtil as GPU
def printmm():
GPUs = GPU.getGPUs()
gpu = GPUs[0]
process = psutil.Process(os.getpid())
return(" Util {0:.2f}% ".format(gpu.memoryUtil*100))
print(printmm())

Importing code as a module runs much slower than a single block of code in Python

I'm writing a code to stream images form raspberry-pi3 to my laptop.
I wrote two versions of code, one using module and other as single block of code.
I'm putting the code here (more details follow the code)
Code that runs fast:
import picamera
import socket
import struct
import time
import io
stream_client= socket.socket(socket.AF_INET,socket.SOCK_STREAM)
stream_client.connect(('192.168.43.34',9000))
print('connected')
camera = picamera.PiCamera()
camera.resolution=(320,240)
camera.color_effects=(128,128)
camera.framerate=18
time.sleep(2)
stream = io.BytesIO()
count=0
start=time.time()
try:
for i in camera.capture_continuous(stream,'jpeg',use_video_port=True):
count+=1
stream_client.sendall(struct.pack('<L',stream.tell()))
stream_client.sendall(struct.pack('<h',count))
stream_client.sendall(stream.getvalue())
stream.seek(0)
stream.truncate()
if(time.time()-start>10):
break
finally:
stream_client.sendall(struct.pack('<L',0))
stream_client.close()
camera.close()
print('connection closed')
The codes that runs slowly:
Module (contained in a different file "CameraStreamModule.py" in same folder):
'''
MODULE NAME: CAMERA STREAMING MODULE
OBJECTIVE: TO SEND IMAGE DATA FROM RASPBERRY-PI3 TO LAPTOP VIA TCP/IP STREAM
HARDWARE USED: RASPBERRY PI 3, PI CAMERA REV 1.3
PYTHON VERSION: 3.5.3
DATE WRITTEN: 8-1-2018
'''
'''************************************************************** IMPORTING MODULES ***********************************************************************'''
import picamera # PI CAMERA MODULE
import socket # MODULE TO HANDLE TCP/IP CONNECTION AND TRANSFER
import struct # MODULE TO CONVERT DATA TO BYTE-LIKE OBJECTS (REQUIRED BY SOCKET METHODS)
import time # MODULE FOR TIME RELATED FUNCTIONS
import io # MODULE USED TO CREATE IN-MEMORY DATA STREAMS
'''************************************************************ MODULE IMPORTS END HERE *******************************************************************'''
'''*********************************************************** DECLARING GLOBAL VARIABLES *****************************************************************'''
LIMITED_STREAM = False # Stream only for short time when true
LIMIT_TIME= 10 # Seconds
stream = io.BytesIO()
stream_client= socket.socket(socket.AF_INET,socket.SOCK_STREAM)
camera = picamera.PiCamera()
count=0
'''*********************************************************** GLOBAL VARIABLES END HERE ******************************************************************'''
'''************************************************************** METHOD DEFINITIONS ***********************************************************************'''
def Client_init (Server_ip,Server_port): # (str,int) expected as parameter
global stream_client,start,LIMITED_STREAM
stream_client.connect((Server_ip,Server_port))
if LIMITED_STREAM :
start=time.time()
print('connected')
def Camera_init (Resolution_tuple,Colour_tuple,Frame_rate): # (int_tuple,int_tuple,int) expected as parameter
global camera
camera.resolution= Resolution_tuple
camera.color_effects=Colour_tuple
camera.framerate= Frame_rate
time.sleep(2)
'''
THIS METHOD IS INTENDED TO BE CALLED INSIDE "for i in camera.capture_continuous(stream,'jpeg',use_video_port=True):" CONTINUOUSLY. ALSO .close() FOR SOCKET
AND CAMERA MUST BE CALLED SEPERATELY.
'''
def Send_frame ():
global count,stream_client,stream,LIMITED_STREAM
count+=1
stream_client.sendall(struct.pack('<L',stream.tell()))
stream_client.sendall(struct.pack('<h',count))
stream_client.sendall(stream.getvalue())
stream.seek(0)
stream.truncate()
if LIMITED_STREAM:
if(time.time()-start>LIMIT_TIME):
raise Exception('Time Finished')
'''********************************************************* METHOD DEFINITIONS END HERE ******************************************************************'''
The calling code:
import importlib
Camera_module= importlib.import_module('CameraStreamModule')
Camera_module.LIMITED_STREAM= True
Camera_module.Client_init('192.168.1.102',9000)
Camera_module.Camera_init((320,240),(128,128),18)
try:
for i in Camera_module.camera.capture_continuous(Camera_module.stream,'jpeg',use_video_port=True):
Camera_module.Send_frame()
finally:
Camera_module.camera.close()
Camera_module.stream_client.sendall(struct.pack('<L',0))
Camera_module.stream_client.close()
print('connection closed')
The first code streams about 179 images in 10 seconds and the second version does about 133 images which is a drastic reduction. I just wanted to create a module to make code more manageable and readable.
I have started coding in Python quite recently and I know that my method of coding may look ridiculous to more experienced coders (trust me, I'm trying to improve). Can anyone tell me what could be the reason of this slowdown?
I have observed that even changing the WiFi connection has an effect on the amount of data transferred in a given amount of time, so I have kept the WiFi connection the same for both versions of code.
I think that this slowdown occurrs because I'm passing a lot of data back and forth between modules ?
In any case, any advice/help regarding the code is welcome.
P.S: If you feel that my way of asking questions on this platform is not up to the mark and I need to give more or reduce the details, please let me know.
OK I found out the problem. Actually in the reciever code that I created , I had written a lot of print() statements for debugging purpose. which were causing the delay and due to this I was loosing some data since the socket transmission was asynchronous (and i did not handle the problem of data loss in the code). Along with this and probably the delay in passing large array data, back and forth between the modules caused this problem
Leaving this answer here just in case someone finds it useful.

Addressing multiple B200 devices through the UHD API

I have 2 B210 radios on USB3 on a Windows 10 system using the UHD USRP C API latest release and Python 3.6 as the programming environment. I can "sort of" run them simultaneously in separate processes but would like to know if it is possible to run them in a single thread? How?
1 Happy to move to Linux if it makes things easier, I'm just more familiar with Windows.
2 "sort of" = I sometimes get errors which might be the two processes colliding somewhere down the stack.
The code below illustrates the race condition, sometimes one or both processes fail with error code 40 (UHD_ERROR_ASSERTION) or occasionally code 11 ( UHD_ERROR_KEY )
from ctypes import (windll, byref, c_void_p, c_char_p)
from multiprocessing import Process, current_process
def pread(argstring):
# get handle for device
usrp = c_void_p(0)
uhdapi = windll.uhd
p_str=c_char_p(argstring.encode("UTF8"))
errNo = uhdapi.uhd_usrp_make(byref(usrp),p_str)
if errNo != 0:
print("\r*****************************************************************")
print("ERROR: ",errNo," IN: ", current_process())
print("=================================================================")
if usrp.value != 0:
uhdapi.uhd_usrp_free(byref(usrp))
return
if __name__ == '__main__':
while True:
p2 = Process(target=pread, args=("",))
p1 = Process(target=pread, args=("",))
p1.start()
p2.start()
p1.join()
p2.join()
print("end")
Yes, you can have multiple multi_usrp handles.
By the way, note that UHD is natively C++, and the C API is just a wrapper around that. It's designed for generating scripting interfaces like the Python thing you're using (don't know which interface between Python and the C API you're using – something self-written?).
While it's possible, there's no good reason to call the recv and send functions from the same thread – most modern machines are multi-threaded, and you should make use of that. Real-time SDR is a CPU-intensive task and you should use all the CPU resources available to get data to and from the driver, as to avoid overflowing buffers.

Resources