Setting Hugging Face dataloader_num_workers for multi-GPU training - huggingface-transformers

Should the HuggingFace transformers TrainingArguments dataloader_num_workers argument be set per GPU? Or total across GPUs? And does this answer change depending whether the training is running in DataParallel or DistributedDataParallel mode?
For example if I have a machine with 4 GPUs and 48 CPUs (running only this training task), would there be any expected value in setting dataloader_num_workers greater than 12 (48 / 4)? Or would they all start contending over the same resources?
As I understand when running in DDP mode (with torch.distributed.launch or similar), one training process manages each device, but in the default DP mode one lead process manages everything. So maybe the answer to this is 12 for DDP but ~47 for DP?

Related

Test Intel Extension for Pytorch(IPEX) in multiple-choice from huggingface / transformers

I am trying out one huggingface sample with SWAG dataset
https://github.com/huggingface/transformers/tree/master/examples/pytorch/multiple-choice
I would like to use Intel Extension for Pytorch in my code to increase the performance.
Here I am using the one without training (run_swag_no_trainer)
In the run_swag_no_trainer.py , I made some changes to use ipex .
#Code before changing is given below:
device = accelerator.device
model.to(device)
#After adding ipex:
import intel_pytorch_extension as ipex
device = ipex.DEVICE
model.to(device)
While running the below command, its taking too much time.
export DATASET_NAME=swag
accelerate launch run_swag_no_trainer.py \
--model_name_or_path bert-base-cased \
--dataset_name $DATASET_NAME \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--output_dir /tmp/$DATASET_NAME/
Is there any other method to test the same on intel ipex?
First you have to understand, which factors actually increases the running time. Following are these factors:
The large input size.
The data structure; shifted mean, and unnormalized.
The large network depth, and/or width.
Large number of epochs.
The batch size not compatible with physical available memory.
Very small or high learning rate.
For fast running, make sure to work on the above factors, like:
Reduce the input size to the appropriate dimensions that assures no loss in important features.
Always preprocess the input to make it zero mean, and normalized it by dividing it by std. deviation or difference in max, min values.
Keep the network depth and width that is not to high or low. Or always use the standard architecture that are theoretically proven.
Always make sure of the epochs. If you are not able to make any further improvements in your error or accuracy beyond a defined threshold, then there is no need to take more epochs.
The batch size should be decided based on the available memory, and number of CPUs/GPUs. If the batch cannot be loaded fully in memory, then this will lead to slow processing due to lots of paging between memory and the filesystem.
Appropriate learning rate should be determine by trying multiple, and using that which gives the best reduction in error w.r.t. number of epochs.

A fast solution to obtain the best ARIMA model in R (function `auto.arima`)

I have a data series composed by 2775 elements:
mean(series)
[1] 21.24862
length(series)
[1] 2775
max(series)
[1] 81.22
min(series)
[1] 9.192
I would like to obtain the best ARIMA model by using function auto.arima of package forecast:
library(forecast)
fit=auto.arima(Netherlands,stepwise=F,approximation = F)
But I am having a big problem: RStudio is running for an hour and a half without results. (I developed an R code to perform these calculations, employed on a Windows machine equipped with a 2.80GHz Intel(R) Core(TM) i7 CPU and 16.0 GB RAM.) I suspect that this is due to the length of time series. A solution could be the parallelization? (But I don't know how apply it).
Anyway, suggestions to speed this code? Thanks!
The forecast package has many of its functions built with parallel processing in mind. One of the arguments of the auto.arima() function is 'parallel'.
According to the package documentation, "If [parallel = ] TRUE and stepwise = FALSE, then the specification search is done in parallel.This can give a significant speedup on mutlicore machines."
If parallel = TRUE, it will automatically select how many 'cores' to use (for a laptop or desktop, it is often the number of cores * 2. For example, I have 4 cores and each core has 2 processors = 8 'cores'). If you want to manually set the number of cores, also use the argument num.cores.
I'd recommend checking out the e-book written by Hyndman all about the package. It is like a time-series forecasting bible.

Optimal number of filters in a Convolutional network

I'm building a convolutional Network image classification purposes, my network is inspired by VGG conv network but I changed the number of layers and filters per layers because my image dataset is quite simple.
Nevertheless I'm wondering why the number of fitlers in VGG is always a power of 2 : 64 -> 128 -> 256 -> 512 -> 4096
I guessed that's because each pooling divide the output size by 2 x 2 and therefore one would want to multiply the number of filters by 2.
But I'm still wondering what's the real reason behind this choice; is this for optimization ? is it easier to distribute calculation ? And should I keep this logic in my network.
Yes, it is mainly for optimization. If the network is going to run on a GPU, threads in GPUs come in groups and blocks, normally a group is of 32 threads.
Roughly speaking, if you have a layer with 40 filters, you will need 2 groups = 64 threads. So why not making use of the rest threads and make the layer of 64 filters that can be computed in parallel.

PWM transistor heating - Rapberry

I have a raspberry and an auxiliary PCB with transistors for driving some LED strips.
The strips datasheets says 12V, 13.3W/m, i'll use 3 strips in parallel, 1.8m each, so 13.3*1.8*3 = 71,82W, with 12 V, almost 6A.
I'm using an 8A transistor, E13007-2.
In the project i have 5 channels of different LEDs: RGB and 2 types of white.
R, G, B, W1 and W2 are directly connected in py pins.
LED strips are connected with 12V and in CN3, CN4 for GND (by the transistor).
Transistor schematic.
I know that that's a lot of current passing through the transistors, but, is there a way to reduce the heating? I think it's getting 70-100°C. I already had a problem with one raspberry, and i think it's getting dangerous for the application. I have some large traces in the PCB, that's not the problem.
Some thoughts:
1 - Resistor driving the base of the transistor. Maybe it won't reduce heating, but i think it's advisable for short circuit protection, how can i calculate this?
2 - The PWM has a frequency of 100Hz, is there any difference if i reduce this frequency?
The BJT transistor you're using has current gain hFE of roughly 20. This means that the collector current is roughly 20 times the base current, or the base current needs to be 1/20 of the collector current, i.e. 6A/20=300mA.
Rasperry PI for sure can't supply 300mA current from the IO pins, so you're operating the transistor in linear region, which causes it to dissipate a lot of heat.
Change your transistors to MOSFETs with low enough threshold voltage (like 2.0V to have enough conduction at 3.3V IO voltage) to keep it simple.
Using a N-Channel MOSFET will run much cooler if you get enough gate voltage to force to completely enhance. Since this is not a high volume item why not simply use a MOSFET gate driver chip. Then you can use a low RDS on device. Another device is the siemons BTS660 (S50085B BTS50085B TO-220). it is a high side driver that you will need to drive with an open collector or drain device. It will switch 5A at room temperature with no heat sink.It is rated for much more current and is available in a To220 type package. It is obsolete but available as is the replacement. MOSFETs are voltage controlled while transistors are current controlled.

H2O - Not seeing much speed-up after moving to powerful machine

I am running a Python program that calls H2O for deep learning (training and testing). The program runs in a loop of 20 iterations and in each loop calls H2ODeepLearningEstimator() 4 times and associated predict() and model_performance(). I am doing h2o.remove_all() and cleaning up all data-related Python objects after each iteration.
Data size: training set 80,000 with 122 features (all float) with 20% for validation (10-fold CV). test set 20,000. Doing binary classification.
Machine 1: Windows 7, 4 core, Xeon, each core 3.5GHz, Memory 32 GB
Takes about 24 hours to complete
Machine 2: CentOS 7, 20 core, Xeon, each core 2.0GHz, Memory 128 GB
Takes about 17 hours to complete
I am using h2o.init(nthreads=-1, max_mem_size = 96)
So, the speed-up is not that much.
My questions:
1) Is the speed-up typical?
2) What can I do to achieve substantial speed-up?
2.1) Will adding more cores help?
2.2) Are there any H2O configuration or tips that I am missing?
Thanks very much.
- Mohammad,
Graduate student
If the training time is the main effort, and you have enough memory, then the speed up will be proportional to cores times core-speed. So, you might have expected a 40/14 = 2.85 speed-up (i.e. your 24hrs coming down to the 8-10 hour range).
There is a typo in your h2o.init(): 96 should be "96g". However, I think that was a typo when writing the question, as h2o.init() would return an error message. (And H2O would fail to start if you'd tried "96", with the quotes but without the "g".)
You didn't show your h2o.deeplearning() command, but I am guessing you are using early stopping. And that can be unpredictable. So, what might have happened is that your first 24hr run did, say, 1000 epochs, but your second 17hr run did 2000 epochs. (1000 vs. 2000 would be quite an extreme difference, though.)
It might be that you are spending too much time scoring. If you've not touched the defaults, this is unlikely. But you could experiment with train_samples_per_iteration (e.g. set it to 10 times the number of your training rows).
What can I do to achieve substantial speed-up?
Stop using cross-validation. That might be a bit controversial, but personally I think 80,000 training rows is going to be enough to do an 80%/10%/10% split into train/valid/test. That will be 5-10 times quicker.
If it is for a paper, and you want to show more confidence in the results, once you have your final model, and you've checked that test score is close to valid score, then rebuild it a couple of times using a different seed for the 80/10/10 split, and confirm you end up with the same metrics. (*)
*: By the way, take a look at the score for each of the 10 cv models you've already made; if they are fairly close to each other, then this approach should work well. If they are all over the place, you might have to re-consider the train/valid/test splits - or just think about what it is in your data that might be causing that sensitivity.

Resources