Test Intel Extension for Pytorch(IPEX) in multiple-choice from huggingface / transformers - performance

I am trying out one huggingface sample with SWAG dataset
https://github.com/huggingface/transformers/tree/master/examples/pytorch/multiple-choice
I would like to use Intel Extension for Pytorch in my code to increase the performance.
Here I am using the one without training (run_swag_no_trainer)
In the run_swag_no_trainer.py , I made some changes to use ipex .
#Code before changing is given below:
device = accelerator.device
model.to(device)
#After adding ipex:
import intel_pytorch_extension as ipex
device = ipex.DEVICE
model.to(device)
While running the below command, its taking too much time.
export DATASET_NAME=swag
accelerate launch run_swag_no_trainer.py \
--model_name_or_path bert-base-cased \
--dataset_name $DATASET_NAME \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--output_dir /tmp/$DATASET_NAME/
Is there any other method to test the same on intel ipex?

First you have to understand, which factors actually increases the running time. Following are these factors:
The large input size.
The data structure; shifted mean, and unnormalized.
The large network depth, and/or width.
Large number of epochs.
The batch size not compatible with physical available memory.
Very small or high learning rate.
For fast running, make sure to work on the above factors, like:
Reduce the input size to the appropriate dimensions that assures no loss in important features.
Always preprocess the input to make it zero mean, and normalized it by dividing it by std. deviation or difference in max, min values.
Keep the network depth and width that is not to high or low. Or always use the standard architecture that are theoretically proven.
Always make sure of the epochs. If you are not able to make any further improvements in your error or accuracy beyond a defined threshold, then there is no need to take more epochs.
The batch size should be decided based on the available memory, and number of CPUs/GPUs. If the batch cannot be loaded fully in memory, then this will lead to slow processing due to lots of paging between memory and the filesystem.
Appropriate learning rate should be determine by trying multiple, and using that which gives the best reduction in error w.r.t. number of epochs.

Related

Tensorflow-GPU performance decreases each epoch on Windows machine

I'm seeing decreasing performance with each subsequent epoch running Tensorflow-GPU with Keras on Windows.
I'm training a 16-layer CNN and load my data using tf.data. The first epoch performs as well as expected: ~1hr training time. The CPU and GPU CUDA load is at 85-90%. Temperatures are reasonable (CPU between 65-70C and GPU at 70C) -- no thermal throttling occurs.
But by the second epoch, GPU CUDA load inexplicably drops to 33-50%. CPU load appears reduced as well, to around 65-75%. I don't see any big change in disk I/O speeds (the data is being loaded from an NVMe SSD).
I don't think this is a hardware issue. I'm using an RTX 3090 and a 4-core i5-6500 CPU and as mentioned, the performance on the first epoch is quite good without enough thermal headroom to continue. My tf.data pipeline looks like this:
# Construct tf.data.Dataset
paths_ds = tf.data.Dataset.from_tensor_slices(image_paths)
if shuffle:
paths_ds = paths_ds.shuffle(buffer_size=len(image_paths), reshuffle_each_iteration=True) # reshuffle each epoch
dataset = paths_ds.map(
lambda path: (self.load_image(path, scale_range, crop_size, augment), self.get_onehot_label(path, tf_class_index_by_image, self.num_classes)),
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
A map and a prefetch, both using AUTOTUNE. I suspect maybe AUTOTUNE is the problem but I'm not sure how to debug this and determine how the number of threads and prefetch buffer size might be evolving over time. There is no documentation on how AUTOTUNE works.
EDIT: I did some profiling and have discovered that from epoch 2 onwards, Iterator::MapAndBatch is consuming significant time (it hard to even see in Epoch 1 because the data it collects is always ready). The cause appears to be some sort of perplexing slowness in my "load_image" function. Each operation it performs (read file, decode JPEG, convert to float, resize) is followed by a pause in Epoch 2 onwards and this becomes slower over time.
For example, Epoch 1 with everything functioning correctly:
Note the read is immediately followed by a decode, etc.
Now in epoch 2:
What could be causing this? Attempting to elevate the priority of the process using Process Explorer has no effect.

In a normal image classification using cnn's? what should be the value of the units in the dense layer?

I am just creating a normal image classifier for rock-paper-scissors.I am using my local gpu itself and it isnt a high end gpu. When i began training the model it kept giving the error:
ResourceExhaustedError: OOM when allocating tensor with shape.
I googled this error and they suggested I decrease my batch size which i did. It still did not solve anything however later I changed my image size to 50*50 initially it was 200*200 and then it started training with an accuracy of 99%.
Later i wanted to see if i could do it with 150*150 sized images as i found a tutorial on the official tensorflow channel on youtube I followed their exact code and it still did not work. I reduced the batch size, still no solution. Later I changed the no. of units in the dense layer initially it was 512 and then i decreased it to 200 and it worked fine but now the accuracy is pretty trash. I was just wondering is there anyway I could tune my model according to my gpu without affecting my accuracy? So I was just wondering how does the no. of units in the dense layer matter? It would really help me alot.
i=Input(shape=X_train[0].shape)
x=Conv2D(64,(3,3),padding='same',activation='relu')(i)
x=BatchNormalization()(x)
x=Conv2D(64,(3,3),padding='same',activation='relu')(x)
x=BatchNormalization()(x)
x=MaxPool2D((2,2))(x)
x=Conv2D(128,(3,3),padding='same',activation='relu')(x)
x=BatchNormalization()(x)
x=Conv2D(128,(3,3),padding='same',activation='relu')(x)
x=BatchNormalization()(x)
x=MaxPool2D((2,2))(x)
x=Flatten()(x)
x=Dropout(0.2)(x)
x=Dense(512,activation='relu')(x)
x=Dropout(0.2)(x)
x=Dense(3,activation='softmax')(x)
model=Model(i,x)
okay now when I run this with image size of 150*150 it throws that error,
if I change the size of the image to 50*50 and reduce batch size to 8 it works and gives me an accuracy of 99. but if I use 150*150 and reduce the no. of units in the dense layer to 200(random) it works fine but accuracy is very very bad.
I am using a low end nvidia geforce mx 230 gpu.
And my vram is 4 gigs
For 200x200 images the output of the last MaxPool has a shape of (50,50,128) which is then flattened and serves as the input of the Dense layer giving you in total of 50*50*128*512=163840000 parameters. This is a lot.
To reduce the amount of parameters you can do one of the following:
- reduce the amount of filters in the last Conv2D layer
- do a MaxPool of more than 2x2
- reduce the size of the Dense layer
- reduce the size of the input images.
You have already tried the two latter options. You will only find out by trial and error which method ultimately gives you the best accuracy. You were already at 99%, which is good.
If you want a platform with more VRAM available, you can use Google Colab https://colab.research.google.com/

Why actual runtime for a larger search value is smaller than a lower search value in a sorted array?

I executed a linear search on an array containing all unique elements in range [1, 10000], sorted in increasing order with all search values i.e., from 1 to 10000 and plotted the runtime vs search value graph as follows:
Upon closely analysing the zoomed in version of the plot as follows:
I found that the runtime for some larger search values is smaller than the lower search values and vice versa
My best guess for this phenomenon is that it is related to how data is processed by CPU using primary memory and cache, but don't have a firm quantifiable reason to explain this.
Any hint would be greatly appreciated.
PS: The code was written in C++ and executed on linux platform hosted on virtual machine with 4 VCPUs on Google Cloud. The runtime was measured using the C++ Chrono library.
CPU cache size depends on the CPU model, there are several cache levels, so your experiment should take all those factors into account. L1 cache is usually 8 KiB, which is about 4 times smaller than your 10000 array. But I don't think this is cache misses. L2 latency is about 100ns, which is much smaller than the difference between lowest and second line, which is about 5 usec. I suppose this (second line-cloud) is contributed from the context switching. The longer the task, the more probable the context switching to occur. This is why the cloud on the right side is thicker.
Now for the zoomed in figure. As Linux is not a real time OS, it's time measuring is not very reliable. IIRC it's minimal reporting unit is microsecond. Now, if a certain task takes exactly 15.45 microseconds, then its ending time depends on when it started. If the task started at exact zero time clock, the time reported would be 15 microseconds. If it started when the internal clock was at 0.1 microsecond in, than you will get 16 microsecond. What you see on the graph is a linear approximation of the analogue straight line to the discrete-valued axis. So the tasks duration you get is not actual task duration, but the real value plus task start time into microsecond (which is uniformly distributed ~U[0,1]) and all that rounded to the closest integer value.

How to get detailed memory breakdown in the TensorFlow profiler?

I'm using the new TensorFlow profiler to profile memory usage in my neural net, which I'm running on a Titan X GPU with 12GB RAM. Here's some example output when I profile my main training loop:
==================Model Analysis Report======================
node name | requested bytes | ...
Conv2DBackpropInput 10227.69MB (100.00%, 35.34%), ...
Conv2D 9679.95MB (64.66%, 33.45%), ...
Conv2DBackpropFilter 8073.89MB (31.21%, 27.90%), ...
Obviously this adds up to more than 12GB, so some of these matrices must be in main memory while others are on the GPU. I'd love to see a detailed breakdown of what variables are where at a given step. Is it possible to get more detailed information on where various parameters are stored (main or GPU memory), either with the profiler or otherwise?
"Requested bytes" shows a sum over all memory allocations, but that memory can be allocated and de-allocated. So just because "requested bytes" exceeds GPU RAM doesn't necessarily mean that memory is being transferred to CPU.
In particular, for a feedforward neural network, TF will normally keep around the forward activations, to make backprop efficient, but doesn't need to keep the intermediate backprop activations, i.e. dL/dh at each layer, so it can just throw away these intermediates after it's done with these. So I think in this case what you care about is the memory used by Conv2D, which is less than 12 GB.
You can also use the timeline to verify that total memory usage never exceeds 12 GB.

why executation time of tf.nn.conv2d function different while the multiply times are the same?

I am using tensorflow to build cnn net in image classification experiment,I found such phenomenon as:
operation 1:tf.nn.conv2d(x, [3,3,32,32], strides=[1,1,1,1], padding='SAME')
the shape of x is [128,128,32],means convolution using 3x3 kernel on x,both input channels and output channels are 32,the total multiply times is
3*3*32*32*128*128=150994944
operation 2:tf.nn.conv2d(x, [3,3,64,64], strides=[1,1,1,1], padding='SAME')
the shape of x is [64,64,64],means convolution using 3x3 kernel on x,both input channels and output channels are 64,the total multiply times is
3*3*64*64*64*64=150994944
In contrast with operation 1,the feature map size of operation 2 scale down to 1/2 and the channel number doubled. The multiply times are the same so the running time should be same.But in practice the running time of operation 1 is longer than operation 2.
My measure method was shown below
eliminate an convolution of operation 1,the training time for one epoch reduced 23 seconds,means the running time of operation 1 is 23 seconds.
eliminate an convolution of operation 2,the training time for one epoch reduced 13 seconds,means the running time of operation 2 is 13 seconds.
the phenomenon can reproduction every time。
My gpu is nvidia gtx980Ti,os is ubuntu 16.04。
So that comes the question: Why the running time of operation 1 was longer than operation 2?
If I had to guess it has to do with how the image is ordered in memory. Remember that in memory everything is stored in a flattened format. This means that if you have a tensor of shape [128, 128, 32], the 32 features/channels are stored next to eachover. Then all of the rows, then all of the columns. https://en.wikipedia.org/wiki/Row-major_order
Accessing closely packed memory is very important to performance especially on a GPU which has a large memory bus and is optimized for aligned in order memory access. In case with the larger image you have to skip around the image more and the memory access is more out of order. In case 2 you can do more in order memory access which gives you more speed. Multiplications are very fast operations. I bet with a convolution memory access if the bottleneck which limits performance.
chasep255's answer is good and probably correct.
Another possibility (or alternative way of thinking about chasep255's answer) is to consider how caching (all the little hardware tricks that can speed up memory fetches, address mapping, etc) could be producing what you see...
You have basically two things: a stream of X input data and a static filter matrix. In case 1, you have 9*1024 static elements, in case 2 you have 4 times as many. Both cases have the same total multiplication count, but in case 2 the process is finding more of its data where it expects (i.e. where it was last time it was asked for.) Net result: less memory access stalls, more speed.

Resources