TensorFlow conv2d slow compared to others - performance

I'm trying to use TF to do some filtering. I have 60 images of size 1740 x 2340 and a guassian filter of size 16 x 16. I ran a conv2d as
strides = [1,1,1,1]
data_ph = tf.constant(data,tf.float32)
filt_ph = tf.constant(filt,tf.float32)
data_format = 'NCHW'
conv = tf.nn.conv2d(data_ph,filt_ph,strides,'SAME',data_format=data_format)
where
data_ph = <tf.Tensor 'Const:0' shape=(60, 1, 1740, 2340) dtype=float32>
filt_ph = <tf.Tensor 'Const_1:0' shape=(16, 16, 1, 1) dtype=float32>
I tried to use place holders instead of constants and I also tried to use readers such as tf.FixedLengthRecordReader. I repeated the experiments few times per run. The first runs in 12 secs and the subsequents run in 4 secs using constants and 5 secs using place holders. The same experiment ran in mxnet takes always 1.6 secs and in matlab 1.5 secs. In all cases I'm placing the computation on the GPU, a 8 GB Quadro K5200. Is this expected (some posts mention TF being slower than other frameworks) or am I doing anything wrong?

Related

Odoo14 - Questions about limit_memory soft and hard

I'm trying to figure out how memory limits work and how to choose the right values.
My test server (VM) has 16GB of RAM and 4 vCPUs but it is a shared server, so I choose to use only 2 vCPUs and 2GB of RAM.
I look in the official documentation, and I calculate how many workers and RAM I need (https://www.odoo.com/documentation/14.0/administration/install/deploy.html#worker-number-calculation) .
W = Workers (workers)
2 workers for 1 CPU
CW = Cron Workers (max_cron_threads)
TW = W + CW
Worker number calculation
(#CPU * 2) + CW
(2 * 2) + 1 = 5 theorical maximal workers
Memory size calculation
Needed RAM = W * ( (light_worker_ratio * light_worker_ram_estimation) + (heavy_worker_ratio * heavy_worker_ram_estimation) )
5 * ((0.8 * 150) + (0.2 * 1024)) = 1624 (~2GB of RAM).
Ok, now, I go to the "configuration sample" (https://www.odoo.com/documentation/14.0/administration/install/deploy.html#id5) and I see I need to estimate how many concurrent users I'll have.
Can you confirm that the number of concurrent users includes all website visitors and not only the connected users?
In the configuration sample, how do you calculate/estimate the value of the limit? (limit_memory_hard, limit_memory_soft, limit_request, limit_time_cpu, limit_time_real)
I've read a lot of documentations (official or not), but they never say how to calculate these values.
Examples:
https://github.com/DocCyblade/tkl-odoo/issues/49 (I really don't understand how DocCyblade finds its values with its formula)
https://github.com/DocCyblade/tkl-odoo/blob/master/overlay/etc/odoo/openerp-server.conf
https://linuxize.com/post/how-to-install-odoo-14-on-ubuntu-20-04/
https://www.rosehosting.com/blog/how-to-speed-up-odoo/. 2048 is the default value since Odoo 10, not 640. If I try its formula, I will find that :
limit memory soft : 5 * 2147483648 = 10737418240
limit memory hard : 5 * 2684354560 = 13421772800
Can you help me, please?
Thanks

MFCC feature extraction, Librosa

I want to extract mfcc features of an audio file sampled at 8000 Hz with the frame size of 20 ms and of 10 ms overlap. What must be the parameters for librosa.feature.mfcc() function. Does the code written below specify 20ms chunks with 10ms overlap?
import librosa as l
x, sr = l.load('/home/user/Data/Audio/Tracks/Dev/FS_P01_dev_001.wav', sr = 8000)
mfccs = l.feature.mfcc(x, sr=sr, n_mfcc = 24, hop_length = 160)
The audio file is of 1800 seconds. Does that mean I would get 24 mfccs for all (1800/0.01)-1 chunks of the audio?
1800 seconds at 8000 Hz are obviously 1800 * 8000 = 14400000 samples.
If your hop length is 160, you get roughly 14400000 / 160 = 90000 MFCC values with 24 dimensions each. So this is clearly not (1800 / 0.01) - 1 = 179999 (off by a factor of roughly 2).
Note that I used roughly in my calculation, because I only used the hop length and ignored the window length. Hop length is the number of samples the window is moved with each step. How many hops you can fit depends on whether you pad somehow or not. And if you decide not to pad, the number of frames also depends on your window size.
To get back to your question: You have to ask yourself how many samples are 10 ms?
If 1 s contains 8000 samples (that's what 8000 Hz means), how many samples are in 0.01 s? That's 8000 * 0.01 = 80 samples.
This means you have a hop length of 80 samples and a window length of 160 samples (0.02 s—twice as long).
Now you should tell librosa to use this info, like this:
import librosa as l
x, sr = l.load('/home/user/Data/Audio/Tracks/Dev/FS_P01_dev_001.wav', sr = 8000)
n_fft = int(sr * 0.02) # window length: 0.02 s
hop_length = n_fft // 2 # usually one specifies the hop length as a fraction of the window length
mfccs = l.feature.mfcc(x, sr=sr, n_mfcc=24, hop_length=hop_length, n_fft=n_fft)
# check the dimensions
print(mfccs.shape)
Hope this helps.

Calculating speed-up time of an application (book exercise)

I've been reading Computer Organization and Design by Patterson and Hennessy and stumbled upon an exercise with three given solutions. I can't find which is the correct one. I tried calculating with the performance equation given in the book:
CPU Execution time = (Instruction count * CPI) / Clock rate
but it doesn't work. Here's the question:
A given application written in Java runs 15 seconds on a desktop processor.
A new Java compiler is released that requires only 0.6 as many instructions as the old compiler.
Unfortunately, it increases the CPI by 1.1.
How fast can we expect the application to run using this new compiler?
Pick the right answer from the three choices below:
a. (15 * 0.6) / 1.1 = 8.2 sec
b. 15 * 0.6 * 1.1 = 9.9 sec
c. (15 * 1.1) / 0.6 = 27.5 sec
Some insights on the correct answer and why it is obtained using that particular formula would be helpful. Thanks!
new instruction count = old instruction count * 0.6
new CPI = old CPI * 1.1
Now substitute and you will arrive at solution b.
A: 15 seconds = InsA * CPIA * ClockRate
ClockRate = 15 seconds / (InsA * CPIA)
B: TimeB = (0.6*InsA) * (1.1*CPIA) * ClockRate
TimeB = (0.6*InsA) * (1.1*CPIA) * 15 seconds / (InsA * CPIA)
TimeB = 0.6*1.1*15 seconds = 9.9 seconds

Why average loss goes up when training using Vowpal Wabbit

I tried to use VW to train a regression model on a small set of examples (about 3112). I think I'm doing it correctly, yet it showed me weird results. Dug around but didn't find anything helpful.
$ cat sh600000.feat | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.064936 0.077711 16 16.0 -0.1800 0.0547 77
0.060507 0.056079 32 32.0 0.0000 0.3164 79
0.136933 0.213358 64 64.0 -0.5900 -0.0850 79
0.151692 0.166452 128 128.0 0.0700 0.0060 79
0.133965 0.116238 256 256.0 0.0900 -0.0446 78
0.179995 0.226024 512 512.0 0.3700 -0.0217 79
0.109296 0.038597 1024 1024.0 0.1200 -0.0728 79
0.579360 1.049425 2048 2048.0 -0.3700 -0.0084 79
0.485389 0.485389 4096 4096.0 1.9600 0.3934 79 h
0.517748 0.550036 8192 8192.0 0.0700 0.0334 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
$ wc model
41 48 657 model
Questions:
Why is the number of features in the output (readable) model less than the number of actual features? I counted that the training data contains 78 features (plus the bias that's 79 as shown during the training). The number of feature bits is 24, which should be far more than enough to avoid collision.
Why does the average loss actually go up in the training as you can see in the above example?
(Minor) I tried to increase the number of feature bits to 32, and it output an empty model. Why?
EDIT:
I tried to shuffle the input file, as well as using --holdout_off, as suggested. But the result is still almost the same - the average loss go up.
$ cat sh600000.feat.shuf | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache --holdout_off
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.071332 0.090504 16 16.0 0.0300 0.1203 79
0.043720 0.016108 32 32.0 -0.2200 -0.1971 78
0.142895 0.242071 64 64.0 0.0100 -0.1531 79
0.158564 0.174232 128 128.0 0.0500 -0.0439 79
0.150691 0.142818 256 256.0 0.3200 0.1466 79
0.197050 0.243408 512 512.0 0.2300 -0.0459 79
0.117398 0.037747 1024 1024.0 0.0400 0.0284 79
0.636949 1.156501 2048 2048.0 1.2500 -0.0152 79
0.363364 0.089779 4096 4096.0 0.1800 0.0071 79
0.477569 0.591774 8192 8192.0 -0.4800 0.0065 79
0.411068 0.344567 16384 16384.0 0.0700 0.0450 77
finished run
number of examples per pass = 3112
passes used = 10
weighted example sum = 31120
weighted label sum = -105.5
average loss = 0.423404
best constant = -0.0033901
total feature number = 2451800
The training examples are unique to each other so I doubt there is over-fitting problem (which, as I understand it, usually happens when the number of input is too small comparing the number of features).
EDIT2:
Tried to print the average loss for every pass of examples, and see that it mostly remains constant.
$ cat dist/sh600000.feat | vw --l1 1e-8 --l2 1e-8 -f dist/model -P 3112 --passes 10 -b 24 --cache_file dist/cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.498822 0.498822 3112 3112.0 0.0800 0.0015 79 h
0.476677 0.454595 6224 6224.0 -0.2200 -0.0085 79 h
0.466413 0.445856 9336 9336.0 0.0200 -0.0022 79 h
0.490221 0.561506 12448 12448.0 0.0700 -0.1113 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
Also another try without the --l1, --l2 and -b parameters:
$ cat dist/sh600000.feat | vw -f dist/model -P 3112 --passes 10 --cache_file dist/cacheNum weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.520286 0.520286 3112 3112.0 0.0800 -0.0021 79 h
0.488581 0.456967 6224 6224.0 -0.2200 -0.0137 79 h
0.474247 0.445538 9336 9336.0 0.0200 -0.0299 79 h
0.496580 0.563450 12448 12448.0 0.0700 -0.1727 79 h
0.533413 0.680958 15560 15560.0 -0.1700 0.0322 79 h
0.524531 0.480201 18672 18672.0 -0.9800 -0.0573 79 h
finished run
number of examples per pass = 2801
passes used = 7
weighted example sum = 19608
weighted label sum = -212.58
average loss = 0.491739 h
best constant = -0.0108415
total feature number = 1544713
Does that mean it's normal for average loss to go up during one pass, but as long as multiple pass gets the same loss then it's fine?
Model file stores only non-zero weights. So most likely others got nulled especially if you are using --l1
It may be caused by many reasons. Perhaps your dataset isn't shuffled well enough. If you sort your dataset so examples labeled -1 will be in first half and examples labeled 1 will be in second then your model will show very good convergence on first half, but you'll see avg loss bump as it reaches 2nd half. So it may be disbalance in dataset. As for last two losses - these are holdout losses (marked with 'h' at end of line) and may point that model is overfitted. Pls refer to my other answer.
Well, in master branch usage of -b 32 is even currently blocked. You shall use up to -b 31. On practice -b 24-28 is usually enough even for dozens of thousands of features.
I would recommend you to get up-to-date VW version from github

Optimizing a program and calculating % of total execution time improved

So I was told to ask this on here instead of StackExchage:
If I have a program P, which runs on a 2GHz machine M in 30seconds and is optimized by replacing all instances of 'raise to the power 4' with 3 instructions of multiplying x by. This optimized program will be P'. The CPI of multiplication is 2 and CPI of power is 12. If there are 10^9 such operations optimized, what is the percent of total execution time improved?
Here is what I've deduced so far.
For P, we have:
time (30s)
CPI: 12
Frequency (2GHz)
For P', we have:
CPI (6) [2*3]
Frequency (2GHz)
So I need to figure our how to calculate the time of P' in order to compare the times. But I have no idea how to achieve this. Could someone please help me out?
Program P, which runs on a 2GHz machine M in 30 seconds and is optimized by replacing all instances of 'raise to the power 4' with 3 instructions of multiplying x by. This optimized program will be P'. The CPI of multiplication is 2 and CPI of power is 12. If there are 10^9 such operations optimized,
From this information we can compute time needed to execute all POWER4 ("raise to the power 4) instructions, we have total count of such instructions (all POWER4 was replaced, count is 10^9 or 1 G). Every POWER4 instruction needs 12 clock cycles (CPI = clock per instruction), so all POWER4 were executed in 1G * 12 = 12G cycles.
2GHz machine has 2G cycles per second, and there are 30 seconds of execution. Total P program execution is 2G*30 = 60 G cycles (60 * 10^9). We can conclude that P program has some other instructions. We don't know what instructions, how many executions they have and there is no information about their mean CPI. But we know that time needed to execute other instructions is 60 G - 12 G = 48 G (total program running time minus POWER4 running time - true for simple processors). There is some X executed instructions with Y mean CPI, so X*Y = 48 G.
So, total cycles executed for the program P is
Freq * seconds = POWER4_count * POWER4_CPI + OTHER_count * OTHER_mean_CPI
2G * 30 = 1G * 12 + X*Y
Or total running time for P:
30s = (1G * 12 + X*Y) / 2GHz
what is the percent of total execution time improved?
After replacing 1G POWER4 operations with 3 times more MUL instructions (multiply by) we have 3G MUL operations, and cycles needed for them is now CPI * count, where MUL CPI is 2: 2*3G = 6G cycles. X*Y part of P' was unchanged, and we can solve the problem.
P' time in seconds = ( MUL_count * MUL_CPI + OTHER_count * OTHER_mean_CPI ) / Frequency
P' time = (3G*2 + X*Y) / 2GHz
Improvement is not so big as can be excepted, because POWER4 instructions in P takes only some part of running time: 12G/60G; and optimization converted 12G to 6G, without changing remaining 48 G cycles part. By halving only some part of time we get not half of time.

Resources