Measuring algorithm runtime accurately with MATLAB

Measuring algorithm runtime accurately with MATLAB - algorithm

I am trying to measure the running time of the code below.
function out=load_database()
persistent loaded;
persistent w;
if(isempty(loaded))
A=zeros(10304,400);
for i=1:40
cd(strcat('s',num2str(i)));
for j=1:8
a=imread(strcat(num2str(j),'.pgm'));
A(:,(i-1)*8+j)=reshape(a,size(a,1)*size(a,2),1);
end
cd ..
end
w=A;
end
loaded=1;
out=w;
end
I tried using tic and toc but it does not work properly because the measured
time is like 2-3 seconds when in reality it takes almost 20 seconds until everything is done.
This code is for ORL database: http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
and it loads only 8 out of 10 images. 8 for training, the other 2 for testing. Not the best way, but is ok for me.

Related

Why is the Computer Clock faster than the real clock

Using the conventional way of measuring run time, I've done something like the following:
start = time.time()
#example code
for _ in range(100000):
for __ in range(100000):
print(_)
end = time.time()
run_time = end-start
I noticed in some cases, the run_time reading would give me 9 seconds. But I was 100% sure it hasn't been 9 seconds long. So I measured with my phone stopwatch. Conservatively speaking, I got 2 seconds on my phone's stopwatch while the program says 4 seconds. (and I have done this multiple times to check it wasn't a false reading) So now I am wondering if the computer's clock works in a different way than our normal clock.

Matlab parallel computing running slower than expected

I am trying to use Matlab parallel computing toolbox. My PC has 6 'workers' or cores. Thus, I would expect my code to run roughly 6x as fast (ie ~600% increase in speed). However, when I actually time the opertations, I find I am only getting roughly a 40% increase in speed.
Is this normal, or am I doing something wrong?
Here is my code:
N=5000;
%%Parrelel
pp=parpool(6)
ts=tic;
parfor i=1:12
q=eye(N); q^-1;
end
disp(['Time for Parrelel Computation: ' num2str(toc(ts),3) 's']);
delete(pp);
%Serial
ts=tic;
for i=1:12
q=eye(N); q^-1;
end
disp(['Time for Serial Computation: ' num2str(toc(ts),3) 's']);
The readout is:
Time for Parrelel Computation: 24.6s
Time for Serial Computation: 35.9s
I wouldve expected the parrelel computation to be roughly 35/6~=6s, not 24s
Any advice?
Thanks
Roman

Parallel-ForkManager, DBI. Faster than before forking, but still too slow

I have a very simple task on updating database.
my $pm = new Parallel::ForkManager(15);
for my $line (#lines){
my $pid = $pm->start and next;
my $dbh2 = $dbh->clone();
my $sth2 = $dbh2->prepare("update db1 set field1=? where field2 =?");
my ($field1, $field2) = very_slow_subroutine();
$sth2->execute($field1,$field2);
$pm->finish;
}
$pm->wait_all_children;
I could just use $dbh2->do, but I doubt it a reason for a slowness.
What interesting, is that it seems it very fast starts these 15 processes (or whatever I specify) , but right after that slows drastically, still noticeable faster than without forking, but I would expect more...
Edit:
The very_slow_subroutine is sub which get an answer from a web service. The service can answer from fraction of second to several seconds on time out. I have to ask dozen thousands times... the reason I would like to make a fork.
And if this is matters -- I am on Linux.

Parallel::ForkManager doesn't magically make things faster, it just lets you do run your code multiple times and at the same time. In order to get the benefit out of it, you have to design your code for parallelism.
Think of it this way. It takes you 10 minutes to get to the store, shop, load your car, come back, and unload it. You need to get 5 loads. You alone can do it in 50 minutes. That is working in serial. 10 minutes * 5 trips one after the other = 50 minutes.
Let's say you get four friends to help. You all start off for the store at the same time. There's still 5 trips, and they still take 10 minutes, but because you did it in parallel the total time is only 10 minutes.
But it will never take less than 10 minutes, no matter how many trips you have to make or how many friends you get to help. That is why the process starts up fast, everybody gets into their cars and drives off to the store, but then nothing happens for a while because it still takes 10 minutes for everyone to do their job.
Same thing here. Your loop body takes X time to run. If you iterate through it Y times, it will take X * Y real world human time to run. If you run it in parallel Y times, ideally it will take just X time to run. Each parallel worker must still execute the full body of the loop taking X time.
In order to speed things up further, you have to break up the big bottleneck of very_slow_subroutine and make that work in parallel. Your SQL is so simple that is where you should focus your efforts at optimization and parallelism.
Let's say the store is really close, it's only a 1 minute drive (this is your SQL UPDATE), but shopping, loading and unloading takes 9 minutes (this is very_slow_subroutine). What if instead you have 5 cars and 15 friends. You load 3 people into each car. Driving to and from the store will take the same time, but now three people are working together to do the shopping, loading and unloading taking only 4 minutes. Now each trip takes 5 minutes instead of 10.
This represents redesigning very_slow_subroutine to do its work in parallel. If it's just a big loop, you can put more workers on that loop. If it's a series of slow operations, you will have to redesign it to take advantage of parallel execution.
If you use too many workers you can clog up the system, it depends on what the bottleneck is. If it's CPU bound and you have 2 CPU cores, you're probably see performance gains up to 3 to 5 workers ((cores * 2)+1 is a good rule of thumb) and after that performance will drop off as the CPU spends more time switching between processes than doing work. If the bottleneck is IO, or an external service as is often the case with database and network calls, you can see great efficiencies throwing many workers at the problem. While one process is waiting around for a disk or network operation, the others can be using your CPU.

Whether parallelism can help depends on where your bottleneck is. If your CPU with 4 cores is the bottleneck, forking 4 processes might cause things to complete in about 1/4th the under the best case scenario, but spawning 15 processes is not going to improve things much more.
If, more likely, your bottleneck is in I/O, starting 15 processes that compete for the same I/O is not going to help much, although in cases where you have tons of memory to use as file cache, some improvement might be possible.
To explore the limits on your system, consider the following program:
#!/usr/bin/env perl
use strict;
use warnings;
use Parallel::ForkManager;
run(#ARGV);
sub run {
my $count = #_ ? $_[0] : 2;
my $pm = Parallel::ForkManager->new($count);
for (1 .. 20) {
$pm->start and next;
sleep 1;
$pm->finish;
}
$pm->wait_all_children;
}
My ancient laptop has a single CPU with 2 cores. Let's see what I get:
TimeThis : Command Line : perl sleeper.pl 1
TimeThis : Elapsed Time : 00:00:20.735
TimeThis : Command Line : perl sleeper.pl 2
TimeThis : Elapsed Time : 00:00:06.578
TimeThis : Command Line : perl sleeper.pl 4
TimeThis : Elapsed Time : 00:00:04.578
TimeThis : Command Line : perl sleeper.pl 8
TimeThis : Elapsed Time : 00:00:03.546
TimeThis : Command Line : perl sleeper.pl 16
TimeThis : Elapsed Time : 00:00:02.562
TimeThis : Command Line : perl sleeper.pl 20
TimeThis : Elapsed Time : 00:00:02.563
So, running with max 20 processes gives me a total run time over 2.5 seconds for sleeping one second 20 times.
On the other hand, with just one process, sleeping one second 20 times took just over 20 seconds. That is a huge improvement, but it also indicates a management overhead of more than 150% when you have 20 processes each sleeping for one second.
This is in the nature of parallel programming. There are a lot of formal treatments out there on what you can expect, but Amdahl's Law is required reading.

Use Tic Toc to Control Loop Speed

I am writing a code that is outputting to a DAQ which controls a device. I want to have it send a signal out precisely every 1 second. Depending on the performance of my proccessor the code sometimes takes longer or shorter than 1 second. Is there any way to improve this bit of code?
Elapsed time is 1.000877 seconds.
Elapsed time is 0.992847 seconds.
Elapsed time is 0.996886 seconds.
for i= 1:100
tic
pause(.99)
toc
end

Using pause is known to be fairly imprecise (on the order of 10 ms). Matlab in recent versions has optimized tic toc to be low-overhead and as precise as possible (see here).
You can make use of tic toc to be more precise than pause using the following code:
ntimes = 100;
times = zeros(ntimes,1);
time_dur = 0.99;
for i= 1:ntimes
outer = tic;
while toc(outer) < time_dur
end
times(i) = toc(outer);
end
mean(times)
std(times)
Here is my outcome for 50 measurements: mean = 0.9900 with a std = 1.0503e-5, which is much more precise than using pause.
Using the original code with just pause, for 50 measurements I get: mean = 0.9981 with a std = 0.0037.

This is a inproved version of shimizu's answer. The main issue is a minimal clock drift. Each iteration the time stamp is taken and then then the timer is reset. The clock drifts by the execution time of these two commands.
A secondary minor improvement combines pause and the tic-toc technique to lower the cpu load.
ntimes = 100;
times = zeros(ntimes,1);
time_dur = 0.99;
t = tic;
for ix= 1:ntimes
pause((time_dur*ix-toc(t)-0.1))
while toc(t) < time_dur*ix
end
times(ix) = toc(t);
end
mean(diff(times))
std(diff(times))

If you want your DAQ to update exactly every second, use a DAQ with a FIFO buffer and a clock and configured to read a value from the FIFO exactly once per second.
Even if you got your MATLAB task iterations running exactly one second apart, the inconsistent delay in communication with the DAQ would mess up your timing.

does matlab cache solutions for eigs

I seem to be getting different performance results when using eigs. On the same matrix, calling
[c, v] = eigs(A, 2, 'sm');
somtimes takes 30 seconds and sometimes 2 seconds.
I need to know whether there's a speedup using some caching on subsequent calls for eigs on the same matrix since I need to report the times...

If so, this doesn't appear to be a generic feature. I ran this test from the command line
A = randn(10000);
B = randn(10000);
C = B;
tic; [c1,v1] = eigs(A,2,'sm'); toc;
tic; [c2,v2] = eigs(A,2,'sm'); toc;
tic; [c3,v3] = eigs(B,2,'sm'); toc;
tic; [c4,v4] = eigs(C,2,'sm'); toc
and got this result
Elapsed time is 32.373128 seconds.
Elapsed time is 28.412905 seconds.
Elapsed time is 32.752616 seconds.
Elapsed time is 29.024055 seconds.
I'm surprised, because usually MATLAB tries to outsmart you and will store results for reuse.

Under some circumstances, a large enough matrix might push things into virtual memory, or not, depending upon whether there is a large enough block of contiguous RAM available. Or, you may be doing something on the side.
You can verify what is happening by watching a process monitor as you do the test. Are there suddenly large amounts of disk accesses? If so, then virtual memory is being touched. Is there a different, unrelated process active that is hogging the CPU?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio