I am writing a code that is outputting to a DAQ which controls a device. I want to have it send a signal out precisely every 1 second. Depending on the performance of my proccessor the code sometimes takes longer or shorter than 1 second. Is there any way to improve this bit of code?
Elapsed time is 1.000877 seconds.
Elapsed time is 0.992847 seconds.
Elapsed time is 0.996886 seconds.
for i= 1:100
Using pause is known to be fairly imprecise (on the order of 10 ms). Matlab in recent versions has optimized tic toc to be low-overhead and as precise as possible (see here).
You can make use of tic toc to be more precise than pause using the following code:
ntimes = 100;
times = zeros(ntimes,1);
time_dur = 0.99;
for i= 1:ntimes
outer = tic;
while toc(outer) < time_dur
times(i) = toc(outer);
Here is my outcome for 50 measurements: mean = 0.9900 with a std = 1.0503e-5, which is much more precise than using pause.
Using the original code with just pause, for 50 measurements I get: mean = 0.9981 with a std = 0.0037.
This is a inproved version of shimizu's answer. The main issue is a minimal clock drift. Each iteration the time stamp is taken and then then the timer is reset. The clock drifts by the execution time of these two commands.
A secondary minor improvement combines pause and the tic-toc technique to lower the cpu load.
ntimes = 100;
times = zeros(ntimes,1);
time_dur = 0.99;
t = tic;
for ix= 1:ntimes
while toc(t) < time_dur*ix
times(ix) = toc(t);
If you want your DAQ to update exactly every second, use a DAQ with a FIFO buffer and a clock and configured to read a value from the FIFO exactly once per second.
Even if you got your MATLAB task iterations running exactly one second apart, the inconsistent delay in communication with the DAQ would mess up your timing.
Using the conventional way of measuring run time, I've done something like the following:
start = time.time()
#example code
for _ in range(100000):
for __ in range(100000):
end = time.time()
run_time = end-start
I noticed in some cases, the run_time reading would give me 9 seconds. But I was 100% sure it hasn't been 9 seconds long. So I measured with my phone stopwatch. Conservatively speaking, I got 2 seconds on my phone's stopwatch while the program says 4 seconds. (and I have done this multiple times to check it wasn't a false reading) So now I am wondering if the computer's clock works in a different way than our normal clock.
i have some little trouble and i am asking for hint. I am on Windows platform, doing calculations in a following manner:
int input = 0;
int output; // junk bytes here
while(true) {
async_enqueue_upload(input); // completes instantly, but transfer will take 10us
async_enqueue_calculate(); // completes instantly, but computation will take 80us
async_enqueue_download(output); // completes instantly, but transfer will take 10us
sync_wait_finish(); // must wait while output is fully calculated, and there is no junk
input = process(output); // i cannot launch next step without doing it on the host.
I am asking about wait_finish() thing. I must wait all devices to finish, to combine all results and somehow process the data and upload a new portion, that is based on a previous computation step. I need to sync data in between each step, so i can't parallelize steps. I know, this is not quite performant case. So lets proceed to question.
I have 2 ways of checking completion, within wait_finish(). First is to put thread to sleep until it wakes up by completion event:
while( !is_completed() )
It has very low performance, because actual calculation, to say, takes 100us, and minimal Windows sheduler timestep is 1ms, so it gives unsuitable 10x lower performance.
Second way is to check completion in empty infinite loop:
while( !is_completed() )
{} // do_nothing();
It has 10x good computation performance. But it is also unsuitable solution, because it makes full cpu core utilisation usage, with absolutely useless work. How to make cpu "sleep" exactly time i needed? (Each step has equal amount of work)
How this case is usually solved, when amount of calculation time is too big for active spin-wait, but is too small compared to sheduler timestep? Also related subquestion - how to do that on linux?
Fortunately, i have succeeded in finding answer on my own. In short words - i should use linux for that.
And my investigation shows following. On windows there is hidden function in ntdll, NtDelayExecution(). It is not exposed through SDK, but can be loaded in a following manner:
static int(__stdcall *NtDelayExecution)(BOOL Alertable, PLARGE_INTEGER DelayInterval) = (int(__stdcall*)(BOOL, PLARGE_INTEGER)) GetProcAddress(GetModuleHandleW(L"ntdll.dll"), "NtDelayExecution");
It allows to set sleep intervals in 100ns periods. However, even that not worked well, as shown in a following benchmark:
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS); // requires Admin privellegies
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
uint64_t hpf = qpf(); // QueryPerformanceFrequency()
uint64_t s0 = qpc(); // QueryPerformanceCounter()
uint64_t n = 0;
while (1) {
sleep_precise(1); // NtDelayExecution(-1); waits one 100-nanosecond interval
auto s1 = qpc();
auto passed = s1 - s0;
if (passed >= hpf) {
std::cout << "freq=" << (n * hpf / passed) << " hz\n";
s0 = s1;
n = 0;
That yields something less than just 2000 hz loop rate, and result varies from string to string. That led me towards windows thread switching sheduler, which is totally not suited for real time tasks. And its minimum interval of 0.5ms (+overhead). Btw, does anyone knows on how to tune that value?
And next was linux question, and what does it can? So i've built custom tiny kernel 4.14 with means of buildroot, and tested that benchmark code there. I replaced qpc() to return clock_gettime() data, with CLOCK_MONOTONIC clock, and qpf() just returns number of nanoseconds in a second and sleep_precise() just called clock_nanosleep(). I was failed to find out what is the difference between CLOCK_MONOTONIC and CLOCK_REALTIME.
And i was quite surprised, getting whooping 18.4khz frequency just out of the box, and that was quite stable. While i tested several intervals, i found that i can set the loop to almost any frequency up to 18.4khz, but also that actual measured wait time results differs to 1.6 times of what i asked. For example if i ask to sleep 100 us it actually sleeps for ~160 us, giving ~6.25 khz frequency. Nothing else is going on the system, just kernel, busybox and this test. I am not an experience linux user, and i am still wondering how can i tune this to be more real-time and deterministic. Can i push that frequency maximum even more?
The cputime (); returns the CPU time used by an Octave session. Here, a way is stated to measure the processing time of a code.
t = cputime;
...%your code
...%when you are done
printf('Total cpu time: %f seconds\n', cputime-t);
In octave documentation we can use the etime(); function file, it returns the difference (in seconds) between two timestamps from the clock, like this:
t0 = clock ();
# many computations later...
elapsed_time = etime (clock (), t0);
What's the difference between the above two ways? The etime uses the hour of clock of my system? And what's a good way to collect the processing time of an code?
Your first code snippet with cputime measures the time your code was execuded on a CPU. The second snippet measures the wall clock time (btw: use tic and toc to measure this). The wall clock includes time where the CPU was executing some other threads or waiting for IO.
It is up to you what you want to measure. I normally use wall clock with tic/toc if I want to benchmark my code.
On single threaded apps the wall clock time is greater than or equal the cputime (because it includes the time the CPU was also processing some other threads). On multithreaded apps with multiple cores cputime can be greater than the wall clock time.
I am working with MATLAB. I am just new with parallel computing toolbox in MATLAB. I have core i3 processor, MATLAB R2011a, 2 GB of RAM, 320 Hard disk.
To calculate speed up, I just wrote following code and found that parallel code is taking longer time than a sequential code.
1st code is taking 0.039763 seconds
2nd code is taking 0.379056 seconds.
1st code:
MM = magic(5);
MN = magic(5);
ML = magic(5);
MP = magic(5);
MK = magic(5);
2nd Code:
matlabpool open local 4
spmd % Uses all 3 workers
MM = magic(5); % MM is a variable on each lab
matlabpool close
I want to learn parallel computing toolbox.
As mentioned by Dan in the comments, the problem is clearly too small for parallelization to be beneficial. Increasing for example the size of the magic matrices you create from 5 to 5000, already shows a clear improvement. That is, with the larger size the overhead of parallelization becomes (almost) negligible compared to the computation time for one matrix.
I seem to be getting different performance results when using eigs. On the same matrix, calling
[c, v] = eigs(A, 2, 'sm');
somtimes takes 30 seconds and sometimes 2 seconds.
I need to know whether there's a speedup using some caching on subsequent calls for eigs on the same matrix since I need to report the times...
If so, this doesn't appear to be a generic feature. I ran this test from the command line
A = randn(10000);
B = randn(10000);
C = B;
tic; [c1,v1] = eigs(A,2,'sm'); toc;
tic; [c2,v2] = eigs(A,2,'sm'); toc;
tic; [c3,v3] = eigs(B,2,'sm'); toc;
tic; [c4,v4] = eigs(C,2,'sm'); toc
and got this result
Elapsed time is 32.373128 seconds.
Elapsed time is 28.412905 seconds.
Elapsed time is 32.752616 seconds.
Elapsed time is 29.024055 seconds.
I'm surprised, because usually MATLAB tries to outsmart you and will store results for reuse.
Under some circumstances, a large enough matrix might push things into virtual memory, or not, depending upon whether there is a large enough block of contiguous RAM available. Or, you may be doing something on the side.
You can verify what is happening by watching a process monitor as you do the test. Are there suddenly large amounts of disk accesses? If so, then virtual memory is being touched. Is there a different, unrelated process active that is hogging the CPU?