Lets assume I have the following requirements for writing monitor files from a simulation:
A large number of individual files has to be written, typically in the order of 10000
The files must be human-readable, i.e. formatted I/O
Periodically, a new line is added to each file. Typically every 50 seconds.
The new data has to be accessible almost instantly, so large manual write buffers are not an option
We are on a Lustre file system that appears to be optimized for just about the opposite: sequential writes to a small number of large files.
It was not me who formulated the requirements so unfortunately there is no point in discussing them. I would just like to find the best possible solution with above prerequisites.
I came up with a little working example to test a few implementations. Here is the best I could do so far:
!===============================================================!
! program to test some I/O implementations for many small files !
!===============================================================!
PROGRAM iotest
use types
use omp_lib
implicit none
INTEGER(I4B), PARAMETER :: steps = 1000
INTEGER(I4B), PARAMETER :: monitors = 1000
INTEGER(I4B), PARAMETER :: cachesize = 10
INTEGER(I8B) :: counti, countf, count_rate, counti_global, countf_global
REAL(DP) :: telapsed, telapsed_global
REAL(DP), DIMENSION(:,:), ALLOCATABLE :: density, pressure, vel_x, vel_y, vel_z
INTEGER(I4B) :: n, t, unitnumber, c, i, thread
CHARACTER(LEN=100) :: dummy_char, number
REAL(DP), DIMENSION(:,:,:), ALLOCATABLE :: writecache_real
call system_clock(counti_global,count_rate)
! allocate cache
allocate(writecache_real(5,cachesize,monitors))
writecache_real = 0.0_dp
! fill values
allocate(density(steps,monitors), pressure(steps,monitors), vel_x(steps,monitors), vel_y(steps,monitors), vel_z(steps,monitors))
do n=1, monitors
do t=1, steps
call random_number(density(t,n))
call random_number(pressure(t,n))
call random_number(vel_x(t,n))
call random_number(vel_y(t,n))
call random_number(vel_z(t,n))
end do
end do
! create files
do n=1, monitors
write(number,'(I0.8)') n
dummy_char = 'monitor_' // trim(adjustl(number)) // '.dat'
open(unit=20, file=trim(adjustl(dummy_char)), status='replace', action='write')
close(20)
end do
call system_clock(counti)
! write data
c = 0
do t=1, steps
c = c + 1
do n=1, monitors
writecache_real(1,c,n) = density(t,n)
writecache_real(2,c,n) = pressure(t,n)
writecache_real(3,c,n) = vel_x(t,n)
writecache_real(4,c,n) = vel_y(t,n)
writecache_real(5,c,n) = vel_z(t,n)
end do
if(c .EQ. cachesize .OR. t .EQ. steps) then
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(n,number,dummy_char,unitnumber, thread)
thread = OMP_get_thread_num()
unitnumber = thread + 20
!$OMP DO
do n=1, monitors
write(number,'(I0.8)') n
dummy_char = 'monitor_' // trim(adjustl(number)) // '.dat'
open(unit=unitnumber, file=trim(adjustl(dummy_char)), status='old', action='write', position='append', buffered='yes')
write(unitnumber,'(5ES25.15)') writecache_real(:,1:c,n)
close(unitnumber)
end do
!$OMP END DO
!$OMP END PARALLEL
c = 0
end if
end do
call system_clock(countf)
call system_clock(countf_global)
telapsed=real(countf-counti,kind=dp)/real(count_rate,kind=dp)
telapsed_global=real(countf_global-counti_global,kind=dp)/real(count_rate,kind=dp)
write(*,*)
write(*,'(A,F15.6,A)') ' elapsed wall time for I/O: ', telapsed, ' seconds'
write(*,'(A,F15.6,A)') ' global elapsed wall time: ', telapsed_global, ' seconds'
write(*,*)
END PROGRAM iotest
The main features are: OpenMP parallelization and a manual write buffer.
Here are some of the timings on the Lustre file system with 16 threads:
cachesize=5: elapsed wall time for I/O: 991.627404 seconds
cachesize=10: elapsed wall time for I/O: 415.456265 seconds
cachesize=20: elapsed wall time for I/O: 93.842964 seconds
cachesize=50: elapsed wall time for I/O: 79.859099 seconds
cachesize=100: elapsed wall time for I/O: 23.937832 seconds
cachesize=1000: elapsed wall time for I/O: 10.472421 seconds
For reference the results on a local workstation HDD with deactivated HDD write cache, 16 threads:
cachesize=1: elapsed wall time for I/O: 5.543722 seconds
cachesize=2: elapsed wall time for I/O: 2.791811 seconds
cachesize=3: elapsed wall time for I/O: 1.752962 seconds
cachesize=4: elapsed wall time for I/O: 1.630385 seconds
cachesize=5: elapsed wall time for I/O: 1.174099 seconds
cachesize=10: elapsed wall time for I/O: 0.700624 seconds
cachesize=20: elapsed wall time for I/O: 0.433936 seconds
cachesize=50: elapsed wall time for I/O: 0.425782 seconds
cachesize=100: elapsed wall time for I/O: 0.227552 seconds
As you can see the implementation is still embarrassingly slow on the Lustre file system compared to an ordinary HDD and I would need huge buffer sizes to reduce the I/O overhead to a tolerable extent. This would mean that the output lags behind which is against the requirements formulated earlier.
Another promising approach was leaving the units open between consecutive writes. Unfortunately, the number of units open simultaneously is limited to typically 1024-4096 without root privileges. So this is not an option because the number of files can exceed this limit.
How could the I/O overhead be further reduced while still fulfilling the requirements?
Edit 1
From the discussion with Gilles I learned that the lustre file system can be tweaked even with normal user privileges. So I tried setting the stripe count to 1 as suggested (this was already the default setting) and decreased the stripe size to the minimum supported value of 64k (default value was 1M). However, this did not improve I/O performance with my test case. If anyone has additional hints on more suitable file system settings please let me know.
For everyone suffering from small files performance, the new lustre release 2.11 allows storing the small files directly on MDT, which improves access time for those.
http://cdn.opensfs.org/wp-content/uploads/2018/04/Leers-Lustre-Data_on_MDT_An_Early_Look_DDN.pdf
lfs setstripe -E 1M -L mdt -E -1 fubar fill store the first megabyte of all files in directory fubar on MDT
The cputime (); returns the CPU time used by an Octave session. Here, a way is stated to measure the processing time of a code.
t = cputime;
...%your code
...%computing
...%when you are done
printf('Total cpu time: %f seconds\n', cputime-t);
In octave documentation we can use the etime(); function file, it returns the difference (in seconds) between two timestamps from the clock, like this:
t0 = clock ();
# many computations later...
elapsed_time = etime (clock (), t0);
What's the difference between the above two ways? The etime uses the hour of clock of my system? And what's a good way to collect the processing time of an code?
Your first code snippet with cputime measures the time your code was execuded on a CPU. The second snippet measures the wall clock time (btw: use tic and toc to measure this). The wall clock includes time where the CPU was executing some other threads or waiting for IO.
It is up to you what you want to measure. I normally use wall clock with tic/toc if I want to benchmark my code.
On single threaded apps the wall clock time is greater than or equal the cputime (because it includes the time the CPU was also processing some other threads). On multithreaded apps with multiple cores cputime can be greater than the wall clock time.
I am trying to measure the running time of the code below.
function out=load_database()
persistent loaded;
persistent w;
if(isempty(loaded))
A=zeros(10304,400);
for i=1:40
cd(strcat('s',num2str(i)));
for j=1:8
a=imread(strcat(num2str(j),'.pgm'));
A(:,(i-1)*8+j)=reshape(a,size(a,1)*size(a,2),1);
end
cd ..
end
w=A;
end
loaded=1;
out=w;
end
I tried using tic and toc but it does not work properly because the measured
time is like 2-3 seconds when in reality it takes almost 20 seconds until everything is done.
This code is for ORL database: http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
and it loads only 8 out of 10 images. 8 for training, the other 2 for testing. Not the best way, but is ok for me.
I am writing a code that is outputting to a DAQ which controls a device. I want to have it send a signal out precisely every 1 second. Depending on the performance of my proccessor the code sometimes takes longer or shorter than 1 second. Is there any way to improve this bit of code?
Elapsed time is 1.000877 seconds.
Elapsed time is 0.992847 seconds.
Elapsed time is 0.996886 seconds.
for i= 1:100
tic
pause(.99)
toc
end
Using pause is known to be fairly imprecise (on the order of 10 ms). Matlab in recent versions has optimized tic toc to be low-overhead and as precise as possible (see here).
You can make use of tic toc to be more precise than pause using the following code:
ntimes = 100;
times = zeros(ntimes,1);
time_dur = 0.99;
for i= 1:ntimes
outer = tic;
while toc(outer) < time_dur
end
times(i) = toc(outer);
end
mean(times)
std(times)
Here is my outcome for 50 measurements: mean = 0.9900 with a std = 1.0503e-5, which is much more precise than using pause.
Using the original code with just pause, for 50 measurements I get: mean = 0.9981 with a std = 0.0037.
This is a inproved version of shimizu's answer. The main issue is a minimal clock drift. Each iteration the time stamp is taken and then then the timer is reset. The clock drifts by the execution time of these two commands.
A secondary minor improvement combines pause and the tic-toc technique to lower the cpu load.
ntimes = 100;
times = zeros(ntimes,1);
time_dur = 0.99;
t = tic;
for ix= 1:ntimes
pause((time_dur*ix-toc(t)-0.1))
while toc(t) < time_dur*ix
end
times(ix) = toc(t);
end
mean(diff(times))
std(diff(times))
If you want your DAQ to update exactly every second, use a DAQ with a FIFO buffer and a clock and configured to read a value from the FIFO exactly once per second.
Even if you got your MATLAB task iterations running exactly one second apart, the inconsistent delay in communication with the DAQ would mess up your timing.
curious question on timing. When measuring wall clock time with any language such as python time.time() does the time include the CPU/System time.clock() time in it as well?
In Python, time.time() gives you the elapsed time (also known as wall time).That includes CPU time inasmuch as that's a subset of wall time but you cannot extract CPU time from time.time() itself.
For example, if your process runs for ten seconds but uses the CPU for only five of those seconds, the former includes the latter.