How do I parallelise a basic fortran do loop using MPI? - parallel-processing

I am new to MPI. I have a fortran 77 program which reads in a large data file (~1.7 GB) and then does some analysis on the data. Then it reads in the next data file and does the analysis again. This process repeats itself 'nstep' times (where for me nstep ~=1000).
I attach some relevant sections of the code. The analysis itself is not time consuming. The reading in of the large data files is time consuming.
Note MN is a Massive Number, the files I'm reading in typically have between 4 - 4.7 million lines (particles) i.e. integer 'i' changes at each step.
Currently reading in 1000 data files will take several hours on 1 core. I would like to parallelize the program below (do loop) so that each core can read in a smaller chunk of the data.
c *** DECLARATIONS ***
integer MN
parameter (MN=4700000)
....etc
c ********************
c *** SELECT INPUT DATA STEP ***/
open(10,file='../ini/ghofile.dat',status='old')
do is=0,nstep-1
read(10,*) step
...
c *** OPEN INPUT DATA FILE (THE DO LOOP BELOW IS TIME CONSUMING) ***/
open(20,file=filename,status='old')
do i=0,MN
read(20,121,end=21) x(i),y(i),z(i),vx(i),vy(i),
& vz(i),m(i).... etc
121 format(17(1pE13.5),2(I10),2(1pE13.5),I10)
enddo
21 ns=i
write(6,*) 'No of particles =',ns
enddo
close(20)
stop
end

Related

Do CPUs with AVX2 or newer instruction sets support any form of caching on register renaming?

For example, there is a very simple pseudo code with many duplicated values taken:
Data:
1 5 1 5 1 2 2 3 8 3 4 5 6 7 7 7
For all data elements:
get particle id from data array
idx = id/7
index = (idx << 8) | id
aabb = lookup[index]
test collision of aabb with a ray
so that it will very probably re-compute same value of 1 for same division followed by same bitwise operation, with no loop carried dependency.
Can new CPUs (like Avx512 or AVX2) remember the pattern (same data + same code path) and directly rename an old input register and return the output quickly (like predicting branch but instead predicting register renamed for a temporary value)?
I'm currently developing a collision detection algorithm on an old CPU (bulldozer ver.1) and any online C++ compiler is not good enough for having predictable performance due to cpu being shared by all visitors.
Removing duplicates by using an unoredered map takes about 15-30 nanoseconds per insert or by using a vectorized plain array scan about 3-5 nanoseconds per insert. This is too slow to effectively filter unnecessary duplicates out. Even if a direct-mapped cache is used (that contains just a modulo operator and some assignments), it still fails (due to cache miss) even worse than using an unordered map in terms of performance.
I'm not expecting a cpu with only hundred(s) of physical registers to actually cache many things, but it could help a lot in computing duplicate values quickly, by just remembering the "same value + same code path" combo only from the last iteration of a loop. At least some physics simulations with collision checking could get a decent boost.
Processing a sorted is faster, but only for branching code? What about branchless code, with newest cpus?
Is there any way of harnessing the register renaming performance (zero latency?) as a simple caching of duplicated work?

Improve Fortran formatted I/O with a large number of small files

Lets assume I have the following requirements for writing monitor files from a simulation:
A large number of individual files has to be written, typically in the order of 10000
The files must be human-readable, i.e. formatted I/O
Periodically, a new line is added to each file. Typically every 50 seconds.
The new data has to be accessible almost instantly, so large manual write buffers are not an option
We are on a Lustre file system that appears to be optimized for just about the opposite: sequential writes to a small number of large files.
It was not me who formulated the requirements so unfortunately there is no point in discussing them. I would just like to find the best possible solution with above prerequisites.
I came up with a little working example to test a few implementations. Here is the best I could do so far:
!===============================================================!
! program to test some I/O implementations for many small files !
!===============================================================!
PROGRAM iotest
use types
use omp_lib
implicit none
INTEGER(I4B), PARAMETER :: steps = 1000
INTEGER(I4B), PARAMETER :: monitors = 1000
INTEGER(I4B), PARAMETER :: cachesize = 10
INTEGER(I8B) :: counti, countf, count_rate, counti_global, countf_global
REAL(DP) :: telapsed, telapsed_global
REAL(DP), DIMENSION(:,:), ALLOCATABLE :: density, pressure, vel_x, vel_y, vel_z
INTEGER(I4B) :: n, t, unitnumber, c, i, thread
CHARACTER(LEN=100) :: dummy_char, number
REAL(DP), DIMENSION(:,:,:), ALLOCATABLE :: writecache_real
call system_clock(counti_global,count_rate)
! allocate cache
allocate(writecache_real(5,cachesize,monitors))
writecache_real = 0.0_dp
! fill values
allocate(density(steps,monitors), pressure(steps,monitors), vel_x(steps,monitors), vel_y(steps,monitors), vel_z(steps,monitors))
do n=1, monitors
do t=1, steps
call random_number(density(t,n))
call random_number(pressure(t,n))
call random_number(vel_x(t,n))
call random_number(vel_y(t,n))
call random_number(vel_z(t,n))
end do
end do
! create files
do n=1, monitors
write(number,'(I0.8)') n
dummy_char = 'monitor_' // trim(adjustl(number)) // '.dat'
open(unit=20, file=trim(adjustl(dummy_char)), status='replace', action='write')
close(20)
end do
call system_clock(counti)
! write data
c = 0
do t=1, steps
c = c + 1
do n=1, monitors
writecache_real(1,c,n) = density(t,n)
writecache_real(2,c,n) = pressure(t,n)
writecache_real(3,c,n) = vel_x(t,n)
writecache_real(4,c,n) = vel_y(t,n)
writecache_real(5,c,n) = vel_z(t,n)
end do
if(c .EQ. cachesize .OR. t .EQ. steps) then
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(n,number,dummy_char,unitnumber, thread)
thread = OMP_get_thread_num()
unitnumber = thread + 20
!$OMP DO
do n=1, monitors
write(number,'(I0.8)') n
dummy_char = 'monitor_' // trim(adjustl(number)) // '.dat'
open(unit=unitnumber, file=trim(adjustl(dummy_char)), status='old', action='write', position='append', buffered='yes')
write(unitnumber,'(5ES25.15)') writecache_real(:,1:c,n)
close(unitnumber)
end do
!$OMP END DO
!$OMP END PARALLEL
c = 0
end if
end do
call system_clock(countf)
call system_clock(countf_global)
telapsed=real(countf-counti,kind=dp)/real(count_rate,kind=dp)
telapsed_global=real(countf_global-counti_global,kind=dp)/real(count_rate,kind=dp)
write(*,*)
write(*,'(A,F15.6,A)') ' elapsed wall time for I/O: ', telapsed, ' seconds'
write(*,'(A,F15.6,A)') ' global elapsed wall time: ', telapsed_global, ' seconds'
write(*,*)
END PROGRAM iotest
The main features are: OpenMP parallelization and a manual write buffer.
Here are some of the timings on the Lustre file system with 16 threads:
cachesize=5: elapsed wall time for I/O: 991.627404 seconds
cachesize=10: elapsed wall time for I/O: 415.456265 seconds
cachesize=20: elapsed wall time for I/O: 93.842964 seconds
cachesize=50: elapsed wall time for I/O: 79.859099 seconds
cachesize=100: elapsed wall time for I/O: 23.937832 seconds
cachesize=1000: elapsed wall time for I/O: 10.472421 seconds
For reference the results on a local workstation HDD with deactivated HDD write cache, 16 threads:
cachesize=1: elapsed wall time for I/O: 5.543722 seconds
cachesize=2: elapsed wall time for I/O: 2.791811 seconds
cachesize=3: elapsed wall time for I/O: 1.752962 seconds
cachesize=4: elapsed wall time for I/O: 1.630385 seconds
cachesize=5: elapsed wall time for I/O: 1.174099 seconds
cachesize=10: elapsed wall time for I/O: 0.700624 seconds
cachesize=20: elapsed wall time for I/O: 0.433936 seconds
cachesize=50: elapsed wall time for I/O: 0.425782 seconds
cachesize=100: elapsed wall time for I/O: 0.227552 seconds
As you can see the implementation is still embarrassingly slow on the Lustre file system compared to an ordinary HDD and I would need huge buffer sizes to reduce the I/O overhead to a tolerable extent. This would mean that the output lags behind which is against the requirements formulated earlier.
Another promising approach was leaving the units open between consecutive writes. Unfortunately, the number of units open simultaneously is limited to typically 1024-4096 without root privileges. So this is not an option because the number of files can exceed this limit.
How could the I/O overhead be further reduced while still fulfilling the requirements?
Edit 1
From the discussion with Gilles I learned that the lustre file system can be tweaked even with normal user privileges. So I tried setting the stripe count to 1 as suggested (this was already the default setting) and decreased the stripe size to the minimum supported value of 64k (default value was 1M). However, this did not improve I/O performance with my test case. If anyone has additional hints on more suitable file system settings please let me know.
For everyone suffering from small files performance, the new lustre release 2.11 allows storing the small files directly on MDT, which improves access time for those.
http://cdn.opensfs.org/wp-content/uploads/2018/04/Leers-Lustre-Data_on_MDT_An_Early_Look_DDN.pdf
lfs setstripe -E 1M -L mdt -E -1 fubar fill store the first megabyte of all files in directory fubar on MDT

Why loop is so faster?

The following loop in fortran almost takes no time
j=0
do i=1,1000000000000000000
j=j+1
end do
print*,j
But I just don't understand, our cpu is about GHz, which means 10^9 cycle in a second, while the above loop cycle is way too much than 10^9, why it almost takes no time?
It seems that the values is not computed at compiled time. We can add outer loop, until
do m=1,1000000000
do i=1,1000000000000000000
j=j+1
end do
end do
print*,j
Now it takes a second on my computer
Edit
I am using windows, intel parallel studio 15, with no extra compilation option: simply ifort test.f90. Timing method is simple, just wait after I press Enter in command line to execute the .exe
don't know fortran, but if this would be C, the compiler could optimize the above code removing the loop altogether as the value of j can be computed at compile time.
So the above code would be reduced to
print 1000000000000000000
Your logic about cycles and instructions is flawed. Modern CPUs parallelize code on hardware level, even if the code is serial:
a cpu has more a few ALU who can compute arithmetic instructions in parallel
instructions are executed in a pipeline, so at any one point, different stages of consecutive instructions are executed in parallel.
So "max of one instruction per cycle" doesn't hold.
Also increment by one is one of the fastest instruction in the CPU.

Parallel text processing in julia

I'm trying to write a simple function that reads a series of files and performs some regex search (or just a word count) on them and then return the number of matches, and I'm trying to make this run in parallel to speed it up, but so far I have been unable to achieve this.
If I do a simple loop with a math operation I do get significant performance increases. However, a similar idea for the grep function doesn't provide speed increases:
function open_count(file)
fh = open(file)
text = readall(fh)
length(split(text))
end
tic()
total = 0
for name in files
total += open_count(string(dir,"/",name))
total
end
toc()
elapsed time: 29.474181026 seconds
tic()
total = 0
total = #parallel (+) for name in files
open_count(string(dir,"/",name))
end
toc()
elapsed time: 29.086511895 seconds
I tried different versions but also got no significant speed increases. Am I doing something wrong?
I've had similar problems with R and Python. As others pointed out in the comment, you should start with the profiler.
If the read is taking up the majority of time then there's not much you can do. You can try moving the files to different hard drives and read them in from there.
You can also try a RAMDisk kind of solution, which basically makes your RAM look like permanent storage (reducing available ram) but then you can get very fast read and writes.
However, if the time is used to do the regex, than consider the following:
Create a function that reads in one file as whole and splits out separate lines. That should be a continuous read hence as fast as possible. Then create a parallel version of your regex which processes each line in parallel. This way the whole file is in memory and your computing cores can munge the data a faster rate. That way you might see some increase in performance.
This is a technique I used when trying to process large text files.

file paging when insert 1 byte early in file

what happens when i open a 100 MB file, and insert 1 byte somewhere near the beginning, then save it? does the Linux kernel literally shift everything back 1 byte (thus altering every page), & then re-saves every byte after the insertion? that seems highly inefficient!
or i suppose the kernel could insert a 1-byte page just to hold this insertion, but i've never heard of that happening. i thought all pages had to be a standard size (e.g., 4 KB or 4 MB but not 1 byte)
i have checked in numerous linux/OS bks (bovet/cesati, kerrisk, tanenbaum), & have played around with the kernel code a bit, and can't seem to figure this out.
The answer is that OSes don't typically allow you to insert an arbitrary number of bytes at an arbitrary position within a file. Your analysis shows why - it just isn't an efficient operation on the typical implementation of a file.
Normally you can only add or remove bytes at the end of a file.

Resources