Can improved performance be obtained from parallel reading in Fortran?

Can improved performance be obtained from parallel reading in Fortran? - parallel-processing

I have a fortran90 code that spends (by far) most of the time on I/O, because very large data files (at least 1GB and up from there), need to be read. Smaller, but still large, data files with the results of the calculations need to be written. Comparatively, some fast Fourier transforms and other calculations are done in no time. I have parallelized (OpenMP) some of these calculations but the overall gain in performance is minimal given the mentioned I/O issues.
My strategy at the moment is to read the whole file at once:
open(unit=10, file="data", status="old")
do i=1,verylargenumber
read(10,*) var1(i), var2(i), var3(i)
end do
close(10)
and then perform operations on var1, etc. My question is whether there is a suitable strategy using (preferably) OpenMP that would allow me to speed up the reading process, especially with the consideration in mind (if it makes any difference) that the data files are quite large.
I have the possibility to run these calculations on Lustre file systems, which in principle offer advantages for parallel I/O, although a general solution for regular file systems would be appreciated.
My intuition is that there is no work around this issue but I wanted to check for sure.

I'm not a Fortran guru, but it looks like you are reading the values from the file in very small chunks (3x integers at a time, at most a few dozen bytes). Reading the file in large chunks (multi-MB at a time) is going to provide a significant improvement in performance, since you will be reducing the number of underlying read() system calls (and corresponding locking overhead) by many orders of magnitude.
If your large files are written in Lustre with multiple stripes (e.g. in a directory with lfs setstripe -c 8 -S 4M <dir> to set a default stripe count of 8 with a stripe size of 4MB for all new files in that directory) then this may improve the aggregate read performance - assuming that you are reading only a single file at one time, and you are not limited by the client network bandwidth. If your program is running on multiple nodes and/or threads concurrently, and each of those threads is itself reading its own file, then you will already have parallelism above the file level. Even reading from a single file can do quite well (if the reads are large) because the Lustre client will do readahead in the background.
If you have multiple compute threads that are each working on different chunks of the file at one time (e.g. 4MB chunks) then you could read each of the 4MB chunks from a different thread, which may improve performance since you will have more IO requests in flight. However, there is still a limit on how fast a single client can read files over the network. Reading from a multi-striped file from multiple clients concurrently will allow you to aggregate the network and disk bandwidth from multiple clients and servers, which is where Lustre does best.

Related

Which is faster, Loading assets from a big package or from many files spread across the filesystem?

Many applications, mostly video-games (mainly ones with constant content streaming), takes the approach of having a single big package file containing many of the game assets, that's [arguably] used for security-and space efficiency reasons, but perhaps there's a performance reason?
Given an application that loads 3+ assets file on each 5 seconds using asynchronous I/O in a secondary streaming thread, would it be technically faster if the I/O was being performed on a single big file by seeking and reading to the various offsets of the assets when necessary or to read each file separately spread across the operating system file system?
There would probably be a different across HDDs, SDDs and other factors, what are those?
Assume the files aren't fragmented (i.e. reserved space for those was applied during installation before effectively writing the content), but specifying the affects on fragmentation on the result would be interesting too.
The files spread on the filesystem approach seems interesting for quick production and modding support, but if there's a performance penalty, care should be taken.
This question is purely theoretical and curiosity at this point.

Creating a file handle/descriptor requires security validation and file system metadata operations. So is arguably that the many small file operations will be slower than one operation one one big file. Whether is a measurable difference in the specific context of your code, that remains to be demonstrated.
BTW True asynchronous IO should no require 'secondary' threads. You likely describe synchronous IO performed on a different thread, a completely different beast.
Achieving high throughput IO is very platform specific. To illustrate, for Windows specific read Designing Applications for High Performance - Part 1, Part 2 and Part 3.

Opening a file requires string processing to verify the file name, finding the directory the block references are in and then seeking to the first block of the file to start reading.
The directory lookup can be cached but all the rest has a cost per file.
Also having one big file lets the OS know that it will be accessed in one go and read the next blocks of the file while you are processing the data; however if you only read some assets then this read-ahead will possibly read unnecessary blocks.
A better solution would be a hybrid approach: collect assets that are often loaded together in buckets and have a database that say per level which asset buckets need to be read.
You can also duplicate the data across several buckets; in extreme each level has a single bucket with all the assets that level needs. This takes up more space but will net you the greatest speedup as you only need to dump 1 file into memory.
This presentation talks about how you can create a good distribution of seeks (files in this context) vs. amount of assets within a storage budget.
When creating a game for PC you can let the use decide on install which side of the coin he wants, the quick loading of 1 file per level or the space saving of each asset stored only once.

improving concurrent file read with ssd and mmap

I have huge meteorological files. Too big for fitting in ram.
I need to perform a lot of concurrent random reads.
So, I think SSD + mmap could improve performance.
But what's about concurrent mmap reads ? How should they be organized ?

Is there a concurrency reason (contention for data structures and resources shared between threads) why you would want to open the same files independently in different threads? If no, then I can't see a reason for doing that. It will just make the kernel work a little harder by having to track a bunch of different memory mappings (one for each thread) that all ultimately map to the same object, consume more file descriptors (no big deal unless you have a very large number of files), and consume more address space when you mmap the same files multiple times.
If I understand that in your scenario the files are mostly open infrequently, read a lot, and then closed infrequently, I don't think you would have much contention between threads. So go with opening files globally for all threads.
Regardless of whether you have contention between threads for the housekeeping of the open files, there is one overriding reason in favor of mapping each file only once per process, and that is if your address spare is only 32 bits. If you are in 32 bit mode then address space is quite a limited resource if your files are large and you want to mmap significant portions of them. In that case you most certainly need to conserve address space by not wastefully mapping the same file twice in two different threads.

Multithreaded File Compare Performance

I just stumbled onto this SO question and was wondering if there would be any performance improvement if:
The file was compared in blocks no larger than the hard disk sector size (1/2KB, 2KB, or 4KB)
AND the comparison was done multithreaded (or maybe even with the .NET 4 parallel stuff)
I imagine there being 2 threads: one that reads from the beginning of the file and another that reads from the end until they meet in the middle.
I understand in this situation the disk IO is going to be the slowest part but if the reads never have to cross sector boundries (which in my twisted imagination somehow eliminates any possible fragmentation overhead) then it may potentially reduce head moves hence resulting in better performance (maybe?).
Of course other factors could play in as well, such as, single vs multiple processors/cores or SSD vs non-SSD, but with those aside; is the disk IO speed + potentially sharing processor time insurmountable? Or perhaps my concept of computer theory is completely off-base...

If you're comparing two files that are on the same drive, the only benefit you could receive from multi-threading is to have one thread reading--populating the next buffers--while another thread is comparing the previously-read buffers.
If the files you're comparing are on different physical drives, then you can have two asynchronous reads going concurrently--one on each drive.
But your idea of having one thread reading from the beginning and another reading from the end will make things slower because seek time is going to kill you. The disk drive heads will continually be seeking from one end of the file to the other. Think of it this way: do you think it would be faster to read a file sequentially from the start, or would it be faster to read 64K from the front, then read 64K from the end, then seek back to the start of the file to read the next 64K, etc?
Fragmentation is an issue, to be sure, but excessive fragmentation is the exception, not the rule. Most files are going to be unfragmented, or only partially fragmented. Reading alternately from either end of the file would be like reading a file that's pathologically fragmented.
Remember, a typical disk drive can only satisfy one I/O request at a time.
Making single-sector reads will probably slow things down. In my tests of .NET I/O speed, reading 32K at a time was significantly faster (between 10 and 20 percent) than reading 4K at a time. As I recall (it's been some time since I did this), on my machine at the time, the optimum buffer size for sequential reads was 256K. That will undoubtedly differ for each machine, based on processor speed, disk controller, hard drive, and operating system version.

What are the most efficient idioms for streaming data from disk with constant space usage?

Problem Description
I need to stream large files from disk. Assume the files are larger than will fit in memory. Furthermore, suppose that I'm doing some calculation on the data and the result is small enough to fit in memory. As a hypothetical example, suppose I need to calculate an md5sum of a 200GB file and I need to do so with guarantees about how much ram will be used.
In summary:
Needs to be constant space
Fast as possible
Assume very large files
Result fits in memory
Question
What are the fastest ways to read/stream data from a file using constant space?
Ideas I've had
If the file was small enough to fit in memory, then mmap on POSIX systems would be very fast, unfortunately that's not the case here. Is there any performance advantage to using mmap with a small buffer size to buffer successive chunks of the file? Would the system call overhead of moving the mmap buffer down the file dominate any advantages Or should I use a fixed buffer that I read into with fread?

I wouldn't be so sure that mmap would be very fast (where very fast is defined as significantly faster than fread).
Grep used to use mmap, but switched back to fread. One of the reasons was stability (strange things happen with mmap if the file shrinks whilst it is mapped or an IO error occurs). This page discusses some of the history about that.
You can compare the performance on your system with the option --mmap to grep. On my system the difference in performance on a 200GB file is negligible, but your mileage might vary!
In short, I'd use fread with a fixed size buffer. It's simpler to code, easier to handle errors and will almost certainly be fast enough.

Depending on the language you are using, a C-like fread() loop based on a file for which you declared a particular buffer size will require exactly this buffer size, no more no less.
We typically choose a buffer size of 4 to 128 kBytes, there is little gain if any with bigger buffers.
If performance was extremely important, for relatively little gain (and at the risk of re-inventing something), one could consider using a two-thread implementation, whereby one thread reads the file in a set of two buffers, and the other thread perform calculations sequential fashion in one of the buffers at a time. In this fashion the disk access delays can be removed.

mjv is right. You can use double-buffers and overlapped I/O. That way your crunching and the disk reading can be happening at the same time. Then I would profile or stack-shot the crunching to make it as fast as possible. With luck it will be faster than the I/O, so you will end up running the I/O at top speed without pause. Then things like file fragmentation come into the picture.

Why does multithreaded file transfer improve performance?

RichCopy, a better-than-robocopy-with-GUI tool from Microsoft, seems to be the current tool of choice for copying files. One of it's main features, hightlighted in the TechNet article presenting the tool, is that it copies multiple files in parallel. In its default setting, three files are copied simultaneously, which you can see nicely in the GUI: [Progress: xx% of file A, yy% of file B, ...]. There are a lot of blog entries around praising this tool and claiming that this speeds up the copying process.
My question is: Why does this technique improve performance? As far as I know, when copying files on modern computer systems, the HDD is the bottleneck, not the CPU or the network. My assumption would be that copying multiple files at once makes the whole process slower, since the HDD needs to jump back and forth between different files rather than just sequentially streaming one file. Since RichCopy is faster, there must be some mistake in my assumptions...

The tool is making use improvements in hardware which can optimise multiple read and write requests much better.
When copying one file at a time the hardware isn't going to know that the block of data that currently is passing under the read head (or near by) will be needed of a subsquent read since the software hasn't queued that request yet.
A single file copy these days is not very taxing task for modern disk sub-systems. By giving these hardware systems more work to do at once the tool is leveraging its improved optimising features.

A naive "copy multiple files" application will copy one file, then wait for that to complete before copying the next one.
This will mean that an individual file CANNOT be copied faster than the network latency, even if it is empty (0 bytes). Because it probably does several file server calls, (open,write,close), this may be several x the latency.
To efficiently copy files, you want to have a server and client which use a sane protocol which has pipelining; that's to say - the client does NOT wait for the first file to be saved before sending the next, and indeed, several or many files may be "on the wire" at once.
Of course to do that would require a custom server not a SMB (or similar) file server. For example, rsync does this and is very good at copying large numbers of files despite being single threaded.
So my guess is that the multithreading helps because it is a work-around for the fact that the server doesn't support pipelining on a single session.
A single-threaded implementation which used a sensible protocol would be best in my opinion.

It's a network tool, so the bottleneck is the network, not the HDD. Up to a (low) point you can get more throughput out of a TCP link by using a few connections in parallel. This (a) parallelizes the TCP handshakes; (b) can make better use of the bandwidth-delay product if that is high; and (c) doesn't make one arbitrarily slow connection the critical path if for some reason it encounters a high RTT or failure rate.
Another way to do (b) is to use an enormous TCP socket receive buffer but that's not always convenient.
Several of the other answers about HDD are incorrect. Practically any HDD will do some read-ahead on the assumption of sequential access, and any intelligent OS cache will also do that.

My gues is that the hdd read write heads spend most of their time idle and wait for the correct memory block of the disk to apear under them, the more memory being copied means less time in idle and most modern disk schedulers should take care of the jumping (for a low number of files/fragments)

As far as I know, when copying files on modern computer systems, the HDD is the bottleneck, not the CPU or the network.
I think those assumptions are overly simplistic.
First, while LANs run at 100Mb / 1Gbit. Long haul networks have a maximum data rate that is less than the max rate of the slowest link.
Second, the effective throughput of TCP/IP stream over the internet is often dominated by the time taken to round-trip messages and acknowledgments. For example, I have a 8+Mbit link, but my data rate on downloads is rarely above 1-2Mbits per second when I'm downloading from the USA. So if you can run multiple streams in parallel one stream can be waiting for an acknowledgment while another is pumping packets. (But if you try to send too much, you start getting congestion, timeouts, back-off and lower overall transfer rates.)
Finally, operating systems are good at doing a variety of I/O tasks in parallel with other work. If you are downloading 2 or more files in parallel, the O/S may be reading / processing network packets for one download and writing to disc for another one ... at the same time.

Over long distances, networks can write much faster than they can read. With multithreading, having additional "readers" means the data can be transmitted more efficiently and not bogged down in buffers.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio