improving concurrent file read with ssd and mmap - performance

I have huge meteorological files. Too big for fitting in ram.
I need to perform a lot of concurrent random reads.
So, I think SSD + mmap could improve performance.
But what's about concurrent mmap reads ? How should they be organized ?

Is there a concurrency reason (contention for data structures and resources shared between threads) why you would want to open the same files independently in different threads? If no, then I can't see a reason for doing that. It will just make the kernel work a little harder by having to track a bunch of different memory mappings (one for each thread) that all ultimately map to the same object, consume more file descriptors (no big deal unless you have a very large number of files), and consume more address space when you mmap the same files multiple times.
If I understand that in your scenario the files are mostly open infrequently, read a lot, and then closed infrequently, I don't think you would have much contention between threads. So go with opening files globally for all threads.
Regardless of whether you have contention between threads for the housekeeping of the open files, there is one overriding reason in favor of mapping each file only once per process, and that is if your address spare is only 32 bits. If you are in 32 bit mode then address space is quite a limited resource if your files are large and you want to mmap significant portions of them. In that case you most certainly need to conserve address space by not wastefully mapping the same file twice in two different threads.

Related

Does GetWriteWatch work with Memory-Mapped FIles?

outI'm working with memory mapped files (MMF) with very large datasets (depending on the input file), where each file has ~50GB and there are around 40 files open at the same time. Of course this depends, I can also have smaller files, but I can also have larger files - so the system should scale itself.
The MMF is acting as a backing buffer, so as long as I have enough free memory there shoud occur no paging. The problem is that the windows memory manager and my application are two autonomous processes. In good conditions everything is working fine, but the memory manager obviously is too slow in conditions where I'm entering low memory conditions, the memory is full and then the system starts to page (which is good), but I'm still allocating memory, because I don't get any information about the paging.
In the end I'm entering a state where the system stalls, the memory manager pages and I'm allocating.
So I came to the point where I need to advice the memory manager, check current memory conditions and invoke the paging myself. For that reason I wanted to use the GetWriteWatch to inspect the memory region I can flush.
Interestingly the GetWriteWatch does not work in my situation, it returns a -1 without filling the structures. So my question is does GetWriteWatch work with MMFs?
Does GetWriteWatch work with Memory-Mapped Files?
I don't think so.
GetWriteWatch accepts memory allocated via VirtualAlloc function using MEM_WRITE_WATCH.
File mapping are mapped using MapViewOfFile* functions that do not have this flag.

Can improved performance be obtained from parallel reading in Fortran?

I have a fortran90 code that spends (by far) most of the time on I/O, because very large data files (at least 1GB and up from there), need to be read. Smaller, but still large, data files with the results of the calculations need to be written. Comparatively, some fast Fourier transforms and other calculations are done in no time. I have parallelized (OpenMP) some of these calculations but the overall gain in performance is minimal given the mentioned I/O issues.
My strategy at the moment is to read the whole file at once:
open(unit=10, file="data", status="old")
do i=1,verylargenumber
read(10,*) var1(i), var2(i), var3(i)
end do
close(10)
and then perform operations on var1, etc. My question is whether there is a suitable strategy using (preferably) OpenMP that would allow me to speed up the reading process, especially with the consideration in mind (if it makes any difference) that the data files are quite large.
I have the possibility to run these calculations on Lustre file systems, which in principle offer advantages for parallel I/O, although a general solution for regular file systems would be appreciated.
My intuition is that there is no work around this issue but I wanted to check for sure.
I'm not a Fortran guru, but it looks like you are reading the values from the file in very small chunks (3x integers at a time, at most a few dozen bytes). Reading the file in large chunks (multi-MB at a time) is going to provide a significant improvement in performance, since you will be reducing the number of underlying read() system calls (and corresponding locking overhead) by many orders of magnitude.
If your large files are written in Lustre with multiple stripes (e.g. in a directory with lfs setstripe -c 8 -S 4M <dir> to set a default stripe count of 8 with a stripe size of 4MB for all new files in that directory) then this may improve the aggregate read performance - assuming that you are reading only a single file at one time, and you are not limited by the client network bandwidth. If your program is running on multiple nodes and/or threads concurrently, and each of those threads is itself reading its own file, then you will already have parallelism above the file level. Even reading from a single file can do quite well (if the reads are large) because the Lustre client will do readahead in the background.
If you have multiple compute threads that are each working on different chunks of the file at one time (e.g. 4MB chunks) then you could read each of the 4MB chunks from a different thread, which may improve performance since you will have more IO requests in flight. However, there is still a limit on how fast a single client can read files over the network. Reading from a multi-striped file from multiple clients concurrently will allow you to aggregate the network and disk bandwidth from multiple clients and servers, which is where Lustre does best.

Multithreaded File Compare Performance

I just stumbled onto this SO question and was wondering if there would be any performance improvement if:
The file was compared in blocks no larger than the hard disk sector size (1/2KB, 2KB, or 4KB)
AND the comparison was done multithreaded (or maybe even with the .NET 4 parallel stuff)
I imagine there being 2 threads: one that reads from the beginning of the file and another that reads from the end until they meet in the middle.
I understand in this situation the disk IO is going to be the slowest part but if the reads never have to cross sector boundries (which in my twisted imagination somehow eliminates any possible fragmentation overhead) then it may potentially reduce head moves hence resulting in better performance (maybe?).
Of course other factors could play in as well, such as, single vs multiple processors/cores or SSD vs non-SSD, but with those aside; is the disk IO speed + potentially sharing processor time insurmountable? Or perhaps my concept of computer theory is completely off-base...
If you're comparing two files that are on the same drive, the only benefit you could receive from multi-threading is to have one thread reading--populating the next buffers--while another thread is comparing the previously-read buffers.
If the files you're comparing are on different physical drives, then you can have two asynchronous reads going concurrently--one on each drive.
But your idea of having one thread reading from the beginning and another reading from the end will make things slower because seek time is going to kill you. The disk drive heads will continually be seeking from one end of the file to the other. Think of it this way: do you think it would be faster to read a file sequentially from the start, or would it be faster to read 64K from the front, then read 64K from the end, then seek back to the start of the file to read the next 64K, etc?
Fragmentation is an issue, to be sure, but excessive fragmentation is the exception, not the rule. Most files are going to be unfragmented, or only partially fragmented. Reading alternately from either end of the file would be like reading a file that's pathologically fragmented.
Remember, a typical disk drive can only satisfy one I/O request at a time.
Making single-sector reads will probably slow things down. In my tests of .NET I/O speed, reading 32K at a time was significantly faster (between 10 and 20 percent) than reading 4K at a time. As I recall (it's been some time since I did this), on my machine at the time, the optimum buffer size for sequential reads was 256K. That will undoubtedly differ for each machine, based on processor speed, disk controller, hard drive, and operating system version.

Reading in parallel from multiple hard drives

I am writing an application that deals with lots of data (gigabytes). I am considering splitting the data onto multiple hard drives and reading it in parallel. I am wondering what kind of limitations I will run into--for example, is it possible to read from 4 or 8 hard drives in parallel, and will I get approximately 4 or 8 times the performance if disk I/O is the limiting factor? What should I look out for? Pointers to relevant docs are also appreciated--Google didn't turn up much.
EDIT: I should point out that I've looked at RAID, but the performance wasn't as good as I was hoping for. I am planning on writing this myself in C/C++.
Well splitting data and reading from 4 to 8 drives in parallel would not jump the throughput by 4 to 8 times. There are other factors which you need to consider.
If you reading data in the application, then threads might be required to read data from different harddisks.
Windows provide overlapped and non-overlapped method of reading and writing data to hdd. See if using that increases the throughput. Same way *nux would also have read/write methods.
On a single core/processor threads appear to run in parallel but its sequentially underlying. With multicore multiple threads can be read in parallel but generally OS decides what to run and when to run. So having so many threads to read might decrease performance than increase.
If you check specs of any harddisk, you would see it gives random access time and sequential access time. So based on you data you may want to check these parameters.
When you spliting data into different drives you need to keep in mind that your application would require syncronization of how to populate data into meaningful information. If you using threads, additionally threads should be in sync.
You may get state of the art harddisk with high data read/write speeds but you other hardware may be the weak link. So you may be using a low-end motherboard or RAM which may not let you get the best of the speeds.
If you're not going to use real RAID, you better at least use multiple hard drive controllers, otherwise you won't see much performance gain at all. One controller can't do lots of concurrent IO so it will quickly become the bottleneck.
It sounds like you are talking about the concept of data striping. This is commonly used for RAID implementations. You may want to look into one of the software RAID solutions available for most operating systems. An advantage is if you can use raid to your advantage and add parity (ability to lose a drive and not your data)
This would give you the benefits of RAID without having to try to deal with it yourself. You could do it on a database level as well with data files spread across the drives, but this adds complexity.
You will stream data faster. Drives are only so fast and if your I/O channel can handle more go for it. There's also seek times to take into account... Probably not a big deal based on your app description.
As you seem to be OK with looking at reconfiguring the drives, how about SSDs?
They run rings around any mechanical drives (up around 200+GB/sec read, 150+GB/sec write).
Are you sequentially reading the data, or randomly?
How many GB are you expecting?

What are the most efficient idioms for streaming data from disk with constant space usage?

Problem Description
I need to stream large files from disk. Assume the files are larger than will fit in memory. Furthermore, suppose that I'm doing some calculation on the data and the result is small enough to fit in memory. As a hypothetical example, suppose I need to calculate an md5sum of a 200GB file and I need to do so with guarantees about how much ram will be used.
In summary:
Needs to be constant space
Fast as possible
Assume very large files
Result fits in memory
Question
What are the fastest ways to read/stream data from a file using constant space?
Ideas I've had
If the file was small enough to fit in memory, then mmap on POSIX systems would be very fast, unfortunately that's not the case here. Is there any performance advantage to using mmap with a small buffer size to buffer successive chunks of the file? Would the system call overhead of moving the mmap buffer down the file dominate any advantages Or should I use a fixed buffer that I read into with fread?
I wouldn't be so sure that mmap would be very fast (where very fast is defined as significantly faster than fread).
Grep used to use mmap, but switched back to fread. One of the reasons was stability (strange things happen with mmap if the file shrinks whilst it is mapped or an IO error occurs). This page discusses some of the history about that.
You can compare the performance on your system with the option --mmap to grep. On my system the difference in performance on a 200GB file is negligible, but your mileage might vary!
In short, I'd use fread with a fixed size buffer. It's simpler to code, easier to handle errors and will almost certainly be fast enough.
Depending on the language you are using, a C-like fread() loop based on a file for which you declared a particular buffer size will require exactly this buffer size, no more no less.
We typically choose a buffer size of 4 to 128 kBytes, there is little gain if any with bigger buffers.
If performance was extremely important, for relatively little gain (and at the risk of re-inventing something), one could consider using a two-thread implementation, whereby one thread reads the file in a set of two buffers, and the other thread perform calculations sequential fashion in one of the buffers at a time. In this fashion the disk access delays can be removed.
mjv is right. You can use double-buffers and overlapped I/O. That way your crunching and the disk reading can be happening at the same time. Then I would profile or stack-shot the crunching to make it as fast as possible. With luck it will be faster than the I/O, so you will end up running the I/O at top speed without pause. Then things like file fragmentation come into the picture.

Resources