As a part of my data processing pipeline I'm reading many hdf files on a network drive, potentially away from the physical machine. After profiling (using cProfile) my code which does basically the following:
data = []
for path in paths:
with h5py.File(path, 'r') as hdf:
data.append(hdf['dataset'][()])
return data
I found that there are two main calls in this loop: h5py.File.__init__ (which dispatches to make_fid internally) and File.__getitem__ (which dispatches to method 'read' of h5py._selector.Blahblah). Now, make_fid takes almost as much time as __getitem__ itself when reading from a far away drive and drops to almost negligible when reading files that were moved to a local SSD, while __getitem__ runtime remains almost constant (in terms of time per call). I am no OS guy so I would like to ask what exactly should contribute to this slowing down: is it plain network transfer, some filesystem operations/synchronization, or something else entirely? Network would be the most likely culprit but I have two issues with that explanation:
Shouldn't it contribute to __getitem__ which executes read, rather than to instantiation of the File object?
Using the method from here to benchmark network transfer from my VM to two different non-local drives I found that they had almost 3x different in read throughput, but this translates to barely ~20% speedup of one over the other when executing the code above.
Related
Many applications, mostly video-games (mainly ones with constant content streaming), takes the approach of having a single big package file containing many of the game assets, that's [arguably] used for security-and space efficiency reasons, but perhaps there's a performance reason?
Given an application that loads 3+ assets file on each 5 seconds using asynchronous I/O in a secondary streaming thread, would it be technically faster if the I/O was being performed on a single big file by seeking and reading to the various offsets of the assets when necessary or to read each file separately spread across the operating system file system?
There would probably be a different across HDDs, SDDs and other factors, what are those?
Assume the files aren't fragmented (i.e. reserved space for those was applied during installation before effectively writing the content), but specifying the affects on fragmentation on the result would be interesting too.
The files spread on the filesystem approach seems interesting for quick production and modding support, but if there's a performance penalty, care should be taken.
This question is purely theoretical and curiosity at this point.
Creating a file handle/descriptor requires security validation and file system metadata operations. So is arguably that the many small file operations will be slower than one operation one one big file. Whether is a measurable difference in the specific context of your code, that remains to be demonstrated.
BTW True asynchronous IO should no require 'secondary' threads. You likely describe synchronous IO performed on a different thread, a completely different beast.
Achieving high throughput IO is very platform specific. To illustrate, for Windows specific read Designing Applications for High Performance - Part 1, Part 2 and Part 3.
Opening a file requires string processing to verify the file name, finding the directory the block references are in and then seeking to the first block of the file to start reading.
The directory lookup can be cached but all the rest has a cost per file.
Also having one big file lets the OS know that it will be accessed in one go and read the next blocks of the file while you are processing the data; however if you only read some assets then this read-ahead will possibly read unnecessary blocks.
A better solution would be a hybrid approach: collect assets that are often loaded together in buckets and have a database that say per level which asset buckets need to be read.
You can also duplicate the data across several buckets; in extreme each level has a single bucket with all the assets that level needs. This takes up more space but will net you the greatest speedup as you only need to dump 1 file into memory.
This presentation talks about how you can create a good distribution of seeks (files in this context) vs. amount of assets within a storage budget.
When creating a game for PC you can let the use decide on install which side of the coin he wants, the quick loading of 1 file per level or the space saving of each asset stored only once.
We have a program building a 3d Model from three files hosted on a Linux file server. Basically x.bin, y.bin and z.bin. It builds the models one z level at a time, and is read each file for every "slice".
On Linux machines running this program, the first slice takes around 45 seconds, and then ~2 seconds for every "slice" after that.
On Windows, the exact same program performing the exact same operation running the exact same script and code takes 5 minutes for the first slice, and around a minute and a half each slice after that.
Reading file over network slow due to extra reads
This thread seemed to have a guy with a similar problem, but the truth is that I'm still unclear on how NFS can be faster, as well as how I can suggest a change to the actual developers as to how to improve performance. The code is OS independent, I believe it's just using C's fread, fseek, etc to read the file information over the network.
How does NFS transfer/read data that it can be 60x faster than samba?
How can I get that performance on samba?
I'm not 100% sure as I don't know much about samba, but my guess is that nfs support fseek and thus can just position over the next splice and return that data. While samba probably doesn't and have to return the full file from the server and discard the "unused" content.
By the way, it's not the exact same program you're running, you probably recompile them right? So it's been transcode to a lot of different system call with each platforms having differents pros and cons...
I just stumbled onto this SO question and was wondering if there would be any performance improvement if:
The file was compared in blocks no larger than the hard disk sector size (1/2KB, 2KB, or 4KB)
AND the comparison was done multithreaded (or maybe even with the .NET 4 parallel stuff)
I imagine there being 2 threads: one that reads from the beginning of the file and another that reads from the end until they meet in the middle.
I understand in this situation the disk IO is going to be the slowest part but if the reads never have to cross sector boundries (which in my twisted imagination somehow eliminates any possible fragmentation overhead) then it may potentially reduce head moves hence resulting in better performance (maybe?).
Of course other factors could play in as well, such as, single vs multiple processors/cores or SSD vs non-SSD, but with those aside; is the disk IO speed + potentially sharing processor time insurmountable? Or perhaps my concept of computer theory is completely off-base...
If you're comparing two files that are on the same drive, the only benefit you could receive from multi-threading is to have one thread reading--populating the next buffers--while another thread is comparing the previously-read buffers.
If the files you're comparing are on different physical drives, then you can have two asynchronous reads going concurrently--one on each drive.
But your idea of having one thread reading from the beginning and another reading from the end will make things slower because seek time is going to kill you. The disk drive heads will continually be seeking from one end of the file to the other. Think of it this way: do you think it would be faster to read a file sequentially from the start, or would it be faster to read 64K from the front, then read 64K from the end, then seek back to the start of the file to read the next 64K, etc?
Fragmentation is an issue, to be sure, but excessive fragmentation is the exception, not the rule. Most files are going to be unfragmented, or only partially fragmented. Reading alternately from either end of the file would be like reading a file that's pathologically fragmented.
Remember, a typical disk drive can only satisfy one I/O request at a time.
Making single-sector reads will probably slow things down. In my tests of .NET I/O speed, reading 32K at a time was significantly faster (between 10 and 20 percent) than reading 4K at a time. As I recall (it's been some time since I did this), on my machine at the time, the optimum buffer size for sequential reads was 256K. That will undoubtedly differ for each machine, based on processor speed, disk controller, hard drive, and operating system version.
This is not a pure programming question, however it impacts the performance of programs using fseek(), hence it is important to know how it works. A little disclaimer so that it doesn't get closed.
I am wondering how efficient it is to insert data in the middle of the file. Supposing I have a file with 1MB data and then I insert something at the 512KB offset. How efficient would that be compared to appending my data at the end of the file? Just to make the example complete lets say I want to insert 16KB of data.
I understand the answer varies depending on the filesystem, however I assume that the techniques used in common filesystems are quite similar and I just want to get the right notion of it.
(disclaimer: I want just to add some hints to this interesting discussion)
IMHO there are some things to take into account:
1) fseek is not a primary system service, but a library function. To evaluate its performance we must consider how the file stream library is implemented. In general, the file I/O library adds a layer of buffering in user space, so the performance of fseek may be quite different if the target position is inside or outside the current buffer. Also, the system services that the I/O libary uses may vary a lot. I.e. on some systems the library uses extensively the file memory mapping if possible.
2) As you said, different filesystems may behave in a very different way. In particular, I would expect that a transactional filesystem must do something very smart and perhaps expensive to be prepared to a possible rollback of an aborted write operation in the middle of a file.
3) Modern OS'es have very aggressive caching algorithms. An "fseeked" file is likely to be already present in cache, so operations become much faster. But they may degrade a lot if the overall filesystem activity produced by other processes become important.
Any comments?
fseek(...) is a library call, not an OS system call. It is the run-time library that takes care of the actual overhead involved in making a system call to the OS, technically speaking, fseek is indirectly making a call to the system but really it is not (this brings up a clear distinction between the differences between a library call and a system call). fseek(...) is a standard input-output function regardless of the underlying system...however...and this is a big however...
The OS will more than likely to have cached the file in its kernel memory, that is, the direct offset to the location on the disk on where the 1's and 0's are stored, it is through the OS's kernel layers, more than likely, a top-most layer within the kernel that would have the snapshot of what the file is composed of, i.e. data irrespectively of what it contains (it does not care either way, as long as the 'pointers' to the disk structure for that offset to the lcoation on the disk is valid!)...
When fseek(..) occurs, there would be a lot of over-head, indirectly, the kernel delegated the task of reading from the disk, depending on how fragmented the file is, it could be theoretically, "all over the place", that could be a significant over-head in terms of having to, from a user-land perspective, i.e. the C code doing an fseek(...), it could be scattering itself all over the place to gather the data into a "one contiguous view of the data" and henceforth, inserting into the middle of a file, (remember at this stage, the kernel would have to adjust the location/offsets into the actual disk platter for the data) would be deemed slower than appending to the end of the file.
The reason is quite simple, the kernel "knows" what was the last offset was, and simply wipe the EOF marker and insert more data, behind the scenes, the kernel, is having to allocate another block of memory for the disk-buffer with the adjusted offset to the location on the disk following an EOF marker, once the appending of data is completed.
Let us assume the ext2 FS and the Linux OS as an example. I don't think there will be a significant performance difference between a insert and an append. In both cases the files node and offset table must be read, the relevant disk sector mapped into memory, the data updated and at some later point the data written back to disk. What will make a big performance difference in this example is good temporal and spatial locality when accessing parts of the file since this will reduce the number of load/store combos.
As a previous answers says you may be able to speed up both operations if you deal with data writes that exact multiples of the FS block size, in this case you could skip the load stage and just insert the new blocks into the files inode datastrucure. This would not be practical, as you would need low level access to the FS driver, and using it would be very restrictive and not portable.
One observation I have made about fseek on Solaris, is that each call to it resets the read buffer of the FILE. The next read will then always read a full block (8K by default). So if you have a lot of random access with small reads it's a good idea to do it unbuffered (setvbuf with NULL buffer) or even use direct syscalls (lseek+read or even better pread which is only 1 syscall instead of 2). I suppose this behaviour will be similar on other OS.
You can insert data to the middle of file efficiently only if data size is a multiple of FS sector but OSes doesn't provide such functions so you have to use low-level interface to the FS driver.
Inserting data in the middle of the file is less efficient than appending to the end because when inserting you would have to move the data after the insertion point to make room for the data being inserted. Moving these data would involve reading them from disk, writing the data to be inserted and then writing the old data after the inserted data. So you have at least one extra read and write when inserting.
RichCopy, a better-than-robocopy-with-GUI tool from Microsoft, seems to be the current tool of choice for copying files. One of it's main features, hightlighted in the TechNet article presenting the tool, is that it copies multiple files in parallel. In its default setting, three files are copied simultaneously, which you can see nicely in the GUI: [Progress: xx% of file A, yy% of file B, ...]. There are a lot of blog entries around praising this tool and claiming that this speeds up the copying process.
My question is: Why does this technique improve performance? As far as I know, when copying files on modern computer systems, the HDD is the bottleneck, not the CPU or the network. My assumption would be that copying multiple files at once makes the whole process slower, since the HDD needs to jump back and forth between different files rather than just sequentially streaming one file. Since RichCopy is faster, there must be some mistake in my assumptions...
The tool is making use improvements in hardware which can optimise multiple read and write requests much better.
When copying one file at a time the hardware isn't going to know that the block of data that currently is passing under the read head (or near by) will be needed of a subsquent read since the software hasn't queued that request yet.
A single file copy these days is not very taxing task for modern disk sub-systems. By giving these hardware systems more work to do at once the tool is leveraging its improved optimising features.
A naive "copy multiple files" application will copy one file, then wait for that to complete before copying the next one.
This will mean that an individual file CANNOT be copied faster than the network latency, even if it is empty (0 bytes). Because it probably does several file server calls, (open,write,close), this may be several x the latency.
To efficiently copy files, you want to have a server and client which use a sane protocol which has pipelining; that's to say - the client does NOT wait for the first file to be saved before sending the next, and indeed, several or many files may be "on the wire" at once.
Of course to do that would require a custom server not a SMB (or similar) file server. For example, rsync does this and is very good at copying large numbers of files despite being single threaded.
So my guess is that the multithreading helps because it is a work-around for the fact that the server doesn't support pipelining on a single session.
A single-threaded implementation which used a sensible protocol would be best in my opinion.
It's a network tool, so the bottleneck is the network, not the HDD. Up to a (low) point you can get more throughput out of a TCP link by using a few connections in parallel. This (a) parallelizes the TCP handshakes; (b) can make better use of the bandwidth-delay product if that is high; and (c) doesn't make one arbitrarily slow connection the critical path if for some reason it encounters a high RTT or failure rate.
Another way to do (b) is to use an enormous TCP socket receive buffer but that's not always convenient.
Several of the other answers about HDD are incorrect. Practically any HDD will do some read-ahead on the assumption of sequential access, and any intelligent OS cache will also do that.
My gues is that the hdd read write heads spend most of their time idle and wait for the correct memory block of the disk to apear under them, the more memory being copied means less time in idle and most modern disk schedulers should take care of the jumping (for a low number of files/fragments)
As far as I know, when copying files on modern computer systems, the HDD is the bottleneck, not the CPU or the network.
I think those assumptions are overly simplistic.
First, while LANs run at 100Mb / 1Gbit. Long haul networks have a maximum data rate that is less than the max rate of the slowest link.
Second, the effective throughput of TCP/IP stream over the internet is often dominated by the time taken to round-trip messages and acknowledgments. For example, I have a 8+Mbit link, but my data rate on downloads is rarely above 1-2Mbits per second when I'm downloading from the USA. So if you can run multiple streams in parallel one stream can be waiting for an acknowledgment while another is pumping packets. (But if you try to send too much, you start getting congestion, timeouts, back-off and lower overall transfer rates.)
Finally, operating systems are good at doing a variety of I/O tasks in parallel with other work. If you are downloading 2 or more files in parallel, the O/S may be reading / processing network packets for one download and writing to disc for another one ... at the same time.
Over long distances, networks can write much faster than they can read. With multithreading, having additional "readers" means the data can be transmitted more efficiently and not bogged down in buffers.