I need to speed up downloading a large file and I'm using multiple connections for that. I'm using a single goroutine with access to disk and it receives data from multiple goroutines using a channel, as I was advised here.
file, _ := os.Create(filename)
down.destination = file
for info := range down.copyInfo {
down.destination.Seek(info.start, 0)
io.CopyN(down.destination, info.from, info.length)
}
}
The problem is, seeking, when used repeatedly, on a large file, seems to make the operation slower. When info.length is larger, it has to seek less number of times, and it seems to do the job faster. But I need to make info.length smaller. Is there a way to make seeking faster? Or should I just download each part to separate temp files and concatenate them at last?
A seek itself does not do any I/O but just sets the position in the file for the next read or write. The number of seeks by themselves thus likely don't matter. This can also be easily be tested by adding dummy seeks without any following read or write operation.
The problem is likely not the number of seeks but the number of write operations. With many small fragments it will need more I/O operations to write the data than with a few large fragments. And each of these I/O operations has a significant overhead. There is the overhead of the system call itself. Then there might be overhead if the fragment is not aligned at the block boundaries of the underlying storage. And it might be overhead with rotating disk to position to the actual sector.
Related
I'm currently writing an encoder and (obviously) want to make it fast.
I have a working system for doing the encoding (so every goroutine is doing the same thing) but I am struggling with finding the right amount of goroutines to run the code in. I basically want to decide on a maximum amount of Goroutines that keeps the CPU busy.
The following thoughts crossed my mind:
If a file is only <1 kB it's not useful to run the code in a lot of goroutines
The amount of goroutines should be influenced by the cores/threads available
running 16 goroutines on a 4x4 GHz CPU will not be a problem but what with a 4x1 GHz CPU?
hard to determine reliably cross-platform
The CPU should be busy but not as busy as to keep other programs from responding (~70-ish %?)
hard to decide beforehand due to clockspeed and other parameters
Now I've tried to decide based on these factors on how many goroutines to use but I'm not quite sure how to do so cross-platform and in a reliable way.
Attempts already made:
using a linear function to determine based on filesize
requires different functions based on CPU
parsing CPU-specs from lscpu
not cross-platform
requires another function to determine based on frequency
Which have not been satisfactory.
You mention in a comment that
every goroutine is reading the file that is to be encoded
But of course the file—any file—is already encoded in some way: as plain text, perhaps, or UTF-8 (stream of bytes), perhaps assembled into units of "lines". Or it might be an image stream, such an an mpeg file, consisting of some number of frames. Or it might be a database, consisting of records. Whatever its input form is, it contains some sort of basic unit that you could feed to your (re-)encoder.
That unit, whatever it may be, is a sensible place to divide work. (How sensible, depends on what it is. See the idea of chunking below.)
Let's say the file consists of independent lines: then use scanner.Scan to read them, and pass each line to a channel that takes lines. Spin off N, for some N, readers that read the channel, one line at a time:
ch := make(chan string)
for i := 0; i < n; i++ {
go readAndEncode(ch)
}
// later, or immediately:
for s := bufio.NewScanner(os.Stdin); s.Scan(); {
ch <- s.Text()
}
close(ch)
If there are 100 lines, and 4 readers, the first four ch <- s.Text() operations go fast, and the fifth one pauses until one of the readers is done encoding and goes back to reading the channel.
If individual lines are too small a unit, perhaps you should read a "chunk" (e.g., 1 MB) at a time. If the chunk has a partial line at the end, back up, or read more, until you have a whole line. Then send the entire data chunk.
Because channels copy the data, you may wish to send a reference to the chunk instead.1 This would be true of any larger data unit. (Lines tend to be short, and the overhead of copying them is generally not very large compared to the overhead of using channels in the first place. If your lines have type string, well, see the footnote.)
If line, or chunk-of-lines, are not the correct unit of work here, figure out what is. Think of goroutines as people (or busy little gophers) who each get one job to do. They can depend on someone else—another person or gopher—to do a smaller job, whatever that might be; and having ten people, or gophers, working on sub-tasks allows a supervisor to manage them. If you need to do the same job N times, and N is not unbounded, you can spin off N goroutines. If N is potentially unbounded, spin off a fixed number (maybe based on #cpus) and feed them work through a channel.
1As Burak Serdar notes, some copies can be elided automatically: e.g., strings are in effect read-only slices. Slice types have three parts: a pointer (reference) to the underlying data, a length, and a capacity. Copying a slice copies these three parts, but not the underlying data. The same goes for strings: string headers omit the capacity, so sending a string through a channel copies only the two header words. Hence many of the obvious and easy-to-code ways of chunking data will already be pretty efficient.
I have a fortran90 code that spends (by far) most of the time on I/O, because very large data files (at least 1GB and up from there), need to be read. Smaller, but still large, data files with the results of the calculations need to be written. Comparatively, some fast Fourier transforms and other calculations are done in no time. I have parallelized (OpenMP) some of these calculations but the overall gain in performance is minimal given the mentioned I/O issues.
My strategy at the moment is to read the whole file at once:
open(unit=10, file="data", status="old")
do i=1,verylargenumber
read(10,*) var1(i), var2(i), var3(i)
end do
close(10)
and then perform operations on var1, etc. My question is whether there is a suitable strategy using (preferably) OpenMP that would allow me to speed up the reading process, especially with the consideration in mind (if it makes any difference) that the data files are quite large.
I have the possibility to run these calculations on Lustre file systems, which in principle offer advantages for parallel I/O, although a general solution for regular file systems would be appreciated.
My intuition is that there is no work around this issue but I wanted to check for sure.
I'm not a Fortran guru, but it looks like you are reading the values from the file in very small chunks (3x integers at a time, at most a few dozen bytes). Reading the file in large chunks (multi-MB at a time) is going to provide a significant improvement in performance, since you will be reducing the number of underlying read() system calls (and corresponding locking overhead) by many orders of magnitude.
If your large files are written in Lustre with multiple stripes (e.g. in a directory with lfs setstripe -c 8 -S 4M <dir> to set a default stripe count of 8 with a stripe size of 4MB for all new files in that directory) then this may improve the aggregate read performance - assuming that you are reading only a single file at one time, and you are not limited by the client network bandwidth. If your program is running on multiple nodes and/or threads concurrently, and each of those threads is itself reading its own file, then you will already have parallelism above the file level. Even reading from a single file can do quite well (if the reads are large) because the Lustre client will do readahead in the background.
If you have multiple compute threads that are each working on different chunks of the file at one time (e.g. 4MB chunks) then you could read each of the 4MB chunks from a different thread, which may improve performance since you will have more IO requests in flight. However, there is still a limit on how fast a single client can read files over the network. Reading from a multi-striped file from multiple clients concurrently will allow you to aggregate the network and disk bandwidth from multiple clients and servers, which is where Lustre does best.
I am working on a project requires reading a file, making some manipulations on each line and generate a new file. I am a bit concerned about performance. Which algorithm is more efficient? I wrote some pseudocode below.
Store everything to an array, close the file, manipulate each line and store new array to output file:
openInputFile()
lineArray[] = readInput()
closeInputFile()
for (i in lineArray) // i:current line
manipulate i
newArray[] += i // store manipulted line to new array
openOutputFile()
writeOutput(newArray)
closeOutput()
Get each line in a loop, after manipulation write new line to the output
openInputFile()
openOutputFile()
for (i in inputFile) // i:current line
manipulate i
print manipulated line to output
closeInputFile()
closeOutputFile()
Which one should I choose?
It depends on how large the input file is:
If it is small, it doesn't matter which approach you use.
If it is large enough, then the overhead of holding the entire input file and the entire output file in memory at the same time can have significant performance impacts. (Increased paging load, etcetera.)
If it is really large, you will run out of memory and the application will fail.
If you cannot predict the number of lines there will be, then preallocating the line array is problematic.
Provided that you use buffered input and output streams, the second version will be more efficient, will use less memory, and won't break if the input file is too big.
In both cases you read from each file once, and write to each file once. From that perspective, there isn't much difference in efficiency. Filesystems are good at buffering and serialising IO, and your disks are almost always the limiting factor in this sort of thing.
In an extreme case, you do sometimes gain a bit of efficiency with batching your write operations - a single large write is more efficient than lots of small ones. This is very rarely relevant on a modern operating system though, as they'll already be doing that behind the scenes.
So the key difference between the two approaches is memory use - in the former case, you have a much larger memory footprint, and gain no advantage from doing it. You should therefore go for the second choice*.
* Unless you actually need to reference elsewhere in the array, e.g. if you need to sort your data, because you then do need to pull your whole file into memory to manipulate it.
Problem Description
I need to stream large files from disk. Assume the files are larger than will fit in memory. Furthermore, suppose that I'm doing some calculation on the data and the result is small enough to fit in memory. As a hypothetical example, suppose I need to calculate an md5sum of a 200GB file and I need to do so with guarantees about how much ram will be used.
In summary:
Needs to be constant space
Fast as possible
Assume very large files
Result fits in memory
Question
What are the fastest ways to read/stream data from a file using constant space?
Ideas I've had
If the file was small enough to fit in memory, then mmap on POSIX systems would be very fast, unfortunately that's not the case here. Is there any performance advantage to using mmap with a small buffer size to buffer successive chunks of the file? Would the system call overhead of moving the mmap buffer down the file dominate any advantages Or should I use a fixed buffer that I read into with fread?
I wouldn't be so sure that mmap would be very fast (where very fast is defined as significantly faster than fread).
Grep used to use mmap, but switched back to fread. One of the reasons was stability (strange things happen with mmap if the file shrinks whilst it is mapped or an IO error occurs). This page discusses some of the history about that.
You can compare the performance on your system with the option --mmap to grep. On my system the difference in performance on a 200GB file is negligible, but your mileage might vary!
In short, I'd use fread with a fixed size buffer. It's simpler to code, easier to handle errors and will almost certainly be fast enough.
Depending on the language you are using, a C-like fread() loop based on a file for which you declared a particular buffer size will require exactly this buffer size, no more no less.
We typically choose a buffer size of 4 to 128 kBytes, there is little gain if any with bigger buffers.
If performance was extremely important, for relatively little gain (and at the risk of re-inventing something), one could consider using a two-thread implementation, whereby one thread reads the file in a set of two buffers, and the other thread perform calculations sequential fashion in one of the buffers at a time. In this fashion the disk access delays can be removed.
mjv is right. You can use double-buffers and overlapped I/O. That way your crunching and the disk reading can be happening at the same time. Then I would profile or stack-shot the crunching to make it as fast as possible. With luck it will be faster than the I/O, so you will end up running the I/O at top speed without pause. Then things like file fragmentation come into the picture.
What is the difference between sequential write and random write in case of :-
1)Disk based systems
2)SSD [Flash Device ] based systems
When the application writes something and the information/data needs to be modified on the disk then how do we know whether it is a sequential write or a random write.As till this point a write cannot be distinguished as "sequential" or "random".The write is just buffered and then applied to the disk when we will flush the buffer.
Please correct me if I am wrong.
When people talk about sequential vs random writes to a file, they're generally drawing a distinction between writing without intermediate seeks ("sequential"), vs. a pattern of seek-write-seek-write-seek-write, etc. ("random").
The distinction is very important in traditional disk-based systems, where each disk seek will take around 10ms. Sequentially writing data to that same disk takes about 30ms per MB. So if you sequentially write 100MB of data to a disk, it will take around 3 seconds. But if you do 100 random writes of 1MB each, that will take a total of 4 seconds (3 seconds for the actual writing, and 10ms*100 == 1 second for all the seeking).
As each random write gets smaller, you pay more and more of a penalty for the disk seeks. In the extreme case where you perform 100 million random 1-byte writes, you'll still net 3 seconds for all the actual writes, but you'd now have 11.57 days worth of seeking to do! So clearly the degree to which your writes are sequential vs. random can really affect the time it takes to accomplish your task.
The situation is a bit different when it comes to flash. With flash, you don't have a physical disk head that you must move around. (This is where the 10ms seek cost comes from for a traditional disk). However, flash devices tend to have large page sizes (the smallest "typical" page size is around 512 bytes according to wikipedia, and 4K page sizes appear to be common as well). So if you're writing a small number of bytes, flash still has overhead in that you must read out an entire page, modify the bytes you're writing, and then write back the entire page. I don't know the characteristic numbers for flash off the top of my head. But the rule of thumb is that on flash if each of your writes is generally comparable in size to the device's page size, then you won't see much performance difference between random and sequential writes. If each of your writes is small compared to the device page size, then you'll see some overhead when doing random writes.
Now for all of the above, it's true that at the application layer much is hidden from you. There are layers in the kernel, disk/flash controller, etc. that could for example interject non-obvious seeks in the middle of your "sequential" writing. But in most cases, writing that "looks" sequential at the application layer (no seeks, lots of continuous I/O) will have sequential-write performance while writing that "looks" random at the application layer will have the (generally worse) random-write performance.