External sorting of lines. Сount of files to merge?

External sorting of lines. Сount of files to merge? - sorting

I need to sort the rows in the file in the minimum time (tens of GB) on a PC. I should use N-way merge sorting, right? How do choose the number N (the number of files to merge at a time)? Should I measure delays when reading or writing and tune N? Or get disk information from the system? If i have SSD, could I merge all sorted part at once (The program need to somehow determine is it an SSD)? What other optimizations can be?

After the initial pass to create a set of sorted sub-files, for hard drives, typically a 16-way merge using a min-heap is used, which is still fast enough to keep the process I/O bound. To reduce random access overhead, large read / writes are needed, like 100MB if you have enough ram (16 input blocks, 1 output block, 1.7GB of buffer space).
For SSDs faster transfer rate, a smaller than 16 k-way merge may be best. For the very fast SAS or NVMe SSDs with read rates at 2GB/S and write rates over 1GB/S, a 2 or 4 way merge without heap may be all that is possible while keeping the drives close to I/O bound. For SATA SSDs with read write rates a bit over 500MB/S, something from a 6 to 16 way merge may be best.

Related

Number of seeks() when reading a file in Hadoop?

I want to make sure I am getting this concept right:
In Hadoop the Definite Guide it is stated that: "the goal while designing a file system is always to reduce the number of seeks in comparison to the amount of data to be transferred." In this statement the author is referring to the "seeks()" of Hadoop logical blocks, right?
I am thinking that no matter how big the Hadoop block size is (64MB or 128MB or bigger) the number of seeks of the physical blocks (which are usually 4KB or 8KB) that the underlying filesystem (e.g. ext3/fat) will have to perform will be the same no matter the size of Hadoop block size.
Example: To keep numbers simple, assume underlying file system block size is 1MB. We want to read a file of size 128MB. If the Hadoop block size is 64MB, the file occupies 2 blocks. When reading there are 128 seeks. if the Hadoop block size is increased to 128MB, the number of seeks performed by the files system is still 128. In the second case, Hadoop will perform 1 seek instead of 2.
Is my understanding correct?
If I am correct, a substantial performance improvement by increasing block size will only be observed only for very large files, right? I am thinking that in the case of files that are in the 1~GB size range, reducing the number of seeks from ~20 seeks (64MB block size) to ~10 seek (128MB block size) shouldn't make much of a difference, right?

You are correct that increasing the file system block size will improve performance. Linux requires that the block size be less than or equal to the page size. The x86 page size is limited to 4K; therefore, the largest block size that you can use is 4K even if the file system can support larger block sizes. The performance benefits of a large block size and page size are significant: reduction in read/write system calls, reduction in rotational delays and seeks (don't begin to consider SSDs), fewer context switches, improved cache locality, fewer TLB misses, etc. This is all goodness.
I analytically modeled the benefits of various block sizes based on our disk usage pattern and in some cases predicted order of magnitude improvements from the disk subsystem. This would shift the performance bottleneck elsewhere.
You are correct that substantial performance gains are possible. Unfortunately, a certain engineer who controls such improvements sees no value in page sizes larger than 4K. He mocks enterprise users who need high perform from largely homogeneous workloads on big iron and focuses on heterogeneous workloads that are interactively run on desktop or laptop systems where high performance is unimportant.

Constant Write Speed to Disk

I'm writing real-time data to an empty spinning disk sequentially. (EDIT: It doesn't have to be sequential, as long as I can read it back as if it was sequential.) The data arrives at a rate of 100 MB/s and the disks have an average write speed of 120 MB/s.
Sometimes (especially as free space starts to decrease) the disk speed goes under 100 MB/s depending on where on the platter the disk is writing, and I have to drop vital data.
Is there any way to write to disk in a pattern (or some other way) to ensure a constant write speed close to the average rate? Regardless of how much data there currently is on the disk.
EDIT:
Some notes on why I think this should be possible.
When usually writing to the disk, it starts in the fast portion of the platter and then writes towards the slower parts. However, if I could write half the data to the fast part and half the data to the slow part (i.e. for 1 second it could write 50MB to the fast part and 50MB to the slow part), they should meet in the middle. I could possibly achieve a constant rate?
As a programmer, I am not sure how I can decide where on the platter the data is written or even if the OS can achieve something similar.

If I had to do this on a regular Windows system, I would use a device with a higher average write speed to give me more headroom. Expecting 100MB/s average write speed over the entire disk that is rated for 120MB/s is going to cause you trouble. Spinning hard disks don't have a constant write speed over the whole disk.
The usual solution to this problem is to buffer in RAM to cover up infrequent slow downs. The more RAM you use as a buffer, the longer the span of slowness you can handle. These are tradeoffs you have to make. If your problem is the known slowdown on the inside sectors of a rotating disk, then your device just isn't fast enough.
Another thing that might help is to access the disk as directly as possible and ensure it isn't being shared by other parts of the system. Use a separate physical device, don't format it with a filesystem, write directly to the partitioned space. Yes, you'll have to deal with some of the issues a filesystem solves for you, but you also skip a bunch of code you can't control. Even then, your app could run into scheduling issues with Windows. Windows is not a RTOS, there are not guarantees as far as timing. Again this would help more with temporary slowdowns from filesystem cleanup, flushing dirty pages, etc. It probably won't help much with the "last 100GB writes at 80MB/s" problem.
If you really are stuck with a disk that goes from 120MB/s -> 80MB/s outside-to-inside (you should test with your own code and not trust the specs from the manufacture so you know what you're dealing with), then you're going to have to play partitioning games like others have suggested. On a mechanical disk, that will introduce some serious head seeking, which may eat up your improvement. To minimize seeks, it would be even more important to ensure it's a dedicated disk the OS isn't using for anything else. Also, use large buffers and write many megabytes at a time before seeking to the end of the disk. Instead of partitioning, you could write directly to the block device and control which blocks you write to. I don't know how to do this in Windows.
To solve this on Linux, I would be tempted to test mdadm's raid0 across two partitions on the same drive and see if that works. If so, then the work is done and you don't have to write and test some complicated write mechanism.

Partition the disk into two equally sized partitions. Write a few seconds worth of data alternating between the partitions. That way you get almost all of the usual sequential speed, nicely averaged. One disk seek every few seconds eats up almost no time. One seek per second reduces the usable time from 1000ms to ~990ms which is a ~1% reduction in throughput. The more RAM you can dedicate to buffering the less you have to seek.
Use more partitions to increase the averaging effect.

I fear this may be more difficult than you realize:
If your average 120 MB/s write speed is the manufacturer's value then it is most likely "optimistic" at best.
Even a benchmarked write speed is usually done on a non-partitioned/formatted drive and will be higher than what you'd typically see in actual use (how much higher is a good question).
A more important value is the drive's minimum write speed. For example, from Tom's Hardware 2013 HDD Benchmarks a drive with a 120 MB/s average has a 76 MB/s minimum.
A drive that is being used by other applications at the same time (e.g., Windows) will have a much lower write speed.
An even more important value is the drives actual measured performance. I would make a simple application similar to your use case that writes data to the drive as fast as possible until it fills the drive. Do this a few (dozen) times to get a more realistic average/minimum/maximum write speed value...it will likely be lower than you'd expect.
As you noted, even if your "real" average write speed is higher than 100 MB/s you run into issues if you run into slow write speeds just before the disk fills up, assuming you don't have somewhere else to write the data to. Using a buffer doesn't help in this case.
I'm not sure if you can actually specify a physical location to write to on the hard drive these days without getting into the drive's firmware. Even if you could this would be my last choice for a solution.
A few specific things I would look at to solve your problem:
Measure the "real" write performance of the drive to see if its fast enough. This gives you an idea of how far behind you actually are.
Put the OS on a separate drive to ensure the data drive is not being used by anything other than your application.
Get faster drives (either HDD or SDD). It is fine to use the manufacturer's write speeds as an initial guide but test them thoroughly as well.
Get more drives and put them into a RAID0 (or similar) configuration for faster write access. You'll again want to actually test this to confirm it works for you.

You could implement the strategy of alternating writes bewteen the inside and the outside by directly controlling the disk write locations. Under Windows you can open a disk like "\.\PhysicalDriveX" and control where it writes. For more info see
http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx

First of all, I hope you are using raw disks and not a filesystem. If you're using a filesystem, you must:
Create an empty, non-sparse file that's as large as the filesystem will fit.
Obtain a mapping from the logical file positions to disk blocks.
Reverse this mapping, so that you can map from disk blocks to logical file positions. Of course some blocks are unavailable due to filesystem's own use.
At this point, the disk looks like a raw disk that you access by disk block. It's a valid assumption that this block addressing is mostly monotonous to the physical cylinder number. IOW if you increase the disk block number, the cylinder number will never decrease (or never increase -- depending on the drive's LBA to physical mapping order).
Also, note that a disk's average write speed may be given per cylinder or per unit of storage. How would you know? You need the latter number, and the only sure way to get it is to benchmark it yourself. You need to fill the entire disk with data, by repeatedly writing a zero page to the disk, going block by block, and divide the total amount of data written by the amount it took. You need to be accessing the disk or the file in the direct mode. This should disable the OS buffering for the file data, and not for the filesystem metadata (if not using a raw disk).
At this point, all you need to do is to write data blocks of sensible sizes at the two extremes of the block numbers: you need to fill the disk from both ends inwards. The size of the data blocks depends on the bandwidth wastage you can allow for seeks. You should also assume that the hard drive might seek once in a while to update its housekeeping data. Assuming a worst-case seek taking 15ms, you waste 1.5% of per-second bandwidth for each seek. Assuming you can spare no more than 5% of bandwidth, with 1 seek/s on average for the drive itself, you can seek twice per second. Thus your block size needs to be your_bandwith_per_second/2. This bandwidth is not the disk bandwidth, but the bandwidth of your data source.
Alas, if only things where this easy. It generally turns out that the bandwidth at the middle of the disk is not the average bandwidth. During your benchmark you must also take a note of write speed over smaller sections of the disk, say every 1% of the disk. This way, when writing into each section of the disk, you can figure out how to split the data between the "low" and the "high" section that you're writing to. Suppose that you're starting out at 0% and 99% positions on the disk, and the low position has a bandwidth of mean*1.5, and the high position has a bandwidth of mean*0.8, where mean is your desired mean bandwidth. You'll then need to write 100% * 1.5/(0.8+1.5) of the data into the low position, and the remainder (100% * 0.8/(0.8+1.5)) into the slower high position.
The size of your buffer needs to be larger than just the block size, since you must assume some worst-case latency for the hard drive if it hits bad blocks and needs to relocate data, etc. I'd say a 3 second buffer may be reasonable. Optionally it can grow by itself if latencies you measure while your software runs turn out higher. This buffer must be locked ("pinned") to physical memory so that it's not subject to swapping.

Another possible option is to destroke (or short stroke) a hard drive. If you start with a 4TB or larger drive and destroke it to 2TB, only the outer portions of the platters will be used, resulting in a faster throughput rate. The issue would be getting the software that issues vendor unique commands to a hard drive to destroke it.

Durably reordering values on disk

I have a large number (100s of millions) of fixed-size values stored in a random order on a disk. I have the same set of values stored in memory, in a different order. I need to store the values in the order they are in memory, on disk. The challenge is this: I need to keep at least one copy of each value on disk at any one time – i.e. it needs to be durable.
I have quite a bit of RAM to work with (the values take up only about 60%), a lot of ephemeral storage, but only a very small amount of space on the durable disk, enough for less than a million of the values.
Given a value on disk, I can find it in memory very very quickly. But the converse is not true, given a value in memory, it is very slow to find it on disk.
Given these limitations, what's the best algorithm to transfer the order of the values from memory, to disk, as fast as possible?

Sounds like you have a sorting problem, where your comparator is the order of elements in RAM (element x is 'bigger' than element y, if x appears after y in RAM).
It can be solved using an external sort.
Note that if you allow duplicates, some more processing needs to be done in order to make sure your comparator is valid (can be solved by enumerating the identical values, and assigning a 'dupe_id' to each duplicate - both in RAM and on disk)

five minutes rules - the price of one access of disc I/O

This is very interesting topic, they use following formula to compute access interval time:
BreakEvenIntervalinSeconds = (PagesPerMBofRAM / AccessesPerSecondPerDisk) × (PricePerDiskDrive / PricePerMBofRAM).
It is derived using formulas for the cost of RAM to hold a page in the buffer pool and the cost of a (fractional) disk to perform I/O every time a page is needed, equating these two costs, and solving the equation for the interval between accesses.
so the cost of disc I/O per access is PricePerDiskDrive / AccessesPerSecondPerDisk, My question is why disc I/O cost per access is computed like this?

The underlying assumption is that the limit to the life of a disk is how many disk seeks there are, while RAM has a fixed cost for its size, and a fixed lifetime regardless of how often it is accessed. This is reasonable because seeking to disk causes physical wear and tear, and when the disk goes, you lose the whole disk. By contrast RAM has no physical moving parts, and so does not wear out with use.
With that assumption, the cost of keeping data on disk depends on the frequency of access and the cost of the disk. The cost of keeping data in RAM depends on how much RAM you're using. What they are trying to find is the break even point between where it is cheaper to keep data on disk or in RAM.
However the equation as given is incomplete. While that equation identifies relevant factors, there is an important constant of proportionality missing. How many accesses can the average hard drive sustain? How long does RAM last on average? Those enter into the costs for keeping data on hard drives and RAM, and without them you are comparing apples and oranges.
This is indicative of my impression of the whole paper. It says a lot at great length, about an important topic, but the analysis is sloppy. They are slopping and leave critical things out, and don't do enough to help people understand what they are thinking and when their analysis is appropriate what you are doing. For instance if you are trying to maintain a low latency system, you have to keep all of your data in RAM. Period. If you're processing large data sets and don't want to pay to keep it all in RAM, then you will be streaming data to/from disk. If you're keeping data in a redundant format, for instance RAID, you are doing more seeks per read than they admit.

Compression to Improve Hard Disk Write Performance

On a modern system can local hard disk write speeds be improved by compressing the output stream?
This question derives from a case I'm working with where a program serially generates and dumps around 1-2GB of text logging data to a raw text file on the hard disk and I think it is IO bound. Would I expect to be able to decrease runtimes by compressing the data before it goes to disk or would the overhead of compression eat up any gain I could get? Would having an idle second core affect this?
I know this would be affected by how much CPU is being used to generate the data so rules of thumb on how much idle CPU time would be needed would be good.
I recall a video talk where someone used compression to improve read speeds for a database but IIRC compressing is a lot more CPU intensive than decompressing.

Yes, yes, yes, absolutely.
Look at it this way: take your maximum contiguous disk write speed in megabytes per second. (Go ahead and measure it, time a huge fwrite or something.) Let's say 100mb/s. Now take your CPU speed in megahertz; let's say 3Ghz = 3000mhz. Divide the CPU speed by the disk write speed. That's the number of cycles that the CPU is spending idle, that you can spend per byte on compression. In this case 3000/100 = 30 cycles per byte.
If you had an algorithm that could compress your data by 25% for an effective 125mb/s write speed, you would have 24 cycles per byte to run it in and it would basically be free because the CPU wouldn't be doing anything else anyway while waiting for the disk to churn. 24 cycles per byte = 3072 cycles per 128-byte cache line, easily achieved.
We do this all the time when reading optical media.
If you have an idle second core it's even easier. Just hand off the log buffer to that core's thread and it can take as long as it likes to compress the data since it's not doing anything else! The only tricky bit is you want to actually have a ring of buffers so that you don't have the producer thread (the one making the log) waiting on a mutex for a buffer that the consumer thread (the one writing it to disk) is holding.

Yes, this has been true for at least 10 years. There are operating-systems papers about it. I think Chris Small may have worked on some of them.
For speed, gzip/zlib compression on lower quality levels is pretty fast; if that's not fast enough you can try FastLZ. A quick way to use an extra core is just to use popen(3) to send output through gzip.

For what it is worth Sun's filesystem ZFS has the ability to have on the fly compression enabled to decrease the amount of disk IO without a significant increase in overhead as an example of this in practice.

The Filesystems and storage lab from Stony Brook published a rather extensive performance (and energy) evaluation on file data compression on server systems at IBM's SYSTOR systems research conference this year: paper at ACM Digital Library, presentation.
The results depend on the
used compression algorithm and settings,
the file workload and
the characteristics of your machine.
For example, in the measurements from the paper, using a textual workload and a server environment using lzop with low compression effort are faster than plain write, but bzip and gz aren't.
In your specific setting, you should try it out and measure. It really might improve performance, but it is not always the case.

CPUs have grown faster at a faster rate than hard drive access. Even back in the 80's a many compressed files could be read off the disk and uncompressed in less time than it took to read the original (uncompressed) file. That will not have changed.
Generally though, these days the compression/de-compression is handled at a lower level than you would be writing, for example in a database I/O layer.
As to the usefulness of a second core only counts if the system will be also doing a significant number of other things - and your program would have to be multi-threaded to take advantage of the additional CPU.

Logging the data in binary form may be a quick improvement. You'll write less to the disk and the CPU will spend less time converting numbers to text. It may not be useful if people are going to be reading the logs, but they won't be able to read compressed logs either.

Windows already supports File Compression in NTFS, so all you have to do is to set the "Compressed" flag in the file attributes.
You can then measure if it was worth it or not.

This depends on lots of factors and I don't think there is one correct answer. It comes down to this:
Can you compress the raw data faster than the raw write performance of your disk times the compression ratio you are achieving (or the multiple in speed you are trying to get) given the CPU bandwidth you have available to dedicate to this purpose?
Given today's relatively high data write rates in the 10's of MBytes/second this is a pretty high hurdle to get over. To the point of some of the other answers, you would likely have to have easily compressible data and would just have to benchmark it with some test of reasonableness type experiments and find out.
Relative to a specific opinion (guess!?) to the point about additional cores. If you thread up the compression of the data and keep the core(s) fed - with the high compression ratio of text, it is likely such a technique would bear some fruit. But this is just a guess. In a single threaded application alternating between disk writes and compression operations, it seems much less likely to me.

If it's just text, then compression could definitely help. Just choose an compression algorithm and settings that make the compression cheap. "gzip" is cheaper than "bzip2" and both have parameters that you can tweak to favor speed or compression ratio.

If you are I/O bound saving human-readable text to the hard drive, I expect compression to reduce your total runtime.
If you have an idle 2 GHz core, and a relatively fast 100 MB/s streaming hard drive,
halving the net logging time requires at least 2:1 compression and no more than roughly 10 CPU cycles per uncompressed byte for the compressor to ponder the data.
With a dual-pipe processor, that's (very roughly) 20 instructions per byte.
I see that LZRW1-A (one of the fastest compression algorithms) uses 10 to 20 instructions per byte, and compresses typical English text about 2:1.
At the upper end (20 instructions per byte), you're right on the edge between IO bound and CPU bound. At the middle and lower end, you're still IO bound, so there is a a few cycles available (not much) for a slightly more sophisticated compressor to ponder the data a little longer.
If you have a more typical non-top-of-the-line hard drive, or the hard drive is slower for some other reason (fragmentation, other multitasking processes using the disk, etc.)
then you have even more time for a more sophisticated compressor to ponder the data.
You might consider setting up a compressed partition, saving the data to that partition (letting the device driver compress it), and comparing the speed to your original speed.
That may take less time and be less likely to introduce new bugs than changing your program and linking in a compression algorithm.
I see a list of compressed file systems based on FUSE, and I hear that NTFS also supports compressed partitions.

If this particular machine is often IO bound,
another way to speed it up is to install a RAID array.
That would give a speedup to every program and every kind of data (even incompressible data).
For example, the popular RAID 1+0 configuration with 4 total disks gives a speedup of nearly 2x.
The nearly as popular RAID 5 configuration, with same 4 total disks, gives all a speedup of nearly 3x.
It is relatively straightforward to set up a RAID array with a speed 8x the speed of a single drive.
High compression ratios, on the other hand, are apparently not so straightforward. Compression of "merely" 6.30 to one would give you a cash prize for breaking the current world record for compression (Hutter Prize).

This used to be something that could improve performance in quite a few applications way back when. I'd guess that today it's less likely to pay off, but it might in your specific circumstance, particularly if the data you're logging is easily compressible,
However, as Shog9 commented:
Rules of thumb aren't going to help
you here. It's your disk, your CPU,
and your data. Set up a test case and
measure throughput and CPU load with
and without compression - see if it's
worth the tradeoff.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio