Algorithms for Optimization with Fast Disk Storage (SSDs)?

Algorithms for Optimization with Fast Disk Storage (SSDs)? - algorithm

Given that Solid State Disks (SSDs) are decreasing in price and soon will become more prevalent as system drives, and given that their access rates are significantly higher than rotating magnetic media, what standard algorithms will gain in performance from the use of SSDs for local storage? For example, the high random read speed of SSDs makes something like a disk-based hashtable a viability for large hashstables; 4GB of disk space is readily available, which makes hashing to the entire range of a 32-bit integer viable (more for lookup than population, though, which would still take a long time); while this size of a hashtable would be prohibitive to work with with rotating media due to the access speed, it shouldn't be as much of an issue with SSDs.
Are there any other areas where the impending transition to SSDs will provide potential gains in algorithmic performance? I'd rather see reasoning as to how one thing will work rather than opinion; I don't want this to turn contentious.

Your example of hashtables is indeed the key database structure that will benefit. Instead of having to load a whole 4GB or more file into memory to probe for values, the SSD can be probed directly. The SSD is still slower than RAM, by orders of magnitude, but it's quite reasonable to have a 50GB hash table on disk, but not in RAM unless you pay big money for big iron.
An example is chess position databases. I have over 50GB of hashed positions. There is complex code to try to group related positions near each other in the hash, so I can page in 10MB of the table at a time and hope to reuse some of it for multiple similar position queries. There's a ton of code and complexity to make this efficient.
Replaced with an SSD, I was able to drop all the complexity of the clustering and just use really dumb randomized hashes. I also got an increase in performance since I only fetch the data I need from the disk, not big 10MB chunks. The latency is indeed larger, but the net speedup is significant.. and the super-clean code (20 lines, not 800+), is perhaps even nicer.

SSDs are only significantly faster for random access. Sequential access to disk they are only twice as performant as mainstream rotational drives. Many SSDs have poorer performance in many scenarios causing them to perform worse, as described here.
While SSDs do move the needle considerably, they are still much slower than CPU operations and physical memory. For your 4GB hash table example, you may be able to sustain 250+ MB/s off of an SSD for accessing random hash table buckets. For a rotational drive, you'd be lucky to break the single digit MB/s. If you can keep this 4 GB hash table in memory, you could access it on the order of gigabytes a second - much faster than even a very swift SSD.
The referenced article lists several changes MS made for Windows 7 when running on SSD's, which can give you an idea of the sort of changes you could consider making. First, SuperFetch for prefetching data off of disk is disabled - it's designed to get around slow random access times for disk which are alleviated by SSDs. Defrag is disabled, because having files scattered across the disk aren't a performance hit for SSDs.

Ipso facto, any algorithm you can think of which requires lots of random disk I/O (random being the key word, which helps to throw the principle of locality to the birds, thus eliminating the usefulness of a lot of caching that goes on).
I could see certain database systems gaining from this though. MySQL, for instance using the MyISAM storage engine (where data records are basically glorified CSVs). However, I think very large hashtables are going to be your best bet for good examples.

SSD are a lot faster for random reads, a bit for sequential reads and properly slower for writes (random or not).
So a diskbased hashtable is properly not useful with an SSD, since it now takes significantly time to update it, but searching the disk becomes (compared to a normal hdd) very cheap.

Don't kid yourself. SSDs are still a whole lot slower than system memory. Any algorithm that chooses to use system memory over the hard disk is still going to be much faster, all other things being equal.

Related

What are the trade-offs of larger cache memories ? Could we use one to replace secondary memory?

What are the disadvantages of using larger cache memories? Could we use just use a large enough cache memory so a secondary memory wouldn't be needed at all? I understand that the most compelling arguments are related to the cost of it / the problem of it's size. But if we assume that creating such a cache memory is possible, would it encounter any additional problems?

Many problems even if it was not expensive
Size will degrade the performance
Cache is fast because it’s very small compared to the main memory and hence it requires small amount of time to search it. If you build a large cache then it will not be able to perform at the same speed as the smaller counterpart.
Larger die area
Most of the DRAM chips only require a capacitor and a transistor to store a bit. SRAM on the other hand requires at least 6 transistors to make a single cell of memory. Which requires more area.
High power requirements
Because of the more transistors SRAM requires more power to operate. Which in turn generates more heat so you will have to handle the cooling problem.
So as you can see it’s not worth the effort given that today’s computers already achieve 90% hit ratio most of the time.

SAN Performance

Have a question regarding SAN performance specifically EMC VNX SAN. I have a significant number of processes spread over number of blade servers running concurrently. The number of processes is typically around 200. Each process loads 2 small files from storage, one 3KB one 30KB. There are millions (20) of files to be processed. The processes are running on Windows Server on VMWare. The way this was originally setup was 1TB LUNs on the SAN bundled into a single 15TB drive in VMWare and then shared as a network share from one Windows instance to all the processes. The processes running concurrently and the performance is abysmal. Essentially, 200 simultaneous requests are being serviced by the SAN through Windows share at the same time and the SAN is not handling it too well. I'm looking for suggestions to improve performance.

With all performance questions, there's a degree of 'it depends'.
When you're talking about accessing a SAN, there's a chain of potential bottlenecks to unravel. First though, we need to understand what the actual problem is:
Do we have problems with throughput - e.g. sustained transfer, or latency?
It sounds like we're looking at random read IO - which is one of the hardest workloads to service, because predictive caching doesn't work.
So begin at the beginning:
What sort of underlying storage are you using?
Have you fallen into the trap of buying big SATA, configuring it RAID-6? I've seen plenty of places do this because it looks like cheap terabytes, without really doing the sums on the performance. A SATA drive starts to slow down at about 75 IO operations per second. If you've got big drives - 3TB for example - that's 25 IOPs per terabytes. As a rough rule of thumb, 200 per drive for FC/SAS and 1500 for SSD.
are you tiering?
Storage tiering is a clever trick of making a 'sandwich' out of different speeds of disk. This usually works, because usually only a small fraction of a filesystem is 'hot' - so you can put the hot part on fast disk, and the cold part on slow disk, and average performance looks better. This doesn't work for random IO or cold read accesses. Nor does it work for full disk transfers - as only 10% of it (or whatever proportion) can ever be 'fast' and everything else has to go the slow way.
What's your array level contention?
The point of SAN is that you aggregate your performance, such that each user has a higher peak and a lower average, as this reflects most workloads. (When you're working on a document, you need a burst of performance to fetch it, but then barely any until you save it again).
How are you accessing your array?
Typically SAN is accessed using a Fiber Channel network. There's a whole bunch of technical differences with 'real' networks, but they don't matter to you - but contention and bandwidth still do. With ESX in particular, I find there's a tendency to underestimate storage IO needs. (Multiple VMs using a single pair of HBAs means you get contention on the ESX server).
what sort of workload are we dealing with?
One of the other core advantages of storage arrays is caching mechanisms. They generally have very large caches and some clever algorithms to take advantage of workload patterns such as temporal locality and sequential or semi-sequential IO. Write loads are easier to handle for an array, because despite the horrible write penalty of RAID-6, write operations are under a soft time constraint (they can be queued in cache) but read operations are under a hard time constraint (the read cannot complete until the block is fetched).
This means that for true random read, you're basically not able to cache at all, which means you get worst case performance.
Is the problem definitely your array? Sounds like you've a single VM with 15TB presented, and that VM is handling the IO. That's a bottleneck right there. How many IOPs are the VM generating to the ESX server, and what's the contention like there? What's the networking like? How many other VMs are using the same ESX server and might be sources of contention? Is it a pass through LUN, or VMFS datastore with a VMDK?
So - there's a bunch of potential problems, and as such it's hard to roll it back to a single source. All I can give you is some general recommendations to getting good IO performance.
fast disks (they're expensive, but if you need the IO, you need to spend money on it).
Shortest path to storage (don't put a VM in the middle if you can possibly avoid it. For CIFS shares a NAS head may be the best approach).
Try to make your workload cacheable - I know, easier said than done. But with millions of files, if you've got a predictable fetch pattern your array will start prefetching, and it'll got a LOT faster. You may find if you start archiving the files into large 'chunks' you'll gain performance (because the array/client will fetch the whole chunk, and it'll be available for the next client).
Basically the 'lots of small random IO operations' especially on slow disks is really the worst case for storage, because none of the clever tricks for optimization work.

Constant Write Speed to Disk

I'm writing real-time data to an empty spinning disk sequentially. (EDIT: It doesn't have to be sequential, as long as I can read it back as if it was sequential.) The data arrives at a rate of 100 MB/s and the disks have an average write speed of 120 MB/s.
Sometimes (especially as free space starts to decrease) the disk speed goes under 100 MB/s depending on where on the platter the disk is writing, and I have to drop vital data.
Is there any way to write to disk in a pattern (or some other way) to ensure a constant write speed close to the average rate? Regardless of how much data there currently is on the disk.
EDIT:
Some notes on why I think this should be possible.
When usually writing to the disk, it starts in the fast portion of the platter and then writes towards the slower parts. However, if I could write half the data to the fast part and half the data to the slow part (i.e. for 1 second it could write 50MB to the fast part and 50MB to the slow part), they should meet in the middle. I could possibly achieve a constant rate?
As a programmer, I am not sure how I can decide where on the platter the data is written or even if the OS can achieve something similar.

If I had to do this on a regular Windows system, I would use a device with a higher average write speed to give me more headroom. Expecting 100MB/s average write speed over the entire disk that is rated for 120MB/s is going to cause you trouble. Spinning hard disks don't have a constant write speed over the whole disk.
The usual solution to this problem is to buffer in RAM to cover up infrequent slow downs. The more RAM you use as a buffer, the longer the span of slowness you can handle. These are tradeoffs you have to make. If your problem is the known slowdown on the inside sectors of a rotating disk, then your device just isn't fast enough.
Another thing that might help is to access the disk as directly as possible and ensure it isn't being shared by other parts of the system. Use a separate physical device, don't format it with a filesystem, write directly to the partitioned space. Yes, you'll have to deal with some of the issues a filesystem solves for you, but you also skip a bunch of code you can't control. Even then, your app could run into scheduling issues with Windows. Windows is not a RTOS, there are not guarantees as far as timing. Again this would help more with temporary slowdowns from filesystem cleanup, flushing dirty pages, etc. It probably won't help much with the "last 100GB writes at 80MB/s" problem.
If you really are stuck with a disk that goes from 120MB/s -> 80MB/s outside-to-inside (you should test with your own code and not trust the specs from the manufacture so you know what you're dealing with), then you're going to have to play partitioning games like others have suggested. On a mechanical disk, that will introduce some serious head seeking, which may eat up your improvement. To minimize seeks, it would be even more important to ensure it's a dedicated disk the OS isn't using for anything else. Also, use large buffers and write many megabytes at a time before seeking to the end of the disk. Instead of partitioning, you could write directly to the block device and control which blocks you write to. I don't know how to do this in Windows.
To solve this on Linux, I would be tempted to test mdadm's raid0 across two partitions on the same drive and see if that works. If so, then the work is done and you don't have to write and test some complicated write mechanism.

Partition the disk into two equally sized partitions. Write a few seconds worth of data alternating between the partitions. That way you get almost all of the usual sequential speed, nicely averaged. One disk seek every few seconds eats up almost no time. One seek per second reduces the usable time from 1000ms to ~990ms which is a ~1% reduction in throughput. The more RAM you can dedicate to buffering the less you have to seek.
Use more partitions to increase the averaging effect.

I fear this may be more difficult than you realize:
If your average 120 MB/s write speed is the manufacturer's value then it is most likely "optimistic" at best.
Even a benchmarked write speed is usually done on a non-partitioned/formatted drive and will be higher than what you'd typically see in actual use (how much higher is a good question).
A more important value is the drive's minimum write speed. For example, from Tom's Hardware 2013 HDD Benchmarks a drive with a 120 MB/s average has a 76 MB/s minimum.
A drive that is being used by other applications at the same time (e.g., Windows) will have a much lower write speed.
An even more important value is the drives actual measured performance. I would make a simple application similar to your use case that writes data to the drive as fast as possible until it fills the drive. Do this a few (dozen) times to get a more realistic average/minimum/maximum write speed value...it will likely be lower than you'd expect.
As you noted, even if your "real" average write speed is higher than 100 MB/s you run into issues if you run into slow write speeds just before the disk fills up, assuming you don't have somewhere else to write the data to. Using a buffer doesn't help in this case.
I'm not sure if you can actually specify a physical location to write to on the hard drive these days without getting into the drive's firmware. Even if you could this would be my last choice for a solution.
A few specific things I would look at to solve your problem:
Measure the "real" write performance of the drive to see if its fast enough. This gives you an idea of how far behind you actually are.
Put the OS on a separate drive to ensure the data drive is not being used by anything other than your application.
Get faster drives (either HDD or SDD). It is fine to use the manufacturer's write speeds as an initial guide but test them thoroughly as well.
Get more drives and put them into a RAID0 (or similar) configuration for faster write access. You'll again want to actually test this to confirm it works for you.

You could implement the strategy of alternating writes bewteen the inside and the outside by directly controlling the disk write locations. Under Windows you can open a disk like "\.\PhysicalDriveX" and control where it writes. For more info see
http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx

First of all, I hope you are using raw disks and not a filesystem. If you're using a filesystem, you must:
Create an empty, non-sparse file that's as large as the filesystem will fit.
Obtain a mapping from the logical file positions to disk blocks.
Reverse this mapping, so that you can map from disk blocks to logical file positions. Of course some blocks are unavailable due to filesystem's own use.
At this point, the disk looks like a raw disk that you access by disk block. It's a valid assumption that this block addressing is mostly monotonous to the physical cylinder number. IOW if you increase the disk block number, the cylinder number will never decrease (or never increase -- depending on the drive's LBA to physical mapping order).
Also, note that a disk's average write speed may be given per cylinder or per unit of storage. How would you know? You need the latter number, and the only sure way to get it is to benchmark it yourself. You need to fill the entire disk with data, by repeatedly writing a zero page to the disk, going block by block, and divide the total amount of data written by the amount it took. You need to be accessing the disk or the file in the direct mode. This should disable the OS buffering for the file data, and not for the filesystem metadata (if not using a raw disk).
At this point, all you need to do is to write data blocks of sensible sizes at the two extremes of the block numbers: you need to fill the disk from both ends inwards. The size of the data blocks depends on the bandwidth wastage you can allow for seeks. You should also assume that the hard drive might seek once in a while to update its housekeeping data. Assuming a worst-case seek taking 15ms, you waste 1.5% of per-second bandwidth for each seek. Assuming you can spare no more than 5% of bandwidth, with 1 seek/s on average for the drive itself, you can seek twice per second. Thus your block size needs to be your_bandwith_per_second/2. This bandwidth is not the disk bandwidth, but the bandwidth of your data source.
Alas, if only things where this easy. It generally turns out that the bandwidth at the middle of the disk is not the average bandwidth. During your benchmark you must also take a note of write speed over smaller sections of the disk, say every 1% of the disk. This way, when writing into each section of the disk, you can figure out how to split the data between the "low" and the "high" section that you're writing to. Suppose that you're starting out at 0% and 99% positions on the disk, and the low position has a bandwidth of mean*1.5, and the high position has a bandwidth of mean*0.8, where mean is your desired mean bandwidth. You'll then need to write 100% * 1.5/(0.8+1.5) of the data into the low position, and the remainder (100% * 0.8/(0.8+1.5)) into the slower high position.
The size of your buffer needs to be larger than just the block size, since you must assume some worst-case latency for the hard drive if it hits bad blocks and needs to relocate data, etc. I'd say a 3 second buffer may be reasonable. Optionally it can grow by itself if latencies you measure while your software runs turn out higher. This buffer must be locked ("pinned") to physical memory so that it's not subject to swapping.

Another possible option is to destroke (or short stroke) a hard drive. If you start with a 4TB or larger drive and destroke it to 2TB, only the outer portions of the platters will be used, resulting in a faster throughput rate. The issue would be getting the software that issues vendor unique commands to a hard drive to destroke it.

five minutes rules - the price of one access of disc I/O

This is very interesting topic, they use following formula to compute access interval time:
BreakEvenIntervalinSeconds = (PagesPerMBofRAM / AccessesPerSecondPerDisk) × (PricePerDiskDrive / PricePerMBofRAM).
It is derived using formulas for the cost of RAM to hold a page in the buffer pool and the cost of a (fractional) disk to perform I/O every time a page is needed, equating these two costs, and solving the equation for the interval between accesses.
so the cost of disc I/O per access is PricePerDiskDrive / AccessesPerSecondPerDisk, My question is why disc I/O cost per access is computed like this?

The underlying assumption is that the limit to the life of a disk is how many disk seeks there are, while RAM has a fixed cost for its size, and a fixed lifetime regardless of how often it is accessed. This is reasonable because seeking to disk causes physical wear and tear, and when the disk goes, you lose the whole disk. By contrast RAM has no physical moving parts, and so does not wear out with use.
With that assumption, the cost of keeping data on disk depends on the frequency of access and the cost of the disk. The cost of keeping data in RAM depends on how much RAM you're using. What they are trying to find is the break even point between where it is cheaper to keep data on disk or in RAM.
However the equation as given is incomplete. While that equation identifies relevant factors, there is an important constant of proportionality missing. How many accesses can the average hard drive sustain? How long does RAM last on average? Those enter into the costs for keeping data on hard drives and RAM, and without them you are comparing apples and oranges.
This is indicative of my impression of the whole paper. It says a lot at great length, about an important topic, but the analysis is sloppy. They are slopping and leave critical things out, and don't do enough to help people understand what they are thinking and when their analysis is appropriate what you are doing. For instance if you are trying to maintain a low latency system, you have to keep all of your data in RAM. Period. If you're processing large data sets and don't want to pay to keep it all in RAM, then you will be streaming data to/from disk. If you're keeping data in a redundant format, for instance RAID, you are doing more seeks per read than they admit.

Given disk is slow and multiple cores does on the fly decompression make sense for performance?

It used to be that disk compression was used to increase storage space at the expense of efficiency but we were all on single processor systems back then.
These days there are extra cores around to potentially do the decompression work in parallel with processing the data.
For I/O bound applications (particularly read heavy sequential data processing) it might be possible to increase throughput by only reading and writing compressed data to disk.
Does anyone have any experience to support or reject this conjecture?

Take care not to confuse disk seek times and disk read rates. It takes millions of CPU cycles (5–10 milliseconds or 5–10 million nanoseconds) to seek to the right track on a hard drive (HDD). Once you're there, you can read tens of megabytes of data per second, assuming low fragmentation. For solid-state drives (SSD), seek times are lower (35,000–100,000ns) than HDDs.
Whether or not the data is compressed on the disk, you still have to seek. The question becomes, is (disk read time for compressed data + the decompression time) < (disk read time for uncompressed data). Decompression is relatively fast, since it amounts to replacing a short token with a longer one. In the end, it probably boils down to how well the data was compressed and how big it was in the first place. If you're reading a 2KB compressed file instead of a 5KB original, it's probably not worth it. If you're reading a 2MB compressed file instead of a 25MB original, it likely is.
Measure with a reasonable workload.

Yes! In fact, processors are so ridiculously fast now that it even makes sense for memory. (IBM does this, I believe.) I believe, some of the current big iron machines even do compression on the CPU cache.

Yes, this makes perfect sense. On NT-based Windows OS's it's widely accepted that sometimes enabling NTFS compression can be faster than disabling it for precisely this reason. This has been true for years and multicore should only make it more true.

I think it also depends on how aggressive your compression is vs how IO bound you are.
For example, DB2's row compression feature is targeted for IO bound application: data warehouses, reporting systems, etc. It uses a dictionary-based algorithm and isn't very aggressive - resulting in 50-80% compression of data (tables, indexes in storage as well as when in memory). However - it also tends to speed queries up by around 10%.
They could have gone with much more aggressive compression, but then would have taken a performance hit.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio