Performance testing. How to increase hdd operations stability - performance

I try to simulate application load to measure application performance. Dozens of clients send requests to server and significant part of request processing is random data loaded from HDD (random file, random file offset).
I use 15 Gb in 400 files.
HDD does its best to cache reading operations so overall performance is very unstable from run to run (+/- 5..10%).
In order to minimize HDD-internals optimizations I am thinking to put data to dedicated physical HDD, create random files before every test run, use the same random file access sequence (sequence of files and offsets), then run a test and format HDD at the end. I suppose it will clear all internal HDD caches and file access predictions.
What shall I do to minimize performance result dispersion? It there a simpler (or may be more appropriate) way to get performance results stable?
Thank you in advance!

Essentially all modern hard drives do include caching. It seems to me that results without a cache might be more uniform, but would be uniformly meaningless.
In any case, there are commands to disable caching on most drives (but, if memory serves, they're probably extensions, not part of the standard, so you'd have to implement them specifically for a particular target drive).
OTOH, given that you want to simulate something that isn't how a real hard drive (normally) works, I'd consider writing it as a complete software simulation -- e.g., have some sort of hard-drive class that kept a "current track", with commands to read and write data, seek to another track, etc. The class would keep track of things like the amount of (virtual) time consumed for each operation.

Related

SAN Performance

Have a question regarding SAN performance specifically EMC VNX SAN. I have a significant number of processes spread over number of blade servers running concurrently. The number of processes is typically around 200. Each process loads 2 small files from storage, one 3KB one 30KB. There are millions (20) of files to be processed. The processes are running on Windows Server on VMWare. The way this was originally setup was 1TB LUNs on the SAN bundled into a single 15TB drive in VMWare and then shared as a network share from one Windows instance to all the processes. The processes running concurrently and the performance is abysmal. Essentially, 200 simultaneous requests are being serviced by the SAN through Windows share at the same time and the SAN is not handling it too well. I'm looking for suggestions to improve performance.
With all performance questions, there's a degree of 'it depends'.
When you're talking about accessing a SAN, there's a chain of potential bottlenecks to unravel. First though, we need to understand what the actual problem is:
Do we have problems with throughput - e.g. sustained transfer, or latency?
It sounds like we're looking at random read IO - which is one of the hardest workloads to service, because predictive caching doesn't work.
So begin at the beginning:
What sort of underlying storage are you using?
Have you fallen into the trap of buying big SATA, configuring it RAID-6? I've seen plenty of places do this because it looks like cheap terabytes, without really doing the sums on the performance. A SATA drive starts to slow down at about 75 IO operations per second. If you've got big drives - 3TB for example - that's 25 IOPs per terabytes. As a rough rule of thumb, 200 per drive for FC/SAS and 1500 for SSD.
are you tiering?
Storage tiering is a clever trick of making a 'sandwich' out of different speeds of disk. This usually works, because usually only a small fraction of a filesystem is 'hot' - so you can put the hot part on fast disk, and the cold part on slow disk, and average performance looks better. This doesn't work for random IO or cold read accesses. Nor does it work for full disk transfers - as only 10% of it (or whatever proportion) can ever be 'fast' and everything else has to go the slow way.
What's your array level contention?
The point of SAN is that you aggregate your performance, such that each user has a higher peak and a lower average, as this reflects most workloads. (When you're working on a document, you need a burst of performance to fetch it, but then barely any until you save it again).
How are you accessing your array?
Typically SAN is accessed using a Fiber Channel network. There's a whole bunch of technical differences with 'real' networks, but they don't matter to you - but contention and bandwidth still do. With ESX in particular, I find there's a tendency to underestimate storage IO needs. (Multiple VMs using a single pair of HBAs means you get contention on the ESX server).
what sort of workload are we dealing with?
One of the other core advantages of storage arrays is caching mechanisms. They generally have very large caches and some clever algorithms to take advantage of workload patterns such as temporal locality and sequential or semi-sequential IO. Write loads are easier to handle for an array, because despite the horrible write penalty of RAID-6, write operations are under a soft time constraint (they can be queued in cache) but read operations are under a hard time constraint (the read cannot complete until the block is fetched).
This means that for true random read, you're basically not able to cache at all, which means you get worst case performance.
Is the problem definitely your array? Sounds like you've a single VM with 15TB presented, and that VM is handling the IO. That's a bottleneck right there. How many IOPs are the VM generating to the ESX server, and what's the contention like there? What's the networking like? How many other VMs are using the same ESX server and might be sources of contention? Is it a pass through LUN, or VMFS datastore with a VMDK?
So - there's a bunch of potential problems, and as such it's hard to roll it back to a single source. All I can give you is some general recommendations to getting good IO performance.
fast disks (they're expensive, but if you need the IO, you need to spend money on it).
Shortest path to storage (don't put a VM in the middle if you can possibly avoid it. For CIFS shares a NAS head may be the best approach).
Try to make your workload cacheable - I know, easier said than done. But with millions of files, if you've got a predictable fetch pattern your array will start prefetching, and it'll got a LOT faster. You may find if you start archiving the files into large 'chunks' you'll gain performance (because the array/client will fetch the whole chunk, and it'll be available for the next client).
Basically the 'lots of small random IO operations' especially on slow disks is really the worst case for storage, because none of the clever tricks for optimization work.

Constant Write Speed to Disk

I'm writing real-time data to an empty spinning disk sequentially. (EDIT: It doesn't have to be sequential, as long as I can read it back as if it was sequential.) The data arrives at a rate of 100 MB/s and the disks have an average write speed of 120 MB/s.
Sometimes (especially as free space starts to decrease) the disk speed goes under 100 MB/s depending on where on the platter the disk is writing, and I have to drop vital data.
Is there any way to write to disk in a pattern (or some other way) to ensure a constant write speed close to the average rate? Regardless of how much data there currently is on the disk.
EDIT:
Some notes on why I think this should be possible.
When usually writing to the disk, it starts in the fast portion of the platter and then writes towards the slower parts. However, if I could write half the data to the fast part and half the data to the slow part (i.e. for 1 second it could write 50MB to the fast part and 50MB to the slow part), they should meet in the middle. I could possibly achieve a constant rate?
As a programmer, I am not sure how I can decide where on the platter the data is written or even if the OS can achieve something similar.
If I had to do this on a regular Windows system, I would use a device with a higher average write speed to give me more headroom. Expecting 100MB/s average write speed over the entire disk that is rated for 120MB/s is going to cause you trouble. Spinning hard disks don't have a constant write speed over the whole disk.
The usual solution to this problem is to buffer in RAM to cover up infrequent slow downs. The more RAM you use as a buffer, the longer the span of slowness you can handle. These are tradeoffs you have to make. If your problem is the known slowdown on the inside sectors of a rotating disk, then your device just isn't fast enough.
Another thing that might help is to access the disk as directly as possible and ensure it isn't being shared by other parts of the system. Use a separate physical device, don't format it with a filesystem, write directly to the partitioned space. Yes, you'll have to deal with some of the issues a filesystem solves for you, but you also skip a bunch of code you can't control. Even then, your app could run into scheduling issues with Windows. Windows is not a RTOS, there are not guarantees as far as timing. Again this would help more with temporary slowdowns from filesystem cleanup, flushing dirty pages, etc. It probably won't help much with the "last 100GB writes at 80MB/s" problem.
If you really are stuck with a disk that goes from 120MB/s -> 80MB/s outside-to-inside (you should test with your own code and not trust the specs from the manufacture so you know what you're dealing with), then you're going to have to play partitioning games like others have suggested. On a mechanical disk, that will introduce some serious head seeking, which may eat up your improvement. To minimize seeks, it would be even more important to ensure it's a dedicated disk the OS isn't using for anything else. Also, use large buffers and write many megabytes at a time before seeking to the end of the disk. Instead of partitioning, you could write directly to the block device and control which blocks you write to. I don't know how to do this in Windows.
To solve this on Linux, I would be tempted to test mdadm's raid0 across two partitions on the same drive and see if that works. If so, then the work is done and you don't have to write and test some complicated write mechanism.
Partition the disk into two equally sized partitions. Write a few seconds worth of data alternating between the partitions. That way you get almost all of the usual sequential speed, nicely averaged. One disk seek every few seconds eats up almost no time. One seek per second reduces the usable time from 1000ms to ~990ms which is a ~1% reduction in throughput. The more RAM you can dedicate to buffering the less you have to seek.
Use more partitions to increase the averaging effect.
I fear this may be more difficult than you realize:
If your average 120 MB/s write speed is the manufacturer's value then it is most likely "optimistic" at best.
Even a benchmarked write speed is usually done on a non-partitioned/formatted drive and will be higher than what you'd typically see in actual use (how much higher is a good question).
A more important value is the drive's minimum write speed. For example, from Tom's Hardware 2013 HDD Benchmarks a drive with a 120 MB/s average has a 76 MB/s minimum.
A drive that is being used by other applications at the same time (e.g., Windows) will have a much lower write speed.
An even more important value is the drives actual measured performance. I would make a simple application similar to your use case that writes data to the drive as fast as possible until it fills the drive. Do this a few (dozen) times to get a more realistic average/minimum/maximum write speed value...it will likely be lower than you'd expect.
As you noted, even if your "real" average write speed is higher than 100 MB/s you run into issues if you run into slow write speeds just before the disk fills up, assuming you don't have somewhere else to write the data to. Using a buffer doesn't help in this case.
I'm not sure if you can actually specify a physical location to write to on the hard drive these days without getting into the drive's firmware. Even if you could this would be my last choice for a solution.
A few specific things I would look at to solve your problem:
Measure the "real" write performance of the drive to see if its fast enough. This gives you an idea of how far behind you actually are.
Put the OS on a separate drive to ensure the data drive is not being used by anything other than your application.
Get faster drives (either HDD or SDD). It is fine to use the manufacturer's write speeds as an initial guide but test them thoroughly as well.
Get more drives and put them into a RAID0 (or similar) configuration for faster write access. You'll again want to actually test this to confirm it works for you.
You could implement the strategy of alternating writes bewteen the inside and the outside by directly controlling the disk write locations. Under Windows you can open a disk like "\.\PhysicalDriveX" and control where it writes. For more info see
http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx
First of all, I hope you are using raw disks and not a filesystem. If you're using a filesystem, you must:
Create an empty, non-sparse file that's as large as the filesystem will fit.
Obtain a mapping from the logical file positions to disk blocks.
Reverse this mapping, so that you can map from disk blocks to logical file positions. Of course some blocks are unavailable due to filesystem's own use.
At this point, the disk looks like a raw disk that you access by disk block. It's a valid assumption that this block addressing is mostly monotonous to the physical cylinder number. IOW if you increase the disk block number, the cylinder number will never decrease (or never increase -- depending on the drive's LBA to physical mapping order).
Also, note that a disk's average write speed may be given per cylinder or per unit of storage. How would you know? You need the latter number, and the only sure way to get it is to benchmark it yourself. You need to fill the entire disk with data, by repeatedly writing a zero page to the disk, going block by block, and divide the total amount of data written by the amount it took. You need to be accessing the disk or the file in the direct mode. This should disable the OS buffering for the file data, and not for the filesystem metadata (if not using a raw disk).
At this point, all you need to do is to write data blocks of sensible sizes at the two extremes of the block numbers: you need to fill the disk from both ends inwards. The size of the data blocks depends on the bandwidth wastage you can allow for seeks. You should also assume that the hard drive might seek once in a while to update its housekeeping data. Assuming a worst-case seek taking 15ms, you waste 1.5% of per-second bandwidth for each seek. Assuming you can spare no more than 5% of bandwidth, with 1 seek/s on average for the drive itself, you can seek twice per second. Thus your block size needs to be your_bandwith_per_second/2. This bandwidth is not the disk bandwidth, but the bandwidth of your data source.
Alas, if only things where this easy. It generally turns out that the bandwidth at the middle of the disk is not the average bandwidth. During your benchmark you must also take a note of write speed over smaller sections of the disk, say every 1% of the disk. This way, when writing into each section of the disk, you can figure out how to split the data between the "low" and the "high" section that you're writing to. Suppose that you're starting out at 0% and 99% positions on the disk, and the low position has a bandwidth of mean*1.5, and the high position has a bandwidth of mean*0.8, where mean is your desired mean bandwidth. You'll then need to write 100% * 1.5/(0.8+1.5) of the data into the low position, and the remainder (100% * 0.8/(0.8+1.5)) into the slower high position.
The size of your buffer needs to be larger than just the block size, since you must assume some worst-case latency for the hard drive if it hits bad blocks and needs to relocate data, etc. I'd say a 3 second buffer may be reasonable. Optionally it can grow by itself if latencies you measure while your software runs turn out higher. This buffer must be locked ("pinned") to physical memory so that it's not subject to swapping.
Another possible option is to destroke (or short stroke) a hard drive. If you start with a 4TB or larger drive and destroke it to 2TB, only the outer portions of the platters will be used, resulting in a faster throughput rate. The issue would be getting the software that issues vendor unique commands to a hard drive to destroke it.

Read files by device/inode order?

I'm interested in an efficient way to read a large number of files on the disk. I want to know if I sort files by device and then by inode I'll got some speed improvement against natural file reading.
There are vast speed improvements to be had from reading files in physical order from rotating storage. Operating system I/O scheduling mechanisms only do any real work if there are several processes or threads contending for I/O, because they have no information about what files you plan to read in the future. Hence, other than simple read-ahead, they usually don't help you at all.
Furthermore, Linux worsens your access patterns during directory scans by returning directory entries to user space in hash table order rather than physical order. Luckily, Linux also provides system calls to determine the physical location of a file, and whether or not a file is stored on a rotational device, so you can recover some of the losses. See for example this patch I submitted to dpkg a few years ago:
http://lists.debian.org/debian-dpkg/2009/11/msg00002.html
This patch does not incorporate a test for rotational devices, because this feature was not added to Linux until 2012:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ef00f59c95fe6e002e7c6e3663cdea65e253f4cc
I also used to run a patched version of mutt that would scan Maildirs in physical order, usually giving a 5x-10x speed improvement.
Note that inodes are small, heavily prefetched and cached, so opening files to get their physical location before reading is well worth the cost. It's true that common tools like tar, rsync, cp and PostgreSQL do not use these techniques, and the simple truth is that this makes them unnecessarily slow.
Back in the 1970s I proposed to our computer center that reading/writing from/to disk would be faster overall if they organized the queue of disk reads and/or writes in such a way as to minimize the seek time and I was told by the computer center that their experiments and information from IBM that many studies had been made of several techniques and that the overall throughput of JOBS (not just a single job) was most optimal if disk reads/writes were done in first come first serve order. This was an IBM batch system.
In general, optimisation techniques for file access are too tied to the architecture of your storage subsystem for them to be something as simple as a sorting algorithm.
1) You can effectively multiply the read data rate if your files are spread into multiple physical drives (not just partitions) and you read two or more files in parallel from different drives. This one is probably the only method that is easy to implement.
2) Sorting the files by name or inode number does not really change anything in the general case. What you'd want is to sort the files by the physical location of their blocks on the disk, so that they can be read with minimal seeking. There are quite a few obstacles however:
Most filesystems do not provide such information to userspace applications, unless it's for debugging reasons.
The blocks themselves of each file can be spread all over the disk, especially on a mostly full filesystem. There is no way to read multiple files sequentially without seeking back and forth.
You are assuming that your process is the only one accessing the storage subsystem. Once there is at least someone else doing the same, every optimisation you come up with goes out of the window.
You are trying to be smarter than the operating system and its own caching and I/O scheduling mechanisms. It's very likely that by trying to second-guess the kernel, i.e. the only one that really knows your system and your usage patterns, you will make things worse.
Don't you think e.g. PostreSQL pr Oracle would have used a similar technique if they could? When the DB is installed on a proper filesystem they let the kernel do its thing and don't try to second-guess its decisions. Only when the DB is on a raw device do the specialised optimisation algorithms that take physical blocks into account come into play.
You should also take the specific properties of your storage devices into account. Modern SSDs, for example, make traditional seek-time optimisations obsolete.

Reading in parallel from multiple hard drives

I am writing an application that deals with lots of data (gigabytes). I am considering splitting the data onto multiple hard drives and reading it in parallel. I am wondering what kind of limitations I will run into--for example, is it possible to read from 4 or 8 hard drives in parallel, and will I get approximately 4 or 8 times the performance if disk I/O is the limiting factor? What should I look out for? Pointers to relevant docs are also appreciated--Google didn't turn up much.
EDIT: I should point out that I've looked at RAID, but the performance wasn't as good as I was hoping for. I am planning on writing this myself in C/C++.
Well splitting data and reading from 4 to 8 drives in parallel would not jump the throughput by 4 to 8 times. There are other factors which you need to consider.
If you reading data in the application, then threads might be required to read data from different harddisks.
Windows provide overlapped and non-overlapped method of reading and writing data to hdd. See if using that increases the throughput. Same way *nux would also have read/write methods.
On a single core/processor threads appear to run in parallel but its sequentially underlying. With multicore multiple threads can be read in parallel but generally OS decides what to run and when to run. So having so many threads to read might decrease performance than increase.
If you check specs of any harddisk, you would see it gives random access time and sequential access time. So based on you data you may want to check these parameters.
When you spliting data into different drives you need to keep in mind that your application would require syncronization of how to populate data into meaningful information. If you using threads, additionally threads should be in sync.
You may get state of the art harddisk with high data read/write speeds but you other hardware may be the weak link. So you may be using a low-end motherboard or RAM which may not let you get the best of the speeds.
If you're not going to use real RAID, you better at least use multiple hard drive controllers, otherwise you won't see much performance gain at all. One controller can't do lots of concurrent IO so it will quickly become the bottleneck.
It sounds like you are talking about the concept of data striping. This is commonly used for RAID implementations. You may want to look into one of the software RAID solutions available for most operating systems. An advantage is if you can use raid to your advantage and add parity (ability to lose a drive and not your data)
This would give you the benefits of RAID without having to try to deal with it yourself. You could do it on a database level as well with data files spread across the drives, but this adds complexity.
You will stream data faster. Drives are only so fast and if your I/O channel can handle more go for it. There's also seek times to take into account... Probably not a big deal based on your app description.
As you seem to be OK with looking at reconfiguring the drives, how about SSDs?
They run rings around any mechanical drives (up around 200+GB/sec read, 150+GB/sec write).
Are you sequentially reading the data, or randomly?
How many GB are you expecting?

Optimizing locations of on-disk data for sequential access

I need to store large amounts of data on-disk in approximately 1k blocks. I will be accessing these objects in a way that is hard to predict, but where patterns probably exist.
Is there an algorithm or heuristic I can use that will rearrange the objects on disk based on my access patterns to try to maximize sequential access, and thus minimize disk seek time?
On modern OSes (Windows, Linux, etc) there is absolutely nothing you can do to optimise seek times! Here's why:
You are in a pre-emptive multitasking system. Your application and all it's data can be flushed to disk at any time - user switches task, screen saver kicks in, battery runs out of charge, etc.
You cannot guarantee that the file is contiguous on disk. Doing Aaron's first bullet point will not ensure an unfragmented file. When you start writing the file, the OS doesn't know how big the file is going to be so it could put it in a small space, fragmenting it as you write more data to it.
Memory mapping the file only works as long as the file size is less than the available address range in your application. On Win32, the amount of address space available is about 2Gb - memory used by application. Mapping larger files usually involves un-mapping and re-mapping portions of the file, which won't be the best of things to do.
Putting data in the centre of the file is no help as, for all you know, the central portion of the file could be the most fragmented bit.
To paraphrase Raymond Chen, if you have to ask about OS limits, you're probably doing something wrong. Treat your filesystem as an immutable black box, it just is what it is (I know, you can use RAID and so on to help).
The first step you must take (and must be taken whenever you're optimising) is to measure what you've currently got. Never assume anything. Verify everything with hard data.
From your post, it sounds like you haven't actually written any code yet, or, if you have, there is no performance problem at the moment.
The only real solution is to look at the bigger picture and develop methods to get data off the disk without stalling the application. This would usually be through asynchronous access and speculative loading. If your application is always accessing the disk and doing work with small subsets of the data, you may want to consider reorganising the data to put all the useful stuff in one place and the other data elsewhere. Without knowing the full problem domain it's not possible to to be really helpful.
Depending on what you mean by "hard to predict", I can think of a few options:
If you always seek based on the same block field/property, store the records on disk sorted by that field. This lets you use binary search for O(log n) efficiency.
If you seek on different block fields, consider storing an external index for each field. A b-tree gives you O(log n) efficiency. When you seek, grab the appropriate index, search it for your block's data file address and jump to it.
Better yet, if your blocks are homogeneous, consider breaking them down into database records. A database gives you optimized storage, indexing, and the ability to perform advanced queries for free.
Use memory-mapped file access rather than the usual open-seek-read/write pattern. This technique works on Windows and Unix platforms.
In this way the operating system's virtual memory system will handle the caching for you. Accesses of blocks that are already in memory will result in no disk seek or read time. Writes from memory back to disk are handled automatically and efficiently and without blocking your application.
Aaron's notes are good too as they will affect initial-load time for a chunk that's not in memory. Combine that with the memory-mapped technique -- after all it's easier to reorder chunks using memcpy() than by reading/writing from disk and attempting swapouts etc.
The most simple way to solve this is to use an OS which solves that for you under the hood, like Linux. Give it enough RAM to hold 10% of the objects in RAM and it will try to keep as many of them in the cache as possible reducing the load time to 0. The recent server versions of Windows might work, too (some of them didn't for me, that's why I'm mentioning this).
If this is a no go, try this algorithm:
Create a very big file on the harddisk. It is very important that you write this in one go so the OS will allocate a continuous space on disk.
Write all your objects into that file. Make sure that each object is the same size (or give each the same space in the file and note the length in the first few bytes of of each chunk). Use an empty harddisk or a disk which has just been defragmented.
In a data structure, keep the offsets of each data chunk and how often it is accessed. When it is accessed very often, swap its position in the file with a chunk that is closer to the start of the file and which has a lesser access count.
[EDIT] Access this file with the memory-mapped API of your OS to allow the OS to effectively cache the most used parts to get best performance until you can optimize the file layout next time.
Over time, heavily accessed chunks will bubble to the top. Note that you can collect the access patterns over some time, analyze them and do the reorder over night when there is little load on your machine. Or you can do the reorder on a completely different machine and swap the file (and the offset table) when that's done.
That said, you should really rely on a modern OS where a lot of clever people have thought long and hard to solve these issues for you.
That's an interesting challenge. Unfortunately, I don't know how to solve this out of the box, either. Corbin's approach sounds reasonable to me.
Here's a little optimization suggestion, at least: Place the most-accessed items at the center of your disk (or unfragmented file), not at the start of end. That way, seeking to lesser-used data will be closer by average. Err, that's pretty obvious, though.
Please let us know if you figure out a solution yourself.

Resources