Hadoop, hardware and bioinformatics - hadoop

We're about to buy new hardware to run our analyses and are wondering if we're making the right decisions.
The setting:
We're a bioinformatics lab that will be handling DNA sequencing data. The biggest issue that our field has is the amount of data, rather than the compute. A single experiment will quickly go into the 10s-100s of Gb, and we would typically run different experiments at the same time. Obviously, mapreduce approaches are interesting (see also http://abhishek-tiwari.com/2010/08/mapreduce-and-hadoop-algorithms-in-bioinformatics-papers.html), but not all our software use that paradigm. Also, some software uses ascii files as in/output while other software works with binary files.
What we might be buying:
The machine that we might be buying would be a server with 32 cores and 192Gb of RAM, linked to NAS storage (>20Tb). This seems a very interesting setup for us for many of our (non-mapreduce) applications, but will such configuration prevent us from implementing hadoop/mapreduce/hdfs in a meaningful way?
Many thanks,
jan.

You have an interesting configuration. What would be the Disk IO for the NAS storage used by you?
Make your decision based on the following:
Map Reduce paradigm is used to solve the problem of handling large amount of data. Basically, RAM is more expensive than the Disk storage. You cannot hold all the data in the RAM. Disk storage allows you to store large amounts of data at cheaper costs. But, the speed at which you can read data from the disks is not very high. How does Map Reduce solve this problem? Map Reduce solves this problem by distributing the data over multiple machines. Now, the speed at which you can read data in parallel is greater than you could have done with a single storage disk. Suppose the Disk IO speed is 100 Mbps. With 100 machines you can read the data at 100*100 Mbps = 10Gbps.
Typically processor speed is not the bottleneck. Rather, the Disk IOs are the big bottlenecks while processing large amount of data.
I have a feeling that it may not be very efficient.

Related

SAN Performance

Have a question regarding SAN performance specifically EMC VNX SAN. I have a significant number of processes spread over number of blade servers running concurrently. The number of processes is typically around 200. Each process loads 2 small files from storage, one 3KB one 30KB. There are millions (20) of files to be processed. The processes are running on Windows Server on VMWare. The way this was originally setup was 1TB LUNs on the SAN bundled into a single 15TB drive in VMWare and then shared as a network share from one Windows instance to all the processes. The processes running concurrently and the performance is abysmal. Essentially, 200 simultaneous requests are being serviced by the SAN through Windows share at the same time and the SAN is not handling it too well. I'm looking for suggestions to improve performance.
With all performance questions, there's a degree of 'it depends'.
When you're talking about accessing a SAN, there's a chain of potential bottlenecks to unravel. First though, we need to understand what the actual problem is:
Do we have problems with throughput - e.g. sustained transfer, or latency?
It sounds like we're looking at random read IO - which is one of the hardest workloads to service, because predictive caching doesn't work.
So begin at the beginning:
What sort of underlying storage are you using?
Have you fallen into the trap of buying big SATA, configuring it RAID-6? I've seen plenty of places do this because it looks like cheap terabytes, without really doing the sums on the performance. A SATA drive starts to slow down at about 75 IO operations per second. If you've got big drives - 3TB for example - that's 25 IOPs per terabytes. As a rough rule of thumb, 200 per drive for FC/SAS and 1500 for SSD.
are you tiering?
Storage tiering is a clever trick of making a 'sandwich' out of different speeds of disk. This usually works, because usually only a small fraction of a filesystem is 'hot' - so you can put the hot part on fast disk, and the cold part on slow disk, and average performance looks better. This doesn't work for random IO or cold read accesses. Nor does it work for full disk transfers - as only 10% of it (or whatever proportion) can ever be 'fast' and everything else has to go the slow way.
What's your array level contention?
The point of SAN is that you aggregate your performance, such that each user has a higher peak and a lower average, as this reflects most workloads. (When you're working on a document, you need a burst of performance to fetch it, but then barely any until you save it again).
How are you accessing your array?
Typically SAN is accessed using a Fiber Channel network. There's a whole bunch of technical differences with 'real' networks, but they don't matter to you - but contention and bandwidth still do. With ESX in particular, I find there's a tendency to underestimate storage IO needs. (Multiple VMs using a single pair of HBAs means you get contention on the ESX server).
what sort of workload are we dealing with?
One of the other core advantages of storage arrays is caching mechanisms. They generally have very large caches and some clever algorithms to take advantage of workload patterns such as temporal locality and sequential or semi-sequential IO. Write loads are easier to handle for an array, because despite the horrible write penalty of RAID-6, write operations are under a soft time constraint (they can be queued in cache) but read operations are under a hard time constraint (the read cannot complete until the block is fetched).
This means that for true random read, you're basically not able to cache at all, which means you get worst case performance.
Is the problem definitely your array? Sounds like you've a single VM with 15TB presented, and that VM is handling the IO. That's a bottleneck right there. How many IOPs are the VM generating to the ESX server, and what's the contention like there? What's the networking like? How many other VMs are using the same ESX server and might be sources of contention? Is it a pass through LUN, or VMFS datastore with a VMDK?
So - there's a bunch of potential problems, and as such it's hard to roll it back to a single source. All I can give you is some general recommendations to getting good IO performance.
fast disks (they're expensive, but if you need the IO, you need to spend money on it).
Shortest path to storage (don't put a VM in the middle if you can possibly avoid it. For CIFS shares a NAS head may be the best approach).
Try to make your workload cacheable - I know, easier said than done. But with millions of files, if you've got a predictable fetch pattern your array will start prefetching, and it'll got a LOT faster. You may find if you start archiving the files into large 'chunks' you'll gain performance (because the array/client will fetch the whole chunk, and it'll be available for the next client).
Basically the 'lots of small random IO operations' especially on slow disks is really the worst case for storage, because none of the clever tricks for optimization work.

Reading in parallel from multiple hard drives

I am writing an application that deals with lots of data (gigabytes). I am considering splitting the data onto multiple hard drives and reading it in parallel. I am wondering what kind of limitations I will run into--for example, is it possible to read from 4 or 8 hard drives in parallel, and will I get approximately 4 or 8 times the performance if disk I/O is the limiting factor? What should I look out for? Pointers to relevant docs are also appreciated--Google didn't turn up much.
EDIT: I should point out that I've looked at RAID, but the performance wasn't as good as I was hoping for. I am planning on writing this myself in C/C++.
Well splitting data and reading from 4 to 8 drives in parallel would not jump the throughput by 4 to 8 times. There are other factors which you need to consider.
If you reading data in the application, then threads might be required to read data from different harddisks.
Windows provide overlapped and non-overlapped method of reading and writing data to hdd. See if using that increases the throughput. Same way *nux would also have read/write methods.
On a single core/processor threads appear to run in parallel but its sequentially underlying. With multicore multiple threads can be read in parallel but generally OS decides what to run and when to run. So having so many threads to read might decrease performance than increase.
If you check specs of any harddisk, you would see it gives random access time and sequential access time. So based on you data you may want to check these parameters.
When you spliting data into different drives you need to keep in mind that your application would require syncronization of how to populate data into meaningful information. If you using threads, additionally threads should be in sync.
You may get state of the art harddisk with high data read/write speeds but you other hardware may be the weak link. So you may be using a low-end motherboard or RAM which may not let you get the best of the speeds.
If you're not going to use real RAID, you better at least use multiple hard drive controllers, otherwise you won't see much performance gain at all. One controller can't do lots of concurrent IO so it will quickly become the bottleneck.
It sounds like you are talking about the concept of data striping. This is commonly used for RAID implementations. You may want to look into one of the software RAID solutions available for most operating systems. An advantage is if you can use raid to your advantage and add parity (ability to lose a drive and not your data)
This would give you the benefits of RAID without having to try to deal with it yourself. You could do it on a database level as well with data files spread across the drives, but this adds complexity.
You will stream data faster. Drives are only so fast and if your I/O channel can handle more go for it. There's also seek times to take into account... Probably not a big deal based on your app description.
As you seem to be OK with looking at reconfiguring the drives, how about SSDs?
They run rings around any mechanical drives (up around 200+GB/sec read, 150+GB/sec write).
Are you sequentially reading the data, or randomly?
How many GB are you expecting?

Estimation of commodity hardware for an application

Suppose, I wanted to develop stack overflow website. How do I estimate the amount of commodity hardware required to support this website assuming 1 million requests per day. Are there any case studies that explains the performance improvements possible in this situation?
I know I/O bottleneck is the major bottleneck in most systems. What are the possible options to improve I/O performance? Few of them I know are
caching
replication
You can improve I/O performance in several ways depending upon what you use for your storage setup:
Increase filesystem block size if your app displays good spatial locality in its I/Os or uses large files.
Use RAID 10 (striping + mirroring) for performance + redundancy (disk failure protection).
Use fast disks (Performance Wise: SSD > FC > SATA).
Segregate workloads at different times of day. e.g. Backup during night, normal app I/O during day.
Turn off atime updates in your filesystem.
Cache NFS file handles a.k.a. Haystack (Facebook), if storing data on NFS server.
Combine small files into larger chunks, a.k.a BigTable, HBase.
Avoid very large directories i.e. lots of files in the same directory (instead divide files between different directories in a hierarchy).
Use a clustered storage system (yeah not exactly commodity hardware).
Optimize/design your application for sequential disk accesses whenever possible.
Use memcached. :)
You may want to look at "Lessons Learned" section of StackOverflow Architecture.
check out this handy tool:
http://www.sizinglounge.com/
and another guide from dell:
http://www.dell.com/content/topics/global.aspx/power/en/ps3q01_graham?c=us&l=en&cs=555
if you want your own stackoverflow-like community, you can sign up with StackExchange.
you can read some case studies here:
High Scalability - How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
http://www.gear6.com/gear6-downloads?fid=56&dlt=case-study&ls=Veoh-Case-Study
1 million requests per day is 12/second. Stack overflow is small enough that you could (with interesting normalization and compression tricks) fit it entirely in RAM of a 64 GByte Dell PowerEdge 2970. I'm not sure where caching and replication should play a role.
If you have a problem thinking enough about normalization, a PowerEdge R900 with 256GB is available.
If you don't like a single point of failure, you can connect a few of those and just push updates over a socket (preferably on a separate network card). Even a peak load of 12K/second should not be a problem for a main-memory system.
The best way to avoid the I/O bottleneck is to not do I/O (as much as possible). That means a prevayler-like architecture with batched writes (no problem to lose a few seconds of data), basically a log file, and for replication also write them out to a socket.

Algorithms for Optimization with Fast Disk Storage (SSDs)?

Given that Solid State Disks (SSDs) are decreasing in price and soon will become more prevalent as system drives, and given that their access rates are significantly higher than rotating magnetic media, what standard algorithms will gain in performance from the use of SSDs for local storage? For example, the high random read speed of SSDs makes something like a disk-based hashtable a viability for large hashstables; 4GB of disk space is readily available, which makes hashing to the entire range of a 32-bit integer viable (more for lookup than population, though, which would still take a long time); while this size of a hashtable would be prohibitive to work with with rotating media due to the access speed, it shouldn't be as much of an issue with SSDs.
Are there any other areas where the impending transition to SSDs will provide potential gains in algorithmic performance? I'd rather see reasoning as to how one thing will work rather than opinion; I don't want this to turn contentious.
Your example of hashtables is indeed the key database structure that will benefit. Instead of having to load a whole 4GB or more file into memory to probe for values, the SSD can be probed directly. The SSD is still slower than RAM, by orders of magnitude, but it's quite reasonable to have a 50GB hash table on disk, but not in RAM unless you pay big money for big iron.
An example is chess position databases. I have over 50GB of hashed positions. There is complex code to try to group related positions near each other in the hash, so I can page in 10MB of the table at a time and hope to reuse some of it for multiple similar position queries. There's a ton of code and complexity to make this efficient.
Replaced with an SSD, I was able to drop all the complexity of the clustering and just use really dumb randomized hashes. I also got an increase in performance since I only fetch the data I need from the disk, not big 10MB chunks. The latency is indeed larger, but the net speedup is significant.. and the super-clean code (20 lines, not 800+), is perhaps even nicer.
SSDs are only significantly faster for random access. Sequential access to disk they are only twice as performant as mainstream rotational drives. Many SSDs have poorer performance in many scenarios causing them to perform worse, as described here.
While SSDs do move the needle considerably, they are still much slower than CPU operations and physical memory. For your 4GB hash table example, you may be able to sustain 250+ MB/s off of an SSD for accessing random hash table buckets. For a rotational drive, you'd be lucky to break the single digit MB/s. If you can keep this 4 GB hash table in memory, you could access it on the order of gigabytes a second - much faster than even a very swift SSD.
The referenced article lists several changes MS made for Windows 7 when running on SSD's, which can give you an idea of the sort of changes you could consider making. First, SuperFetch for prefetching data off of disk is disabled - it's designed to get around slow random access times for disk which are alleviated by SSDs. Defrag is disabled, because having files scattered across the disk aren't a performance hit for SSDs.
Ipso facto, any algorithm you can think of which requires lots of random disk I/O (random being the key word, which helps to throw the principle of locality to the birds, thus eliminating the usefulness of a lot of caching that goes on).
I could see certain database systems gaining from this though. MySQL, for instance using the MyISAM storage engine (where data records are basically glorified CSVs). However, I think very large hashtables are going to be your best bet for good examples.
SSD are a lot faster for random reads, a bit for sequential reads and properly slower for writes (random or not).
So a diskbased hashtable is properly not useful with an SSD, since it now takes significantly time to update it, but searching the disk becomes (compared to a normal hdd) very cheap.
Don't kid yourself. SSDs are still a whole lot slower than system memory. Any algorithm that chooses to use system memory over the hard disk is still going to be much faster, all other things being equal.

Compression to Improve Hard Disk Write Performance

On a modern system can local hard disk write speeds be improved by compressing the output stream?
This question derives from a case I'm working with where a program serially generates and dumps around 1-2GB of text logging data to a raw text file on the hard disk and I think it is IO bound. Would I expect to be able to decrease runtimes by compressing the data before it goes to disk or would the overhead of compression eat up any gain I could get? Would having an idle second core affect this?
I know this would be affected by how much CPU is being used to generate the data so rules of thumb on how much idle CPU time would be needed would be good.
I recall a video talk where someone used compression to improve read speeds for a database but IIRC compressing is a lot more CPU intensive than decompressing.
Yes, yes, yes, absolutely.
Look at it this way: take your maximum contiguous disk write speed in megabytes per second. (Go ahead and measure it, time a huge fwrite or something.) Let's say 100mb/s. Now take your CPU speed in megahertz; let's say 3Ghz = 3000mhz. Divide the CPU speed by the disk write speed. That's the number of cycles that the CPU is spending idle, that you can spend per byte on compression. In this case 3000/100 = 30 cycles per byte.
If you had an algorithm that could compress your data by 25% for an effective 125mb/s write speed, you would have 24 cycles per byte to run it in and it would basically be free because the CPU wouldn't be doing anything else anyway while waiting for the disk to churn. 24 cycles per byte = 3072 cycles per 128-byte cache line, easily achieved.
We do this all the time when reading optical media.
If you have an idle second core it's even easier. Just hand off the log buffer to that core's thread and it can take as long as it likes to compress the data since it's not doing anything else! The only tricky bit is you want to actually have a ring of buffers so that you don't have the producer thread (the one making the log) waiting on a mutex for a buffer that the consumer thread (the one writing it to disk) is holding.
Yes, this has been true for at least 10 years. There are operating-systems papers about it. I think Chris Small may have worked on some of them.
For speed, gzip/zlib compression on lower quality levels is pretty fast; if that's not fast enough you can try FastLZ. A quick way to use an extra core is just to use popen(3) to send output through gzip.
For what it is worth Sun's filesystem ZFS has the ability to have on the fly compression enabled to decrease the amount of disk IO without a significant increase in overhead as an example of this in practice.
The Filesystems and storage lab from Stony Brook published a rather extensive performance (and energy) evaluation on file data compression on server systems at IBM's SYSTOR systems research conference this year: paper at ACM Digital Library, presentation.
The results depend on the
used compression algorithm and settings,
the file workload and
the characteristics of your machine.
For example, in the measurements from the paper, using a textual workload and a server environment using lzop with low compression effort are faster than plain write, but bzip and gz aren't.
In your specific setting, you should try it out and measure. It really might improve performance, but it is not always the case.
CPUs have grown faster at a faster rate than hard drive access. Even back in the 80's a many compressed files could be read off the disk and uncompressed in less time than it took to read the original (uncompressed) file. That will not have changed.
Generally though, these days the compression/de-compression is handled at a lower level than you would be writing, for example in a database I/O layer.
As to the usefulness of a second core only counts if the system will be also doing a significant number of other things - and your program would have to be multi-threaded to take advantage of the additional CPU.
Logging the data in binary form may be a quick improvement. You'll write less to the disk and the CPU will spend less time converting numbers to text. It may not be useful if people are going to be reading the logs, but they won't be able to read compressed logs either.
Windows already supports File Compression in NTFS, so all you have to do is to set the "Compressed" flag in the file attributes.
You can then measure if it was worth it or not.
This depends on lots of factors and I don't think there is one correct answer. It comes down to this:
Can you compress the raw data faster than the raw write performance of your disk times the compression ratio you are achieving (or the multiple in speed you are trying to get) given the CPU bandwidth you have available to dedicate to this purpose?
Given today's relatively high data write rates in the 10's of MBytes/second this is a pretty high hurdle to get over. To the point of some of the other answers, you would likely have to have easily compressible data and would just have to benchmark it with some test of reasonableness type experiments and find out.
Relative to a specific opinion (guess!?) to the point about additional cores. If you thread up the compression of the data and keep the core(s) fed - with the high compression ratio of text, it is likely such a technique would bear some fruit. But this is just a guess. In a single threaded application alternating between disk writes and compression operations, it seems much less likely to me.
If it's just text, then compression could definitely help. Just choose an compression algorithm and settings that make the compression cheap. "gzip" is cheaper than "bzip2" and both have parameters that you can tweak to favor speed or compression ratio.
If you are I/O bound saving human-readable text to the hard drive, I expect compression to reduce your total runtime.
If you have an idle 2 GHz core, and a relatively fast 100 MB/s streaming hard drive,
halving the net logging time requires at least 2:1 compression and no more than roughly 10 CPU cycles per uncompressed byte for the compressor to ponder the data.
With a dual-pipe processor, that's (very roughly) 20 instructions per byte.
I see that LZRW1-A (one of the fastest compression algorithms) uses 10 to 20 instructions per byte, and compresses typical English text about 2:1.
At the upper end (20 instructions per byte), you're right on the edge between IO bound and CPU bound. At the middle and lower end, you're still IO bound, so there is a a few cycles available (not much) for a slightly more sophisticated compressor to ponder the data a little longer.
If you have a more typical non-top-of-the-line hard drive, or the hard drive is slower for some other reason (fragmentation, other multitasking processes using the disk, etc.)
then you have even more time for a more sophisticated compressor to ponder the data.
You might consider setting up a compressed partition, saving the data to that partition (letting the device driver compress it), and comparing the speed to your original speed.
That may take less time and be less likely to introduce new bugs than changing your program and linking in a compression algorithm.
I see a list of compressed file systems based on FUSE, and I hear that NTFS also supports compressed partitions.
If this particular machine is often IO bound,
another way to speed it up is to install a RAID array.
That would give a speedup to every program and every kind of data (even incompressible data).
For example, the popular RAID 1+0 configuration with 4 total disks gives a speedup of nearly 2x.
The nearly as popular RAID 5 configuration, with same 4 total disks, gives all a speedup of nearly 3x.
It is relatively straightforward to set up a RAID array with a speed 8x the speed of a single drive.
High compression ratios, on the other hand, are apparently not so straightforward. Compression of "merely" 6.30 to one would give you a cash prize for breaking the current world record for compression (Hutter Prize).
This used to be something that could improve performance in quite a few applications way back when. I'd guess that today it's less likely to pay off, but it might in your specific circumstance, particularly if the data you're logging is easily compressible,
However, as Shog9 commented:
Rules of thumb aren't going to help
you here. It's your disk, your CPU,
and your data. Set up a test case and
measure throughput and CPU load with
and without compression - see if it's
worth the tradeoff.

Resources