Estimation of commodity hardware for an application - performance

Suppose, I wanted to develop stack overflow website. How do I estimate the amount of commodity hardware required to support this website assuming 1 million requests per day. Are there any case studies that explains the performance improvements possible in this situation?
I know I/O bottleneck is the major bottleneck in most systems. What are the possible options to improve I/O performance? Few of them I know are
caching
replication

You can improve I/O performance in several ways depending upon what you use for your storage setup:
Increase filesystem block size if your app displays good spatial locality in its I/Os or uses large files.
Use RAID 10 (striping + mirroring) for performance + redundancy (disk failure protection).
Use fast disks (Performance Wise: SSD > FC > SATA).
Segregate workloads at different times of day. e.g. Backup during night, normal app I/O during day.
Turn off atime updates in your filesystem.
Cache NFS file handles a.k.a. Haystack (Facebook), if storing data on NFS server.
Combine small files into larger chunks, a.k.a BigTable, HBase.
Avoid very large directories i.e. lots of files in the same directory (instead divide files between different directories in a hierarchy).
Use a clustered storage system (yeah not exactly commodity hardware).
Optimize/design your application for sequential disk accesses whenever possible.
Use memcached. :)
You may want to look at "Lessons Learned" section of StackOverflow Architecture.

check out this handy tool:
http://www.sizinglounge.com/
and another guide from dell:
http://www.dell.com/content/topics/global.aspx/power/en/ps3q01_graham?c=us&l=en&cs=555
if you want your own stackoverflow-like community, you can sign up with StackExchange.
you can read some case studies here:
High Scalability - How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
http://www.gear6.com/gear6-downloads?fid=56&dlt=case-study&ls=Veoh-Case-Study

1 million requests per day is 12/second. Stack overflow is small enough that you could (with interesting normalization and compression tricks) fit it entirely in RAM of a 64 GByte Dell PowerEdge 2970. I'm not sure where caching and replication should play a role.
If you have a problem thinking enough about normalization, a PowerEdge R900 with 256GB is available.
If you don't like a single point of failure, you can connect a few of those and just push updates over a socket (preferably on a separate network card). Even a peak load of 12K/second should not be a problem for a main-memory system.
The best way to avoid the I/O bottleneck is to not do I/O (as much as possible). That means a prevayler-like architecture with batched writes (no problem to lose a few seconds of data), basically a log file, and for replication also write them out to a socket.

Related

SAN Performance

Have a question regarding SAN performance specifically EMC VNX SAN. I have a significant number of processes spread over number of blade servers running concurrently. The number of processes is typically around 200. Each process loads 2 small files from storage, one 3KB one 30KB. There are millions (20) of files to be processed. The processes are running on Windows Server on VMWare. The way this was originally setup was 1TB LUNs on the SAN bundled into a single 15TB drive in VMWare and then shared as a network share from one Windows instance to all the processes. The processes running concurrently and the performance is abysmal. Essentially, 200 simultaneous requests are being serviced by the SAN through Windows share at the same time and the SAN is not handling it too well. I'm looking for suggestions to improve performance.
With all performance questions, there's a degree of 'it depends'.
When you're talking about accessing a SAN, there's a chain of potential bottlenecks to unravel. First though, we need to understand what the actual problem is:
Do we have problems with throughput - e.g. sustained transfer, or latency?
It sounds like we're looking at random read IO - which is one of the hardest workloads to service, because predictive caching doesn't work.
So begin at the beginning:
What sort of underlying storage are you using?
Have you fallen into the trap of buying big SATA, configuring it RAID-6? I've seen plenty of places do this because it looks like cheap terabytes, without really doing the sums on the performance. A SATA drive starts to slow down at about 75 IO operations per second. If you've got big drives - 3TB for example - that's 25 IOPs per terabytes. As a rough rule of thumb, 200 per drive for FC/SAS and 1500 for SSD.
are you tiering?
Storage tiering is a clever trick of making a 'sandwich' out of different speeds of disk. This usually works, because usually only a small fraction of a filesystem is 'hot' - so you can put the hot part on fast disk, and the cold part on slow disk, and average performance looks better. This doesn't work for random IO or cold read accesses. Nor does it work for full disk transfers - as only 10% of it (or whatever proportion) can ever be 'fast' and everything else has to go the slow way.
What's your array level contention?
The point of SAN is that you aggregate your performance, such that each user has a higher peak and a lower average, as this reflects most workloads. (When you're working on a document, you need a burst of performance to fetch it, but then barely any until you save it again).
How are you accessing your array?
Typically SAN is accessed using a Fiber Channel network. There's a whole bunch of technical differences with 'real' networks, but they don't matter to you - but contention and bandwidth still do. With ESX in particular, I find there's a tendency to underestimate storage IO needs. (Multiple VMs using a single pair of HBAs means you get contention on the ESX server).
what sort of workload are we dealing with?
One of the other core advantages of storage arrays is caching mechanisms. They generally have very large caches and some clever algorithms to take advantage of workload patterns such as temporal locality and sequential or semi-sequential IO. Write loads are easier to handle for an array, because despite the horrible write penalty of RAID-6, write operations are under a soft time constraint (they can be queued in cache) but read operations are under a hard time constraint (the read cannot complete until the block is fetched).
This means that for true random read, you're basically not able to cache at all, which means you get worst case performance.
Is the problem definitely your array? Sounds like you've a single VM with 15TB presented, and that VM is handling the IO. That's a bottleneck right there. How many IOPs are the VM generating to the ESX server, and what's the contention like there? What's the networking like? How many other VMs are using the same ESX server and might be sources of contention? Is it a pass through LUN, or VMFS datastore with a VMDK?
So - there's a bunch of potential problems, and as such it's hard to roll it back to a single source. All I can give you is some general recommendations to getting good IO performance.
fast disks (they're expensive, but if you need the IO, you need to spend money on it).
Shortest path to storage (don't put a VM in the middle if you can possibly avoid it. For CIFS shares a NAS head may be the best approach).
Try to make your workload cacheable - I know, easier said than done. But with millions of files, if you've got a predictable fetch pattern your array will start prefetching, and it'll got a LOT faster. You may find if you start archiving the files into large 'chunks' you'll gain performance (because the array/client will fetch the whole chunk, and it'll be available for the next client).
Basically the 'lots of small random IO operations' especially on slow disks is really the worst case for storage, because none of the clever tricks for optimization work.

Read files by device/inode order?

I'm interested in an efficient way to read a large number of files on the disk. I want to know if I sort files by device and then by inode I'll got some speed improvement against natural file reading.
There are vast speed improvements to be had from reading files in physical order from rotating storage. Operating system I/O scheduling mechanisms only do any real work if there are several processes or threads contending for I/O, because they have no information about what files you plan to read in the future. Hence, other than simple read-ahead, they usually don't help you at all.
Furthermore, Linux worsens your access patterns during directory scans by returning directory entries to user space in hash table order rather than physical order. Luckily, Linux also provides system calls to determine the physical location of a file, and whether or not a file is stored on a rotational device, so you can recover some of the losses. See for example this patch I submitted to dpkg a few years ago:
http://lists.debian.org/debian-dpkg/2009/11/msg00002.html
This patch does not incorporate a test for rotational devices, because this feature was not added to Linux until 2012:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ef00f59c95fe6e002e7c6e3663cdea65e253f4cc
I also used to run a patched version of mutt that would scan Maildirs in physical order, usually giving a 5x-10x speed improvement.
Note that inodes are small, heavily prefetched and cached, so opening files to get their physical location before reading is well worth the cost. It's true that common tools like tar, rsync, cp and PostgreSQL do not use these techniques, and the simple truth is that this makes them unnecessarily slow.
Back in the 1970s I proposed to our computer center that reading/writing from/to disk would be faster overall if they organized the queue of disk reads and/or writes in such a way as to minimize the seek time and I was told by the computer center that their experiments and information from IBM that many studies had been made of several techniques and that the overall throughput of JOBS (not just a single job) was most optimal if disk reads/writes were done in first come first serve order. This was an IBM batch system.
In general, optimisation techniques for file access are too tied to the architecture of your storage subsystem for them to be something as simple as a sorting algorithm.
1) You can effectively multiply the read data rate if your files are spread into multiple physical drives (not just partitions) and you read two or more files in parallel from different drives. This one is probably the only method that is easy to implement.
2) Sorting the files by name or inode number does not really change anything in the general case. What you'd want is to sort the files by the physical location of their blocks on the disk, so that they can be read with minimal seeking. There are quite a few obstacles however:
Most filesystems do not provide such information to userspace applications, unless it's for debugging reasons.
The blocks themselves of each file can be spread all over the disk, especially on a mostly full filesystem. There is no way to read multiple files sequentially without seeking back and forth.
You are assuming that your process is the only one accessing the storage subsystem. Once there is at least someone else doing the same, every optimisation you come up with goes out of the window.
You are trying to be smarter than the operating system and its own caching and I/O scheduling mechanisms. It's very likely that by trying to second-guess the kernel, i.e. the only one that really knows your system and your usage patterns, you will make things worse.
Don't you think e.g. PostreSQL pr Oracle would have used a similar technique if they could? When the DB is installed on a proper filesystem they let the kernel do its thing and don't try to second-guess its decisions. Only when the DB is on a raw device do the specialised optimisation algorithms that take physical blocks into account come into play.
You should also take the specific properties of your storage devices into account. Modern SSDs, for example, make traditional seek-time optimisations obsolete.

Hadoop, hardware and bioinformatics

We're about to buy new hardware to run our analyses and are wondering if we're making the right decisions.
The setting:
We're a bioinformatics lab that will be handling DNA sequencing data. The biggest issue that our field has is the amount of data, rather than the compute. A single experiment will quickly go into the 10s-100s of Gb, and we would typically run different experiments at the same time. Obviously, mapreduce approaches are interesting (see also http://abhishek-tiwari.com/2010/08/mapreduce-and-hadoop-algorithms-in-bioinformatics-papers.html), but not all our software use that paradigm. Also, some software uses ascii files as in/output while other software works with binary files.
What we might be buying:
The machine that we might be buying would be a server with 32 cores and 192Gb of RAM, linked to NAS storage (>20Tb). This seems a very interesting setup for us for many of our (non-mapreduce) applications, but will such configuration prevent us from implementing hadoop/mapreduce/hdfs in a meaningful way?
Many thanks,
jan.
You have an interesting configuration. What would be the Disk IO for the NAS storage used by you?
Make your decision based on the following:
Map Reduce paradigm is used to solve the problem of handling large amount of data. Basically, RAM is more expensive than the Disk storage. You cannot hold all the data in the RAM. Disk storage allows you to store large amounts of data at cheaper costs. But, the speed at which you can read data from the disks is not very high. How does Map Reduce solve this problem? Map Reduce solves this problem by distributing the data over multiple machines. Now, the speed at which you can read data in parallel is greater than you could have done with a single storage disk. Suppose the Disk IO speed is 100 Mbps. With 100 machines you can read the data at 100*100 Mbps = 10Gbps.
Typically processor speed is not the bottleneck. Rather, the Disk IOs are the big bottlenecks while processing large amount of data.
I have a feeling that it may not be very efficient.

Reading in parallel from multiple hard drives

I am writing an application that deals with lots of data (gigabytes). I am considering splitting the data onto multiple hard drives and reading it in parallel. I am wondering what kind of limitations I will run into--for example, is it possible to read from 4 or 8 hard drives in parallel, and will I get approximately 4 or 8 times the performance if disk I/O is the limiting factor? What should I look out for? Pointers to relevant docs are also appreciated--Google didn't turn up much.
EDIT: I should point out that I've looked at RAID, but the performance wasn't as good as I was hoping for. I am planning on writing this myself in C/C++.
Well splitting data and reading from 4 to 8 drives in parallel would not jump the throughput by 4 to 8 times. There are other factors which you need to consider.
If you reading data in the application, then threads might be required to read data from different harddisks.
Windows provide overlapped and non-overlapped method of reading and writing data to hdd. See if using that increases the throughput. Same way *nux would also have read/write methods.
On a single core/processor threads appear to run in parallel but its sequentially underlying. With multicore multiple threads can be read in parallel but generally OS decides what to run and when to run. So having so many threads to read might decrease performance than increase.
If you check specs of any harddisk, you would see it gives random access time and sequential access time. So based on you data you may want to check these parameters.
When you spliting data into different drives you need to keep in mind that your application would require syncronization of how to populate data into meaningful information. If you using threads, additionally threads should be in sync.
You may get state of the art harddisk with high data read/write speeds but you other hardware may be the weak link. So you may be using a low-end motherboard or RAM which may not let you get the best of the speeds.
If you're not going to use real RAID, you better at least use multiple hard drive controllers, otherwise you won't see much performance gain at all. One controller can't do lots of concurrent IO so it will quickly become the bottleneck.
It sounds like you are talking about the concept of data striping. This is commonly used for RAID implementations. You may want to look into one of the software RAID solutions available for most operating systems. An advantage is if you can use raid to your advantage and add parity (ability to lose a drive and not your data)
This would give you the benefits of RAID without having to try to deal with it yourself. You could do it on a database level as well with data files spread across the drives, but this adds complexity.
You will stream data faster. Drives are only so fast and if your I/O channel can handle more go for it. There's also seek times to take into account... Probably not a big deal based on your app description.
As you seem to be OK with looking at reconfiguring the drives, how about SSDs?
They run rings around any mechanical drives (up around 200+GB/sec read, 150+GB/sec write).
Are you sequentially reading the data, or randomly?
How many GB are you expecting?

What are the theoretical performance limits on web servers?

In a currently deployed web server, what are the typical limits on its performance?
I believe a meaningful answer would be one of 100, 1,000, 10,000, 100,000 or 1,000,000 requests/second, but which is true today? Which was true 5 years ago? Which might we expect in 5 years? (ie, how do trends in bandwidth, disk performance, CPU performance, etc. impact the answer)
If it is material, the fact that HTTP over TCP is the access protocol should be considered. OS, server language, and filesystem effects should be assumed to be best-of-breed.
Assume that the disk contains many small unique files that are statically served. I'm intending to eliminate the effect of memory caches, and that CPU time is mainly used to assemble the network/protocol information. These assumptions are intended to bias the answer towards 'worst case' estimates where a request requires some bandwidth, some cpu time and a disk access.
I'm only looking for something accurate to an order of magnitude or so.
Read http://www.kegel.com/c10k.html. You might also read StackOverflow questions tagged 'c10k'. C10K stands for 10'000 simultaneous clients.
Long story short -- principally, the limit is neither bandwidth, nor CPU. It's concurrency.
Six years ago, I saw an 8-proc Windows Server 2003 box serve 100,000 requests per second for static content. That box had 8 Gigabit Ethernet cards, each on a separate subnet. The limiting factor there was network bandwidth. There's no way you could serve that much content over the Internet, even with a truly enormous pipe.
In practice, for purely static content, even a modest box can saturate a network connection.
For dynamic content, there's no easy answer. It could be CPU utilization, disk I/O, backend database latency, not enough worker threads, too much context switching, ...
You have to measure your application to find out where your bottlenecks lie. It might be in the framework, it might be in your application logic. It probably changes as your workload changes.
I think it really depends on what you are serving.
If you're serving web applications that dynamically render html, CPU is what is consumed most.
If you are serving up a relatively small number of static items lots and lots of times, you'll probably run into bandwidth issues (since the static files themselves will probably find themselves in memory)
If you're serving up a large number of static items, you may run into disk limits first (seeking and reading files)
If you are not able to cache your files in memory, then disk seek times will likely be the limiting factor and limit your performance to less than 1000 requests/second. This might improve when using solid state disks.
100, 1,000, 10,000, 100,000 or 1,000,000 requests/second, but which is true today?
This test was done on a modest i3 laptop, but it reviewed Varnish, ATS (Apache Traffic Server), Nginx, Lighttpd, etc.
http://nbonvin.wordpress.com/2011/03/24/serving-small-static-files-which-server-to-use/
The interesting point is that using a high-end 8-core server gives a very little boost to most of them (Apache, Cherokee, Litespeed, Lighttpd, Nginx, G-WAN):
http://www.rootusers.com/web-server-performance-benchmark/
As the tests were done on localhost to avoid hitting the network as a bottleneck, the problem is in the kernel which does not scale - unless you tune its options.
So, to answer your question, the progress margin is in the way servers process IO.
They will have to use better data structures (wait-free).
I think there are too many variables here to answer your question.
What processor, what speed, what cache, what chipset, what disk interface, what spindle speed, what network card, how configured, the list is huge. I think you need to approach the problem from the other side...
"This is what I want to do and achieve, what do I need to do it?"
OS, server language, and filesystem effects are the variables here. If you take them out, then you're left with a no-overhead TCP socket.
At that point it's not really a question of performance of the server, but of the network. With a no-overhead TCP socket your limit that you will hit will most likely be at the firewall or your network switches with how many connections can be handled concurrently.
In any web application that uses a database you also open up a whole new range of optimisation needs.
indexes, query optimisation etc
For static files, does your application cache them in memory?
etc, etc, etc
This will depend what is your CPU core
What speed are your disks
What is a 'fat' 'medium' sized hosting companies pipe.
What is the web server?
The question is too general
Deploy you server test it using tools like http://jmeter.apache.org/ and see how you get on.

Resources