Related
Have a question regarding SAN performance specifically EMC VNX SAN. I have a significant number of processes spread over number of blade servers running concurrently. The number of processes is typically around 200. Each process loads 2 small files from storage, one 3KB one 30KB. There are millions (20) of files to be processed. The processes are running on Windows Server on VMWare. The way this was originally setup was 1TB LUNs on the SAN bundled into a single 15TB drive in VMWare and then shared as a network share from one Windows instance to all the processes. The processes running concurrently and the performance is abysmal. Essentially, 200 simultaneous requests are being serviced by the SAN through Windows share at the same time and the SAN is not handling it too well. I'm looking for suggestions to improve performance.
With all performance questions, there's a degree of 'it depends'.
When you're talking about accessing a SAN, there's a chain of potential bottlenecks to unravel. First though, we need to understand what the actual problem is:
Do we have problems with throughput - e.g. sustained transfer, or latency?
It sounds like we're looking at random read IO - which is one of the hardest workloads to service, because predictive caching doesn't work.
So begin at the beginning:
What sort of underlying storage are you using?
Have you fallen into the trap of buying big SATA, configuring it RAID-6? I've seen plenty of places do this because it looks like cheap terabytes, without really doing the sums on the performance. A SATA drive starts to slow down at about 75 IO operations per second. If you've got big drives - 3TB for example - that's 25 IOPs per terabytes. As a rough rule of thumb, 200 per drive for FC/SAS and 1500 for SSD.
are you tiering?
Storage tiering is a clever trick of making a 'sandwich' out of different speeds of disk. This usually works, because usually only a small fraction of a filesystem is 'hot' - so you can put the hot part on fast disk, and the cold part on slow disk, and average performance looks better. This doesn't work for random IO or cold read accesses. Nor does it work for full disk transfers - as only 10% of it (or whatever proportion) can ever be 'fast' and everything else has to go the slow way.
What's your array level contention?
The point of SAN is that you aggregate your performance, such that each user has a higher peak and a lower average, as this reflects most workloads. (When you're working on a document, you need a burst of performance to fetch it, but then barely any until you save it again).
How are you accessing your array?
Typically SAN is accessed using a Fiber Channel network. There's a whole bunch of technical differences with 'real' networks, but they don't matter to you - but contention and bandwidth still do. With ESX in particular, I find there's a tendency to underestimate storage IO needs. (Multiple VMs using a single pair of HBAs means you get contention on the ESX server).
what sort of workload are we dealing with?
One of the other core advantages of storage arrays is caching mechanisms. They generally have very large caches and some clever algorithms to take advantage of workload patterns such as temporal locality and sequential or semi-sequential IO. Write loads are easier to handle for an array, because despite the horrible write penalty of RAID-6, write operations are under a soft time constraint (they can be queued in cache) but read operations are under a hard time constraint (the read cannot complete until the block is fetched).
This means that for true random read, you're basically not able to cache at all, which means you get worst case performance.
Is the problem definitely your array? Sounds like you've a single VM with 15TB presented, and that VM is handling the IO. That's a bottleneck right there. How many IOPs are the VM generating to the ESX server, and what's the contention like there? What's the networking like? How many other VMs are using the same ESX server and might be sources of contention? Is it a pass through LUN, or VMFS datastore with a VMDK?
So - there's a bunch of potential problems, and as such it's hard to roll it back to a single source. All I can give you is some general recommendations to getting good IO performance.
fast disks (they're expensive, but if you need the IO, you need to spend money on it).
Shortest path to storage (don't put a VM in the middle if you can possibly avoid it. For CIFS shares a NAS head may be the best approach).
Try to make your workload cacheable - I know, easier said than done. But with millions of files, if you've got a predictable fetch pattern your array will start prefetching, and it'll got a LOT faster. You may find if you start archiving the files into large 'chunks' you'll gain performance (because the array/client will fetch the whole chunk, and it'll be available for the next client).
Basically the 'lots of small random IO operations' especially on slow disks is really the worst case for storage, because none of the clever tricks for optimization work.
I know the title of my question is rather vague, so I'll try to clarify as much as I can. Please feel free to moderate this question to make it more useful for the community.
Given a standard LAMP stack with more or less default settings (a bit of tuning is allowed, client-side and server-side caching turned on), running on modern hardware (16Gb RAM, 8-core CPU, unlimited disk space, etc), deploying a reasonably complicated CMS service (a Drupal or Wordpress project for arguments sake) - what amounts of traffic, SQL queries, user requests can I resonably expect to accommodate before I have to start thinking about performance?
NOTE: I know that specifics will greatly depend on the details of the project, i.e. optimizing MySQL queries, indexing stuff, minimizing filesystem hits - assuming web developers did a professional job - I'm really looking for a very rough figure in terms of visits per day, traffic during peak visiting times, how many records before (transactional) MySQL fumbles, so on.
I know the only way to really answer my question is to run load testing on a real project, and I'm concerned that my question may be treated as partly off-top.
I would like to get a set of figures from people with first-hand experience, e.g. "we ran such and such set-up and it handled at least this much load [problems started surfacing after such and such]". I'm also greatly interested in any condenced (I'm short on time atm) reading I can do to get a better understanding of the matter.
P.S. I'm meeting a client tomorrow to talk about his project, and I want to be prepared to reason about performance if his project turns out to be akin FourSquare.
Very tricky to answer without specifics as you have noted. If I was tasked with what you have to do, I would take each component in turn ( network interface, CPU/memory, physical IO load, SMP locking etc) and get the maximum capacity available, divide by rough estimate of use per request.
For example, network io. You might have 1x 1Gb card, which might achieve maybe 100Mbytes/sec. ( I tend to use 80% of theoretical max). How big will a typical 'hit' be? Perhaps 3kbytes average, for HTML, images etc. that means you can achieve 33k requests per second before you bottleneck at the physical level. These numbers are absolute maximums, depending on tools and skills you might not get anywhere near them, but nobody can exceed these maximums.
Repeat the above for every component, perhaps varying your numbers a little, and you will build a quick picture of what is likely to be a concern. Then, consider how you can quickly get more capacity in each component, can you just chuck $$ and gain more performance (eg use SSD drives instead of HD)? Or will you hit a limit that cannot be moved without rearchitecting? Also take into account what resources you have available, do you have lots of skilled programmer time, DBAs, or wads of cash? If you have lots of a resource, you can tend to reduce those constraints easier and quicker as you move along the experience curve.
Do not forget external components too, firewalls may have limits that are lower than expected for sustained traffic.
Sorry I cannot give you real numbers, our workloads are using custom servers, high memory caching and other tricks, and not using all the products you list. However, I would concentrate most on IO/SQL queries and possibly network IO, as these tend to be more hard limits, than CPU/memory, although I'm sure others will have a different opinion.
Obviously, the question is such that does not have a "proper" answer, but I'd like to close it and give some feedback. The client meeting has taken place, performance was indeed a biggie, their hosting platform turned out to be on the Amazon cloud :)
From research I've done independently:
Memcache is a must;
MySQL (or whatever persistent storage instance you're running) is usually the first to go. Solutions include running multiple virtual instances and replicate data between them, distributing the load;
http://highscalability.com/ is a good read :)
I would like to try testing NLP tools against dumps of the web and other corpora, sometimes larger than 4 TB.
If I run this on a mac it's very slow. What is the best way to speed up this process?
deploying to EC2/Heroku and scaling up servers
buying hardware and creating a local setup
Just want to know how this is usually done (processing terabytes in a matter of minutes/seconds), if it's cheaper/better to experiment with this in the cloud, or do I need my own hardware setup?
Regardless of the brand of your cloud, the whole idea of cloud computing is to be able to scale-up and scale down in a flexible way.
In a corporate environment you might have a scenario in which you will consistently need the same amount of computing resources, so if you already have them, it is rather a difficult case to use the cloud because you just don't need the flexibility provided.
On the other hand if your processing tasks are not quite predictable, your best solution is the cloud because you will be able to pay more when you use more computing power, and then pay less when you don't need as much power.
Take into account though, that not all cloud-solutions are the same, for instance, a Web role is a highly web-dedicated node whose main purpose is to serve web requests, the more requests are served, the more you pay.
Whereas in a virtual role, is almost like you are given the exclusivity of a computer system that you can use for anything you want, either a linux or a windows OS, the system keeps running even though you are not using it at its best.
Overall, the costs depend on your own scenario and how well it fits to your needs.
I suppose it depends quite a bit on what kind of experimenting you are wanting to do, for what purpose and for how long.
If you're looking into buying the hardware and running your own cluster then you probably want something like Hadoop or Storm to manage the compute nodes. I don't know how feasible it is to go through 4TB of data in a matter of seconds but again that really depends on the kind of processing you want to do. Counting the frequency of words in the 4TB corpus should be pretty easy (even or your mac), but building SVMs or doing something like LDA on the lot won't be. One issue you'll run into is that you won't have enough memory to fit all of that, so you'll want a library that can run the methods off disk.
If you don't know exactly what your requirements are then I would use EC2 to setup a test rig to gain a better understanding what it is that you want to do and how much grunt/memory that needs to get done in the amount of time you require.
We recently bought two compute nodes 128 cores each with 256Gb of memory and a few terabytes of disk space for I think it was around £20k or so. These are AMD interlagos machines. That said the compute cluster already had infiniband storage so we just had to hook up to that and just buy to two compute nodes, not the whole infrastructure.
The obvious thing to do here is to start off with a smaller data set, say a few gigabytes. That'll get you started on your mac, you can experiment with the data and different methods to get an idea of what works and what doesn't, and then move your pipeline to the cloud, and run it with more data. If you don't want to start the experimentation with a single sample, you can always take multiple samples from different parts of the full corpus, just keep the sample sizes down to something you can manage on your own workstation to start off with.
As an aside, I highly recommend the scikit-learn project on GitHub for machine learning. It's written in Python, but most of the matrix operations are done in Fortran or C libraries so it's pretty fast. The developer community is also extremely active on the project. Another good library that perhaps a bit more approachable (depending on your level of expertise) is NLTK. It's nowhere near as fast but makes a bit more sense if you're not familiar with thinking about everything as a matrix.
UPDATE
One thing I forgot to mention is the time your project will be running. Or to put it another way, how long will you get some use out of your specialty hardware. If it's a project that is supposed to serve the EU parliament for the next 10 years, then you should definitely buy the hardware. If it's a project for you to get familiar with NLP, then forking out the money might be a bit redundant, unless you're also planning on starting you own cloud computing rental service :).
That said, I don't know what the real world costs of using EC2 are for something like this. I've never had to use them.
In a currently deployed web server, what are the typical limits on its performance?
I believe a meaningful answer would be one of 100, 1,000, 10,000, 100,000 or 1,000,000 requests/second, but which is true today? Which was true 5 years ago? Which might we expect in 5 years? (ie, how do trends in bandwidth, disk performance, CPU performance, etc. impact the answer)
If it is material, the fact that HTTP over TCP is the access protocol should be considered. OS, server language, and filesystem effects should be assumed to be best-of-breed.
Assume that the disk contains many small unique files that are statically served. I'm intending to eliminate the effect of memory caches, and that CPU time is mainly used to assemble the network/protocol information. These assumptions are intended to bias the answer towards 'worst case' estimates where a request requires some bandwidth, some cpu time and a disk access.
I'm only looking for something accurate to an order of magnitude or so.
Read http://www.kegel.com/c10k.html. You might also read StackOverflow questions tagged 'c10k'. C10K stands for 10'000 simultaneous clients.
Long story short -- principally, the limit is neither bandwidth, nor CPU. It's concurrency.
Six years ago, I saw an 8-proc Windows Server 2003 box serve 100,000 requests per second for static content. That box had 8 Gigabit Ethernet cards, each on a separate subnet. The limiting factor there was network bandwidth. There's no way you could serve that much content over the Internet, even with a truly enormous pipe.
In practice, for purely static content, even a modest box can saturate a network connection.
For dynamic content, there's no easy answer. It could be CPU utilization, disk I/O, backend database latency, not enough worker threads, too much context switching, ...
You have to measure your application to find out where your bottlenecks lie. It might be in the framework, it might be in your application logic. It probably changes as your workload changes.
I think it really depends on what you are serving.
If you're serving web applications that dynamically render html, CPU is what is consumed most.
If you are serving up a relatively small number of static items lots and lots of times, you'll probably run into bandwidth issues (since the static files themselves will probably find themselves in memory)
If you're serving up a large number of static items, you may run into disk limits first (seeking and reading files)
If you are not able to cache your files in memory, then disk seek times will likely be the limiting factor and limit your performance to less than 1000 requests/second. This might improve when using solid state disks.
100, 1,000, 10,000, 100,000 or 1,000,000 requests/second, but which is true today?
This test was done on a modest i3 laptop, but it reviewed Varnish, ATS (Apache Traffic Server), Nginx, Lighttpd, etc.
http://nbonvin.wordpress.com/2011/03/24/serving-small-static-files-which-server-to-use/
The interesting point is that using a high-end 8-core server gives a very little boost to most of them (Apache, Cherokee, Litespeed, Lighttpd, Nginx, G-WAN):
http://www.rootusers.com/web-server-performance-benchmark/
As the tests were done on localhost to avoid hitting the network as a bottleneck, the problem is in the kernel which does not scale - unless you tune its options.
So, to answer your question, the progress margin is in the way servers process IO.
They will have to use better data structures (wait-free).
I think there are too many variables here to answer your question.
What processor, what speed, what cache, what chipset, what disk interface, what spindle speed, what network card, how configured, the list is huge. I think you need to approach the problem from the other side...
"This is what I want to do and achieve, what do I need to do it?"
OS, server language, and filesystem effects are the variables here. If you take them out, then you're left with a no-overhead TCP socket.
At that point it's not really a question of performance of the server, but of the network. With a no-overhead TCP socket your limit that you will hit will most likely be at the firewall or your network switches with how many connections can be handled concurrently.
In any web application that uses a database you also open up a whole new range of optimisation needs.
indexes, query optimisation etc
For static files, does your application cache them in memory?
etc, etc, etc
This will depend what is your CPU core
What speed are your disks
What is a 'fat' 'medium' sized hosting companies pipe.
What is the web server?
The question is too general
Deploy you server test it using tools like http://jmeter.apache.org/ and see how you get on.
Amazon measures their CPU allotment in terms of virtual cores and EC2 Compute Units. EC2 Compute Units are defined as:
The amount of CPU that is allocated to a particular instance is expressed in terms of these EC2 Compute Units. We use several benchmarks and tests to manage the consistency and predictability of the performance from an EC2 Compute Unit. One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. This is also the equivalent to an early-2006 1.7 GHz Xeon processor referenced in our original documentation.
My question is, say I have a "Large Instance" which comes with "4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)". Does this mean I essentially have 4 cores in a logical sense? Would I want to spawn 4 CPU-bound threads? Or are the compute units simply a measure of power, and I have 2 cores?
Also, given the scalability of the servers, would it be better to double the computing power of a single box and host the database and server on the same box? Or should I have 2 seperate, weaker boxes?
nicholaides is correct, the small instances are the equivalent of one core, the large two cores. The remainder of the measurement is expressed as Compute Units, which are defined as follows:
One EC2 Compute Unit (ECU) provides
the equivalent CPU capacity of a
1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.
I run my small website on a single small instance, with both web server and database hosted on the one virtual machine. I've been impressed with the performance, but again don't have a tremendous amount of load on it.
If all you're caring for is bang for your buck, I'd try your setup with both servers running on a single small instance (1 core, 1 EC2 unit at $0.10 / hour) and see how that stacks up. The next step up would be a high-CPU medium instance (2 cores, 5 total EC2 units at $0.20 / hour). Unless you're really hammering your servers, I have to believe you'll be able to run them on that single medium instance. For only twice the price of the small instance, you get five times the performance, which is much better than running two small instances.
One thing to be careful of is that the small and high-CPU medium instances are 32-bit, where all others (large, extra large, and high-CPU extra large) are 64-bit. You cannot run a 32-bit Amazon Machine Image on a 64-bit instance, and vice versa. If you're working with a stock AMI, this isn't a problem because you'll usually be able to find both versions of it, but for a custom image it might make you do a little extra work.
"4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)" simply means you get 2 virtual cpu's, each of which is twice as fast as the basic Small instance.
In total, you get 4 times the power of the Small instance, but since you only get 2 cores, it makes sense to start only two threads.
As for your second question, I think Brad Larson answers it pretty well. The Medium instance has a lot of power for the money. We run our db en web servers on the same host, and it's surprising how many db-heavy sites you can run on a single machine. However, since it depends on your own application your best bet is to benchmark it to see how much load it can handle.
If you must scale up I would suggest separating the two services into different servers, instead of running a larger server, simply because it is easier to optimize each host for the specific service.
As I recall, "Compute Units" are not measuring cores but simple a measure of "power."
Also, given the scalability of the servers, would it be better to double the computing power of a single box and host the database and server on the same box? Or should I have 2 seperate, weaker boxes?
It really depends on the application. Trying it out and getting hard data might be your best bet.