Performance of CPU - performance

While going through Computer organisation by Patterson,I encountered a question where I am completely stuck. Question is:
Suppose we know that an application that uses both a desktop client and a remote server is limited by network performance. For the following changes state whether only the throughput improves, both response time and throughput improve, or neither improves.
And the changes made are:
More memory is added to the computer
If we add more memory ,shouldn't the throughput and execution time will improve?
To be clear ,the definition of throughput and response time is explained in the book as:
Throughput: The amount of work done in a given time.
Response Time: time required to complete a task ,tasks are i/o device activities, Operating System overhead, disk access, memory access.

Assume the desktop client is your internet browser. And the server is the internet, for example the stackoverflow website. If you're having network performance problems, adding more RAM to your computer won't make browsing the internet faster.
More memory helps only when the application needs more memory. For any other limitation, the additional memory will simply remain unused.

You have to think like a text book here. If your only given constraint is network performance, then you have to assume that there are no other constraints.
Therefore, the question boils down to: how does increasing the memory affect network performance?
If you throw in other constraints such as the system is low on memory and actively paging, then maybe response time improves with more memory and less paging. But the only constraint given is network performance.

It wont make a difference as you are already bound by the network performance. Imagine you have a large tank of water and tiny pipe coming out it. Suppose you want to get more water within given amount of time (throughput). Does it make sense to add more water to the tank to achieve that? Its not, as we are bound by the width of the pipe. Either you add more pipes or you widen the pipe you have.
Going back to your question, if the whole system is bound by network performance you need to add more bandwidth, to see any improvement. Doing anything else is pointless.

Related

What methodology would you use to measure the load capacity of a software server application?

I have a high-performance software server application that is expected to get increased traffic in the next few months.
I was wondering what approach or methodology is good to use in order to gauge if the server still has the capacity to handle this increased load?
I think you're looking for Stress Testing and the scenario would be something like:
Create a load test simulating current real application usage
Start with current number of users and gradually increase the load until
you reach the "increased traffic" amount
or errors start occurring
or you start observing performance degradation
whatever comes the first
Depending on the outcome you either can state that your server can handle the increased load without any issues or you will come up with the saturation point and the first bottleneck
You might also want to execute a Soak Test - leave the system under high prolonged load for several hours or days, this way you can detect memory leaks or other capacity problems.
More information: Why ‘Normal’ Load Testing Isn’t Enough
Test the product with one-tenth the data and traffic. Be sure the activity is 'realistic'.
Then consider what will happen as traffic grows -- with the RAM, disk, cpu, network, etc, grow linearly or not?
While you are doing that, look for "hot spots". Optimize them.
Will you be using web pages? Databases? Etc. Each of these things scales differently. (In other words, you have not provided enough details in your question.)
Most canned benchmarks focus on one small aspect of computing; applying the results to a specific application is iffy.
I would start by collecting base line data on critical resources - typically, CPU, memory usage, disk usage, network usage - and track them over time. If any of those resources show regular spikes where they remain at 100% capacity for more than a fraction of a second, under current usage, you have a bottleneck somewhere. In this case, you cannot accept additional load without likely outages.
Next, I'd start figuring out what the bottleneck resource for your application is - it varies between applications, but in most cases it's the bottleneck resource that stops you from scaling further. Your CPU might be almost idle, but you're thrashing the disk I/O, for instance. That's a tricky process - load and stress testing are the way to go.
If you can resolve the bottleneck by buying better hardware, do so - it's much cheaper than rewriting the software. If you can't buy better hardware, look at load balancing. If you can't load balance, you've got to look at application architecture and implementation and see if there are ways to move the bottleneck.
It's quite typical for the bottleneck to move from one resource to the next - you've got CPU to behave, but now when you increase traffic, you're spiking disk I/O; once you resolve that, you may get another CPU challenge.

SAN Performance

Have a question regarding SAN performance specifically EMC VNX SAN. I have a significant number of processes spread over number of blade servers running concurrently. The number of processes is typically around 200. Each process loads 2 small files from storage, one 3KB one 30KB. There are millions (20) of files to be processed. The processes are running on Windows Server on VMWare. The way this was originally setup was 1TB LUNs on the SAN bundled into a single 15TB drive in VMWare and then shared as a network share from one Windows instance to all the processes. The processes running concurrently and the performance is abysmal. Essentially, 200 simultaneous requests are being serviced by the SAN through Windows share at the same time and the SAN is not handling it too well. I'm looking for suggestions to improve performance.
With all performance questions, there's a degree of 'it depends'.
When you're talking about accessing a SAN, there's a chain of potential bottlenecks to unravel. First though, we need to understand what the actual problem is:
Do we have problems with throughput - e.g. sustained transfer, or latency?
It sounds like we're looking at random read IO - which is one of the hardest workloads to service, because predictive caching doesn't work.
So begin at the beginning:
What sort of underlying storage are you using?
Have you fallen into the trap of buying big SATA, configuring it RAID-6? I've seen plenty of places do this because it looks like cheap terabytes, without really doing the sums on the performance. A SATA drive starts to slow down at about 75 IO operations per second. If you've got big drives - 3TB for example - that's 25 IOPs per terabytes. As a rough rule of thumb, 200 per drive for FC/SAS and 1500 for SSD.
are you tiering?
Storage tiering is a clever trick of making a 'sandwich' out of different speeds of disk. This usually works, because usually only a small fraction of a filesystem is 'hot' - so you can put the hot part on fast disk, and the cold part on slow disk, and average performance looks better. This doesn't work for random IO or cold read accesses. Nor does it work for full disk transfers - as only 10% of it (or whatever proportion) can ever be 'fast' and everything else has to go the slow way.
What's your array level contention?
The point of SAN is that you aggregate your performance, such that each user has a higher peak and a lower average, as this reflects most workloads. (When you're working on a document, you need a burst of performance to fetch it, but then barely any until you save it again).
How are you accessing your array?
Typically SAN is accessed using a Fiber Channel network. There's a whole bunch of technical differences with 'real' networks, but they don't matter to you - but contention and bandwidth still do. With ESX in particular, I find there's a tendency to underestimate storage IO needs. (Multiple VMs using a single pair of HBAs means you get contention on the ESX server).
what sort of workload are we dealing with?
One of the other core advantages of storage arrays is caching mechanisms. They generally have very large caches and some clever algorithms to take advantage of workload patterns such as temporal locality and sequential or semi-sequential IO. Write loads are easier to handle for an array, because despite the horrible write penalty of RAID-6, write operations are under a soft time constraint (they can be queued in cache) but read operations are under a hard time constraint (the read cannot complete until the block is fetched).
This means that for true random read, you're basically not able to cache at all, which means you get worst case performance.
Is the problem definitely your array? Sounds like you've a single VM with 15TB presented, and that VM is handling the IO. That's a bottleneck right there. How many IOPs are the VM generating to the ESX server, and what's the contention like there? What's the networking like? How many other VMs are using the same ESX server and might be sources of contention? Is it a pass through LUN, or VMFS datastore with a VMDK?
So - there's a bunch of potential problems, and as such it's hard to roll it back to a single source. All I can give you is some general recommendations to getting good IO performance.
fast disks (they're expensive, but if you need the IO, you need to spend money on it).
Shortest path to storage (don't put a VM in the middle if you can possibly avoid it. For CIFS shares a NAS head may be the best approach).
Try to make your workload cacheable - I know, easier said than done. But with millions of files, if you've got a predictable fetch pattern your array will start prefetching, and it'll got a LOT faster. You may find if you start archiving the files into large 'chunks' you'll gain performance (because the array/client will fetch the whole chunk, and it'll be available for the next client).
Basically the 'lots of small random IO operations' especially on slow disks is really the worst case for storage, because none of the clever tricks for optimization work.

Is there a technique to predict performance impact of application

A customer is running a clustered web application server under considerable load. He wants to know if the upcoming application, which is not implemented yet, will still be manageable by his current setup.
Is there a established method to predict the performance impact of application in concept state, based on an existing requirement specification (or maybe a functional design specification).
First priority would be to predict the impact on CPU resource.
Is it possible to get fairly exact results at all?
I'd say the canonical answer is no. You always have to benchmark the actual application being deployed on its target architecture.
Why? Software and software development are not predictable. And systems are even more unpredictable.
Even if you know the requirements now and have done deep analysis what happens if:
The program has a performance bug (or two...) - which might even be a bug in a third-party library
New requirements are added or requirements change
The analysis and design don't spot all the hidden inter-relationships between components
There are non-linear effects of adding load and the new load might take the hardware over a critical threshold (a threshold that is not obvious now).
These concerns are not theoretical. If they were, SW development would be trivial and projects would always be delivered on time and to budget.
However there are some heuristics I personally used that you can apply. First you need a really good understanding of the current system:
Break the existing system's functions down into small, medium and large and benchmark those on your hardware
Perform a load test of these individual functions and capture thoughput in transactions/sec, CPU cost, network traffic and disk I/O figures for as many of these transactions as possible, making sure you have representation of small, medium and large. This load test should take the system up to the point where additional load will decrease transactions/sec
Get the figures for the max transactions/sec of the current system
Understand the rate of growth of this application and plan accordingly
Perform the analysis to get an 'average' small, medium and large 'cost' in terms of CPU, RAM, disk and network. This would be of the form:
Small transaction
CPU utilization: 10ms
RAM overhead 5MB (cache)
RAM working: 100kb (eg 10 concurrent threads = 1MB, 100 threads = 10MB)
Disk I/O: 5kb (database)
Network app<->DB: 10kb
Network app<->browser: 40kb
From this analysis you should understand how much headroom you have - CPU certainly, but check that there is sufficient RAM, network and disk capacity. Eg, the CPU required for small transactions is number of small transactions per second multiplied by the CPU cost of a small transaction. Add in the CPU cost of medium transactions and large ones, and you have your CPU budget.
Make sure the DBAs are involved. They need to do the same on the DB.
Now you need to analyse your upcoming application:
Assign each features into the same small, medium and large buckets, ensuring a like-for-like matching as far as possible
Ask deep, probing questions about how many transactions/sec each feature will experience at peak
Talk about the expected rate of growth of the application
Don't forget that the system may slow as the size of the database increases
On a personal note, you are being asked to predict the unpredictable - putting your name and reputation on the line. If you say it can fit, you are owning the risk for a large software development project. If you are being pressured to say yes, you need to ensure that there are many other people's names involved along with yours - and those names should all be visible on the go/no-go decision. Not only is this more likely to ensure that all factors are considered, and that the analysis is sound, but it will also ensure that the project has many involved individuals personally aligned to its success.

How to diagnose performance issues inside mongodb by using mongostats

I've been using mongostats to diagnose overall activity inside my mongodb instance. How can I use it to also diagnose performance issues / degradation?
One field I'm really interested in learning more about is locked % and expected behavior based on the results from all the other fields.
I feels this feature is kinda vague and needs to be flushed out a bit more.
The locked % is the % of time the global write lock (remember, mongo has a process wide write lock) is taken per sample. This percentage will increase when you increase the number of writes (inserts, updates, removes, db.eval(), etc.). A high value means the database is spending a lot of time being locked waiting for writes to finish and as a result as no queries can complete until the lock is released. As such the overall query throughput will be reduced (sometimes dramatically).
"faults" means that mongo is trying to hit data that is mapped to the virtual memory space but not in physical memory. Basically it means it's hitting disk rather than memory and is an indication you do not have enough RAM (or, for example, your index isn't right balanced). If memory serves this is available on Linux only though. This should be as close to 0 as you can get it.
"qr:qw" are the read and write query queues and if these are not zero it means the server is receiving more queries than it is able to process. This isn't necessarily a problem unless this number is consistently high or growing. Indicates overal system performance is not high enough to support your query throughput.
Most other fields are pretty self explanatory or not that useful. netIn/Out is useful if you expect to be io bound. This will happen if these values approach your maximum network bandwith.

How can you insure your code runs with no variability in execution time due to cache?

In an embedded application (written in C, on a 32-bit processor) with hard real-time constraints, the execution time of critical code (specially interrupts) needs to be constant.
How do you insure that time variability is not introduced in the execution of the code, specifically due to the processor's caches (be it L1, L2 or L3)?
Note that we are concerned with cache behavior due to the huge effect it has on execution speed (sometimes more than 100:1 vs. accessing RAM). Variability introduced due to specific processor architecture are nowhere near the magnitude of cache.
If you can get your hands on the hardware, or work with someone who can, you can turn off the cache. Some CPUs have a pin that, if wired to ground instead of power (or maybe the other way), will disable all internal caches. That will give predictability but not speed!
Failing that, maybe in certain places in the software code could be written to deliberately fill the cache with junk, so whatever happens next can be guaranteed to be a cache miss. Done right, that can give predictability, and perhaps could be done only in certain places so speed may be better than totally disabling caches.
Finally, if speed does matter - carefully design the software and data as if in the old day of programming for an ancient 8-bit CPU - keep it small enough for it all to fit in L1 cache. I'm always amazed at how on-board caches these days are bigger than all of RAM on a minicomputer back in (mumble-decade). But this will be hard work and takes cleverness. Good luck!
Two possibilities:
Disable the cache entirely. The application will run slower, but without any variability.
Pre-load the code in the cache and "lock it in". Most processors provide a mechanism to do this.
It seems that you are referring to x86 processor family that is not built with real-time systems in mind, so there is no real guarantee for constant time execution (CPU may reorder micro-instructions, than there is branch prediction and instruction prefetch queue which is flushed each time when CPU wrongly predicts conditional jumps...)
This answer will sound snide, but it is intended to make you think:
Only run the code once.
The reason I say that is because so much will make it variable and you might not even have control over it. And what is your definition of time? Suppose the operating system decides to put your process in the wait queue.
Next you have unpredictability due to cache performance, memory latency, disk I/O, and so on. These all boil down to one thing; sometimes it takes time to get the information into the processor where your code can use it. Including the time it takes to fetch/decode your code itself.
Also, how much variance is acceptable to you? It could be that you're okay with 40 milliseconds, or you're okay with 10 nanoseconds.
Depending on the application domain you can even further just mask over or hide the variance. Computer graphics people have been rendering to off screen buffers for years to hide variance in the time to rendering each frame.
The traditional solutions just remove as many known variable rate things as possible. Load files into RAM, warm up the cache and avoid IO.
If you make all the function calls in the critical code 'inline', and minimize the number of variables you have, so that you can let them have the 'register' type.
This should improve the running time of your program. (You probably have to compile it in a special way since compilers these days tend to disregard your 'register' tags)
I'm assuming that you have enough memory not to cause page faults when you try to load something from memory. The page faults can take a lot of time.
You could also take a look at the generated assembly code, to see if there are lots of branches and memory instuctions that could change your running code.
If an interrupt happens in your code execution it WILL take longer time. Do you have interrupts/exceptions enabled?
Understand your worst case runtime for complex operations and use timers.

Resources