high load low cpu low iowait, why? - performance

Look at the those peaks in the first graph, which factor can cause this?
cpu 24X6

There's a lot of stuff going on in any general purpose computer. When I performance profiled apps in a former life, I saw this all the time and factored it out.
It's caused by a whole host of sources: Processor dealing with interrupts, some disk maintenance routine, file system clean up, completely useless background apps that have been installed unknown to you as automatically launched services, etc.
Your plot of idle time is a little disconcerting. It is awfully low. What apps do you have running taking up all that processing? Also, if your memory is low, say because you have 20 or 30 browser tabs/windows open, your CPU load will go through the roof due to all that page and context swapping.

Related

What methodology would you use to measure the load capacity of a software server application?

I have a high-performance software server application that is expected to get increased traffic in the next few months.
I was wondering what approach or methodology is good to use in order to gauge if the server still has the capacity to handle this increased load?
I think you're looking for Stress Testing and the scenario would be something like:
Create a load test simulating current real application usage
Start with current number of users and gradually increase the load until
you reach the "increased traffic" amount
or errors start occurring
or you start observing performance degradation
whatever comes the first
Depending on the outcome you either can state that your server can handle the increased load without any issues or you will come up with the saturation point and the first bottleneck
You might also want to execute a Soak Test - leave the system under high prolonged load for several hours or days, this way you can detect memory leaks or other capacity problems.
More information: Why ‘Normal’ Load Testing Isn’t Enough
Test the product with one-tenth the data and traffic. Be sure the activity is 'realistic'.
Then consider what will happen as traffic grows -- with the RAM, disk, cpu, network, etc, grow linearly or not?
While you are doing that, look for "hot spots". Optimize them.
Will you be using web pages? Databases? Etc. Each of these things scales differently. (In other words, you have not provided enough details in your question.)
Most canned benchmarks focus on one small aspect of computing; applying the results to a specific application is iffy.
I would start by collecting base line data on critical resources - typically, CPU, memory usage, disk usage, network usage - and track them over time. If any of those resources show regular spikes where they remain at 100% capacity for more than a fraction of a second, under current usage, you have a bottleneck somewhere. In this case, you cannot accept additional load without likely outages.
Next, I'd start figuring out what the bottleneck resource for your application is - it varies between applications, but in most cases it's the bottleneck resource that stops you from scaling further. Your CPU might be almost idle, but you're thrashing the disk I/O, for instance. That's a tricky process - load and stress testing are the way to go.
If you can resolve the bottleneck by buying better hardware, do so - it's much cheaper than rewriting the software. If you can't buy better hardware, look at load balancing. If you can't load balance, you've got to look at application architecture and implementation and see if there are ways to move the bottleneck.
It's quite typical for the bottleneck to move from one resource to the next - you've got CPU to behave, but now when you increase traffic, you're spiking disk I/O; once you resolve that, you may get another CPU challenge.

windowed OpenGL first frame delay after idle

I have windowed WinApi/OpenGL app. Scene is drawn rarely (compared to games) in WM_PAINT, mostly triggered by user input - MW_MOUSEMOVE/clicks etc.
I noticed, that when there is no scene moving by user mouse (application "idle") and then some mouse action by user starts, the first frame is drawn with unpleasant delay - like 300 ms. Following frames are fast again.
I implemented 100 ms timer, which only does InvalidateRect, which is later followed by WM_PAINT/draw scene. This "fixed" the problem. But I don't like this solution.
I'd like know why is this happening and also some tips how to tackle it.
Does OpenGL render context save resources, when not used? Or could this be caused by some system behaviour, like processor underclocking/energy saving etc? (Although I noticed that processor runs underclocked even when app under "load")
This sounds like Windows virtual memory system at work. The sum of all the memory use of all active programs is usually greater than the amount of physical memory installed on your system. So windows swaps out idle processes to disc, according to whatever rules it follows, such as the relative priority of each process and the amount of time it is idle.
You are preventing the swap out (and delay) by artificially making the program active every 100ms.
If a swapped out process is reactivated, it takes a little time to retrieve the memory content from disc and restart the process.
Its unlikely that OpenGL is responsible for this delay.
You can improve the situation by starting your program with a higher priority.
https://superuser.com/questions/699651/start-process-in-high-priority
You can also use the virtuallock function to prevent Windows from swapping out part of the memory, but not advisable unless you REALLY know what you are doing!
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366895(v=vs.85).aspx
EDIT: You can improve things for sure by adding more memory and for sure 4GB sounds low for a modern PC, especially if you Chrome with multiple tabs open.
If you want to be scientific before spending any hard earned cash :-), then open Performance Manager and look at Cache Faults/Sec. This will show the swap activity on your machine. (I have 16GB on my PC so this number is very low mostly). To make sure you learn, I would check Cache Faults/Sec before and after the memory upgrade - so you can quantify the difference!
Finally, there is nothing wrong with the solution you found already - to kick start the graphic app every 100ms or so.....
Problem was in NVidia driver global 3d setting -"Power management mode".
Options "Optimal Power" and "Adaptive" save power and cause the problem.
Only "Prefer Maximum Performance" does the right thing.

maxed CPU performance - stack all tasks or aim for less than 100%?

I have 12 tasks to run on an octo-core machine. All tasks are CPU intensive and each will max out a core.
Is there a theoretical reason to avoid stacking tasks on a maxed out core (such as overhead, swapping across tasks) or is it faster to queue everything?
Task switching is a waste of CPU time. Avoid it if you can.
Whatever the scheduler timeslice is set to, the CPU will waste its time every time slice by going into the kernel, saving all the registers, swapping the memory mappings and starting the next task. Then it has to load in all its CPU cache, etc.
Much more efficient to just run one task at a time.
Things are different of course if the tasks use I/O and aren't purely compute bound.
Yes it's called queueing theory https://en.wikipedia.org/wiki/Queueing_theory. There are many different models https://en.wikipedia.org/wiki/Category:Queueing_theory for a range of different problems I'd suggest you scan them and pick the one most applicable to your workload then go and read up on how to avoid the worst outcomes for that model, or pick a different, better, model for dispatching your workload.
Although the graph at this link https://commons.wikimedia.org/wiki/File:StochasticQueueingQueueLength.png applies to Traffic it will give you an idea of what is happening to response times as your CPU utilisation increases. It shows that you'll reach an inflection point after which things get slower and slower.
More work is arriving than can be processed with subsequent work waiting longer and longer until it can be dispatched.
The more cores you have the further to the right you push the inflection point but the faster things go bad after you reach it.
I would also note that unless you've got some really serious cooling in place you are going to cook your CPU. Depending on it's design it will either slow itself down, making your problem worse, or you'll trigger it's thermal overload protection.
So a simplistic design for 8 cores would be, 1 thread to manage things and add tasks to the work queue and 7 threads that are pulling tasks from the work queue. If the tasks need to be performed within a certain time you can add a TimeToLive value so that they can be discarded rather than executed needlessly. As you are almost certainly running your application in an OS that uses a pre-emptive threading model consider things like using processor affinity where possible because as #Zan-Lynx says task/context switching hurts. Be careful not to try to build your OS'es thread management again as you'll probably wind up in conflict with it.
tl;dr: cache thrash is Bad
You have a dozen tasks. Each will have to do a certain amount of work.
At an app level they each processed a thousand customer records or whatever. That is fixed, it is a constant no matter what happens on the hardware.
At the the language level, again it is fixed, C++, java, or python will execute a fixed number of app instructions or bytecodes. We'll gloss over gc overhead here, and page fault and scheduling details.
At the assembly level, again it is fixed, some number of x86 instructions will execute as the app continues to issue new instructions.
But you don't care about how many instructions, you only care about how long it takes to execute those instructions. Many of the instructions are reads which MOV a value from RAM to a register. Think about how long that will take. Your computer has several components to implement the memory hierarchy - which ones will be involved? Will that read hit in L1 cache? In L2? Will it be a miss in last-level cache so you wait (for tens or hundreds of cycles) until RAM delivers that cache line? Did the virtual memory reference miss in RAM, so you wait (for milliseconds) until SSD or Winchester storage can page in the needed frame? You think of your app as issuing N reads, but you might more productively think of it as issuing 0.2 * N cache misses. Running at a different multi-programming level, where you issue 0.3 * N cache misses, could make elapsed time quite noticeably longer.
Every workload is different, and can place larger or smaller demands on memory storage. But every level of the memory hierarchy depends on caching to some extent, and higher multi-programming levels are guaranteed to impact cache hit rates. There are network- and I/O-heavy workloads where very high multi-programming levels absolutely make sense. But for CPU- and memory-intensive workloads, when you benchmark elapsed times you may find that less is more.

Optimal CPU utilization thresholds

I have built software that I deploy on Windows 2003 server. The software runs as a service continuously and it's the only application on the Windows box of importance to me. Part of the time, it's retrieving data from the Internet, and part of the time it's doing some computations on that data. It's multi-threaded -- I use thread pools of roughly 4-20 threads.
I won't bore you with all those details, but suffice it to say that as I enable more threads in the pool, more concurrent work occurs, and CPU use rises. (as does demand for other resources, like bandwidth, although that's of no concern to me -- I have plenty)
My question is this: should I simply try to max out the CPU to get the best bang for my buck? Intuitively, I don't think it makes sense to run at 100% CPU; even 95% CPU seems high, almost like I'm not giving the OS much space to do what it needs to do. I don't know the right way to identify best balance. I guessing I could measure and measure and probably find that the best throughput is achived at a CPU avg utilization of 90% or 91%, etc. but...
I'm just wondering if there's a good rule of thumb about this??? I don't want to assume that my testing will take into account all kinds of variations of workloads. I'd rather play it a bit safe, but not too safe (or else I'm underusing my hardware).
What do you recommend? What is a smart, performance minded rule of utilization for a multi-threaded, mixed load (some I/O, some CPU) application on Windows?
Yep, I'd suggest 100% is thrashing so wouldn't want to see processes running like that all the time. I've always aimed for 80% to get a balance between utilization and room for spikes / ad-hoc processes.
An approach i've used in the past is to crank up the pool size slowly and measure the impact (both on CPU and on other constraints such as IO), you never know, you might find that suddenly IO becomes the bottleneck.
CPU utilization shouldn't matter in this i/o intensive workload, you care about throughput, so try using a hill climbing approach and basically try programmatically injecting / removing worker threads and track completion progress...
If you add a thread and it helps, add another one. If you try a thread and it hurts remove it.
Eventually this will stabilize.
If this is a .NET based app, hill climbing was added to the .NET 4 threadpool.
UPDATE:
hill climbing is a control theory based approach to maximizing throughput, you can call it trial and error if you want, but it is a sound approach. In general, there isn't a good 'rule of thumb' to follow here because the overheads and latencies vary so much, it's not really possible to generalize. The focus should be on throughput & task / thread completion, not CPU utilization. For example, it's pretty easy to peg the cores pretty easily with coarse or fine-grained synchronization but not actually make a difference in throughput.
Also regarding .NET 4, if you can reframe your problem as a Parallel.For or Parallel.ForEach then the threadpool will adjust number of threads to maximize throughput so you don't have to worry about this.
-Rick
Assuming nothing else of importance but the OS runs on the machine:
And your load is constant, you should aim at 100% CPU utilization, everything else is a waste of CPU. Remember the OS handles the threads so it is indeed able to run, it's hard to starve the OS with a well behaved program.
But if your load is variable and you expect peaks you should take in consideration, I'd say 80% CPU is a good threshold to use, unless you know exactly how will that load vary and how much CPU it will demand, in which case you can aim for the exact number.
If you simply give your threads a low priority, the OS will do the rest, and take cycles as it needs to do work. Server 2003 (and most Server OSes) are very good at this, no need to try and manage it yourself.
I have also used 80% as a general rule-of-thumb for target CPU utilization. As some others have mentioned, this leaves some headroom for sporadic spikes in activity and will help avoid thrashing on the CPU.
Here is a little (older but still relevant) advice from the Weblogic crew on this issue: http://docs.oracle.com/cd/E13222_01/wls/docs92/perform/basics.html#wp1132942
If you feel your load is very even and predictable you could push that target a little higher, but unless your user base is exceptionally tolerant of periodic slow responses and your project budget is incredibly tight, I'd recommend adding more resources to your system (adding a CPU, using a CPU with more cores, etc.) over making a risky move to try to squeeze out another 10% CPU utilization out of your existing platform.

Why do my processes only consume 5% of the processors power?

Not sure if this is the appropriate place for this question, but it seems related to threading and system resources and all that.
Why Does my Task Manager show that the System Idle Process is using 90%+ of the CPU power when I have 3 different processes going!?!?
Is it because of I/O bottlenecks?
For example, if I do an SVN checkout, and empty my recycle bin, and browse the web at the same time, why is the System Idle process at 97%, and the other processes around 1% each? None of them seem to actually be going very fast.
Mostly the processes are waiting for disk or network operations to complete, or waiting for user input.
You might think you have a fast disk or network connection, but compared to memory/cpu it's like walking to the nearest library, looking up a book in the catalog and finding it on the shelf vs already having the book in your hand.
This is why you pay thousands or dollars for 10,000 and 15,000 rpm scsi drives (or even more for SANs) on high performance servers.
I can't say for sure. But I'd say I/O bottlenecks would be a large part of it. In fact, I wouldn't imagine any of the tasks you described would be very CPU intensive.
Now, try to re-encode a raw AVI file into DIVX format while simultaneously rendering a 3D animation in Maya, and your CPU should be plenty busy.
It really depends on what your processes are doing. If they are IO bound, then there's a good chance that they're sitting and waiting most of the time.
If they're winforms apps that wait for user input, then they sit there, doing nothing, and waiting for input.
SVN checkout and recycle bin emptying are both very heavy disk activities with not much demand on the CPU, and web browsing is very spiked in terms of CPU usage (spikes when rendering pages, for example, but costs very little once that is done).
If you want to see your CPU sustain high utilization, do something that is almost purely CPU/memory based, like Folding#Home.

Resources