Intentionally high CPU usage, GCD, QOS_CLASS_BACKGROUND, and spindump - xcode

I am developing a program that happens to use a lot of CPU cycles to do its job. I have noticed that it, and other CPU intensive tasks, like iMovie import/export or Grapher Examples, will trigger a spin dump report, logged in Console:
1/21/16 12:37:30.000 PM kernel[0]: process iMovie[697] thread 22740 caught burning CPU! It used more than 50% CPU (Actual recent usage: 77%) over 180 seconds. thread lifetime cpu usage 91.400140 seconds, (87.318264 user, 4.081876 system) ledger info: balance: 90006145252 credit: 90006145252 debit: 0 limit: 90000000000 (50%) period: 180000000000 time since last refill (ns): 116147448571
1/21/16 12:37:30.881 PM com.apple.xpc.launchd[1]: (com.apple.ReportCrash[705]) Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash
1/21/16 12:37:30.883 PM ReportCrash[705]: Invoking spindump for pid=697 thread=22740 percent_cpu=77 duration=117 because of excessive cpu utilization
1/21/16 12:37:35.199 PM spindump[423]: Saved cpu_resource.diag report for iMovie version 9.0.4 (1634) to /Library/Logs/DiagnosticReports/iMovie_2016-01-21-123735_cudrnaks-MacBook-Pro.cpu_resource.diag
I understand that high CPU usage may be associated with software errors, but some operations simply require high CPU usage. It seems a waste of resources to watch-dog and report processes/threads that are expected to use a lot of CPU.
In my program, I use four serial GCD dispatch queues, one for each core of the i7 processor. I have tried using QOS_CLASS_BACKGROUND, and spin dump recognizes this:
Primary state: 31 samples Non-Frontmost App, Non-Suppressed, Kernel mode, Thread QoS Background
The fan spins much more slowly when using QOS_CLASS_BACKGROUND instead of QOS_CLASS_USER_INITIATED, and the program takes about 2x longer to complete. As a side issue, Activity Monitor still reports the same % CPU usage and even longer total CPU Time for the same task.
Based on Apple's Energy Efficiency documentation, QOS_CLASS_BACKGROUND seems to be the proper choice for something that takes a long time to complete:
Work takes significant time, such as minutes or hours.
So why then does it still complain about using a lot of CPU time? I've read about methods to disable spindump, but these methods disable it for all processes. Is there a programmatic way to tell the system that this process/thread is expected to use a lot of CPU, so don't bother watch-dogging it?

Related

Flame graph(perf record) cannot display accurate CPU idle usage

When the CPU usage is 60%, the flame graphs(perf record) is used to capture the CPU usage. Why is 40% idle-related stack usage not displayed in the flame graphs? The usage of the idle stack is often less than 5%.
For flame graphs, the point is normally to measure where a process spends CPU time while it's running, not which blocking functions it calls that make it sleep, or where it gets scheduled out and sleeps when it doesn't want to.
I capture performance for one cpu processor, not one process. According to the operating system design, if there is no active task on the CPU, the CPU calls an idle waiting function. For example, Linux often calls schedule_idle until it is interrupted by a new task. Therefore, it is expected that the schedule_idle can be found in flame gragh and it consumes 40% of the cpu usage.
Perf events like cycles don't increment when the clock is halted (e.g. cycles is cpu_clk_unhalted.thread_p or similar). If you really wanted to see time spend idle, you might be able to disable idle power saving to get Linux to just spin in a loop instead of using x86 monitor/mwait or even basic hlt to put the CPU into a C-state where the clock doesn't tick.
Or run your code pinned to one logical core, and on the other logical core, pin a task that runs the pause instruction in a loop. So the physical core's clock keeps ticking for the core you're counting events for.
You should still get counts for cpu_clk_unhalted.thread_any ([Core cycles when at least one thread on the physical core is not in halt state]) when recording that event on the logical core with your task, even when that logical core is asleep.
And you can also record counts for cpu_clk_unhalted.thread to count cycles when this (hardware) thread aka logical core isn't halted, to know how much CPU time you actually used. (Or use the software event task-clock for that.)
Use perf list to see events available on your CPU, and read their descriptions carefully.

Disk latency causing CPU spikes on EC2 instance

We are having an interesting issue where we are seeing a CPU spike on our EC2 instance and at the same time we are seeing a spike in disk latency. Here is the pattern for CPU spike
CPU spike from 50% to 100% within 30 seconds
It stays at 100% utilization for two minutes
CPU utilization is dropped from 100 to almost 0 in 10 seconds. At the same time almost disk latency is also back to normal
This issue has happened on different AWS ec2 instances a couple of times over a week and still happening. In all cases we are seeing CPU spike along with disk latency with CPU spike having a similar pattern as above.
We had put process monitoring tools to check if any particular process was occupying the CPU. That tool revealed that each of process on the ec2 instance starts taking approx twice the CPU. For eg our app server CPU utilization increases from .75% to 1.5 . Similar observation for Nginx and other processes. There was no single process occupying more than 8% CPU. We studied our traffic pattern and there is nothing unusual which can cause this. So the question is
Can increase in disk latency cause the CPU spike pattern as above or in general can disk latency result in CPU spike
Here is my bet: you are running t2 / t3 machines which are burstable instances. You can access 30% of the CPU all the time, and a credit system create a fair usage predictable mode for the 70% remaining. You earn credit by running the instance, you lose credit by going over 30% CPU usage.
You are running out of credits and then AWS reduce your access to CPU. The system goes smooth again when credits are added to your balance.
t2 and t3 doesn't have the system credit system, you can find details here: CPU Credits and baseline
You have two solutions:
Take a bigger instance, so you will have more credits per hour and better baseline or another family like c5, m5, r5 etc...
Take an unlimited mode option for your t3 instances
I would suggest faster storage. cpu aims to add up to 100%. limiting is working in this strange way that it simulates usage for "unknown" reason. Reasons can be one of those:
idle time (notice here this is what you consider FREE cpu, thats why I say it adds up to 100%)
user time (normal usage)
system time (system usage)
iowait (your case, cpu waiting for HDD/SSD to answer)
nice time (low priority processes that were not included in user time)
interupt time (external device "talk" time - could be your case if you have many usb devices etc - rather unlikely)
softirq (queued work from a processed interrupt - see above)
steal time (case that Clement is describing)
I would suggest ensuring which one is your case
you can try below to get the info:
$ sudo apt-get install sysstat
$ mpstat -P ALL 1
From here there is 2 options for you :)
EBS allows you to run IO optimized volume called "IO1" (mid price - mid speed)
Change the machine and use one in "Nitro System" (provides bare metal capabilities - that is: as if you had actual NVMe connected directly - max possible speed)
m5.2xlarge 8 37 32 GiB EBS Only $0.384 per Hour
m5d.2xlarge 8 37 32 GiB 1 x 300 NVMe SSD $0.452 per Hour
Source: Instances built on the Nitro System

Set CPU affinity for profiling

I am working on a calculation intensive C# project that implements several algorithms. The problem is that when I want to profile my application, the time it takes for a particular algorithm varies. For example sometimes running the algorithm 100 times takes about 1100 ms and another time running 100 times takes much more time like 2000 or even 3000 ms. It may vary even in the same run. So it is impossible to measure improvement when I optimize a piece of code. It's just unreliable.
Here is another run:
So basically I want to make sure one CPU is dedicated to my app. The PC has an old dual core Intel E5300 CPU running on Windows 7 32 bit. So I can't just set process affinity and forget about one core forever. It would make the computer very slow for daily tasks. I need other apps to use a specific core when I desire and the when I'm done profiling, the CPU affinities come back to normal. Having a bat file to do the task would be a fantastic solution.
My question is: Is it possible to have a bat file to set process affinity for every process on windows 7?
PS: The algorithm is correct and every time runs the same code path. I created some object pool so after first run, zero memory is allocated. I also profiled memory allocation with dottrace and it showed no allocation after first run. So I don't believe GC is triggered when the algorithm is working. Physical memory is available and system is not running low on RAM.
Result: The answer by Chris Becke does the job and sets process affinities exactly as intended. It resulted in more uniform results specially when background apps like visual studio and dottrace are running. Further investigation into the divergent execution time revealed that the root for the unpredictability is CPU overheat. The CPU overheat alarm was off while the temperature was over 100C! So after fixing the malfunctioning fan, the results became completely uniform.
You mean SetProcessAffinityMask?
I see this question, while tagged windows, is c#, so... I see the System.Diagnostics.Process object has a ThreadAffinity member that should perform the same function.
I am just not sure that this will stabilize the CPU times quite in the way you expect. A single busy task that is not doing IO should remain scheduled on the same core anyway unless another thread interrupts it, so I think your variable times are more due to other threads / processes interrupting your algorithm than the OS randomly shunting your thread to a different core - so unless you set the affinity for all other threads in the system to exclude your preferred core I can't see this helping.

What exactly is CPU load if instructions are executed one at a time?

I know this question has been asked many times in many different manners, but it's still not clear for me what the CPU load % means.
I'll start explaining how I perceive the concepts now (of course, I might, and sure will, be wrong):
A CPU core can only execute one instruction at a time. It will not execute the next instruction until it finishes executing the current one.
Suppose your box has one single CPU with one single core. Parallel computing is hence not possible. Your OS's scheduler will pick up a process, set the IP to the entry point, and send that instruction to the CPU. It won't move to the next instruction until the CPU finishes executing the current instruction. After a certain amount of time it will switch to another process, and so on. But it will never switch to another process if the CPU is currently executing an instruction. It will wait until the CPU becomes free to switch to another process. Since you only have one single core, you can't have two processes executing simultaneously.
I/O is expensive. Whenever a process wants to read a file from the disk, it has to wait until the disk accomplishes its task, and the current process can't execute its next instruction until then. The CPU is not doing anything while the disk is working, and so our OS will switch to another process until the disk finishes its job in order not to waste time.
Following these principles, I've come myself to the conclusion that CPU load at a given time can only be one of the following two values:
0% - Idle. CPU is doing nothing at all.
100% - Busy. CPU is currently executing an instruction.
This is obviously false as taskmgr reports %1, 12%, 15%, 50%, etc. CPU usage values.
What does it mean that a given process, at a given time, is utilizing 1% of a given CPU core (as reported by taskmgr)? While that given process is executing, what happens with the 99%?
What does it mean that the overall CPU usage is 19% (as reported by Rainmeter at the moment)?
If you look into the task manager on Windows there is Idle process, that does exactly that, it just shows amount of cycles not doing anything useful. Yes, CPU is always busy, but it might be just running in a loop waiting for useful things to come.
Since you only have one single core, you can't have two processes
executing simultaneously.
This is not really true. Yes, true parallelism is not possible with single core, but you can create illusion of one with preemptive multitasking. Yes, it is impossible to interrupt instruction, but it is not a problem because most of the instructions require tiny amount of time to finish. OS shares time with time slices, which are significantly longer than execution time of single instruction.
What does it mean that a given process, at a given time, is utilizing 1% of a given CPU core
Most of the time applications are not doing anything useful. Think of application that waits for user to click a button to start processing something. This app doesn't need CPU, so it sleeps most of the time, or every time it gets time slice it just goes into sleep (see event loop in Windows). GetMessage is blocking, so it means that thread will sleep until message arrives. So what CPU load really means? So imagine the app receives some events or data to do things, it will do operations instead of sleeping. So if it utilizes X% of CPU means that over sampling period of time that app used X% of CPU time. CPU time usage is average metric.
PS: To summarize concept of CPU load, think of speed (in terms of physics). There are instantaneous and average speeds, so speaking of CPU load, there also are instantaneous and average measurements. Instantaneous is always equal to either 0% or 100%, because at some point of time process either uses CPU or not. If process used 100% of CPU in the course of 250ms and didn't use for next 750ms then we can say that process loaded CPU for 25% with sampling period of 1 second (average measurement can only be applied with certain sampling period).
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages
A single-core CPU is like a single lane of traffic. Imagine you are a bridge operator ... sometimes your bridge is so busy there are cars lined up to cross. You want to let folks know how traffic is moving on your bridge. A decent metric would be how many cars are waiting at a particular time. If no cars are waiting, incoming drivers know they can drive across right away. If cars are backed up, drivers know they're in for delays.
This is basically what CPU load is. "Cars" are processes using a slice of CPU time ("crossing the bridge") or queued up to use the CPU. Unix refers to this as the run-queue length: the sum of the number of processes that are currently running plus the number that are waiting (queued) to run.
Also see: http://en.wikipedia.org/wiki/Load_(computing)

Which one will workload(usage) of the CPU-Core if there is a persistent cache-miss, will be 100%?

That is, if the core processor most of the time waiting for data from RAM or cache-L3 with cache-miss, but the system is a real-time (real-time thread priority), and the thread is attached (affinity) to the core and works without switching thread/context, what kind of load(usage) CPU-Core should show on modern x86_64?
That is, CPU usage is displayed as decrease only when logged in Idle?
And if anyone knows, if the behavior is different in this case for other processors: ARM, Power[PC], Sparc?
Clarification: shows CPU-usage in standard Task manager in OS-Windows
A hardware thread (logical core) that's stalled on a cache miss can't be doing anything else, so it still counts as busy for the purposes of task-managers / CPU time accounting / OS process scheduler time-slices / stuff like that.
This is true across all architectures.
Without hyperthreading, "hardware thread" / "logical core" are the same as a "physical core".
Morphcore / other on-the-fly changing between hyperthreading and a more powerful single core could make there be a difference between a thread that keeps many execution units busy, vs. a thread that is blocked on cache misses a lot of the time.
I don't get the link between the OS CPU usage statistics and the optimal use of the pipeline. I think they are uncorrelated as the OS doesn't measure the pipeline load.
I'm writing this in the hope that Peter Cordes can help me understand it better and as a continuation of the comments.
User programs relinquish control to OS very often: when they need input from user or when
they are done with the signal/message. GUI program are basically just big loops and at
each iteration control is given to the OS until the next message.
When the OS has the control it schedules others threads/tasks and if not other actions
are needed just enter the idle process (long time ago a tight loop, now a sleep state)
until the next interrupt. This is the Idle Time.
Time spent on an ISR processing user input is considered idle time by any OS.
An a cache miss there would be still considered idle time.
A heavy program takes more time to complete the work for a given message thereby returning
control to OS say 2 times in a second instead of
20.
If the OS measures that in the last second, it got control for 20ms only then the
CPU usage is (1000-20)/1000 = 98%.
This has nothing to do with the optimal use of the CPU architecture, as said stalls can
occur in the OS code and still be part of the Idle time statistic.
The CPU utilization at pipeline level is not what is measured and it is orthogonal to the
OS statistics.
CPU usage is meant to be used by sysadmin, it is a measure of the load you put on a system,
it is not the measure of how efficiently the assembly of a program was generated.
Sysadmins can't help with that, but measuring how often the OS got the control back (without
preempting) is a measure of how much load a program is putting on the system.
And sysadmins can definitively do terminate heavy programs.

Resources