Windows CPU Scheduler - very high kernel time [closed]

Windows CPU Scheduler - very high kernel time [closed] - windows

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
We are trying to understand how Windows CPU Scheduler works in order to optimize our applications to achieve maximum possible infrastructure/real work ratio. There's some things in xperf that we don't understand and would like to ask the community to shed some light on what's really happening.
We initially started to investigate these issues when we got reports that some servers were "slow" or "unresponsive".
Background information
We have a Windows 2012 R2 Server that runs our middleware infrastructure with the following specs.
We found concerning that 30% of CPU is getting wasted on kernel, so we started to dig deeper.
The server above runs "host" ~500 processes (as windows services), each of these "host" processes has an inner while loop with a ~250 ms delay (yuck!), and each of those "host" processes may have ~1..2 "child" processes that are executing the actual work.
While having the infinite loop with 250 ms delay between iterations, the actual useful work for the "host" application to execute may appear only every 10..15 seconds. So there's a lot of cycles wasted for unnecessary looping.
We are aware that design of the "host" application is sub-optimal, to say the least, as applied to our scenario. The application is getting changed to an event-based model which will not require the loop and therefore we expect a significant reduction of "kernel" time in CPU utilization graph.
However, while we were investigating this problem, we've done some xperf analysis which raised several general questions about Windows CPU Scheduler for which we were unable to find any clear/concise explanation.
What we don't understand
Below is the screenshot from one of xperf sessions.
You can see from the "CPU Usage (Precise)" that
There's 15 ms time slices, of which majority are under-utilized. The utilization of those slices is ~35-40%. So I assume that this in turn means that CPU gets utilized about ~35-40% of the time, yet the system's performance (let's say observable through casual tinkering around the system) is really sluggish.
With this we have this "mysterious" 30% kernel time cost, judged by the task manager CPU utilization graph.
Some CPU's are obviously utilized for the whole 15 ms slice and beyond.
Questions
As far as Windows CPU Scheduling on multiprocessor systems is concerned:
What causes 30% kernel cost? Context switching? Something else? What consideration should be made when applications are written to reduce this cost? Or even - achieve perfect utilization with minimal infrastructure cost (on multiprocessor systems, where number of processes is higher than the number of cores)
What are these 15 ms slices?
Why CPU utilization has gaps in these slices?

To diag the CPU usage issues, you should use Event Tracing for Windows (ETW) to capture CPU Sampling data (not precise, this is useful to detect hangs).
To capture the data, install the Windows Performance Toolkit, which is part of the Windows SDK.
Now run WPRUI.exe, select First Level, under Resource select CPU usage and click on start.
Now capture 1 minute of the CPU usage. After 1 minute click on Save.
Now analyze the generated ETL file with the Windows Performance Analyzer by drag & drop the CPU Usage (sampled) graph to the analysis pane and order the colums like you see in the picture:
Inside WPA, load the debug symbols and expand Stack of the SYSTEM process. In this demo, the CPU usage comes from the nVIDIA driver.

Related

A multi-threaded software(PFC3D-to do simulation) not using all the available cores

I'm using a multi-threaded software(PFC3D developped by Itasca consulting) to do some simulations.After moving to a powerful computer Intel Xeon Gold 5120T CPU 2.2GHZ 2.19 GHZ (2 Processors)(28 physical cores, 56 logical cores)(Windows10) to have rapid calculations, the software seems to only use a limited number of cores.Normally the number of cores detected in the software is 56 and it takes automaticly the maximum number of cores.
I'm quite sure that the problem is in the system not in my software because I'm running the same code in a intel core i9-9880H Processor (16 logical cores) and it'is using all the cores with even more efficiency than the xeon gold.
The software is using 22 to 30;
28 cores/56 threads are displayed on task managers CPU page.I have windows 10 pro.
I appreciate very much your precious help.
Thank you
Youssef
interface
classes
details
code

It's hard to say because I do not have the code and you provide so little information.
You seems to have no IO because you said that you use 100% of the CPU on i9. That should simplify a little bit but...
There could be many reasons.
My feeling is that you have threading synchronisation (like critical section) that depends on shared ressource(s). That ressource seems to be lightly solicitated and thread require it just a little wich enable 16 threads to access it without too much collisions (or very little). I mean that thread do not have to wait for shared resource (it is mostly available / not locked). But adding more threads improve significantly collisions amount (locking state of shared ressources by another thread) to have to wait for that ressource. That really sounds like something like that. But it is only a guess.
A quick try that could potentially improve the performance (because I have the feeling that shared resource require very quick access), is to use SpinLock instead of regular Critical Section. But that is totally a guess based on very little and also SpinLock is available in C# but perhaps not in your language.
About the number of CPU taken, it could be normal to take only the half depending on how the program is made. Sometimes it could be better to not use hyperthreaded and perhaps your program is doing this itself. Also there could be a bug in either the program itself, in C# or in the BIOS which tell the app that there is only 28 cpus instead of 56 (usually due to hyperthreading). IT is still a guess.
There could be some additional information that could potentially help you in this stack overflow question.
Good luck.

Is there any scenario when we would resort to a process instead of a goroutine? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I understand that goroutines are very light weight and we can spawn thousands of them but I want to know if there is some scenario when we should spawn a process instead of a goroutine (like hitting some kind of process boundaries in terms of resource or something else). Can spawning a new process in some scenario be beneficial in terms of resource utilization or some other dimension?

To get things started, here's three reasons. I'm sure there's more.
Reason #1
In a perfect world, CPUs would be busy doing the most important work they can (and not wasted doing the less important work while more important work waits).
To do this, whatever controls what work a CPU does (the scheduler) has to know how important each piece of work is. This is normally done with (e.g.) thread priorities. When there are 2 or more processes that are isolated from each other, whatever controls what work a CPU does can't be part of either process. Otherwise you get a situation where one process is consuming CPU time doing unimportant work because it can't know that there's a different process that wants the CPU for more important work.
This is why things like "goroutines" are broken (inferior to plain old threads). They simply can't do the right thing (unless there's never more than one process that wants CPU time).
Processes (combined with "process priorities") can fix that problem (while adding multiple other problems).
Reason #2
In a perfect world, software would never crash. The reality is that sometimes processes do crash (and sometimes the reason has nothing to do with software - e.g. a hardware flaw). Specifically, when one process crashes often there's no sane way to tell how much damage was done within that process, so the entire process typically gets terminated. To deal with this problem people use some form of redundancy (multiple redundant processes).
Reason #3
In a perfect world, all CPUs and all memory would be equal. In reality things don't scale up like that, so you get things like ccNUMA where a CPU can access memory in the same NUMA domain quickly, but the same CPU can't access memory in a different NUMA domain as quickly. To cope with that, ideally (when allocating memory) you'd want to tell the OS "this memory needs low latency more than bandwidth" (and OS would allocate memory for the fastest/closest NUMA domain only) or you'd tell the OS "this memory needs high bandwidth more than low latency" (and the OS would allocate memory from all NUMA domains). Sadly every language I've ever seen has "retro joke memory management" (without any kind of "bandwidth vs. latency vs. security" hints); which means that the only control you get is the choice between "one process spread across all NUMA domains vs. one process for each NUMA domain".

Is it possible to have an absolute zero CPU utilization? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 years ago.
Improve this question
An Idle server(unix, windows), with at least a chef client and a Zabbix agent running on them, can have the absolute zero CPU utilization, or it is always using, for example, at least 0.001%?

To achive Zero cpu utilization you need to Shut it down.
Following are the ways:
Manually click on the shut down(on windows).
You can create a batch file for the same.
Pull the power cable out.(*Not recommended).

Most CPUs have a Halt instruction, or its moral equivalent. When the Halt instruction is reached, the CPU will stop executing its normal, Fetch-Decode-Execute loop (or moral equivalent).
The only thing that can bring the CPU out of this state is the delivery of an external interrupt (from some other piece of hardware), which will cause the CPU to perform its normal interrupt processing. What it does after that is up to the operating system.
As an example, a system may arrange for its network cards to cause interrupts and then Halt. Most, modern, everyday OSes will not do this since they have plenty of background tasks that they can schedule when the CPU would otherwise be idle.

Depends on the definition of CPU utilization.
At the CPU level, utilization can't be zero, unless the computer is off. Which is intuitive of course. Some process is always running, that should constitute good architecture.
Normally though, at the OS level, CPU utilization is calculated by the OS and its probable that each OS will have its own implementation of the calculation algorithm.
There are a certain set of tasks known as idle tasks, which run when there are no non-idle processes to run. In linux boot, the idle task comes under process 0.
Utilization for a period of time, is the percentage of time non-idle processes were run. For 10ms, if the CPU was idle for 5ms, utilization is 50%.
Simply put, 0% utilization means the CPU is running, but its just waiting for other tasks to be assigned, while running default idle tasks. It is also possible for some minor, non-idle tasks to be running, at something like 0.0001% utilization, which gets rounded off to zero.

Set CPU affinity for profiling

I am working on a calculation intensive C# project that implements several algorithms. The problem is that when I want to profile my application, the time it takes for a particular algorithm varies. For example sometimes running the algorithm 100 times takes about 1100 ms and another time running 100 times takes much more time like 2000 or even 3000 ms. It may vary even in the same run. So it is impossible to measure improvement when I optimize a piece of code. It's just unreliable.
Here is another run:
So basically I want to make sure one CPU is dedicated to my app. The PC has an old dual core Intel E5300 CPU running on Windows 7 32 bit. So I can't just set process affinity and forget about one core forever. It would make the computer very slow for daily tasks. I need other apps to use a specific core when I desire and the when I'm done profiling, the CPU affinities come back to normal. Having a bat file to do the task would be a fantastic solution.
My question is: Is it possible to have a bat file to set process affinity for every process on windows 7?
PS: The algorithm is correct and every time runs the same code path. I created some object pool so after first run, zero memory is allocated. I also profiled memory allocation with dottrace and it showed no allocation after first run. So I don't believe GC is triggered when the algorithm is working. Physical memory is available and system is not running low on RAM.
Result: The answer by Chris Becke does the job and sets process affinities exactly as intended. It resulted in more uniform results specially when background apps like visual studio and dottrace are running. Further investigation into the divergent execution time revealed that the root for the unpredictability is CPU overheat. The CPU overheat alarm was off while the temperature was over 100C! So after fixing the malfunctioning fan, the results became completely uniform.

You mean SetProcessAffinityMask?
I see this question, while tagged windows, is c#, so... I see the System.Diagnostics.Process object has a ThreadAffinity member that should perform the same function.
I am just not sure that this will stabilize the CPU times quite in the way you expect. A single busy task that is not doing IO should remain scheduled on the same core anyway unless another thread interrupts it, so I think your variable times are more due to other threads / processes interrupting your algorithm than the OS randomly shunting your thread to a different core - so unless you set the affinity for all other threads in the system to exclude your preferred core I can't see this helping.

How do I figure out whether my process is CPU bound, I/O bound, Memory bound or

I'm trying to speed up the time taken to compile my application and one thing I'm investigating is to check what resources, if any, I can add to the build machine to speed things up. To this end, how do I figure out if I should invest in more CPU, more RAM, a better hard disk or whether the process is being bound by some other resource? I already saw this (How to check if app is cpu-bound or memory-bound?) and am looking for more tips and pointers.
What I've tried so far:
Time the process on the build machine vs. on my local machine. I found that the build machine takes twice the time as my machine.
Run "Resource Monitor" and look at the CPU usage, Memory usage and Disk usage while the process is running - while doing this, I have trouble interpreting the numbers, mainly because I don't understand what each column means and how that translates to a Virtual Machine vs. a physical box and what it means with multi-CPU boxes.

Start > Run > perfmon.exe
Performance Monitor can graph many system metrics that you can use to deduce where the bottlenecks are including cpu load, io operations, pagefile hits and so on.
Additionally, the Platform SDK now includes a tool called XPerf that can provide information more relevant to developers.

Random-pausing will tell you what is your percentage split between CPU and I/O time.
Basically, if you grab 10 random stackshots, and if 80% (for example) of the time is in I/O, then on 8+/-1.3 samples the stack will reach into the system routine that reads or writes a buffer.
If you want higher precision, take more samples.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio