Windbg ProcessUptime is not equal to Kernel Time + User Time - windows

I was analyzing mini-dump of one of my processes using Windbg. I used .time command to see the process time and I got the result as below. I was expecting (Process Uptime = Kernel Time + User Time), which was not the case. Does any body know why or my interpretation is wrong?
0:035> .time
Debug session time: Tue May 5 14:30:24.000 2020 (UTC - 7:00)
System Uptime: not available
Process Uptime: 3 days 5:29:22.000
Kernel time: 0 days 9:06:26.000
User time: 11 days 18:50:47.000

The kernel & user times match the CPU / Kernel & User Times displayed in Process Explorer under the Performance tab, and are likely related to the times returned by GetProcessTimes. They add up to the Total Time displayed in Process Explorer, or the CPU Time displayed in Task Manager for the same process.
This "CPU time" is the total time across all CPUs, and does not include time the process spent sleeping, waiting, or otherwise sitting idle. Because of that it can be either (a) smaller than the process "uptime" which is simply the time difference between the start and end times, in the case of mostly idle processes, or (b) larger than the process uptime in the case of heavy usage across multiple CPUs.

Related

WinDbg runaway command output explained

I have a production CPU issue, after days of regular activity suddenly the CPU starts to peak. I've saved the dump file and run the !runaway command to get the list of highest CPU time consuming threads. the output is below:
User Mode Time
Thread Time
21:110 0 days 10:51:39.781
19:f84 0 days 10:41:59.671
5:cc4 0 days 0:53:25.343
48:74 0 days 0:34:20.140
47:1670 0 days 0:34:09.812
13:460 0 days 0:32:57.640
8:14d4 0 days 0:19:30.546
7:d90 0 days 0:03:15.000
23:1520 0 days 0:02:21.984
22:ca0 0 days 0:02:08.375
24:72c 0 days 0:02:01.640
29:10ac 0 days 0:01:58.671
27:1088 0 days 0:01:44.390
As you can see, the output shows I've 2 threads: 21 & 19, that consumes more than 20 hours of CPU time combined ,I was able to track the callstack of 1 of those threads like so:
~21s
!CLRStack
the output doesn't matter at the moment, let's call it the "X callstack"
What I would like, is an explanation about the !runaway command output. from what I understand, a dump file is a snapshot of the current state of the application. so my questions are:
How can the runaway command shows 10:51 hours value for thread 21, when the dumping process only took a few seconds?
Does it mean that the specific "instance" of the X callstack I've found with the !CLRStack command is hang more than 10 hours? or it's the total time the 21 thread executed his whole X callstacks executions? If so, it seems strange that the 21 thread responsible for so many executions of the X callstacks. As I know the origin is a web request (the runtime should assign a random thread for each call)
I've a speculation that may answer those 2 questions:
Maybe the windbg calculate the time by taking the thread callstack actual time and dividing it by the scope of the dumping process, so if for example the specific execution of the X callstack took 1 second and the whole dumping process took 3 seconds (33%), while the process was running for total of 24 hours the output will show:
8 hours (33% of 24 hours)
Am I right, or completely got it wrong?
This answer is intended to be comprehensible for the OP. It's not intended to be correct into all bits and bytes.
[...] and dividing it by the scope of the dumping process [...]
This understanding is probably the root of all evil: dumping a process only gives you the state of the process at a certain point in time. The duration of dumping the process is 0.0 seconds, since all threads are suspended during the operation. (so, relative time for your process, nothing has changed and time is standing still; of course wall clock time changes)
You are thinking of dumping a process as monitoring it over a longer period of time, which is not the case. Dumping a process just takes time because it involves disk activity etc.
So no, there is no "scope" and thus you cannot (it's really hard) measure performance issues with crash dumps.
How can the runaway command shows 10:51 hours value for thread 21, [...]
How can your C# program know how long the program is running if you only have a timer event that fires every second? The answer is: it uses a variable and increases the value.
That's roughly how Windows does it. Windows is responsible for thread scheduling and each time it re-schedules threads, it updates a variable that contains the thread time.
When writing the crash dump, the information that was collected by the OS long time ago already, is included in the crash dump.
[...] when the dumping process only took a few seconds?
Since the crash dump is taken by a thread of WinDbg, the time for that is accounted on that thread. You would need to debug WinDbg and do !runaway on a WinDbg thread to see how much CPU time that took. Potentially a nice exercise and the .dbgdbg (debug the debugger) command may be new to you; other than that, this particular case is not really helpful.
Does it mean that the specific "instance" of the X callstack I've found with the !CLRStack command is hang more than 10 hours?
No. It means that at the point in time when you created the crash dump, that specific method was executed. Not more, not less.
This information is unrelated to !runaway, because the thread may have been doing something totally different for a long time, but that ended just a moment ago.
or it's the total time the 21 thread executed his whole X callstacks executions?
No. A crash dump does not contain such detailed performance data. You need a performance profiler like JetBrains dotTrace do get that information. A profiler will look at callstacks very often, then aggregate identical call stacks and derive CPU time per call stack.

Parallel-ForkManager, DBI. Faster than before forking, but still too slow

I have a very simple task on updating database.
my $pm = new Parallel::ForkManager(15);
for my $line (#lines){
my $pid = $pm->start and next;
my $dbh2 = $dbh->clone();
my $sth2 = $dbh2->prepare("update db1 set field1=? where field2 =?");
my ($field1, $field2) = very_slow_subroutine();
$sth2->execute($field1,$field2);
$pm->finish;
}
$pm->wait_all_children;
I could just use $dbh2->do, but I doubt it a reason for a slowness.
What interesting, is that it seems it very fast starts these 15 processes (or whatever I specify) , but right after that slows drastically, still noticeable faster than without forking, but I would expect more...
Edit:
The very_slow_subroutine is sub which get an answer from a web service. The service can answer from fraction of second to several seconds on time out. I have to ask dozen thousands times... the reason I would like to make a fork.
And if this is matters -- I am on Linux.
Parallel::ForkManager doesn't magically make things faster, it just lets you do run your code multiple times and at the same time. In order to get the benefit out of it, you have to design your code for parallelism.
Think of it this way. It takes you 10 minutes to get to the store, shop, load your car, come back, and unload it. You need to get 5 loads. You alone can do it in 50 minutes. That is working in serial. 10 minutes * 5 trips one after the other = 50 minutes.
Let's say you get four friends to help. You all start off for the store at the same time. There's still 5 trips, and they still take 10 minutes, but because you did it in parallel the total time is only 10 minutes.
But it will never take less than 10 minutes, no matter how many trips you have to make or how many friends you get to help. That is why the process starts up fast, everybody gets into their cars and drives off to the store, but then nothing happens for a while because it still takes 10 minutes for everyone to do their job.
Same thing here. Your loop body takes X time to run. If you iterate through it Y times, it will take X * Y real world human time to run. If you run it in parallel Y times, ideally it will take just X time to run. Each parallel worker must still execute the full body of the loop taking X time.
In order to speed things up further, you have to break up the big bottleneck of very_slow_subroutine and make that work in parallel. Your SQL is so simple that is where you should focus your efforts at optimization and parallelism.
Let's say the store is really close, it's only a 1 minute drive (this is your SQL UPDATE), but shopping, loading and unloading takes 9 minutes (this is very_slow_subroutine). What if instead you have 5 cars and 15 friends. You load 3 people into each car. Driving to and from the store will take the same time, but now three people are working together to do the shopping, loading and unloading taking only 4 minutes. Now each trip takes 5 minutes instead of 10.
This represents redesigning very_slow_subroutine to do its work in parallel. If it's just a big loop, you can put more workers on that loop. If it's a series of slow operations, you will have to redesign it to take advantage of parallel execution.
If you use too many workers you can clog up the system, it depends on what the bottleneck is. If it's CPU bound and you have 2 CPU cores, you're probably see performance gains up to 3 to 5 workers ((cores * 2)+1 is a good rule of thumb) and after that performance will drop off as the CPU spends more time switching between processes than doing work. If the bottleneck is IO, or an external service as is often the case with database and network calls, you can see great efficiencies throwing many workers at the problem. While one process is waiting around for a disk or network operation, the others can be using your CPU.
Whether parallelism can help depends on where your bottleneck is. If your CPU with 4 cores is the bottleneck, forking 4 processes might cause things to complete in about 1/4th the under the best case scenario, but spawning 15 processes is not going to improve things much more.
If, more likely, your bottleneck is in I/O, starting 15 processes that compete for the same I/O is not going to help much, although in cases where you have tons of memory to use as file cache, some improvement might be possible.
To explore the limits on your system, consider the following program:
#!/usr/bin/env perl
use strict;
use warnings;
use Parallel::ForkManager;
run(#ARGV);
sub run {
my $count = #_ ? $_[0] : 2;
my $pm = Parallel::ForkManager->new($count);
for (1 .. 20) {
$pm->start and next;
sleep 1;
$pm->finish;
}
$pm->wait_all_children;
}
My ancient laptop has a single CPU with 2 cores. Let's see what I get:
TimeThis : Command Line : perl sleeper.pl 1
TimeThis : Elapsed Time : 00:00:20.735
TimeThis : Command Line : perl sleeper.pl 2
TimeThis : Elapsed Time : 00:00:06.578
TimeThis : Command Line : perl sleeper.pl 4
TimeThis : Elapsed Time : 00:00:04.578
TimeThis : Command Line : perl sleeper.pl 8
TimeThis : Elapsed Time : 00:00:03.546
TimeThis : Command Line : perl sleeper.pl 16
TimeThis : Elapsed Time : 00:00:02.562
TimeThis : Command Line : perl sleeper.pl 20
TimeThis : Elapsed Time : 00:00:02.563
So, running with max 20 processes gives me a total run time over 2.5 seconds for sleeping one second 20 times.
On the other hand, with just one process, sleeping one second 20 times took just over 20 seconds. That is a huge improvement, but it also indicates a management overhead of more than 150% when you have 20 processes each sleeping for one second.
This is in the nature of parallel programming. There are a lot of formal treatments out there on what you can expect, but Amdahl's Law is required reading.

How is CPU time measured on Windows?

I am currently creating a program which identifies processes which are hung/out-of-control, and using an entire CPU core. The program then terminates them, so the CPU usage can be kept under control.
However, I have run into a problem: When I execute the 'tasklist' command on Windows, it outputs this:
Image Name: Blockland.exe
PID: 4880
Session Name: Console
Session#: 6
Mem Usage: 127,544 K
Status: Running
User Name: [removed]\[removed]
CPU Time: 0:00:22
Window Title: C:\HammerHost\Blockland\Blockland.exe
So I know that the line which says "CPU Time" is an indication of the total time, in seconds, used by the program ever since it started.
But let's suppose there are 4 CPU cores on the system. Does this mean that it used up 22 seconds of one core, and therefore used 5.5 seconds on the entire CPU in total? Or does this mean that the process used up 22 seconds on the entire CPU?
It's the total CPU time across all cores. So, if the task used 10 seconds on one core and then 15 seconds later on a different core it would report 25 seconds. If it used 5 seconds on all four cores simultaneously, it would report 20 seconds.

What's a good name for "total wall clock time of all cpus?"

There are only two hard things in Computer Science: cache invalidation
and naming things.
-- Phil Karlton
My app is reporting CPU time, and people reasonably want to know how much time this is out of, so they can compute % CPU utilized. My question is, what's the name for the wall clock time times the number of CPUs?
If you add up the total user, system, idle, etc. time for a system, you get the total wall clock time, times the number of CPUs. What's a good name for that? According to Wikipedia, CPU time is:
CPU time (or CPU usage, process time) is the amount of time for which
a central processing unit (CPU) was used for processing instructions
of a computer program, as opposed to, for example, waiting for
input/output (I/O) operations.
"total time" suggests just wall clock time, and doesn't connote that over a 10 second span, a four-cpu system would have 40 seconds of "total time."
Total Wall Clock Time of all CPUs
Naming things is hard, why waste a good 'un once you've found it ?
Aggregate time: 15 of 40 seconds.

Getting cpu usage and calculating % used

I need to calculate the cpu usage and aggregate it from proc file in linux
/proc/stat gives me data but how would i come to know the % used of cpu at time as
stat gives me the count of processes at cores running at any time which does not give me any idea of %use of cpu?
And i am coding this in Golang and have to do this w/o scripts
Thanks in advance!!
/proc/stat does not only give you the count of processes on each core. man proc will tell you the exact format of that file. Copied from it, here is the part you should be interested in:
/proc/stat
cpu 3357 0 4313 1362393
The amount of time, measured in units of USER_HZ
(1/100ths of a second on most architectures, use
sysconf(_SC_CLK_TCK) to obtain the right value), that the
system spent in user mode, user mode with low priority
(nice), system mode, and the idle task, respectively.
The last value should be USER_HZ times the second entry
in the uptime pseudo-file.
It is then easy to do the substraction of the idle field between two measures, which will give you the time spent not doing anything by this CPU. The other value that you can extract is the time doing something, which is the difference between two measures of:
time in user mode + time spent in user mode with low priority + time spent in system mode
You will then have two values; one, A, is expressing the time doing nothing, and the other, B, the time actually doing something. B / (A + B) will give you the percentage of time the CPU was busy.

Resources