WinDbg runaway command output explained - performance

I have a production CPU issue, after days of regular activity suddenly the CPU starts to peak. I've saved the dump file and run the !runaway command to get the list of highest CPU time consuming threads. the output is below:
User Mode Time
Thread Time
21:110 0 days 10:51:39.781
19:f84 0 days 10:41:59.671
5:cc4 0 days 0:53:25.343
48:74 0 days 0:34:20.140
47:1670 0 days 0:34:09.812
13:460 0 days 0:32:57.640
8:14d4 0 days 0:19:30.546
7:d90 0 days 0:03:15.000
23:1520 0 days 0:02:21.984
22:ca0 0 days 0:02:08.375
24:72c 0 days 0:02:01.640
29:10ac 0 days 0:01:58.671
27:1088 0 days 0:01:44.390
As you can see, the output shows I've 2 threads: 21 & 19, that consumes more than 20 hours of CPU time combined ,I was able to track the callstack of 1 of those threads like so:
~21s
!CLRStack
the output doesn't matter at the moment, let's call it the "X callstack"
What I would like, is an explanation about the !runaway command output. from what I understand, a dump file is a snapshot of the current state of the application. so my questions are:
How can the runaway command shows 10:51 hours value for thread 21, when the dumping process only took a few seconds?
Does it mean that the specific "instance" of the X callstack I've found with the !CLRStack command is hang more than 10 hours? or it's the total time the 21 thread executed his whole X callstacks executions? If so, it seems strange that the 21 thread responsible for so many executions of the X callstacks. As I know the origin is a web request (the runtime should assign a random thread for each call)
I've a speculation that may answer those 2 questions:
Maybe the windbg calculate the time by taking the thread callstack actual time and dividing it by the scope of the dumping process, so if for example the specific execution of the X callstack took 1 second and the whole dumping process took 3 seconds (33%), while the process was running for total of 24 hours the output will show:
8 hours (33% of 24 hours)
Am I right, or completely got it wrong?

This answer is intended to be comprehensible for the OP. It's not intended to be correct into all bits and bytes.
[...] and dividing it by the scope of the dumping process [...]
This understanding is probably the root of all evil: dumping a process only gives you the state of the process at a certain point in time. The duration of dumping the process is 0.0 seconds, since all threads are suspended during the operation. (so, relative time for your process, nothing has changed and time is standing still; of course wall clock time changes)
You are thinking of dumping a process as monitoring it over a longer period of time, which is not the case. Dumping a process just takes time because it involves disk activity etc.
So no, there is no "scope" and thus you cannot (it's really hard) measure performance issues with crash dumps.
How can the runaway command shows 10:51 hours value for thread 21, [...]
How can your C# program know how long the program is running if you only have a timer event that fires every second? The answer is: it uses a variable and increases the value.
That's roughly how Windows does it. Windows is responsible for thread scheduling and each time it re-schedules threads, it updates a variable that contains the thread time.
When writing the crash dump, the information that was collected by the OS long time ago already, is included in the crash dump.
[...] when the dumping process only took a few seconds?
Since the crash dump is taken by a thread of WinDbg, the time for that is accounted on that thread. You would need to debug WinDbg and do !runaway on a WinDbg thread to see how much CPU time that took. Potentially a nice exercise and the .dbgdbg (debug the debugger) command may be new to you; other than that, this particular case is not really helpful.
Does it mean that the specific "instance" of the X callstack I've found with the !CLRStack command is hang more than 10 hours?
No. It means that at the point in time when you created the crash dump, that specific method was executed. Not more, not less.
This information is unrelated to !runaway, because the thread may have been doing something totally different for a long time, but that ended just a moment ago.
or it's the total time the 21 thread executed his whole X callstacks executions?
No. A crash dump does not contain such detailed performance data. You need a performance profiler like JetBrains dotTrace do get that information. A profiler will look at callstacks very often, then aggregate identical call stacks and derive CPU time per call stack.

Related

Windbg ProcessUptime is not equal to Kernel Time + User Time

I was analyzing mini-dump of one of my processes using Windbg. I used .time command to see the process time and I got the result as below. I was expecting (Process Uptime = Kernel Time + User Time), which was not the case. Does any body know why or my interpretation is wrong?
0:035> .time
Debug session time: Tue May 5 14:30:24.000 2020 (UTC - 7:00)
System Uptime: not available
Process Uptime: 3 days 5:29:22.000
Kernel time: 0 days 9:06:26.000
User time: 11 days 18:50:47.000
The kernel & user times match the CPU / Kernel & User Times displayed in Process Explorer under the Performance tab, and are likely related to the times returned by GetProcessTimes. They add up to the Total Time displayed in Process Explorer, or the CPU Time displayed in Task Manager for the same process.
This "CPU time" is the total time across all CPUs, and does not include time the process spent sleeping, waiting, or otherwise sitting idle. Because of that it can be either (a) smaller than the process "uptime" which is simply the time difference between the start and end times, in the case of mostly idle processes, or (b) larger than the process uptime in the case of heavy usage across multiple CPUs.

Golang - What is the meaning of the seconds in CPU profiling graph?

For example, the data in the figure runtime.scanobject:
13.42s
runtime.scanobject 9.69s(4.51%) of 18.30s(8.52%).
5.33s
what is the meaning of the seconds and percent?
Thanks.
When CPU profiling is enabled, the Go program stops about 100 times per second and records a sample consisting of the program counters on the currently executing goroutine's stack.
That time and percentange is in reference to the sample.
Here is a nice reference for you to read more about it: https://blog.golang.org/profiling-go-programs

Parallel-ForkManager, DBI. Faster than before forking, but still too slow

I have a very simple task on updating database.
my $pm = new Parallel::ForkManager(15);
for my $line (#lines){
my $pid = $pm->start and next;
my $dbh2 = $dbh->clone();
my $sth2 = $dbh2->prepare("update db1 set field1=? where field2 =?");
my ($field1, $field2) = very_slow_subroutine();
$sth2->execute($field1,$field2);
$pm->finish;
}
$pm->wait_all_children;
I could just use $dbh2->do, but I doubt it a reason for a slowness.
What interesting, is that it seems it very fast starts these 15 processes (or whatever I specify) , but right after that slows drastically, still noticeable faster than without forking, but I would expect more...
Edit:
The very_slow_subroutine is sub which get an answer from a web service. The service can answer from fraction of second to several seconds on time out. I have to ask dozen thousands times... the reason I would like to make a fork.
And if this is matters -- I am on Linux.
Parallel::ForkManager doesn't magically make things faster, it just lets you do run your code multiple times and at the same time. In order to get the benefit out of it, you have to design your code for parallelism.
Think of it this way. It takes you 10 minutes to get to the store, shop, load your car, come back, and unload it. You need to get 5 loads. You alone can do it in 50 minutes. That is working in serial. 10 minutes * 5 trips one after the other = 50 minutes.
Let's say you get four friends to help. You all start off for the store at the same time. There's still 5 trips, and they still take 10 minutes, but because you did it in parallel the total time is only 10 minutes.
But it will never take less than 10 minutes, no matter how many trips you have to make or how many friends you get to help. That is why the process starts up fast, everybody gets into their cars and drives off to the store, but then nothing happens for a while because it still takes 10 minutes for everyone to do their job.
Same thing here. Your loop body takes X time to run. If you iterate through it Y times, it will take X * Y real world human time to run. If you run it in parallel Y times, ideally it will take just X time to run. Each parallel worker must still execute the full body of the loop taking X time.
In order to speed things up further, you have to break up the big bottleneck of very_slow_subroutine and make that work in parallel. Your SQL is so simple that is where you should focus your efforts at optimization and parallelism.
Let's say the store is really close, it's only a 1 minute drive (this is your SQL UPDATE), but shopping, loading and unloading takes 9 minutes (this is very_slow_subroutine). What if instead you have 5 cars and 15 friends. You load 3 people into each car. Driving to and from the store will take the same time, but now three people are working together to do the shopping, loading and unloading taking only 4 minutes. Now each trip takes 5 minutes instead of 10.
This represents redesigning very_slow_subroutine to do its work in parallel. If it's just a big loop, you can put more workers on that loop. If it's a series of slow operations, you will have to redesign it to take advantage of parallel execution.
If you use too many workers you can clog up the system, it depends on what the bottleneck is. If it's CPU bound and you have 2 CPU cores, you're probably see performance gains up to 3 to 5 workers ((cores * 2)+1 is a good rule of thumb) and after that performance will drop off as the CPU spends more time switching between processes than doing work. If the bottleneck is IO, or an external service as is often the case with database and network calls, you can see great efficiencies throwing many workers at the problem. While one process is waiting around for a disk or network operation, the others can be using your CPU.
Whether parallelism can help depends on where your bottleneck is. If your CPU with 4 cores is the bottleneck, forking 4 processes might cause things to complete in about 1/4th the under the best case scenario, but spawning 15 processes is not going to improve things much more.
If, more likely, your bottleneck is in I/O, starting 15 processes that compete for the same I/O is not going to help much, although in cases where you have tons of memory to use as file cache, some improvement might be possible.
To explore the limits on your system, consider the following program:
#!/usr/bin/env perl
use strict;
use warnings;
use Parallel::ForkManager;
run(#ARGV);
sub run {
my $count = #_ ? $_[0] : 2;
my $pm = Parallel::ForkManager->new($count);
for (1 .. 20) {
$pm->start and next;
sleep 1;
$pm->finish;
}
$pm->wait_all_children;
}
My ancient laptop has a single CPU with 2 cores. Let's see what I get:
TimeThis : Command Line : perl sleeper.pl 1
TimeThis : Elapsed Time : 00:00:20.735
TimeThis : Command Line : perl sleeper.pl 2
TimeThis : Elapsed Time : 00:00:06.578
TimeThis : Command Line : perl sleeper.pl 4
TimeThis : Elapsed Time : 00:00:04.578
TimeThis : Command Line : perl sleeper.pl 8
TimeThis : Elapsed Time : 00:00:03.546
TimeThis : Command Line : perl sleeper.pl 16
TimeThis : Elapsed Time : 00:00:02.562
TimeThis : Command Line : perl sleeper.pl 20
TimeThis : Elapsed Time : 00:00:02.563
So, running with max 20 processes gives me a total run time over 2.5 seconds for sleeping one second 20 times.
On the other hand, with just one process, sleeping one second 20 times took just over 20 seconds. That is a huge improvement, but it also indicates a management overhead of more than 150% when you have 20 processes each sleeping for one second.
This is in the nature of parallel programming. There are a lot of formal treatments out there on what you can expect, but Amdahl's Law is required reading.

Solaris prstat - definition of "recent" time used in percentages

The man page for prstat (on Solaris 10 in my case) notes that that CPU % output is the "percentage of recent CPU time". I am trying to understand in more depth what "recent" means in this context - is it a defined amount of time prior to the sample, does it relate to the sampling interval, etc? Appreciate any insights, particularly with references to supporting documentation. I've searched but haven't been able to find a good answer. Thanks!
Adrian
The kernel maintains data that you see at the bottom - those three numbers.
For each process.
uptime shows you what those numbers are. Those are the 'recent' times for load average - the line at the bottom of prstat. 1 minute, 5 minutes, and 15 minutes.
Recent == 1 minute worth of sampling (last 60 seconds). Those numbers are averages, which is why when you first start prstat the number and processes usually change.
On the first pass you may see processes like nscd that have lots of cpu but have been up for a long time. The first display iteration is completely historical. After that the numbers reflect recent == last one minute average.
You should consider enabling sar sampling to get a much better picture.
Want a reference - try :
http://www.amazon.com/Solaris-Internals-OpenSolaris-Architecture-Edition/dp/0131482092

How to handle mass database manipulation every second - threading?

I have a very hard problem:
I have round about 20-50 objects, which I MUST (that is given for the problem, please don't spend time in thinking around it) put througt a logic EVERY SECOND.
The logic itself need round about 200-600 milliseconds (90% it is 200ms - 10% it is 600ms).
I try to find any solution how I can make is smaller, but there isn't. I must get an object from DB, I must have a lot of if-else and I must actual it. - Even if I reduce it to 50ms or smaller, to veriable rate of the object up to 50 will break my neck with the 1 second timer, because 50 x 50mx =2,5 second. So a tick needs longer then the tickrate should be.
So, my only, not very smart I think, idea is to open for every object an own thread and lead a mainthread for handling. So the mainthread opens x other thread. So only this opening must take unter 1 second. After it logic is used, the thread can kill itself and we all are happy, aren't we?
By given the last answers, I will explain my problem:
I try to build an auctioneer site. So I have up to 50 auctions running at the same moment - nothing special. So I need to: every single second look to the auctionlist, see if the time is 00:00:01 and if it is, bid automaticly (it's a feature, that user can create).
So: get 50 objects in a list, iterate through, check if a automatic bid is need, do it.
With 50 objects and the processing time you've given on average you are doing 12 seconds worth of processing every second. Assuming you have 4 cores, you can get this down to an execution time of 4 seconds via threading. Every second. This means that you're going to start off behind and slip further behind as time goes on.
I know you said you tried to think of a way to make it more efficient, but couldn't, but I fear you're going to have to. The problem as stated now is computationally intractable. You're either going to have to process the objects in a rotating window (so each object gets hit once every 4th cycle or so), or you need to make your processing run faster.
First: Profile, if you haven't already. Figure out what section of your code are taking time, etc. I'd go after that database - how long is the I/O of the objects from the database taking? Can you cache that I/O? (If you're manipulating the same 50 objects, don't load them every second.)
Let's address your threads idea: If you want multiple threads, don't create and destroy them every second. Create your X threads, and leave them be -- creating & destroying them are going to be expensive operations. You might find that less threads will work better - such as 1 or 2 per core, as you might be able to reduce time doing context switches.
To expand on Jonathan Leffler's comment on the question, as the OP requested: (This answer is a wiki)
Say you have these three things being auctioned, ending at the times indicated:
10 Apples - ends at 1:05:00 PM
20 Blueberries - ends at 2:00:00 PM
15 Pears - ends at 3:50:00 PM
If the current time is 1:00:00 PM, then sleep for 4 minutes, 58 seconds (since the closest item ends in 5 minutes). We use the 2 seconds then for processing - adjust that threshold as needed. Once we're done with the apples, we'll sleep for (2 PM - now() - 2s), for the blueberries.
Note that when we wake up at 1:04:58 PM to process the apples auction, we do not touch the blueberries or the pears -- we know that they're still way out in the future, so we don't care.

Resources