Is QueryPerformanceCounter counter process specific?

Is QueryPerformanceCounter counter process specific? - winapi

https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408(v=vs.85).aspx
https://msdn.microsoft.com/en-us/library/ms644904(VS.85).aspx
Imagine that I measure some part of code (20ms)
Context switching happend. And my thread was displaced by another thread which was executed (20 ms)
Then I receive quantum of time back from scheduler and perform some cals during 1ms.
If calculate elapsed time then what time will I receive? 41ms or 21 ms?

If calculate elapsed time then what time will I receive? 41ms or 21 ms?
QueryPerformanceCounter reports wall clock time. So the answer will be 41ms.

Related

How to get equivalent CPU time-difference of GPU rendering time

How to convert the time difference given by the GPU timer while rendering into the equivalent CPU timing?
Let's say,
glGetQueryObjectuiv(query, GL_QUERY_RESULT, &elapsed_time) - will return the elapsed time for that query and I presume this elapsed time will correspond to GPU frequency.
How to get the corresponding CPU time which is equivalent to the GPU elapsed time?

It's a timer query - it returns a time in nanoseconds. Time doesn't change with frequency ...

WinDbg runaway command output explained

I have a production CPU issue, after days of regular activity suddenly the CPU starts to peak. I've saved the dump file and run the !runaway command to get the list of highest CPU time consuming threads. the output is below:
User Mode Time
Thread Time
21:110 0 days 10:51:39.781
19:f84 0 days 10:41:59.671
5:cc4 0 days 0:53:25.343
48:74 0 days 0:34:20.140
47:1670 0 days 0:34:09.812
13:460 0 days 0:32:57.640
8:14d4 0 days 0:19:30.546
7:d90 0 days 0:03:15.000
23:1520 0 days 0:02:21.984
22:ca0 0 days 0:02:08.375
24:72c 0 days 0:02:01.640
29:10ac 0 days 0:01:58.671
27:1088 0 days 0:01:44.390
As you can see, the output shows I've 2 threads: 21 & 19, that consumes more than 20 hours of CPU time combined ,I was able to track the callstack of 1 of those threads like so:
~21s
!CLRStack
the output doesn't matter at the moment, let's call it the "X callstack"
What I would like, is an explanation about the !runaway command output. from what I understand, a dump file is a snapshot of the current state of the application. so my questions are:
How can the runaway command shows 10:51 hours value for thread 21, when the dumping process only took a few seconds?
Does it mean that the specific "instance" of the X callstack I've found with the !CLRStack command is hang more than 10 hours? or it's the total time the 21 thread executed his whole X callstacks executions? If so, it seems strange that the 21 thread responsible for so many executions of the X callstacks. As I know the origin is a web request (the runtime should assign a random thread for each call)
I've a speculation that may answer those 2 questions:
Maybe the windbg calculate the time by taking the thread callstack actual time and dividing it by the scope of the dumping process, so if for example the specific execution of the X callstack took 1 second and the whole dumping process took 3 seconds (33%), while the process was running for total of 24 hours the output will show:
8 hours (33% of 24 hours)
Am I right, or completely got it wrong?

This answer is intended to be comprehensible for the OP. It's not intended to be correct into all bits and bytes.
[...] and dividing it by the scope of the dumping process [...]
This understanding is probably the root of all evil: dumping a process only gives you the state of the process at a certain point in time. The duration of dumping the process is 0.0 seconds, since all threads are suspended during the operation. (so, relative time for your process, nothing has changed and time is standing still; of course wall clock time changes)
You are thinking of dumping a process as monitoring it over a longer period of time, which is not the case. Dumping a process just takes time because it involves disk activity etc.
So no, there is no "scope" and thus you cannot (it's really hard) measure performance issues with crash dumps.
How can the runaway command shows 10:51 hours value for thread 21, [...]
How can your C# program know how long the program is running if you only have a timer event that fires every second? The answer is: it uses a variable and increases the value.
That's roughly how Windows does it. Windows is responsible for thread scheduling and each time it re-schedules threads, it updates a variable that contains the thread time.
When writing the crash dump, the information that was collected by the OS long time ago already, is included in the crash dump.
[...] when the dumping process only took a few seconds?
Since the crash dump is taken by a thread of WinDbg, the time for that is accounted on that thread. You would need to debug WinDbg and do !runaway on a WinDbg thread to see how much CPU time that took. Potentially a nice exercise and the .dbgdbg (debug the debugger) command may be new to you; other than that, this particular case is not really helpful.
Does it mean that the specific "instance" of the X callstack I've found with the !CLRStack command is hang more than 10 hours?
No. It means that at the point in time when you created the crash dump, that specific method was executed. Not more, not less.
This information is unrelated to !runaway, because the thread may have been doing something totally different for a long time, but that ended just a moment ago.
or it's the total time the 21 thread executed his whole X callstacks executions?
No. A crash dump does not contain such detailed performance data. You need a performance profiler like JetBrains dotTrace do get that information. A profiler will look at callstacks very often, then aggregate identical call stacks and derive CPU time per call stack.

About evaluation of processing time of an code in octave

The cputime (); returns the CPU time used by an Octave session. Here, a way is stated to measure the processing time of a code.
t = cputime;
...%your code
...%computing
...%when you are done
printf('Total cpu time: %f seconds\n', cputime-t);
In octave documentation we can use the etime(); function file, it returns the difference (in seconds) between two timestamps from the clock, like this:
t0 = clock ();
# many computations later...
elapsed_time = etime (clock (), t0);
What's the difference between the above two ways? The etime uses the hour of clock of my system? And what's a good way to collect the processing time of an code?

Your first code snippet with cputime measures the time your code was execuded on a CPU. The second snippet measures the wall clock time (btw: use tic and toc to measure this). The wall clock includes time where the CPU was executing some other threads or waiting for IO.
It is up to you what you want to measure. I normally use wall clock with tic/toc if I want to benchmark my code.
On single threaded apps the wall clock time is greater than or equal the cputime (because it includes the time the CPU was also processing some other threads). On multithreaded apps with multiple cores cputime can be greater than the wall clock time.

CPU time and wall clock time

curious question on timing. When measuring wall clock time with any language such as python time.time() does the time include the CPU/System time.clock() time in it as well?

In Python, time.time() gives you the elapsed time (also known as wall time).That includes CPU time inasmuch as that's a subset of wall time but you cannot extract CPU time from time.time() itself.
For example, if your process runs for ten seconds but uses the CPU for only five of those seconds, the former includes the latter.

Performance of index creation

In trying to choose which indexing method to recommend, I tried to measeure the performance. However, the measurements confused me a lot. I ran this multiple times in different orders, but the measurements remain consistent.
Here is how I measured the performance:
for N = [10000 15000 100000 150000]
x = round(rand(N,1)*5)-2;
idx1 = x~=0;
idx2 = abs(x)>0;
tic
for t = 1:5000
idx1 = x~=0;
end
toc
tic
for t = 1:5000
idx2 = abs(x)>0;
end
toc
end
And this is the result:
Elapsed time is 0.203504 seconds.
Elapsed time is 0.230439 seconds.
Elapsed time is 0.319840 seconds.
Elapsed time is 0.352562 seconds.
Elapsed time is 2.118108 seconds. % This is the strange part
Elapsed time is 0.434818 seconds.
Elapsed time is 0.508882 seconds.
Elapsed time is 0.550144 seconds.
I checked and for values around 100000 this also happens, even at 50000 the strange measurements occur.
So my question is: Does anyone else experience this for a certain range, and what causes this? (Is it a bug?)

I think this is something to do with JIT (results below are using 2011b). Depending on system, version of Matlab, the size of variables, and exactly what is in the loop(s), it is not always faster to use JIT. This is related to the "warm-up" effect, where sometimes if you run an m-file more than once in a session it gets quicker after the first run, as the accelerator only has to compile some parts of the code once.
JIT on (feature accel on)
Elapsed time is 0.176765 seconds.
Elapsed time is 0.185301 seconds.
Elapsed time is 0.252631 seconds.
Elapsed time is 0.284415 seconds.
Elapsed time is 1.782446 seconds.
Elapsed time is 0.693508 seconds.
Elapsed time is 0.855005 seconds.
Elapsed time is 1.004955 seconds.
JIT off (feature accel off)
Elapsed time is 0.143924 seconds.
Elapsed time is 0.184360 seconds.
Elapsed time is 0.206405 seconds.
Elapsed time is 0.306424 seconds.
Elapsed time is 1.416654 seconds.
Elapsed time is 2.718846 seconds.
Elapsed time is 2.110420 seconds.
Elapsed time is 4.027782 seconds.
ETA, kinda interesting to see what happens if you use integers instead of doubles:
JIT on, same code but converted x using int8
Elapsed time is 0.202201 seconds.
Elapsed time is 0.192103 seconds.
Elapsed time is 0.294974 seconds.
Elapsed time is 0.296191 seconds.
Elapsed time is 2.001245 seconds.
Elapsed time is 2.038713 seconds.
Elapsed time is 0.870500 seconds.
Elapsed time is 0.898301 seconds.
JIT off, using int8
Elapsed time is 0.198611 seconds.
Elapsed time is 0.187589 seconds.
Elapsed time is 0.282775 seconds.
Elapsed time is 0.282938 seconds.
Elapsed time is 1.837561 seconds.
Elapsed time is 1.846766 seconds.
Elapsed time is 2.746034 seconds.
Elapsed time is 2.760067 seconds.

This may due to some automatic optimization matlab uses for its Basic Linear Algebra Subroutine.
Just like yours, my configuration (OSX 10.8.4, R2012a with default settings) takes longer to compute idx1 = x~=0 for x (10e5 elements) than x (11e5 elements). See the left panel of the figure where the processing time (y-axis) is measured for different vector size (x-axis). You will see a lower proceesing time for N>103000. In this panel, I also displayed the number of cores that were active during the calculation. You will see that there is no drop for the one-core configuration. It means that matlab do not optimize the execution of ~= when 1 core is active (no parallelization possible). Matlab enables some optimization routines when two conditions are met: multiple cores and a vector of sufficient size.
The right panel displays the results when feature('accel','on'/off') is set to off (doc). Here, only one core is active (1-core and 4-core are identical) and therefore no optimization is possible.
Finally, the function I used for activating/deactivating the cores is maxNumCompThreads. According to Loren Shure, maxNumCompThreads controls both the JIT and BLAS. Since feature('JIT','on'/'off') did not play a role in the performance, BLAS is the last option remaining.
I will leave the final sentence to Loren: "The main message here is that you should not generally need to use this function [maxNumCompThreads] at all! Why? Because we'd like to make MATLAB do the best job possible for you."
accel = {'on';'off'};
figure('Color','w');
N = 100000:1000:105000;
for ind_accel = 2:-1:1
eval(['feature(''accel'',''' accel{ind_accel} ''')']);
tElapsed = zeros(4,length(N));
for ind_core = 1:4
maxNumCompThreads(ind_core);
n_core = maxNumCompThreads;
for ii = 1:length(N)
fprintf('core asked: %d(true:%d) - N:%d\n',ind_core,n_core, ii);
x = round(rand(N(ii),1)*5)-2;
idx1 = x~=0;
tStart = tic;
for t = 1:5000
idx1 = x~=0;
end
tElapsed(ind_core,ii) = toc(tStart);
end
end
h2 = subplot(1,2,ind_accel);
plot(N, tElapsed,'-o','MarkerSize',10);
legend({('1':'4')'});
xlabel('Vector size','FontSize',14);
ylabel('Processing time','FontSize',14);
set(gca,'FontSize',14,'YLim',[0.2 0.7]);
title(['accel ' accel{ind_accel}]);
end

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio