Erlang program faster with tracing - performance

I am carrying out some evaluation measurements on a program to test the execution time and how a tracer that checks the execution of the original system effects the performance of the original system. The tracing program does not interfere with the system and no communication between them takes place except from receiving the trace messages.
The results I have so far are an average of 953.14 microseconds for the program without the tracing switched on compared to 937 microseconds with the tracing switched on. The timing is calculated using statistics(wall_clock) function.
My idea was that since I have an extra process(es) from the tracer and the tracing mechanism would require their own processing power, it would slow the system down rather than speeding it up. Is there any known reason why this could happen?

Did you run the measure once or several time? As you use the wall_clock, you are including in this measure potential perturbation of the environment, wait for messages... , and you are not evaluation the CPU time. the example below shows it:
1> F = fun() -> receive _ -> ok after 5000 -> timout end end.
#Fun<erl_eval.20.111823515>
2> F().
timout
3> statistics(wall_clock),F(),statistics(wall_clock).
{965829,5016}
4>
The function F obviously does not need 5 seconds of CPU time, but is only waiting for 5 second.
This means that you should make the measure several time, generally do not use the first execution time that may include the necessary time to load the module code, and take care of the environment - what are the other process running on the same machine, and be sure that the time measured is not the result of waiting states.
If you use runtime instead of wall_clock you should see the CPU time needed and therefore an increased time when tracing your code. Beware that this increase of time may be hidden by the usage of multi core.

Related

Runtime of GPU-based simulation unexplainable?

I am developing a GPU-based simulation using OpenGL and GLSL-Shaders and i found that performance increases when I add additional (unnecessary) GL-commands.
The simulation runs entirely on GPU without any transfers and basically consists of a loop performing 2500 algorithmically identical time steps. I carefully implemented caching of GLSL-uniform locations and removed any GL-state requests (glGet* etc) to maximize speed. To measure wall clock time i've put a glFinish after the main loop and take the elapsed time afterwards.
CASE A:
Normal total runtime for all iterations is 490ms.
CASE B:
Now, if i add a single additional glGetUniformLocation(...) command at the end of EACH time step, it requires only 475ms in total, which is 3 percent faster. (Please note that this is relevant to me since later i will perform a lot more timesteps)
I've looked at a timeline captured with Nvidia nsight and found that, in case A, all opengl commands are issued within the first 140ms and the glFinish takes 348ms until completion of all GPU-work. In case B the issuing of opengl commands is spread out over a significantly longer time (410ms) and the glFinish only takes 64ms yielding the faster 475ms in total.
I also noticed, that hardware command queue is much more full of work packets most of the time in case B, whereas in case A there is only one item waiting most of the time (however, there are no visible idle times).
So my questions are:
Why is B faster?
Why are the command packages issued more uniformly to the hardware queue over time in case A?
How can speed be enhanced without adding additional commands?
I am using Visual c++, VS2008 on Win7 x64.
IMHO this question can not be answered definitely. For what it's worth I experimentally determined, that glFinish (and …SwapBuffers for that matter) have weird runtime time behavior. I'm currently developing my own VR rendering library and prior to that I spend some significant time profiling the timelines of OpenGL commands and their interaction with the graphics system. And what I found out was, that the only thing that's consistent is, that glFinish + …SwapBuffers have a very inconsistent timing behavior.
What could happen is, that this glGetUniformLocation call pulls the OpenGL driver into a "busy" state. If you call glFinish immediately afterwards it may use a different method for waiting (for example it may spin in a while loop waiting for a flag) for the GPU than if you just call glFinish (it may for example wait for a signal or a condition variable and is thus subject to the kernels scheduling behavior).

How do you limit a process' CPU usage on Windows? (need code, not an app)

There is programs that is able to limit the CPU usage of processes in Windows. For example BES and ThreadMaster. I need to write my own program that does the same thing as these programs but with different configuration capabilities. Does anybody know how the CPU throttling of a process is done (code)? I'm not talking about setting the priority of a process, but rather how to limit it's CPU usage to for example 15% even if there is no other processes competing for CPU time.
Update: I need to be able to throttle any processes that is already running and that I have no source code access to.
You probably want to run the process(es) in a job object, and set the maximum CPU usage for the job object with SetInformationJobObject, with JOBOBJECT_CPU_RATE_CONTROL_INFORMATION.
Very simplified, it could work somehow like this:
Create a periodic waitable timer with some reasonable small wait time (maybe 100ms). Get a "last" value for each relevant process by calling GetProcessTimes once.
Loop forever, blocking on the timer.
Each time you wake up:
if GetProcessAffinityMask returns 0, call SetProcessAffinityMask(old_value). This means we've suspended that process in our last iteration, we're now giving it a chance to run again.
else call GetProcessTimes to get the "current" value
call GetSystemTimeAsFileTime
calculate delta by subtracting last from current
cpu_usage = (deltaKernelTime + deltaUserTime) / (deltaTime)
if that's more than you want call old_value = GetProcessAffinityMask followed by SetProcessAffinityMask(0) which will take the process offline.
This is basically a very primitive version of the scheduler that runs in the kernel, implemented in userland. It puts a process "to sleep" for a small amount of time if it has used more CPU time than what you deem right. A more sophisticated measurement maybe going over a second or 5 seconds would be possible (and probably desirable).
You might be tempted to suspend all threads in the process instead. However, it is important not to fiddle with priorities and not to use SuspendThread unless you know exactly what a program is doing, as this can easily lead to deadlocks and other nasty side effects. Think for example of suspending a thread holding a critical section while another thread is still running and trying to acquire the same object. Or imagine your process gets swapped out in the middle of suspending a dozen threads, leaving half of them running and the other half dead.
Setting the affinity mask to zero on the other hand simply means that from now on no single thread in the process gets any more time slices on any processor. Resetting the affinity gives -- atomically, at the same time -- all threads the possibility to run again.
Unluckily, SetProcessAffinityMask does not return the old mask as SetThreadAffinityMask does, at least according to the documentation. Therefore an extra Get... call is necessary.
CPU usage is fairly simple to estimate using QueryProcessCycleTime. The machine's processor speed can be obtained from HKLM\HARDWARE\DESCRIPTION\System\CentralProcessor\\~MHz (where is the processor number, one entry for each processor present). With these values, you can estimate your process' CPU usage and yield the CPU as necessary using Sleep() to keep your usage in bounds.

Is 16 milliseconds an unusually long length of time for an unblocked thread running on Windows to be waiting for execution?

Recently I was doing some deep timing checks on a DirectShow application I have in Delphi 6, using the DSPACK components. As part of my diagnostics, I created a Critical Section class that adds a time-out feature to the usual Critical Section object found in most Windows programming languages. If the time duration between the first Acquire() and the last matching Release() is more than X milliseconds, an Exception is thrown.
Initially I set the time-out at 10 milliseconds. The code I have wrapped in Critical Sections is pretty fast using mostly memory moves and fills for most of the operations contained in the protected areas. Much to my surprise I got fairly frequent time-outs in seemingly random parts of the code. Sometimes it happened in a code block that iterates a buffer list and does certain quick operations in sequence, other times in tiny sections of protected code that only did a clearing of a flag between the Acquire() and Release() calls. The only pattern I noticed is that the durations found when the time-out occurred were centered on a median value of about 16 milliseconds. Obviously that's a huge amount of time for a flag to be set in the latter example of an occurrence I mentioned above.
So my questions are:
1) Is it possible for Windows thread management code to, on a fairly frequent basis (about once every few seconds), to switch out an unblocked thread and not return to it for 16 milliseconds or longer?
2) If that is a reasonable scenario, what steps can I take to lessen that occurrence and should I consider elevating my thread priorities?
3) If it is not a reasonable scenario, what else should I look at or try as an analysis technique to diagnose the real problem?
Note: I am running on Windows XP on an Intel i5 Quad Core with 3 GB of memory. Also, the reason why I need to be fast in this code is due to the size of the buffer in milliseconds I have chosen in my DirectShow filter graphs. To keep latency at a minimum audio buffers in my graph are delivered every 50 milliseconds. Therefore, any operation that takes a significant percentage of that time duration is troubling.
Thread priorities determine when ready threads are run. There's, however, a starvation prevention mechanism. There's a so-called Balance Set Manager that wakes up every second and looks for ready threads that haven't been run for about 3 or 4 seconds, and if there's one, it'll boost its priority to 15 and give it a double the normal quantum. It does this for not more than 10 threads at a time (per second) and scans not more than 16 threads at each priority level at a time. At the end of the quantum, the boosted priority drops to its base value. You can find out more in the Windows Internals book(s).
So, it's a pretty normal behavior what you observe, threads may be not run for seconds.
You may need to elevate priorities or otherwise consider other threads that are competing for the CPU time.
sounds like normal windows behaviour with respect to timer resolution unless you explicitly go for some of the high precision timers. Some details in this msdn link
First of all, I am not sure if Delphi's Now is a good choice for millisecond precision measurements. GetTickCount and QueryPerformanceCoutner API would be a better choice.
When there is no collision in critical section locking, everything runs pretty fast, however if you are trying to enter critical section which is currently locked on another thread, eventually you hit a wait operation on an internal kernel object (mutex or event), which involves yielding control on the thread and waiting for scheduler to give control back later.
The "later" above would depend on a few things, including priorities mentioned above, and there is one important things you omitted in your test - what is the overall CPU load at the time of your testing. The more is the load, the less chances to get the thread continue execution soon. 16 ms time looks perhaps a bit still within reasonable tolerance, and all in all it might depends on your actual implementation.

How do I get repeatable CPU-bound benchmark runtimes on Windows?

We sometimes have to run some CPU-bound tests where we want to measure runtime. The tests last in the order of a minute. The problem is that from run to run the runtime varies by quite a lot (+/- 5%). We suspect that the variation is caused by activity from other applications/services on the system, eg:
Applications doing housekeeping in their idle time (e.g. Visual Studio updating IntelliSense)
Filesystem indexers
etc..
What tips are there to make our benchmark timings more stable?
Currently we minimize all other applications, run the tests at "Above Normal" priority, and not touch the machine while it runs the test.
The usual approach is to perform lots of repetitions and then discard outliers. So, if the distractions such as the disk indexer only crops up once every hour or so, and you do 5 minutes runs repeated for 24 hours, you'll have plenty of results where nothing got in the way. It is a good idea to plot the probability density function to make sure you are understand what is going on. Also, if you are not interested in startup effects such as getting everything into the processor caches then make sure the experiment runs long enough to make them insignificant.
First of all, if it's just about benchmarking the application itself, you should use CPU time, not wallclock time as a measure. That's then (almost) free from influences of what the other processes or the system do. Secondly, as Dickon Reed pointed out, more repetitions increase confidence.
Quote from VC++ team blog, how they do performance tests:
To reduce noise on the benchmarking machines, we take several steps:
Stop as many services and processes as possible.
Disable network driver: this will turn off the interrupts from NIC caused by >broadcast packets.
Set the test’s processor affinity to run on one processor/core only.
Set the run to high priority which will decrease the number of context switches.
Run the test for several iterations.
I do the following:
Call the method x times and measure the time
Do this n times and calculate the mean and standard deviation of those measurements
Try to get the x to a point where you're at a >1 second per measurement. This will reduce the noise a bit.
The mean will tell you the average performance of your test and the standard deviation the stability of your test/measurements.
I also set my application at a very high priority, and when I test a single-thread algorithm I associate it with one cpu core to make sure there is not scheduling overhead.
This code demonstrates how to do this in .NET:
Thread.CurrentThread.Priority = ThreadPriority.Highest;
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;
if (Environment.ProcessorCount > 1)
{
Process.GetCurrentProcess().ProcessorAffinity =
new IntPtr(1 << (Environment.ProcessorCount - 1));
}

How do I obtain CPU cycle count in Win32?

In Win32, is there any way to get a unique cpu cycle count or something similar that would be uniform for multiple processes/languages/systems/etc.
I'm creating some log files, but have to produce multiple logfiles because we're hosting the .NET runtime, and I'd like to avoid calling from one to the other to log. As such, I was thinking I'd just produce two files, combine them, and then sort them, to get a coherent timeline involving cross-world calls.
However, GetTickCount does not increase for every call, so that's not reliable. Is there a better number, so that I get the calls in the right order when sorting?
Edit: Thanks to #Greg that put me on the track to QueryPerformanceCounter, which did the trick.
Heres an interesting article! says not to use RDTSC, but to instead use QueryPerformanceCounter.
Conclusion:
Using regular old timeGetTime() to do
timing is not reliable on many
Windows-based operating systems
because the granularity of the system
timer can be as high as 10-15
milliseconds, meaning that
timeGetTime() is only accurate to
10-15 milliseconds. [Note that the
high granularities occur on NT-based
operation systems like Windows NT,
2000, and XP. Windows 95 and 98 tend
to have much better granularity,
around 1-5 ms.]
However, if you call
timeBeginPeriod(1) at the beginning of
your program (and timeEndPeriod(1) at
the end), timeGetTime() will usually
become accurate to 1-2 milliseconds,
and will provide you with extremely
accurate timing information.
Sleep() behaves similarly; the length
of time that Sleep() actually sleeps
for goes hand-in-hand with the
granularity of timeGetTime(), so after
calling timeBeginPeriod(1) once,
Sleep(1) will actually sleep for 1-2
milliseconds,Sleep(2) for 2-3, and so
on (instead of sleeping in increments
as high as 10-15 ms).
For higher precision timing
(sub-millisecond accuracy), you'll
probably want to avoid using the
assembly mnemonic RDTSC because it is
hard to calibrate; instead, use
QueryPerformanceFrequency and
QueryPerformanceCounter, which are
accurate to less than 10 microseconds
(0.00001 seconds).
For simple timing, both timeGetTime
and QueryPerformanceCounter work well,
and QueryPerformanceCounter is
obviously more accurate. However, if
you need to do any kind of "timed
pauses" (such as those necessary for
framerate limiting), you need to be
careful of sitting in a loop calling
QueryPerformanceCounter, waiting for
it to reach a certain value; this will
eat up 100% of your processor.
Instead, consider a hybrid scheme,
where you call Sleep(1) (don't forget
timeBeginPeriod(1) first!) whenever
you need to pass more than 1 ms of
time, and then only enter the
QueryPerformanceCounter 100%-busy loop
to finish off the last < 1/1000th of a
second of the delay you need. This
will give you ultra-accurate delays
(accurate to 10 microseconds), with
very minimal CPU usage. See the code
above.
You can use the RDTSC CPU instruction (assuming x86). This instruction gives the CPU cycle counter, but be aware that it will increase very quickly to its maximum value, and then reset to 0. As the Wikipedia article mentions, you might be better off using the QueryPerformanceCounter function.
System.Diagnostics.Stopwatch.GetTimestamp() return the number of CPU cycle since a time origin (maybe when the computer start, but I'm not sure) and I've never seen it not increased between 2 calls.
The CPU Cycles will be specific for each computer so you can't use it to merge log file between 2 computers.
RDTSC output may depend on the current core's clock frequency, which for modern CPUs is neither constant nor, in a multicore machine, consistent.
Use the system time, and if dealing with feeds from multiple systems use an NTP time source. You can get reliable, consistent time readings that way; if the overhead is too much for your purposes, using the HPET to work out time elapsed since the last known reliable time reading is better than using the HPET alone.
Use the GetTickCount and add another counter as you merge the log files. Won't give you perfect sequence between the different log files, but it will at least keep all logs from each file in the correct order.

Resources