Why does the execution time of Goroutines differ significantly? - go

I'm just measuring the execution time of a set of goroutines. That means:
I start measuring, then start 20 goroutines and stop measuring as soon as they finish. I repeat that process like 4 times and then compare the 4 exection times.
Sometimes, these execution times differ significantly:
1st run of the 20 goroutines: 1.2 ms
2nd run of the 20 goroutines: 1.9 ms
3rd run of the 20 goroutines: 1.4 ms
4th run of the 20 goroutines: 17.0 ms!
Why does it sometimes differ so significantly? Is there any way to avoid it?

Why does it sometimes differ so significantly?
Execution time will always be unpredictable to some point, as mentioned in comments to your questions (CPU, disk load, memory, etc.)
Is there any way to avoid it?
There is a way to make your measurements more useful. Go has a built-in benchmark tool (here is a guide on how to use it properly). This tool runs your code just enough times to determine a somewhat deterministic execution time.
In addition to showing average execution time for your code, it can also show useful memory information.

Related

Faster Than 1ms But Slower Than a Loop

I need to execute some lines of code multiple times (say, 300 times) in rapid succession, each time incrementing some variable and then using it do to a task (let's assume it's a task that requires negligible time to complete).
I tried doing it with a timer set to 1 ms, but it runs too slowly. I then tried doing it with a While loop, but that was much too fast. I could use Threading.Sleep but I really hate using that, not to mention it can only sleep as short as 1 ms anyways. I also thought of using Environment.TickCount but I believe that counts in milliseconds as well.
While this program isn't important to me, it got me wondering if such a thing was possible. A loop that could run with "faster than 1 ms intervals," but slower than "as fast as the program can execute it."
One thing that comes to my mind is calculating the waiting for every iteration yourself with a high precision clock like javas System.nanoTime().
Yet, the call itself is relatively costly and will not let you do nano second precision waiting. But for waiting shorter than 1ms and longer than e.g. 1ns this might help.

Erlang program faster with tracing

I am carrying out some evaluation measurements on a program to test the execution time and how a tracer that checks the execution of the original system effects the performance of the original system. The tracing program does not interfere with the system and no communication between them takes place except from receiving the trace messages.
The results I have so far are an average of 953.14 microseconds for the program without the tracing switched on compared to 937 microseconds with the tracing switched on. The timing is calculated using statistics(wall_clock) function.
My idea was that since I have an extra process(es) from the tracer and the tracing mechanism would require their own processing power, it would slow the system down rather than speeding it up. Is there any known reason why this could happen?
Did you run the measure once or several time? As you use the wall_clock, you are including in this measure potential perturbation of the environment, wait for messages... , and you are not evaluation the CPU time. the example below shows it:
1> F = fun() -> receive _ -> ok after 5000 -> timout end end.
#Fun<erl_eval.20.111823515>
2> F().
timout
3> statistics(wall_clock),F(),statistics(wall_clock).
{965829,5016}
4>
The function F obviously does not need 5 seconds of CPU time, but is only waiting for 5 second.
This means that you should make the measure several time, generally do not use the first execution time that may include the necessary time to load the module code, and take care of the environment - what are the other process running on the same machine, and be sure that the time measured is not the result of waiting states.
If you use runtime instead of wall_clock you should see the CPU time needed and therefore an increased time when tracing your code. Beware that this increase of time may be hidden by the usage of multi core.

Why isn't my application's performance scaling with more CPUs?

I am running a piece of software that is very parallel. There are about 400 commands I need to run that don't depend on each other at all, so I just fork them off and hope and that having more CPUs means more processes executed per unit time.
Code:
foreach cmd ($CMD_LIST)
$cmd & #fork it off
end
Very simple. Here are my testing results:
On 1 CPU, this takes 1006 seconds, or 16 mins 46 seconds.
With 10 CPUs, this took 600s, or 10 minutes!
Why wouldn't the time taken divide (roughly) by 10? I feel cheated here =(
edit - of course I'm willing to provide additional details you would want to know, just not sure what's relevant because in simplest terms this is what I'm doing.
You are assuming your processes are 100% CPU-bound.
If your processes do any disk or network I/O, the bottleneck will be on those operations, which cannot be parallelised (eg one process will download a file at 100k/s, 2 processes at 50k/s each so you would not see any improvement at all, furthermore you could experience a degrade in performance because of overheads).
See: Amdahl's_law - this allows you to estimate the improvement in performance when parallelising tasks, knowing the proportion between the parallelisable part and the non-parallelisable)

Does printing some values continuously lowers the performance in parallel processing

I have heard that,continuous Input/Ouput operations reduces performance in parallel processing. I have been continuously printing the values so that i can check how many iterations are passed. Does it really affects the speed of the process?
yes, and the more threads the more the impact... IF you have 10 threads generating 10,000 numbers at a time for 30 seconds worth of numbers. they all will generate and then wait for the I/O operation. You are better off keeping a count on each aditional thread, and then displaying them at the end. Display I/O isn't as bad as disk I/O, but the problem still exists.
ex: thread 1 did 30,000 passes, thread 2 did 36,000 passes, etc.

Is 16 milliseconds an unusually long length of time for an unblocked thread running on Windows to be waiting for execution?

Recently I was doing some deep timing checks on a DirectShow application I have in Delphi 6, using the DSPACK components. As part of my diagnostics, I created a Critical Section class that adds a time-out feature to the usual Critical Section object found in most Windows programming languages. If the time duration between the first Acquire() and the last matching Release() is more than X milliseconds, an Exception is thrown.
Initially I set the time-out at 10 milliseconds. The code I have wrapped in Critical Sections is pretty fast using mostly memory moves and fills for most of the operations contained in the protected areas. Much to my surprise I got fairly frequent time-outs in seemingly random parts of the code. Sometimes it happened in a code block that iterates a buffer list and does certain quick operations in sequence, other times in tiny sections of protected code that only did a clearing of a flag between the Acquire() and Release() calls. The only pattern I noticed is that the durations found when the time-out occurred were centered on a median value of about 16 milliseconds. Obviously that's a huge amount of time for a flag to be set in the latter example of an occurrence I mentioned above.
So my questions are:
1) Is it possible for Windows thread management code to, on a fairly frequent basis (about once every few seconds), to switch out an unblocked thread and not return to it for 16 milliseconds or longer?
2) If that is a reasonable scenario, what steps can I take to lessen that occurrence and should I consider elevating my thread priorities?
3) If it is not a reasonable scenario, what else should I look at or try as an analysis technique to diagnose the real problem?
Note: I am running on Windows XP on an Intel i5 Quad Core with 3 GB of memory. Also, the reason why I need to be fast in this code is due to the size of the buffer in milliseconds I have chosen in my DirectShow filter graphs. To keep latency at a minimum audio buffers in my graph are delivered every 50 milliseconds. Therefore, any operation that takes a significant percentage of that time duration is troubling.
Thread priorities determine when ready threads are run. There's, however, a starvation prevention mechanism. There's a so-called Balance Set Manager that wakes up every second and looks for ready threads that haven't been run for about 3 or 4 seconds, and if there's one, it'll boost its priority to 15 and give it a double the normal quantum. It does this for not more than 10 threads at a time (per second) and scans not more than 16 threads at each priority level at a time. At the end of the quantum, the boosted priority drops to its base value. You can find out more in the Windows Internals book(s).
So, it's a pretty normal behavior what you observe, threads may be not run for seconds.
You may need to elevate priorities or otherwise consider other threads that are competing for the CPU time.
sounds like normal windows behaviour with respect to timer resolution unless you explicitly go for some of the high precision timers. Some details in this msdn link
First of all, I am not sure if Delphi's Now is a good choice for millisecond precision measurements. GetTickCount and QueryPerformanceCoutner API would be a better choice.
When there is no collision in critical section locking, everything runs pretty fast, however if you are trying to enter critical section which is currently locked on another thread, eventually you hit a wait operation on an internal kernel object (mutex or event), which involves yielding control on the thread and waiting for scheduler to give control back later.
The "later" above would depend on a few things, including priorities mentioned above, and there is one important things you omitted in your test - what is the overall CPU load at the time of your testing. The more is the load, the less chances to get the thread continue execution soon. 16 ms time looks perhaps a bit still within reasonable tolerance, and all in all it might depends on your actual implementation.

Resources