Golang benchmarking b.StopTimer() hangs -- is it me? - go

I've run into a problem benchmarking my Go programs, and it looks to me like a Go bug. Please help me understand why it's not.
In a nutshell: benchmarking hangs when I call b.StopTimer() even if I then call b.StartTimer() afterwards. AFAICT I am using the functions as intended.
This is on Mac OSX 10.11.5 "El Capitan," with go version go1.6.1 darwin/amd64; everything else seems to be fine. I will try to check this on Linux later today and update the question.
Here's the code. Run with go test -bench=. as usual.
package bmtimer
import (
"testing"
"time"
)
// These appear to not matter; they can be very short.
var COSTLY_PREP_DELAY = time.Nanosecond
var RESET_STATE_DELAY = time.Nanosecond
// This however -- the measured function delay -- must not be too short.
// The shorter it is, the more iterations will be timed. If it's infinitely
// short then you will hang -- or perhaps *appear* to hang?
var FUNCTION_OF_INTEREST_DELAY = time.Microsecond
func CostlyPrep() { time.Sleep(COSTLY_PREP_DELAY) }
func ResetState() { time.Sleep(RESET_STATE_DELAY) }
func FunctionOfInterest() { time.Sleep(FUNCTION_OF_INTEREST_DELAY) }
func TestPlaceHolder(t *testing.T) {}
func BenchmarkSomething(b *testing.B) {
CostlyPrep()
b.ResetTimer()
for n := 0; n < b.N; n++ {
FunctionOfInterest()
b.StopTimer()
ResetState()
b.StartTimer()
}
}
Thanks in advance for any tips! I've come up short on Google as well as here.
Edit: It seems this only happens of FunctionOfInterest() is really fast; and yes, I specifically want to do this inside my benchmark loop. I've updated the sample code to make that more clear. And I think I have it figured out now.

TL;DR: it's not a bug, it's just a little tricky.
I'm pretty sure I know what's going on. The use of StopTimer and StartTimer inside the benchmarking loop is not the problem; the problem is that the function I do want to test is too fast for its own good. Or for my good anyway.
By adjusting the sleep time used by that function, we can pretty easily observe the benchmarking differences. At least for something this simple, Go is running more iterations the faster the measured code returns. This is known behavior and I should have thought of it. One finds output like this:
PASS
BenchmarkSomething-4 1000000 2079 ns/op
ok bmtimer 36.587s
When the FunctionOfInterest returns very fast -- or of course "infinitely" fast, as in func tooFast() {} -- then Go keeps running more iterations of it, trying to get a reliable benchmark. You might run out of patience. Or, your computer might:
*** Test killed with quit: ran too long (10m0s).
FAIL bmtimer 600.037s
(That was ten minutes at 100% CPU on one core.)
The good news of course is that it's not a bug; the bad news is it's still pretty annoying that it doesn't just stop at, say, a million iterations and tell you its best guess.
But then again, the other good news is that encountering this problem indicates your function is fast enough and probably doesn't need benchmarking.
The remaining mystery is: Why does this only happen when we stop and start the timer?
Without fiddling with the timer, I can call my noop function and it benchmarks it just fine, in a mere two billion iterations:
BenchmarkSomething-4 2000000000 0.33 ns/op
ok bmtimer 0.706s
My best guess is based on the comment from #JimB -- stopping the world takes at least a little time, and when you're running two billion iterations that can easily run over ten minutes. Like the man said: a billion here, a billion there, and pretty soon you're talking about real money.
(That's my guess anyway. Smarter folks than I may yet correct me.)
Thus the moral of our story is: yes by all means do take advantage of b.StopTimer() and b.StartTimer() in order to handle state if you have no other way to deal with it. But keep in mind that for very fast functions, it may be too much for the benchmarker.
I do wish I knew a better way to isolate the specific function benchmark for fast functions. My initial hack that I used before trying StopTimer was to have another benchmark function that benchmarks just the extra code, so you'd get two numbers: ns/op with everything, and ns/op without the function of interest. That works, and it's probably even more accurate than stopping the timer, but it does require advance knowledge of how to interpret the benchmark output.

Related

How does goroutines behave on a multi-core processor

I am a newbie in Go language, so please excuse me if my question is very basic. I have written a very simple code:
func main(){
var count int // Default 0
cptr := &count
go incr(cptr)
time.Sleep(100)
fmt.Println(*cptr)
}
// Increments the value of count through pointer var
func incr(cptr *int) {
for i := 0; i < 1000; i++ {
go func() {
fmt.Println(*cptr)
*cptr = *cptr + 1
}()
}
}
The value of count should increment by one the number of times the loop runs. Consider the cases:
Loop runs for 100 times--> value of count is 100 (Which is correct as the loop runs 100 times).
Loop runs for >510 times --> Value of count is either 508 OR 510. This happens even if it is 100000.
I am running this on an 8 core processor machine.
First of all: prior to Go 1.5 it runs on a single processor, only using multiple threads for blocking system calls. Unless you tell the runtime to use more processors by using GOMAXPROCS.
As of Go 1.5 GOMAXPROCS is set to the number of CPUS. See 6, 7 .
Also, the operation *cptr = *cptr + 1 is not guaranteed to be atomic. If you look carefully, it can be split up into 3 operations: fetch old value by dereferencing pointer, increment value, save value into pointer address.
The fact that you're getting 508/510 is due to some magic in the runtime and not defined to stay that way. More information on the behaviour of operations with concurrency can be found in the Go memory model.
You're probably getting the correct values for <510 started goroutines because any number below these are not (yet) getting interrupted.
Generally, what you're trying to do is neither recommendable in any language, nor the "Go" way to do concurrency. A very good example of using channels to synchronize is this code walk: Share Memory By Communicating (rather than communicating by sharing memory)
Here is a little example to show you what I mean: use a channel with a buffer of 1 to store the current number, fetch it from the channel when you need it, change it at will, then put it back for others to use.
You code is racy: You write to the same memory location from different, unsynchronized goroutines without any locking. The result is basically undefined. You must either a) make sure that all the goroutine writes after each other in a nice, ordered way, or b) protect each write by e.g. e mutex or c) use atomic operations.
If you write such code: Always try it under the race detector like $ go run -race main.go and fix all races.
A nice alternative to using channels in this case might be the sync/atomic package, which contains specifically functions for atomically incrementing/decrementing numbers.
You are spawning 500 or 1000 routines without synchronization among routines. This is creating a race condition, which makes the result un-predictable.
Imagine You are working in an office to account for expense balance for your boss.
Your boss was in a meeting with 1000 of his subordinates and in this meeting he said, "I would pay you all a bonus, so we will have to increment the record of our expense". He issued following command:
i) Go to Nerve and ask him/ her what is the current expense balance
ii) Call my secretary to ask how much bonus you would receive
iii) Add your bonus as an expense to the additional expense balance. Do the math yourself.
iv) Ask Nerve to write the new balance of expense on my authority.
All of the 1000 eager participants rushed to record their bonus and created a race condition.
Say 50 of the eager gophers hit Nerve at the same time (almost), they ask,
i) "What is the current expense balance?
-- Nerve says $1000 to all of those 50 gophers, as they asked at the same question at the same time(almost) when the balance was $1000.
ii) The gophers then called secretary, how much bonus should be paid to me?
Secretary answers, "just $1 to be fair"
iii) Boos said do the math, they all calculates $1000+ $1 = $1001 should be the new cost balance for the company
iv) They all will ask Nerve to put back $1001 to the balance.
You see the problem in the method of Eager gophers?
There are $50 units of computation done, every time the added $1 to the existing balance, but the cost didn't increase by $50; only increased by $1.
Hope that clarifies the problem. Now for the solutions, other contributors gave very good solutions that would be sufficient I believe.
All of that approaches failed to me, noobie here. But i have found a better way http://play.golang.org/p/OcMsuUpv2g
I'm using sync package to solve that problem and wait for all goroutines to finish, without Sleep or Channel.
And don't forget to take a look at that awesome post http://devs.cloudimmunity.com/gotchas-and-common-mistakes-in-go-golang/

Do speeds of if statements in a repetitive loop affect overall performance?

If I have code that will take a while to execute, printing out results every iteration will slow down the program a lot. To still receive occasional output to check on the progress of the code, I might have:
if (i % 10000 == 0) {
# print progress here
}
Does the if statement checking every time slow it down at all? Should I just not put output and just wait, will that make it noticeably faster at all?
Also, is it faster to do: (i % 10000 == 0) or (i == 10000)?
Is checking equality or modulus faster?
In general case, it won't matter at all.
A slightly longer answer: It won't matter unless the loop is run millions of times and the other statement in it is actually less demanding than an if statement (for example, a simple multiplication etc.). In that case, you might see a slight performance drop.
Regarding (i % 10000 == 0) vs. (i == 10000), the latter is obviously faster, because it only compares, whereas the former possibility does a (fairly costly) modulus and a comparison.
That said, both an if statement and a modulus count won't make any difference if your loop doesn't take up 90 % of the program's running time. Which usually is the case only at school :). You probably spent a lot more time by asking this question than you would have saved by not printing anything. For development and debugging, this is not a bad way to go.
The golden rule for this kind of decisions:
Write the most readable and explicit code you can imagine to do the
thing you want it to do. If you have a performance problem, look at
wrong data structures and algorithmic choices first. If you have done
all those and need a really quick program, profile it to see which
part takes most time. After all those, you're allowed to do this kind
of low-level guesses.

Replicate() versus a for loop?

Does anyone know how the replicate() function works in R and how efficient it is relative to using a for loop?
For example, is there any efficiency difference between...
means <- replicate(100000, mean(rnorm(50)))
And...
means <- c()
for(i in 1:100000) {
means <- c(means, mean(rnorm(50)))
}
(I may have typed something slightly off above, but you get the idea.)
You can just benchmark the code and get your answer empirically. Note that I also added a second for loop flavor which circumvents the growing vector problem by preallocating the vector.
repl_function = function(no_rep) means <- replicate(no_rep, mean(rnorm(50)))
for_loop = function(no_rep) {
means <- c()
for(i in 1:no_rep) {
means <- c(means, mean(rnorm(50)))
}
means
}
for_loop_prealloc = function(no_rep) {
means <- vector(mode = "numeric", length = no_rep)
for(i in 1:no_rep) {
means[i] <- mean(rnorm(50))
}
means
}
no_loops = 50e3
benchmark(repl_function(no_loops),
for_loop(no_loops),
for_loop_prealloc(no_loops),
replications = 3)
test replications elapsed relative user.self sys.self
2 for_loop(no_loops) 3 18.886 6.274 17.803 0.894
3 for_loop_prealloc(no_loops) 3 3.209 1.066 3.189 0.000
1 repl_function(no_loops) 3 3.010 1.000 2.997 0.000
user.child sys.child
2 0 0
3 0 0
1 0 0
Looking at the relative column, the un-preallocated for loop is 6.2 times slower. However, the preallocated for loop is just as fast as replicate.
replicate is a wrapper for sapply, which itself is a wrapper for lapply. lapply is ultimately an .Internal function that is written in C and performs the looping in an optimised way, rather than through the interpreter. It's main advantages are efficient memory management, especially compared to the highly inefficient vector growing method you present above.
I have a very different experience with replicate which also confuses me. It often happens that my R crashes and my laptop hangs when I use replicate compared to for and this surprises me, as for the reasons mentioned above, I also expected a C-written function to outperform the for loop. For example, if you execute the functions below, you'll see that for loop is faster than replicate
system.time(for (i in 1:10) runif(1e7))
# user system elapsed
# 3.340 0.218 3.558
system.time(replicate(10, runif(1e7)))
# user system elapsed
# 4.622 0.484 5.109
so with 10 replicates, the for loop is clearly faster. If you repeat it for 100 replicates you get similar results. So I wonder if anyone can come with an example that shows its practical privileges compared to for.
PS I also created a function for the runif(1e7) and that made no difference in the comparison. Basically I failed to come with any example that shows the advantage of replicate.
Vectorization is the key difference between them. I will tray to explain this point. R is an high-level-interpreted computer language. It takes care of many basic computer tasks for you. When you write
x <- 2.0
you don’t have to tell your computer that
“2.0” is a floating-point number;
“x” should store numeric-type data;
it has to find a place in memory to put “5”;
it has to register “x” as a pointer to a certain place in memory.
R figures these things by itself.
But, for such comfortable issue, there is a price: it is slower than low level languages.
In C or FORTRAN, much of this "test if" would be accomplished during the compilation step, not during the program execution. They are translated into binary computer language (0/1) after they are written, BUT before they are run. This allows the compiler to organize the binary machine code in an optimal way for the computer to interpret.
What does this have to do with vectorization in R? Well, many R functions are actually written in a a compiled language, such as C, C++, and FORTRAN, and have a small R “wrapper”. This is the difference between yours approach. for loops add further test if operations that the machine has to do on data, making it slower

Speedup problems with go

I wrote a very simple program in go to test performances of a parallel program. I wrote a very simple program that factorizes a big semiprime number by division trials. Since no communications are involved, I expected an almost perfect speedup. However, the program seems to scale very badly.
I timed the program with 1, 2, 4, and 8 processes, running on a 8 (real, not HT) cores computer, using the system timecommand. The number I factorized is "28808539627864609". Here are my results:
cores time (sec) speedup
1 60.0153 1
2 47.358 1.27
4 34.459 1.75
8 28.686 2.10
How to explain such bad speedups? Is it a bug in my program, or is it a problem with go runtime? How could I get better performances? I'm not talking about the algorithm by itself (I know there are better algorithms to factorize semiprime numbers), but about the way I parallelized it.
Here is the source code of my program:
package main
import (
"big"
"flag"
"fmt"
"runtime"
)
func factorize(n *big.Int, start int, step int, c chan *big.Int) {
var m big.Int
i := big.NewInt(int64(start))
s := big.NewInt(int64(step))
z := big.NewInt(0)
for {
m.Mod(n, i)
if m.Cmp(z) == 0{
c <- i
}
i.Add(i, s)
}
}
func main() {
var np *int = flag.Int("n", 1, "Number of processes")
flag.Parse()
runtime.GOMAXPROCS(*np)
var n big.Int
n.SetString(flag.Arg(0), 10) // Uses number given on command line
c := make(chan *big.Int)
for i:=0; i<*np; i++ {
go factorize(&n, 2+i, *np, c)
}
fmt.Println(<-c)
}
EDIT
Problem really seems to be related to Mod function. Replacing it by Rem gives better but still imperfect performances and speedups. Replacing it by QuoRem gives 3 times faster performances, and perfect speedup. Conclusion: it seems memory allocation kills parallel performances in Go. Why? Do you have any references about this?
Big.Int methods generally have to allocate memory, usually to hold the result of the computation. The problem is that there is just one heap and all memory operations are serialized. In this program, the numbers are fairly small and the (parallelizable) computation time needed for things like Mod and Add is small compared to the non-parallelizable operations of repeatedly allocating all the tiny little bits of memory.
As far as speeding it up, there is the obvious answer of don't use big.Ints if you don't have to. Your example number happens to fit in 64 bits. If you plan on working with really big big numbers though, the problem will kind of go away on its own. You will spend much more time doing computations, and the time spent in the heap will be relatively much less.
There is a bug in your program, by the way, although it's not related to performance. When you find a factor and return the result on the channel, you send a pointer to the local variable i. This is fine, except that you don't break out of the loop then. The loop in the goroutine continues incrementing i and by the time the main goroutine gets around to fishing the pointer out of the channel and following it, the value is almost certain to be wrong.
After sending i through the channel, i should be replaced with a newly allocated big.Int:
if m.Cmp(z) == 0 {
c <- i
i = new(big.Int).Set(i)
}
This is necessary because there is no guarantee when fmt.Println will process the integer received on line fmt.Println(<-c). It isn't usual for fmt.Println to cause goroutine switching, so if i isn't replaced with a newly allocated big.Int and the run-time switches back to executing the for-loop in function factorize then the for-loop will overwrite i before it is printed - in which case the program won't print out the 1st integer sent through the channel.
The fact that fmt.Println can cause goroutine switching means that the for-loop in function factorize may potentially consume a lot of CPU time between the moment when the main goroutine receives from channel c and the moment when the main goroutine terminates. Something like this:
run factorize()
<-c in main()
call fmt.Println()
continue running factorize() // Unnecessary CPU time consumed
return from fmt.Println()
return from main() and terminate program
Another reason for the small multi-core speedup is memory allocation. The function (*Int).Mod is internally using (*Int).QuoRem and will create a new big.Int each time it is called. To avoid the memory allocation, use QuoRem directly:
func factorize(n *big.Int, start int, step int, c chan *big.Int) {
var q, r big.Int
i := big.NewInt(int64(start))
s := big.NewInt(int64(step))
z := big.NewInt(0)
for {
q.QuoRem(n, i, &r)
if r.Cmp(z) == 0 {
c <- i
i = new(big.Int).Set(i)
}
i.Add(i, s)
}
}
Unfortunately, the goroutine scheduler in Go release r60.3 contains a bug which prevents this code to use all CPU cores. When the program is started with -n=2 (GOMAXPROCS=2), the run-time will utilize only 1 thread.
Go weekly release has a better run-time and can utilize 2 threads if n=2 is passed to the program. This gives a speedup of approximately 1.9 on my machine.
Another potential contributing factor to multi-core slowdown has been mentioned in the answer by user "High Performance Mark". If the program is splitting the work into multiple sub-tasks and the result comes only from 1 sub-task, it means that the other sub-tasks may do some "extra work". Running the program with n>=2 may in total consume more CPU time than running the program with n=1.
The learn how much extra work is being done, you may want to (somehow) print out values of all i's in all goroutines at the moment the program is exiting the function main().
I don't read go so this is probably the answer to a question which is not what you asked. If so, downvote or delete as you wish.
If you were to make a plot of 'time to factorise integer n' against 'n' you would get a plot that goes up and down somewhat randomly. For any n you choose there will be an integer in the range 1..n that takes longest to factorise on one processor. If your parallelisation strategy is to distribute the n integers across p processors one of those processors will take at least the time to factorise the hardest integer, then the time to factorise the rest of its load.
Perhaps you've done something similar ?

How to explain to a developer that adding extra if - else if conditions is not a good way to "improve" readability?

Recently I've bumped into the following C++ code:
if (a)
{
f();
}
else if (b)
{
f();
}
else if (c)
{
f();
}
Where a, b and c are all different conditions, and they are not very short.
I tried to change the code to:
if (a || b || c)
{
f();
}
But the author opposed saying that my change will decrease readability of the code. I had two arguments:
1) You should not increase readability by replacing one branching statement with three (though I really doubt that it's possible to make code more readable by using else if instead of ||).
2) It's not the fastest code, and no compiler will optimize this.
But my arguments did not convince him.
What would you tell a programmer writing such a code?
Do you think complex condition is an excuse for using else if instead of OR?
This code is redundant. It is prone to error.
If you were to replace f(); someday with something else, there is the danger you miss one out.
There may though be a motivation behind that these three condition bodies could one day become different and you sort of prepare for this situation. If there is a strong possibility it will happen, it may be okay to do something of the sort. But I'd advice to follow the YAGNI principle (You Ain't Gonna Need It). Can't say how much bloated code has been written not because of the real need but just in anticipation of it becoming needed tomorrow. Practice shows this does not bring any value during the entire lifetime of an application but heavily increases maintenance overhead.
As to how to approach explaining it to your colleague, it has been discussed numerous times. Look here:
How do you tell someone they’re writing bad code?
How to justify to your colleagues that they produce crappy code?
How do you handle poor quality code from team members?
“Mentor” a senior programmer or colleague without insulting
Replace the three complex conditions with one function, making it obvious why f() should be executed.
bool ShouldExecute; { return a||b||c};
...
if ShouldExecute {f();};
Since the conditions are long, have him do this:
if ( (aaaaaaaaaaaaaaaaaaaaaaaaaaaa)
|| (bbbbbbbbbbbbbbbbbbbbbbbbbbbb)
|| (cccccccccccccccccccccccccccc) )
{
f();
}
A good compiler might turn all of these into the same code anyway, but the above is a common construct for this type of thing. 3 calls to the same function is ugly.
In general I think you are right in that if (a || b || c) { f(); } is easier to read. He could also make good use of whitespace to help separate the three blocks.
That said, I would be interested to see what a, b, c, and f look like. If f is just a single function call and each block is large, I can sort of see his point, although I cringe at violating the DRY principle by calling f three different times.
Performance is not an issue here.
Many people wrap themselves in the flag of "readability" when it's really just a matter of individual taste.
Sometimes I do it one way, sometimes the other. What I'm thinking about is -
"Which way will make it easier to edit the kinds of changes that might have to be made in the future?"
Don't sweat the small stuff.
I think that both of your arguments (as well as Developer Art's point about maintainability) are valid, but apparently your discussion partner is not open for a discussion.
I get the feeling that you are having this discussion with someone who is ranked as more senior. If that's the case, you have a war to fight and this is just one small battle, which is not important for you to win. Instead of spending time arguing about this thing, try to make your results (which will be far better than your discussion partner's if he's writing that kind of kode) speak for themselves. Just make sure that you get credit for your work, not the whole team or someone else.
This is probably not the kind of answer you expected to the question, but I got a feeling that there's something more to it than just this small argument...
I very much doubt there will be any performance gains of this, except at least in a very specific scenario. In this scenario you change a, b, and c, and thus which of the three that triggers the code changes, but the code executes anyhow, then reducing the code to one if-statement might improve, since the CPU might have the code in the branch cache when it gets to it next time. If you triple the code, so that it occupies 3 times the space in the branch cache, there is a higher chance one or more of the paths will be pushed out, and thus you won't have the most performant execution.
This is very low-level, so again, I doubt this will make much of an impact.
As for readability, which one is easier to read:
if something, do this
if something else, do this
if yet another something else, do this
"this" is the same in all three cases
or this:
if something, or something else, or yet another something else, then do this
Place some more code in there, other than just a simple function call, and it starts getting hard to identify that this is actually three identical pieces of code.
Maintainability goes down with the 3 if-statement organization because of this.
You also have duplicated code, almost always a sign of bad structure and design.
Basically, I would tell the programmer that if he has problems reading the 1 if-statement way of writing it, maybe C++ is not what he should be doing.
Now, let's assume that the "a", "b", and "c" parts are really big, so that the OR's in there gets lost in lots of noise with parenthesis or what not.
I would still reorganize the code so that it only called the function (or executed the code in there) in one place, so perhaps this is a compromise?
bool doExecute = false;
if (a) doExecute = true;
if (b) doExecute = true;
if (c) doExecute = true;
if (doExecute)
{
f();
}
or, even better, this way to take advantage of boolean logic short circuiting to avoid evaluating things unnecessary:
bool doExecute = a;
doExecute = doExecute || b;
doExecute = doExecute || c;
if (doExecute)
{
f();
}
Performance shouldn't really even come into question
Maybe later he wont call f() in all 3 conditons
Repeating code doesn't make things clearer, your (a||b||c) is much clearer although maybe one could refactor it even more (since its C++) e.g.
x.f()

Resources