How to time something in Go, which takes less than a nanosecond? - go

How would I proceed if I want to compare the time of two functions, but the functions take less than a nanosecond?
t := time.Now()
_ = fmt.Sprint("Hello, World!")
d := time.Since(t)
d.Round(0)
fmt.Println(d.Nanoseconds()) // Prints 0
I could run the function a couple of times and divide the time by the number of executions, but I would rather like a way to time a single execution. Is there a way to do this?

If timing a function is generating a time less than a nanosecond, it usually means that the code was optimized out.
Compilers can detect that some code has no side effects, and decide "why would I do that".
1/(1 ns) is 1 Ghz. Modern desktop computers cap out at about 5 Ghz, give or take. So for it to be less than 1 ns, the operation plus the overhead of getting the time has to take fewer than 5 CPU cycles. At that level of resolution, the CPU isn't doing one thing at a time. The start of an instruction and the end can be multiple cycles apart, with the operation being pipelined.
So you have to look at the machine code and understand the architecture to determine what the real cost is, including what parts of the CPU's resources are used, and how it would conflict with other operations you might want to do nearby.
So even "before" and "after" stop having reasonable meaning at sub-1-ns time resolutions on most CPUs fast enough to run above 1 Ghz.

Something you could do, is answer the question "how many runs before this thing
takes at least one nanosecond":
package main
import (
"fmt"
"time"
)
func main() {
d := 1
for {
t := time.Now()
for e := d; e > 0; e-- {
fmt.Sprint("Hello, World!")
}
if time.Since(t) > 0 { break }
d *= 2
}
println(d)
}

Related

How to use go concurrency for a better runtime in Go [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 10 months ago.
Improve this question
I am trying concurrency in Go with simple problem example which is the nth prime palindrome numbers, eg: 1st to 9th palindrome sequence is 2, 3, 5, 7, 11, 101, 131, 151. I am stuck and have no idea what to do with it.
My current code is like this:
n := 9999999
count := 0
primePalindrome := 0
for i := 0; count < n; i++ {
go PrimeNumber(i)
go PalindromeNumber(i)
if PrimeNumber(i) && PalindromeNumber(i) {
primePalindrome = i
count++
}
}
How do I know that function PrimeNumber(i) and PalindromeNumber(i) already executed and done by go routine so that I can give IF CONDITION to get the number?
Here's my solution: https://go.dev/play/p/KShMctUK9yg
The question is, what if I want to find the 9999999th palindrome with faster runtime using Go concurrency, how to apply concurrency in PrimeNumber() and PalindromeNumber() function?
Similar to Torek, I add concurrency by creating multiple workers that can each check a number, thereby checking several numbers in parallel.
Algorithmically, the program works like this. One goroutine generates possible prime palindromes. A pool of multiple goroutines all check candidates. The main goroutine collects results. When we have at least enough results to provide the nth prime palindrome, we sort the list and then return the answer.
Look closely at the use of channels and a wait group to communicate between the goroutines:
multiple worker goroutines. Their job is to run palindrome and then prime checks for candidate numbers. They listen on the candidates channel; when the channel is closed, they end, communicating that they're done to the wg sync.WaitGroup.
a single candidate generator. This goroutine sends each candidate to one of the workers by sending to the candidates channel. If it finds the done channel to be closed, it ends.
the main goroutine also functions as the collector of results. It reads from the resc (results) channel and adds the results to a list. When the list is at least the required length, it closes done, signaling the generator to stop generating candidates.
The done channel may seem redundant, but it's important because the main results collector knows when we are done generating candidates, but isn't the one sending to candidates. IF we closed candidates in main, there's a good chance the generator would attempt to write to it and that would crash the program. The generator is the one writing candidates; it is the only goroutine that "knows" no more candidates will be written.
Note that this implementation generates at least n prime palindromes. Since it generates the prime palindromes in parallel, there's no guarantee that we have them in order. We might generate up to prime palindrome n+m where m is the number of workers minus one, in the worst case.
There's a lot of room for improvement here. I'm pretty sure the generator and collector roles could be combined in one select loop on candidate and result channels. The program also seems to have a very hard time if n is as big as 9999999 when I run it on my windows machine - see if your results vary.
Edit: Performance enhancements
If you're looking to improve performance, here's a few things I found and noticed last night.
No need to check even numbers. for i := start; ; i += 1 + (i % 2) skips to the next odd number, then adds 2 every other time to keep on the odd numbers.
all palindromes of even numbered decimal length are divisble by 11. So whole sets of numbers of even length can be skipped. I do this by adding math.Pow10(len(str)) to even numbered decimal representations to add another digit. This is what caused the program to stop outputting numbers for large amounts of time - every even numbered set of numbers cannot produce prime palindromes.
if the number's decimal notation starts with an even number it can't be a prime palindrome unless it's only one digit long. Same is true of 5. In the code below I add math.Pow10(len(str)-1) to the number to move to the next odd numbered sequence. If the number starts with 5, I double that to move to the next odd numbered sequence.
These tricks make the code a lot more performant, but it's still a brute force at the end of the day and I still haven't gotten even close to 9999999.
// send candidates
go func() {
// send candidates out
for i := start; ; i += 1 + (i % 2) {
str := strconv.FormatInt(i, 10)
if len(str) % 2 == 0 && i != 11 {
newi := int64(math.Pow10(len(str)))+1
log.Printf("%d => %d", i, newi)
i = newi
continue
}
if first := (str[0]-'0'); first % 2 == 0 || first == 5 {
if i < 10 {
continue
}
nexti := int64(math.Pow10(len(str)-1))
if first == 5 {
nexti *= 2
}
newi := i + nexti
log.Printf("%d -> %d", i, newi)
i = newi-2
continue
}
select {
case _, ok := <-done:
if !ok {
close(candidates)
return
}
case candidates <- i:
continue
}
}
}()
There are multiple issues to solve here:
we want to spin off "is prime" and "is palindrome" tests
we want to sequentially order the numbers that pass the tests
and of course, we have to express the "spin off" and "wait for result" parts of the problem in our programming language (in this case, Go).
(Besides this, we might want to optimize our primality testing, perhaps with a Sieve of Eratothenese algorithm or similar, which may also involve parallelism.)
The middle problem here is perhaps the hardest one. There is a fairly obvious way to do it, though: we can observe that if we assign an ascending-order number to each number tested, the answers that come back (n is/isnot suitable), even if they come back in the wrong order, are easily re-shuffled into order.
Since your overall loop increments by 1 (which is kind of a mistake1), the numbers tested are themselves suitable for this purpose. So we should create a Go channel whose type is a pair of results: Here is the number I tested, and here is my answer:
type result struct {
tested int // the number tested
passed bool // pass/fail result
}
testC := make(chan int)
resultC := make(chan result)
Next, we'll use a typical "pool of workers". Then we run our loop of things-to-test. Here is your existing loop:
for i := 0; count < n; i++ {
go PrimeNumber(i)
go PalindromeNumber(i)
if PrimeNumber(i) && PalindromeNumber(i) {
primePalindrome = i
count++
}
}
We'll restructure this as:
count := 0
busy := 0
results := []result{}
for toTest := 0;; toTest += 2 {
// if all workers are busy, wait for one result
if busy >= numWorkers {
result := <-resultC // get one result
busy--
results := addResult(results, result)
if result.passed {
count++ // passed: increment count
if count >= n {
break
}
}
}
// still working, so test this number
testC <- toTest
busy++
}
close(testC) // tell workers to stop working
// collect remaining results
for result := range resultC {
results := addResult(results, result)
}
(The "busy" test is a bit klunky; you could use a select to send or receive, whichever you can do first, instead, but if you do that, the optimizations outlined below get a little more complicated.)
This does mean our standard worker pool pattern needs to close the result channel resultC, which means we'll add a sync.WaitGroup when we spawn off the numWorkers workers:
var wg sync.WaitGroup
wg.Add(numWorkers)
for i := 0; i < numWorkers; i++ {
go worker(&wg, testC, resultC)
}
go func() {
wg.Wait()
close(resultC)
}()
This makes our for result := range resultC loop work; the workers all stop (and return and call wg.Done() via their defers, which are not shown here) when we close testC so resultC is closed once the last worker exits.
Now we have one more problem, which is: the results come back in semi-random order. That's why we have a slice of results. The addResult function needs to expand the slice and insert the result in the proper position, using the value-tested. When the main loop reaches the break statement, the number in toTest is at least the n'th palindromic prime, but it may be great than the n'th. So we need to collect the remaining results and look backwards to see if some earlier number was in fact the n'th.
There are a number of optimizations to make at this point: in particular, if we've tested numbers through k and they're all known to have passed or failed and k + numWorkers < n, we no longer need any of these results (whether they passed or failed) so we can shrink the results slice. Or, if we're interested in building a table of palindromic primes, we can do that, or anything else we might choose. The above is not meant to be the best solution, just a solution.
Note again that we "overshoot": whatever numWorkers is, we may test up to numWorkers-1 values that we didn't need to test at all. That, too, might be optimizable by having each worker quit early (using some sort of quit indicator, whether that's a "done" channel or just a sync/atomic variable) if they're working on a number that's higher than the now-known-to-be-at-least-n'th value.
1We can cut the problem in half by starting with answers pre-loaded with 2, or 1 and 2 if you choose to consider 1 prime—see also http://ncatlab.org/nlab/show/too+simple+to+be+simple. Then we run our loop from 3 upwards by 2 each time, so that we don't even bother testing even numbers, since 2 is the only even prime number.
The answer depends of many aspects of your problem
For instance, if each step is independent, you can use the example below. If the next step depends on the previous you need to find another solution.
For instance: a palindrome number does not depends of the previous number. You can paralelize the palindrome detection. While the prime is a sequential code.
Perhaps check if a palindrome is prime by checking the divisors until the square root of that number. You need to benchmark it
numbers := make(chan int, 1000) // add some buffer
var wg sync.WaitGroup
for … { // number of consumers
wg.Add(1)
go consume(&wg, numbers)
}
for i … {
numbers <- i
}
close(numbers)
wg.Wait()
// here all consumers ended properly
…
func consume(wg *sync.WaitGroup, numbers chan int){
defer wg.Done()
for i := range numbers {
// use i
}
}

Concurrency vs parallelism when executing a goroutine

A fairly naive go question. I was going through go-concurrency tutorial and I came across this https://tour.golang.org/concurrency/4.
I modified the code to add a print statement in the fibonacci function. So the code looks something like
package main
import (
"fmt"
)
func fibonacci(n int, c chan int) {
x, y := 0, 1
for i := 0; i < n; i++ {
c <- x
x, y = y, x+y
fmt.Println("here")
}
close(c)
}
func main() {
c := make(chan int, 10)
go fibonacci(cap(c), c)
for i := range c {
fmt.Println(i)
}
}
And I got this as an output
here
here
here
here
here
here
here
here
here
here
0
1
1
2
3
5
8
13
21
34
I was expecting here and the numbers to be interleaved. (Since the routine gets executed concurrently)
I think I am missing something basic about go-routines. Not quite sure what though.
A few things here.
You have 2 goroutines, one running main(), and one running fibonacci(). Because this is a small program, there isn't a good reason for the go scheduler not to run them one after another on the same thread, so that's what happens consistently, though it isn't guaranteed. Because the goroutine in main() is waiting for the chan, the fibonacci() routine is scheduled first. It's important to remember that goroutines aren't threads, they're routines that the go scheduler runs on threads according to its liking.
Because you're passing the length of the buffered channel to fibonacci() there will almost certainly (never rely on this behavior) be cap(c) heres printed after which the channel is filled, the for loop finishes, a close is sent to the chan, and the goroutine finishes. Then the main() goroutine is scheduled and cap(c) fibonacci's will be printed. If the buffered chan had filled up, then main() would have been rescheduled:
https://play.golang.org/p/_IgFIO1K-Dc
By sleeping you can tell the go scheduler to give up control. But in practice never do this. Restructure in some way or, if you must, use a Waitgroup. See: https://play.golang.org/p/Ln06-NYhQDj
I think you're trying to do this: https://play.golang.org/p/8Xo7iCJ8Gj6
I think what you are observing is that Go has its own scheduler, and at the same time there is a distinction between "concurrency" and "parallelism". In the words of Rob Pike: Concurrency is not Parallelism
Goroutines are much more lightweight than OS threads and they are managed in "userland" (within the Go process) as opposed to the operating system. Some programs have many thousands (even tens of thousands) of goroutines running, whilst there would certainly be far fewer operating system threads allocated. (This is one of Go's major strengths in asynchronous programs with many routines)
Because your program is so simple, and the channel buffered, it does not block on writing to the channel:
c <- x
The fibonacci goroutine isn't getting preempted before it completes the short loop.
Even the fmt.Println("here") doesn't deterministically introduce preemption - I learned something myself there in writing this answer. It is buffered, like the analagous printf and scanf from C.
(see the source code https://github.com/golang/go/blob/master/src/fmt/print.go)
For interest, if you wanted to artificially control the number of OS threads, you can set the GOMAXPROCS environment variable on the command line:
~$ GOMAXPROCS=1 go run main.go
However, with your simple program there probably would be no discernable difference, because the Go runtime is still perfectly capable of scheduling many goroutines against 1 OS thread.
For example, here is a minor variation to your program. By making the channel buffer smaller (5), but still iterating 10 times, we introduce a point at which the fibonacci go routine can (but won't necessarily) be preempted, where it could block at least once on writing to the channel:
package main
import (
"fmt"
)
func fibonacci(n int, c chan int) {
x, y := 0, 1
for i := 0; i < n; i++ {
c <- x
x, y = y, x+y
fmt.Println("here")
}
close(c)
}
func main() {
c := make(chan int, 5)
go fibonacci(cap(c)*2, c)
for i := range c {
fmt.Println(i)
}
}
~$ GOMAXPROCS=1 go run main.go
here
here
here
here
here
here
0
1
1
2
3
5
8
here
here
here
here
13
21
34
Long explanation here, short explanation is that there are a multitude of reasons that a go routine can temporarily block and those are ideal opportunities for the go scheduler to schedule execution of another go routine.
If you add this after the fmt.Println in the fibonacci loop, you will see the results interleaved the way you would expect:
time.Sleep(1 * time.Second)
This gives the Go scheduler a reason to block the execution of the fibonacci() goroutine long enough to allow the main() goroutine to read from the channel.

Efficient way to skip code every X iterations?

I'm using GameMaker Studio and you can think of it as a giant loop.
I use a counter variable step to keep track of what frame it is.
I'd like to run some code only every Xth step for efficiency.
if step mod 60 {
}
Would run that block every 60 steps (or 1 second at 60 fps).
My understanding is modulus is a heavy operation though and with thousands of steps I imagine the computation can get out of hand. Is there a more efficient way to do this?
Perhaps involving bitwise operator?
I know this can work for every other frame:
// Declare
counter = 0
// Step
counter = (counter + 1) & 1
if counter {
}
Or is the performance impact of modulus negligible at 60FPS even with large numbers?
In essence:
i := 0
WHILE i < n/4
do rest of stuff × 4
do stuff that you want to do one time in four
Increment i
Do rest of stuff i%4 times
The variant of this that takes the modulus and switches based on that is called Duff’s Device. Which is faster will depend on your architecture: on many RISC chips, mod executes in a single clock cycle, but on other CPUs, integer division might not even be a native instruction.
If you don’t have a loop counter per se, because it’s an event loop for example, you can always make one and reset it every four times in the if block where you execute your code:
i := 1
WHILE you loop
do other stuff
if i == 4
do stuff
i := 1
else
i := i + 1
Here’s an example of doing some stuff one time in two and stuff one time in three:
WHILE looping
do stuff
do stuff a second time
do stuff B
do stuff a third time
do stuff C
do stuff a fourth time
do stuff B
do stuff a fifth time
do stuff a sixth time
do stuff B
do stiff C
Note that the stuff you do can include calling an event loop once.
Since this can get unwieldy, you can use template metaprogramming to write these loops for you in C++, something like:
constexpr unsigned a = 5, b = 7, LCM_A_B = 35;
template<unsigned N>
inline void do_stuff(void)
{
do_stuff_always();
if (N%a)
do_stuff_a(); // Since N is a compile-time constant, the compiler does not have to check this at runtime!
if (N%b)
do_stuff_b();
do_stuff<N-1>();
}
template<>
inline void do_stuff<1U>(void)
{
do_stuff_always();
}
while (sentinel)
do_stuff<LCM_A_B>();
In general, though, if you want to know whether your optimizations are helping, profile.
The most important part of the answer: that test probably takes so little time, in context, that it isn't worth the ions moving around your brain to think about it.
If it only costs 1% it's almost certain there are bigger speedups you should be thinking about.
However, if the loop is fast, you could put in something like this:
if (--count < 0){
count = 59;
// do your thing
}
In some hardware, that test comes down to a single instruction decrement-and-branch-if-negative.

Why are Scala for loops so much faster than while loops for processing a sequence of distinct objects?

I recently read a post about for loops over a range of integers being slower than the corresponding while loops, which is true, but wanted to see if the same held up for iterating over existing sequences and was surprised to find the complete opposite by a large margin.
First and foremost, I'm using the following function for timing:
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: " + (System.nanoTime - s) / 1e6 + "ms")
ret
}
and I'm using a simple sequence of Integers:
val seq = List.range(0, 10000)
(I also tried creating this sequence a few other ways in case the way this sequence was accessed affected the run time. Using the Range type certainly did. This should ensure that each item in the sequence is an independent object.)
I ran the following:
time {
for(item <- seq) {
println(item)
}
}
and
time {
var i = 0
while(i < seq.size) {
println(seq(i))
i += 1
}
}
I printed the results so to ensure that we're actually accessing the values in both loops. The first code snippet runs in an average of 33 ms on my machine. The second takes an average of 305 ms.
I tried adding the mutable variable i to the for loop, but it only adds a few milliseconds. The map function gets similar performance to a for loop, as expected. For whatever reason, this doesn't seem to occur if I use an array (converting the above defined seq with seq.toArray). In such a case, the for loop takes 90 ms and the while loop takes 40 ms.
What is the reason for this major performance difference?
The reason is: complexity. seq(i) is Θ(i) for List, which means your whole loop takes quadratic time. The foreach method, however, is linear.
If you compile with -optimize, the for loop version will likely be even faster, because List.foreach should be inlined, eliminating the cost of the lambda.

Why are loops slow in R?

I know that loops are slow in R and that I should try to do things in a vectorised manner instead.
But, why? Why are loops slow and apply is fast? apply calls several sub-functions -- that doesn't seem fast.
Update: I'm sorry, the question was ill-posed. I was confusing vectorisation with apply. My question should have been,
"Why is vectorisation faster?"
It's not always the case that loops are slow and apply is fast. There's a nice discussion of this in the May, 2008, issue of R News:
Uwe Ligges and John Fox. R Help Desk: How can I avoid this loop or
make it faster? R News, 8(1):46-50, May 2008.
In the section "Loops!" (starting on pg 48), they say:
Many comments about R state that using loops is a particularly bad idea. This is not necessarily true. In certain cases, it is difficult to write vectorized code, or vectorized code may consume a huge amount of memory.
They further suggest:
Initialize new objects to full length before the loop, rather
than increasing their size within the loop. Do not do things in a
loop that can be done outside the loop. Do not avoid loops simply
for the sake of avoiding loops.
They have a simple example where a for loop takes 1.3 sec but apply runs out of memory.
Loops in R are slow for the same reason any interpreted language is slow: every
operation carries around a lot of extra baggage.
Look at R_execClosure in eval.c (this is the function called to call a
user-defined function). It's nearly 100 lines long and performs all sorts of
operations -- creating an environment for execution, assigning arguments into
the environment, etc.
Think how much less happens when you call a function in C (push args on to
stack, jump, pop args).
So that is why you get timings like these (as joran pointed out in the comment,
it's not actually apply that's being fast; it's the internal C loop in mean
that's being fast. apply is just regular old R code):
A = matrix(as.numeric(1:100000))
Using a loop: 0.342 seconds:
system.time({
Sum = 0
for (i in seq_along(A)) {
Sum = Sum + A[[i]]
}
Sum
})
Using sum: unmeasurably small:
sum(A)
It's a little disconcerting because, asymptotically, the loop is just as good
as sum; there's no practical reason it should be slow; it's just doing more
extra work each iteration.
So consider:
# 0.370 seconds
system.time({
I = 0
while (I < 100000) {
10
I = I + 1
}
})
# 0.743 seconds -- double the time just adding parentheses
system.time({
I = 0
while (I < 100000) {
((((((((((10))))))))))
I = I + 1
}
})
(That example was discovered by Radford Neal)
Because ( in R is an operator, and actually requires a name lookup every time you use it:
> `(` = function(x) 2
> (3)
[1] 2
Or, in general, interpreted operations (in any language) have more steps. Of course, those steps provide benefits as well: you couldn't do that ( trick in C.
The only Answer to the Question posed is; loops are not slow if what you need to do is iterate over a set of data performing some function and that function or the operation is not vectorized. A for() loop will be as quick, in general, as apply(), but possibly a little bit slower than an lapply() call. The last point is well covered on SO, for example in this Answer, and applies if the code involved in setting up and operating the loop is a significant part of the overall computational burden of the loop.
Why many people think for() loops are slow is because they, the user, are writing bad code. In general (though there are several exceptions), if you need to expand/grow an object, that too will involve copying so you have both the overhead of copying and growing the object. This is not just restricted to loops, but if you copy/grow at each iteration of a loop, of course, the loop is going to be slow because you are incurring many copy/grow operations.
The general idiom for using for() loops in R is that you allocate the storage you require before the loop starts, and then fill in the object thus allocated. If you follow that idiom, loops will not be slow. This is what apply() manages for you, but it is just hidden from view.
Of course, if a vectorised function exists for the operation you are implementing with the for() loop, don't do that. Likewise, don't use apply() etc if a vectorised function exists (e.g. apply(foo, 2, mean) is better performed via colMeans(foo)).
Just as a comparison (don't read too much into it!): I ran a (very) simple for loop in R and in JavaScript in Chrome and IE 8.
Note that Chrome does compilation to native code, and R with the compiler package compiles to bytecode.
# In R 2.13.1, this took 500 ms
f <- function() { sum<-0.5; for(i in 1:1000000) sum<-sum+i; sum }
system.time( f() )
# And the compiled version took 130 ms
library(compiler)
g <- cmpfun(f)
system.time( g() )
#Gavin Simpson: Btw, it took 1162 ms in S-Plus...
And the "same" code as JavaScript:
// In IE8, this took 282 ms
// In Chrome 14.0, this took 4 ms
function f() {
var sum = 0.5;
for(i=1; i<=1000000; ++i) sum = sum + i;
return sum;
}
var start = new Date().getTime();
f();
time = new Date().getTime() - start;

Resources