Timing an operation without being affected by clock changes - time

I'm looking for:
1: startTime := time.Now()
2: // run something here that takes a while (measured in milliseconds)
3: duration := time.Since(startTime)
However, I need something that is immune to clock time changes. If the time is adjusted between lines 1 and 3, the duration will be inaccurate.
What are some common approaches for solving this problem, and what Go libraries could be relevant?
Thanks :)

That lack of monotonic clock was detailed in issue 12914 (2015)
Since then, in August 2017 and Go 1.9, you now have a transparent Monotonic Time support:
The time package now transparently tracks monotonic time in each Time value, making computing durations between two Time values a safe operation in the presence of wall clock adjustments.
See the package docs and design document for details.

For Linux (AMD64) go uses clock_gettime with CLOCK_REALTIME.
See the time·now implementation.
You would want a monotonic clock (CLOCK_MONOTONIC or CLOCK_MONOTONIC_RAW), which
is a clock that does not go back in time. In Linux the man page explicitly tells you that CLOCK_MONOTONIC does not guarantee to not leap forward:
This clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator
manually changes the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP.
So, under Linux, the best choice is probably CLOCK_MONOTONIC_RAW. You may use the
clock package mentioned by #MatrixFrog for that. Example:
import (
"fmt"
"github.com/davecheney/junk/clock"
"time"
)
func main() {
start := clock.Monotonic.Now()
// work
end := clock.Monotonic.Now()
duration := end.Sub(start)
fmt.Println("Elapsed:", duration)
}
Further reading:
Discussion about monotonic clocks in go
Outlook to get a clock interface in the go stdlib

Related

Golang time.Ticker to tick on clock times

I am working on a Go program and it requires me to run certain function at (fairly) exact clock times (for example, every 5 minutes, but then specifically at 3:00, 3:05, 3:10, etc, not just every 5 minutes after the start of the program).
Before coming here and requesting your help, I tried implementing a ticker does that, and even though it seems to work ok-ish, it feels a little dirty/hacky and it's not super exact (it's only fractions of milliseconds off, but I'm wondering if there's reason to believe that discrepancy increases over time).
My current implementation is below, and what I'm really asking is, is there a better solution to achieve what I'm trying to achieve (and that I can have a little more confidence in)?
type ScheduledTicker struct {
C chan time.Time
}
// NewScheduledTicker returns a ticker that ticks on defined intervals after the hour
// For example, a ticker with an interval of 5 minutes and an offset of 0 will tick at 0:00:00, 0:05:00 ... 23:55:00
// Using the same interval, but an offset of 2 minutes will tick at 0:02:00, 0:07:00 ... 23:57
func NewScheduledTicker(interval time.Duration, offset time.Duration) *ScheduledTicker {
s := &ScheduledTicker{
C: make(chan time.Time),
}
go func() {
now := time.Now()
// Figure out when the first tick should happen
firstTick := now.Truncate(interval).Add(interval).Add(offset)
// Block until the first tick
<-time.After(firstTick.Sub(now))
t := time.NewTicker(interval)
// Send initial tick
s.C <- firstTick
for {
// Forward ticks from the native time.Ticker to the ScheduledTicker channel
s.C <- <-t.C
}
}()
return s
}
Most timer apis across all platforms work in terms of system time instead of wall clock time. What you are expressing to is have a wall clock interval.
As the other answer expressed, there are open source packages available. A quick Google search for "Golang Wall Clock ticker" yields interesting results.
Another thing to consider. On Windows there are "scheduled tasks" and on Linux there are "cronjobs" that will do the wall clock wakeup interval for you. Consider using that if all your program is going to do is sleep/tick between needed intervals before doing needed work.
But if you build it yourself...
Trying to get things done on wall clock intervals is complicated by desktop PCs going to sleep when laptop lids close (suspending system time) and clock skew between system and wall clocks. And sometimes users like to change their PC's clocks - you could wake up and poll time.Now and discover you're at yesterday! This is unlikely to happen on servers running in the cloud, but a real thing on personal devices.
On my product team, when we really need want clock time or need to do something on intervals that span more than an hour, we'll wake up at a more frequent interval to see if "it's time yet". For example, if there's something we want to execute every 12 hours, we might wake up and poll the time every hour. (We use C++ where I work instead of Go).
Otherwise, my general algorithm for a 5 minute interval would be to sleep (or set a system timer) for 1 minute or shorter. After every return from time.Sleep, check the current time (time.Now()) to see if the current time is at or after the next expected interval time. Post your channel event and then recompute the next wake up time. You can even change the granularity of your sleep time if you work up spuriously early.
But be careful! the golang Time object contains both a wall clock and system clock time. This includes the result returned by Time.Now(). Time.Add(), Time.Sub(), and some of the other Time comparison functions work on the monolithic time first before falling over to wall clock time. You can strip the monolithic time out of the Time.Now result by doing this:
func GetWallclockNow() time.Time {
var t time.Time = time.Now()
return time.Date(t.Year(), t.Month(), t.Day(), t.Hour(), t.Minute(), t.Second(), t.Nanosecond(), t.Location())
}
Then subsequent operations like Add and After will be in wall clock space.

What’s the meaning of `Duration: 30.18s, Total samples = 26.26s (87.00%)` in go pprof?

As my understanding, pprof stops and samples go program at every 10ms. So a 30s program should got 3000 samples, but what’s the meaning of the 26.26s? How can the samples count be shown as time duration?
What’s more, I even ever got such output shows that the sample time is bigger than wall time, how could it be such result?
Duration: 5.13s, Total samples = 5.57s (108.58%)
That confusing wording was reported in google/pprof issue 128
The "Total samples" part is confusing.
Milliseconds are continuous, but samples are discrete — they're individual points, so how can you sum them up into a quantity of milliseconds?
The "sum" is the sum of a discrete number (a quantity of samples), not a continuous range (a time interval).
Reporting the sums makes perfect sense, but reporting discrete numbers using continuous units is just plain confusing.
Please update the formatting of the Duration line to give a clearer indication of what a quantity of samples reported in milliseconds actually means.
Raul Silvera's answer:
Each callstack in a profile is associated to a set of values. What is reported here is the sum of these values for all the callstacks in the profile, and is useful to understand the weight of individual frames over the full profile.
We're reporting the sum using the unit described in the profile.
Would you have a concrete suggestion for this example?
Maybe just a rewording would help, like:
Duration: 1.60s, Samples account for 14.50ms (0.9%)
There are still pprof improvements discussed for pprof in golang/go issue 36821
pprof CPU profiles lack accuracy (closeness to the ground truth) and precision (repeatability across different runs).
The issue is with the use of OS timers used for sampling; OS timers are coarse-grained and have a high skid.
I will propose a design to extend CPU profiling by sampling CPU Performance Monitoring Unit (PMU) aka hardware performance counters
It includes examples where Total samples exceeds duration.
Dependence on the number of cores and length of test execution:
The results of goroutine.go test depend on the number of CPU cores available.
On a multi-core CPU, if you set GOMAXPROCS=1, goroutine.go will not show a huge variation, since each goroutine runs for several seconds.
However, if you set GOMAXPROCS to a larger value, say 4, you will notice a significant measurement attribution problem.
One reason for this problem is that the itimer samples on Linux are not guaranteed to be delivered to the thread whose timer expired.
Since Go 1.17 (and improved in Go 1.18), you can add pprof labels to know more:
A cool feature of Go's CPU profiler is that you can attach arbitrary key value pairs to a goroutine. These labels will be inherited by any goroutine spawned from that goroutine and show up in the resulting profile.
Let's consider the example below that does some CPU work() on behalf of a user.
By using the pprof.Labels() and pprof.Do() API, we can associate the user with the goroutine that is executing the work() function.
Additionally the labels are automatically inherited by any goroutine spawned within the same code block, for example the backgroundWork() goroutine.
func work(ctx context.Context, user string) {
labels := pprof.Labels("user", user)
pprof.Do(ctx, labels, func(_ context.Context) {
go backgroundWork()
directWork()
})
}
How you use these labels is up to you.
You might include things such as user ids, request ids, http endpoints, subscription plan or other data that can allow you to get a better understanding of what types of requests are causing high CPU utilization, even when they are being processed by the same code paths.
That being said, using labels will increase the size of your pprof files. So you should probably start with low cardinality labels such as endpoints before moving on to high cardinality labels once you feel confident that they don't impact the performance of your application.

Speed up without a serial fraction

I ran a set of experiments on a parallel package, say superlu-dist, with different processor numbers e.g.: 4, 16, 32, 64
I got the wall clock time for each experiment, say: 53.17s, 32.65s, 24.30s, 16.03s
The formula of speedup is :
serial time
Speedup = ----------------------
parallel time
But there is no information about the serial fraction.
Can I simply take the reciprocal of the wall clock time?
Can I simply take the reciprocal of the wall clock time ?
No,true Speedup figures require comparing Apples to Apples :
This means, that an original, pure-[SERIAL] process-scheduling ought be compared with any other scenario, where parts may get modified, so as to use some sort of parallelism ( the parallel fraction may get re-organised, so as to run on N CPUs / computing-resources, whereas the serial fraction is left as was ).
This obviously means, that the original [SERIAL]-code was extended ( both in code ( #pragma-decorators, OpenCL-modifications, CUDA-{ host_to_dev | dev_to_host }-tooling etc.), and in time( to execute these added functionalities, that were not present in the original [SERIAL]-code, to benchmark against ), so as to add some new sections, where the ( possible [PARALLEL] ) other part of the processing will take place.
This comes at cost -- add-on overhead costs ( to setup and to terminate and to communicate data from [SERIAL]-part there, to the [PARALLEL]-part and back ) -- which all adds additional [SERIAL]-part workload ( and execution time + latency ).
For more details, feel free to read section Criticism in article on re-formulated Amdahl's Law.
The [PARALLEL]-portion seems interesting, yet the Speedup principal ceiling is in the [SERIAL]-portion duration ( s = 1 - p ) in the original,
but to which add-on durations and added latency costs need to get added as accumulated alongside the "organisation" of work from an original, pure-[SERIAL], to the wished-to-have [PARALLEL]-code execution process scheduling, if realistic evaluation is to be achieved
run the test on a single processor and set that as the serial time, ...,
as #VictorSong has proposed sounds easy, but benchmarks an incoherent system ( not the pure-[SERIAL] original) and records a skewed yardstick to compare against.
This is the reason, why fair methods ought be engineered. The pure-[SERIAL] original code-execution can be time-stamped, so as to show the real durations of unchanged-parts, but the add-on overhead times have to get incorporated into the add-on extensions of the serial part of the now parallelised tests.
The re-articulated Amdahl's Law of Diminishing Returns explains this, altogether with impacts from add-on overheads and also from atomicity-of-processing, that will not allow further fictions of speedup growth, given more computing resources are added, but a parallel-fraction of the processing does not permit further split of task workloads, due to some form of its internal atomicity-of-processing, that cannot be further divided in spite of having free processors available.
The simplified of the two, re-formulated expressions stands like this :
1
S = __________________________; where s, ( 1 - s ), N were defined above
( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on
s + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead add-on
N
Some interactive GUI-tools for further visualisations of the add-on overhead-costs are available for interactive parametric simulations here - just move the p-slider towards the actual value of the ( 1 - s ) ~ having a non-zero fraction of the very [SERIAL]-part of the original code :
What do you mean when you say "serial fraction"? According to a Google search apparently superlu-dist is C, so I guess you could just use ctime or chrono and take the time the usual way, it works for me with both manual std::threads and omp.
I'd just run the test on a single processor and set that as the serial time, then do the test again with more processors (just like you said).

Testing Erlang function performance with timer

I'm testing the performance of a function in a tight loop (say 5000 iterations) using timer:tc/3:
{Duration_us, _Result} = timer:tc(M, F, [A])
This returns both the duration (in microseconds) and the result of the function. For argument's sake the duration is N microseconds.
I then perform a simple average calculation on the results of the iterations.
If I place a timer:sleep(1) function call before the timer:tc/3 call, the average duration for all the iterations is always > the average without the sleep:
timer:sleep(1),
timer:tc(M, F, [A]).
This doesn't make much sense to me as the timer:tc/3 function should be atomic and not care about anything that happened before it.
Can anyone explain this strange functionality? Is it somehow related to scheduling and reductions?
Do you mean like this:
4> foo:foo(10000).
Where:
-module(foo).
-export([foo/1, baz/1]).
foo(N) -> TL = bar(N), {TL,sum(TL)/N} .
bar(0) -> [];
bar(N) ->
timer:sleep(1),
{D,_} = timer:tc(?MODULE, baz, [1000]),
[D|bar(N-1)]
.
baz(0) -> ok;
baz(N) -> baz(N-1).
sum([]) -> 0;
sum([H|T]) -> H + sum(T).
I tried this, and it's interesting. With the sleep statement the mean time returned by timer:tc/3 is 19 to 22 microseconds, and with the sleep commented out, the average drops to 4 to 6 microseconds. Quite dramatic!
I notice there are artefacts in the timings, so events like this (these numbers being the individual microsecond timings returned by timer:tc/3) are not uncommon:
---- snip ----
5,5,5,6,5,5,5,6,5,5,5,6,5,5,5,5,4,5,5,5,5,5,4,5,5,5,5,6,5,5,
5,6,5,5,5,5,5,6,5,5,5,5,5,6,5,5,5,6,5,5,5,5,5,5,5,5,5,5,4,5,
5,5,5,6,5,5,5,6,5,5,7,8,7,8,5,6,5,5,5,6,5,5,5,5,4,5,5,5,5,
14,4,5,5,4,5,5,4,5,4,5,5,5,4,5,5,4,5,5,4,5,4,5,5,5,4,5,5,4,
5,5,4,5,4,5,5,4,4,5,5,4,5,5,4,4,4,4,4,5,4,5,5,4,5,5,5,4,5,5,
4,5,5,4,5,4,5,5,5,4,5,5,4,5,5,4,5,4,5,4,5,4,5,5,4,4,4,4,5,4,
5,5,54,22,26,21,22,22,24,24,32,31,36,31,33,27,25,21,22,21,
24,21,22,22,24,21,22,21,24,21,22,22,24,21,22,21,24,21,22,21,
23,27,22,21,24,21,22,21,24,22,22,21,23,22,22,21,24,22,22,21,
24,21,22,22,24,22,22,21,24,22,22,22,24,22,22,22,24,22,22,22,
24,22,22,22,24,22,22,21,24,22,22,21,24,21,22,22,24,22,22,21,
24,21,23,21,24,22,23,21,24,21,22,22,24,21,22,22,24,21,22,22,
24,22,23,21,24,21,23,21,23,21,21,21,23,21,25,22,24,21,22,21,
24,21,22,21,24,22,21,24,22,22,21,24,22,23,21,23,21,22,21,23,
21,22,21,23,21,23,21,24,22,22,22,24,22,22,41,36,30,33,30,35,
21,23,21,25,21,23,21,24,22,22,21,23,21,22,21,24,22,22,22,24,
22,22,21,24,22,22,22,24,22,22,21,24,22,22,21,24,22,22,21,24,
22,22,21,24,21,22,22,27,22,23,21,23,21,21,21,23,21,21,21,24,
21,22,21,24,21,22,22,24,22,22,22,24,21,22,22,24,21,22,21,24,
21,23,21,23,21,22,21,23,21,23,22,24,22,22,21,24,21,22,22,24,
21,23,21,24,21,22,22,24,21,22,22,24,21,22,21,24,21,22,22,24,
22,22,22,24,22,22,21,24,22,21,21,24,21,22,22,24,21,22,22,24,
24,23,21,24,21,22,24,21,22,21,23,21,22,21,24,21,22,21,32,31,
32,21,25,21,22,22,24,46,5,5,5,5,5,4,5,5,5,5,6,5,5,5,5,5,5,4,
6,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,
5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,4,6,4,6,5,5,5,5,5,5,4,6,5,5,5,
5,4,5,5,5,5,5,5,6,5,5,5,5,4,5,5,5,5,5,5,6,5,5,5,5,5,5,5,6,5,
5,5,5,4,5,5,6,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,6,5,5,5,5,5,5,5,
6,5,5,5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,5,4,5,4,5,5,5,5,5,6,5,5,
5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,
---- snip ----
I assume this is the effect you are referring to, though when you say always > N, is it always, or just mostly? Not always for me anyway.
The above results extract was without the sleep. Typically when using sleep timer:tc/3 returns low times like 4 or 5 most of the time without the sleep, but sometimes big times like 22, and with the sleep in place it's usually big times like 22, with occasional batches of low times.
It's certainly not obvious why this would happen, since sleep really just means yield. I wonder if all this is not down to the CPU cache. After all, especially on a machine that's not busy, one might expect the case without the sleep to execute most of the code all in one go without it getting moved to another core, without doing so much else with the core, thus making the most out of the caches... but when you sleep, and thus yield, and come back later, the chances of cache hits might be considerably less.
Measuring performance is a complex task especially on new HW and in modern OS. There are many things which can fiddle with your result. First thing, you are not alone. It is when you measure on your desktop or notebook, there can be other processes which can interfere with your measurement including system ones. Second thing, there is HW itself. Moder CPUs have many cool features which control performance and power consumption. They can boost performance for a short time before overheat, they can boost performance when there is not work on other CPUs on the same chip or other hyper thread on the same CPU. On another hand, they can enter power saving mode when there is not enough work and CPU doesn't react fast enough to the sudden change. It is hard to tell if it is your case, but it is naive to thing previous work or lack of it can't affect your measurement. You should always take care to measure in steady state for long enough time (seconds at least) and remove as much as possible other things which could affect your measurement. (And do not forget GC in Erlang as well.)

Go-lang parallel segment runs slower than series segment

I have built an epidemic mathematics model which is fairly computationally intense in Go. I'm trying now to build a set of systems to test my model, where I change an input and expect a different output. I built a version in series to slowly increase HIV prevalence and see effects on HIV deaths. It takes ~200 milliseconds to run.
for q = 0.0; q < 1000; q++ {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
fmt.Println(results.HivDeaths[20])
}
Then I made a "parallel" version using channels, and it takes longer, ~400 milliseconds to run. These small changes are important as we will be running millions of runs with different inputs, so would like to make it as efficient as possible. Here is the parallel version:
ch := make(chan ChData)
var q float64
for q = 0.0; q < 1000; q++ {
go func(q float64, inputs *costanalysis.Inputs, ch chan ChData) {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
fmt.Println(results.HivDeaths[20])
ch <- ChData{int(q), results.HivDeaths[20]}
}(q, inputs, ch)
}
for q = 0.0; q < 1000; q++ {
theResults := <-ch
fmt.Println(theResults)
}
Any thoughts are very much appreciated.
There's overhead to starting and communicating with background tasks. The time spent on your cost analyses probably dwarfs equals the cost of communication if the program was taking 200ms, but if coordination cost ever does kill your app, a common approach is to hand off largish chunks of work at a time--e.g., make each goroutine do analyses for a range of 10 q values instead of just one. (Edit: And as #Innominate says, making a "worker pool" of goroutines that process a queue of job objects is another common approach.)
Also, the code you pasted has a race condition. The contents of your Inputs struct don't get copied each time you spawn a goroutine, because you're passing your function a pointer. So goroutines running in parallel will read from and write to the same Inputs instance.
Simply making a brand new Inputs instance for each analysis, with its own arrays, etc. would avoid the race. If that ended up wasting tons of memory or causing lots of redundant copies, you could 1) recycle Inputs instances, 2) separate out read-only data that can safely be shared (maybe there's country data that's fixed, dunno), or 3) change some of the relatively big arrays to be local variables within costAnalysisHandler rather than stuff that needs to be passed around (maybe it could just take initial HIV prevalence and return HIV deaths at t=20, and everything else is local and on the stack).
This doesn't apply to Go today, but did when the question was originally posted: nothing is really running in parallel unless you call runtime.GOMAXPROCS() with your desired concurrency level, e.g., runtime.GOMAXPROCS(runtime.NumCPU()).
Finally, you should only worry about all of this if you're doing some larger analysis and actually have a performance problem; if .2 seconds of waiting is all that performance work can save you here, it's not worth it.
Parallelizing a computationally intensive set of calculations requires that the parallel computations can actually run in parallel on your machine. If they don't then the extra overhead of creating goroutines, channels and reading off the channel will make the program run slower.
I'm guessing that is the problem here.
Try setting the GOMAXPROCS environment variable to the number of CPU's you have before running your code. Or call runtime.GOMAXRPROCS(runtime.NumCPU()) before you start the parallell computations.
I see two issues related to parallel performance,
The first and more obvious one is that you must set GOMAXPROCS in order to get the Go runtime to use more than one cpu/core. Typically one would set it for the number of processors in the machine but the ideal setting can vary.
The second problem is a bit trickier, which is that your code doesn't appear to be parallelizing very well. Simply starting a thousand goroutines and assuming they'll work it out isn't going to give good results. You should probably be using some kind of worker pool, running a limited number of simultaneous computations(a good starting number would be to set it the same as GOMAXPROCS) rather than trying to do 1000 at once.
See: http://golang.org/doc/faq#Why_no_multi_CPU

Resources