Golang time.Ticker to tick on clock times - go

I am working on a Go program and it requires me to run certain function at (fairly) exact clock times (for example, every 5 minutes, but then specifically at 3:00, 3:05, 3:10, etc, not just every 5 minutes after the start of the program).
Before coming here and requesting your help, I tried implementing a ticker does that, and even though it seems to work ok-ish, it feels a little dirty/hacky and it's not super exact (it's only fractions of milliseconds off, but I'm wondering if there's reason to believe that discrepancy increases over time).
My current implementation is below, and what I'm really asking is, is there a better solution to achieve what I'm trying to achieve (and that I can have a little more confidence in)?
type ScheduledTicker struct {
C chan time.Time
}
// NewScheduledTicker returns a ticker that ticks on defined intervals after the hour
// For example, a ticker with an interval of 5 minutes and an offset of 0 will tick at 0:00:00, 0:05:00 ... 23:55:00
// Using the same interval, but an offset of 2 minutes will tick at 0:02:00, 0:07:00 ... 23:57
func NewScheduledTicker(interval time.Duration, offset time.Duration) *ScheduledTicker {
s := &ScheduledTicker{
C: make(chan time.Time),
}
go func() {
now := time.Now()
// Figure out when the first tick should happen
firstTick := now.Truncate(interval).Add(interval).Add(offset)
// Block until the first tick
<-time.After(firstTick.Sub(now))
t := time.NewTicker(interval)
// Send initial tick
s.C <- firstTick
for {
// Forward ticks from the native time.Ticker to the ScheduledTicker channel
s.C <- <-t.C
}
}()
return s
}

Most timer apis across all platforms work in terms of system time instead of wall clock time. What you are expressing to is have a wall clock interval.
As the other answer expressed, there are open source packages available. A quick Google search for "Golang Wall Clock ticker" yields interesting results.
Another thing to consider. On Windows there are "scheduled tasks" and on Linux there are "cronjobs" that will do the wall clock wakeup interval for you. Consider using that if all your program is going to do is sleep/tick between needed intervals before doing needed work.
But if you build it yourself...
Trying to get things done on wall clock intervals is complicated by desktop PCs going to sleep when laptop lids close (suspending system time) and clock skew between system and wall clocks. And sometimes users like to change their PC's clocks - you could wake up and poll time.Now and discover you're at yesterday! This is unlikely to happen on servers running in the cloud, but a real thing on personal devices.
On my product team, when we really need want clock time or need to do something on intervals that span more than an hour, we'll wake up at a more frequent interval to see if "it's time yet". For example, if there's something we want to execute every 12 hours, we might wake up and poll the time every hour. (We use C++ where I work instead of Go).
Otherwise, my general algorithm for a 5 minute interval would be to sleep (or set a system timer) for 1 minute or shorter. After every return from time.Sleep, check the current time (time.Now()) to see if the current time is at or after the next expected interval time. Post your channel event and then recompute the next wake up time. You can even change the granularity of your sleep time if you work up spuriously early.
But be careful! the golang Time object contains both a wall clock and system clock time. This includes the result returned by Time.Now(). Time.Add(), Time.Sub(), and some of the other Time comparison functions work on the monolithic time first before falling over to wall clock time. You can strip the monolithic time out of the Time.Now result by doing this:
func GetWallclockNow() time.Time {
var t time.Time = time.Now()
return time.Date(t.Year(), t.Month(), t.Day(), t.Hour(), t.Minute(), t.Second(), t.Nanosecond(), t.Location())
}
Then subsequent operations like Add and After will be in wall clock space.

Related

What’s the meaning of `Duration: 30.18s, Total samples = 26.26s (87.00%)` in go pprof?

As my understanding, pprof stops and samples go program at every 10ms. So a 30s program should got 3000 samples, but what’s the meaning of the 26.26s? How can the samples count be shown as time duration?
What’s more, I even ever got such output shows that the sample time is bigger than wall time, how could it be such result?
Duration: 5.13s, Total samples = 5.57s (108.58%)
That confusing wording was reported in google/pprof issue 128
The "Total samples" part is confusing.
Milliseconds are continuous, but samples are discrete — they're individual points, so how can you sum them up into a quantity of milliseconds?
The "sum" is the sum of a discrete number (a quantity of samples), not a continuous range (a time interval).
Reporting the sums makes perfect sense, but reporting discrete numbers using continuous units is just plain confusing.
Please update the formatting of the Duration line to give a clearer indication of what a quantity of samples reported in milliseconds actually means.
Raul Silvera's answer:
Each callstack in a profile is associated to a set of values. What is reported here is the sum of these values for all the callstacks in the profile, and is useful to understand the weight of individual frames over the full profile.
We're reporting the sum using the unit described in the profile.
Would you have a concrete suggestion for this example?
Maybe just a rewording would help, like:
Duration: 1.60s, Samples account for 14.50ms (0.9%)
There are still pprof improvements discussed for pprof in golang/go issue 36821
pprof CPU profiles lack accuracy (closeness to the ground truth) and precision (repeatability across different runs).
The issue is with the use of OS timers used for sampling; OS timers are coarse-grained and have a high skid.
I will propose a design to extend CPU profiling by sampling CPU Performance Monitoring Unit (PMU) aka hardware performance counters
It includes examples where Total samples exceeds duration.
Dependence on the number of cores and length of test execution:
The results of goroutine.go test depend on the number of CPU cores available.
On a multi-core CPU, if you set GOMAXPROCS=1, goroutine.go will not show a huge variation, since each goroutine runs for several seconds.
However, if you set GOMAXPROCS to a larger value, say 4, you will notice a significant measurement attribution problem.
One reason for this problem is that the itimer samples on Linux are not guaranteed to be delivered to the thread whose timer expired.
Since Go 1.17 (and improved in Go 1.18), you can add pprof labels to know more:
A cool feature of Go's CPU profiler is that you can attach arbitrary key value pairs to a goroutine. These labels will be inherited by any goroutine spawned from that goroutine and show up in the resulting profile.
Let's consider the example below that does some CPU work() on behalf of a user.
By using the pprof.Labels() and pprof.Do() API, we can associate the user with the goroutine that is executing the work() function.
Additionally the labels are automatically inherited by any goroutine spawned within the same code block, for example the backgroundWork() goroutine.
func work(ctx context.Context, user string) {
labels := pprof.Labels("user", user)
pprof.Do(ctx, labels, func(_ context.Context) {
go backgroundWork()
directWork()
})
}
How you use these labels is up to you.
You might include things such as user ids, request ids, http endpoints, subscription plan or other data that can allow you to get a better understanding of what types of requests are causing high CPU utilization, even when they are being processed by the same code paths.
That being said, using labels will increase the size of your pprof files. So you should probably start with low cardinality labels such as endpoints before moving on to high cardinality labels once you feel confident that they don't impact the performance of your application.

How to properly implement waiting of async computations?

i have some little trouble and i am asking for hint. I am on Windows platform, doing calculations in a following manner:
int input = 0;
int output; // junk bytes here
while(true) {
async_enqueue_upload(input); // completes instantly, but transfer will take 10us
async_enqueue_calculate(); // completes instantly, but computation will take 80us
async_enqueue_download(output); // completes instantly, but transfer will take 10us
sync_wait_finish(); // must wait while output is fully calculated, and there is no junk
input = process(output); // i cannot launch next step without doing it on the host.
}
I am asking about wait_finish() thing. I must wait all devices to finish, to combine all results and somehow process the data and upload a new portion, that is based on a previous computation step. I need to sync data in between each step, so i can't parallelize steps. I know, this is not quite performant case. So lets proceed to question.
I have 2 ways of checking completion, within wait_finish(). First is to put thread to sleep until it wakes up by completion event:
while( !is_completed() )
Sleep(1);
It has very low performance, because actual calculation, to say, takes 100us, and minimal Windows sheduler timestep is 1ms, so it gives unsuitable 10x lower performance.
Second way is to check completion in empty infinite loop:
while( !is_completed() )
{} // do_nothing();
It has 10x good computation performance. But it is also unsuitable solution, because it makes full cpu core utilisation usage, with absolutely useless work. How to make cpu "sleep" exactly time i needed? (Each step has equal amount of work)
How this case is usually solved, when amount of calculation time is too big for active spin-wait, but is too small compared to sheduler timestep? Also related subquestion - how to do that on linux?
Fortunately, i have succeeded in finding answer on my own. In short words - i should use linux for that.
And my investigation shows following. On windows there is hidden function in ntdll, NtDelayExecution(). It is not exposed through SDK, but can be loaded in a following manner:
static int(__stdcall *NtDelayExecution)(BOOL Alertable, PLARGE_INTEGER DelayInterval) = (int(__stdcall*)(BOOL, PLARGE_INTEGER)) GetProcAddress(GetModuleHandleW(L"ntdll.dll"), "NtDelayExecution");
It allows to set sleep intervals in 100ns periods. However, even that not worked well, as shown in a following benchmark:
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS); // requires Admin privellegies
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
uint64_t hpf = qpf(); // QueryPerformanceFrequency()
uint64_t s0 = qpc(); // QueryPerformanceCounter()
uint64_t n = 0;
while (1) {
sleep_precise(1); // NtDelayExecution(-1); waits one 100-nanosecond interval
auto s1 = qpc();
n++;
auto passed = s1 - s0;
if (passed >= hpf) {
std::cout << "freq=" << (n * hpf / passed) << " hz\n";
s0 = s1;
n = 0;
}
}
That yields something less than just 2000 hz loop rate, and result varies from string to string. That led me towards windows thread switching sheduler, which is totally not suited for real time tasks. And its minimum interval of 0.5ms (+overhead). Btw, does anyone knows on how to tune that value?
And next was linux question, and what does it can? So i've built custom tiny kernel 4.14 with means of buildroot, and tested that benchmark code there. I replaced qpc() to return clock_gettime() data, with CLOCK_MONOTONIC clock, and qpf() just returns number of nanoseconds in a second and sleep_precise() just called clock_nanosleep(). I was failed to find out what is the difference between CLOCK_MONOTONIC and CLOCK_REALTIME.
And i was quite surprised, getting whooping 18.4khz frequency just out of the box, and that was quite stable. While i tested several intervals, i found that i can set the loop to almost any frequency up to 18.4khz, but also that actual measured wait time results differs to 1.6 times of what i asked. For example if i ask to sleep 100 us it actually sleeps for ~160 us, giving ~6.25 khz frequency. Nothing else is going on the system, just kernel, busybox and this test. I am not an experience linux user, and i am still wondering how can i tune this to be more real-time and deterministic. Can i push that frequency maximum even more?

Testing Erlang function performance with timer

I'm testing the performance of a function in a tight loop (say 5000 iterations) using timer:tc/3:
{Duration_us, _Result} = timer:tc(M, F, [A])
This returns both the duration (in microseconds) and the result of the function. For argument's sake the duration is N microseconds.
I then perform a simple average calculation on the results of the iterations.
If I place a timer:sleep(1) function call before the timer:tc/3 call, the average duration for all the iterations is always > the average without the sleep:
timer:sleep(1),
timer:tc(M, F, [A]).
This doesn't make much sense to me as the timer:tc/3 function should be atomic and not care about anything that happened before it.
Can anyone explain this strange functionality? Is it somehow related to scheduling and reductions?
Do you mean like this:
4> foo:foo(10000).
Where:
-module(foo).
-export([foo/1, baz/1]).
foo(N) -> TL = bar(N), {TL,sum(TL)/N} .
bar(0) -> [];
bar(N) ->
timer:sleep(1),
{D,_} = timer:tc(?MODULE, baz, [1000]),
[D|bar(N-1)]
.
baz(0) -> ok;
baz(N) -> baz(N-1).
sum([]) -> 0;
sum([H|T]) -> H + sum(T).
I tried this, and it's interesting. With the sleep statement the mean time returned by timer:tc/3 is 19 to 22 microseconds, and with the sleep commented out, the average drops to 4 to 6 microseconds. Quite dramatic!
I notice there are artefacts in the timings, so events like this (these numbers being the individual microsecond timings returned by timer:tc/3) are not uncommon:
---- snip ----
5,5,5,6,5,5,5,6,5,5,5,6,5,5,5,5,4,5,5,5,5,5,4,5,5,5,5,6,5,5,
5,6,5,5,5,5,5,6,5,5,5,5,5,6,5,5,5,6,5,5,5,5,5,5,5,5,5,5,4,5,
5,5,5,6,5,5,5,6,5,5,7,8,7,8,5,6,5,5,5,6,5,5,5,5,4,5,5,5,5,
14,4,5,5,4,5,5,4,5,4,5,5,5,4,5,5,4,5,5,4,5,4,5,5,5,4,5,5,4,
5,5,4,5,4,5,5,4,4,5,5,4,5,5,4,4,4,4,4,5,4,5,5,4,5,5,5,4,5,5,
4,5,5,4,5,4,5,5,5,4,5,5,4,5,5,4,5,4,5,4,5,4,5,5,4,4,4,4,5,4,
5,5,54,22,26,21,22,22,24,24,32,31,36,31,33,27,25,21,22,21,
24,21,22,22,24,21,22,21,24,21,22,22,24,21,22,21,24,21,22,21,
23,27,22,21,24,21,22,21,24,22,22,21,23,22,22,21,24,22,22,21,
24,21,22,22,24,22,22,21,24,22,22,22,24,22,22,22,24,22,22,22,
24,22,22,22,24,22,22,21,24,22,22,21,24,21,22,22,24,22,22,21,
24,21,23,21,24,22,23,21,24,21,22,22,24,21,22,22,24,21,22,22,
24,22,23,21,24,21,23,21,23,21,21,21,23,21,25,22,24,21,22,21,
24,21,22,21,24,22,21,24,22,22,21,24,22,23,21,23,21,22,21,23,
21,22,21,23,21,23,21,24,22,22,22,24,22,22,41,36,30,33,30,35,
21,23,21,25,21,23,21,24,22,22,21,23,21,22,21,24,22,22,22,24,
22,22,21,24,22,22,22,24,22,22,21,24,22,22,21,24,22,22,21,24,
22,22,21,24,21,22,22,27,22,23,21,23,21,21,21,23,21,21,21,24,
21,22,21,24,21,22,22,24,22,22,22,24,21,22,22,24,21,22,21,24,
21,23,21,23,21,22,21,23,21,23,22,24,22,22,21,24,21,22,22,24,
21,23,21,24,21,22,22,24,21,22,22,24,21,22,21,24,21,22,22,24,
22,22,22,24,22,22,21,24,22,21,21,24,21,22,22,24,21,22,22,24,
24,23,21,24,21,22,24,21,22,21,23,21,22,21,24,21,22,21,32,31,
32,21,25,21,22,22,24,46,5,5,5,5,5,4,5,5,5,5,6,5,5,5,5,5,5,4,
6,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,
5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,4,6,4,6,5,5,5,5,5,5,4,6,5,5,5,
5,4,5,5,5,5,5,5,6,5,5,5,5,4,5,5,5,5,5,5,6,5,5,5,5,5,5,5,6,5,
5,5,5,4,5,5,6,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,6,5,5,5,5,5,5,5,
6,5,5,5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,5,4,5,4,5,5,5,5,5,6,5,5,
5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,
---- snip ----
I assume this is the effect you are referring to, though when you say always > N, is it always, or just mostly? Not always for me anyway.
The above results extract was without the sleep. Typically when using sleep timer:tc/3 returns low times like 4 or 5 most of the time without the sleep, but sometimes big times like 22, and with the sleep in place it's usually big times like 22, with occasional batches of low times.
It's certainly not obvious why this would happen, since sleep really just means yield. I wonder if all this is not down to the CPU cache. After all, especially on a machine that's not busy, one might expect the case without the sleep to execute most of the code all in one go without it getting moved to another core, without doing so much else with the core, thus making the most out of the caches... but when you sleep, and thus yield, and come back later, the chances of cache hits might be considerably less.
Measuring performance is a complex task especially on new HW and in modern OS. There are many things which can fiddle with your result. First thing, you are not alone. It is when you measure on your desktop or notebook, there can be other processes which can interfere with your measurement including system ones. Second thing, there is HW itself. Moder CPUs have many cool features which control performance and power consumption. They can boost performance for a short time before overheat, they can boost performance when there is not work on other CPUs on the same chip or other hyper thread on the same CPU. On another hand, they can enter power saving mode when there is not enough work and CPU doesn't react fast enough to the sudden change. It is hard to tell if it is your case, but it is naive to thing previous work or lack of it can't affect your measurement. You should always take care to measure in steady state for long enough time (seconds at least) and remove as much as possible other things which could affect your measurement. (And do not forget GC in Erlang as well.)

Go-lang parallel segment runs slower than series segment

I have built an epidemic mathematics model which is fairly computationally intense in Go. I'm trying now to build a set of systems to test my model, where I change an input and expect a different output. I built a version in series to slowly increase HIV prevalence and see effects on HIV deaths. It takes ~200 milliseconds to run.
for q = 0.0; q < 1000; q++ {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
fmt.Println(results.HivDeaths[20])
}
Then I made a "parallel" version using channels, and it takes longer, ~400 milliseconds to run. These small changes are important as we will be running millions of runs with different inputs, so would like to make it as efficient as possible. Here is the parallel version:
ch := make(chan ChData)
var q float64
for q = 0.0; q < 1000; q++ {
go func(q float64, inputs *costanalysis.Inputs, ch chan ChData) {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
fmt.Println(results.HivDeaths[20])
ch <- ChData{int(q), results.HivDeaths[20]}
}(q, inputs, ch)
}
for q = 0.0; q < 1000; q++ {
theResults := <-ch
fmt.Println(theResults)
}
Any thoughts are very much appreciated.
There's overhead to starting and communicating with background tasks. The time spent on your cost analyses probably dwarfs equals the cost of communication if the program was taking 200ms, but if coordination cost ever does kill your app, a common approach is to hand off largish chunks of work at a time--e.g., make each goroutine do analyses for a range of 10 q values instead of just one. (Edit: And as #Innominate says, making a "worker pool" of goroutines that process a queue of job objects is another common approach.)
Also, the code you pasted has a race condition. The contents of your Inputs struct don't get copied each time you spawn a goroutine, because you're passing your function a pointer. So goroutines running in parallel will read from and write to the same Inputs instance.
Simply making a brand new Inputs instance for each analysis, with its own arrays, etc. would avoid the race. If that ended up wasting tons of memory or causing lots of redundant copies, you could 1) recycle Inputs instances, 2) separate out read-only data that can safely be shared (maybe there's country data that's fixed, dunno), or 3) change some of the relatively big arrays to be local variables within costAnalysisHandler rather than stuff that needs to be passed around (maybe it could just take initial HIV prevalence and return HIV deaths at t=20, and everything else is local and on the stack).
This doesn't apply to Go today, but did when the question was originally posted: nothing is really running in parallel unless you call runtime.GOMAXPROCS() with your desired concurrency level, e.g., runtime.GOMAXPROCS(runtime.NumCPU()).
Finally, you should only worry about all of this if you're doing some larger analysis and actually have a performance problem; if .2 seconds of waiting is all that performance work can save you here, it's not worth it.
Parallelizing a computationally intensive set of calculations requires that the parallel computations can actually run in parallel on your machine. If they don't then the extra overhead of creating goroutines, channels and reading off the channel will make the program run slower.
I'm guessing that is the problem here.
Try setting the GOMAXPROCS environment variable to the number of CPU's you have before running your code. Or call runtime.GOMAXRPROCS(runtime.NumCPU()) before you start the parallell computations.
I see two issues related to parallel performance,
The first and more obvious one is that you must set GOMAXPROCS in order to get the Go runtime to use more than one cpu/core. Typically one would set it for the number of processors in the machine but the ideal setting can vary.
The second problem is a bit trickier, which is that your code doesn't appear to be parallelizing very well. Simply starting a thousand goroutines and assuming they'll work it out isn't going to give good results. You should probably be using some kind of worker pool, running a limited number of simultaneous computations(a good starting number would be to set it the same as GOMAXPROCS) rather than trying to do 1000 at once.
See: http://golang.org/doc/faq#Why_no_multi_CPU

Timing an operation without being affected by clock changes

I'm looking for:
1: startTime := time.Now()
2: // run something here that takes a while (measured in milliseconds)
3: duration := time.Since(startTime)
However, I need something that is immune to clock time changes. If the time is adjusted between lines 1 and 3, the duration will be inaccurate.
What are some common approaches for solving this problem, and what Go libraries could be relevant?
Thanks :)
That lack of monotonic clock was detailed in issue 12914 (2015)
Since then, in August 2017 and Go 1.9, you now have a transparent Monotonic Time support:
The time package now transparently tracks monotonic time in each Time value, making computing durations between two Time values a safe operation in the presence of wall clock adjustments.
See the package docs and design document for details.
For Linux (AMD64) go uses clock_gettime with CLOCK_REALTIME.
See the time·now implementation.
You would want a monotonic clock (CLOCK_MONOTONIC or CLOCK_MONOTONIC_RAW), which
is a clock that does not go back in time. In Linux the man page explicitly tells you that CLOCK_MONOTONIC does not guarantee to not leap forward:
This clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator
manually changes the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP.
So, under Linux, the best choice is probably CLOCK_MONOTONIC_RAW. You may use the
clock package mentioned by #MatrixFrog for that. Example:
import (
"fmt"
"github.com/davecheney/junk/clock"
"time"
)
func main() {
start := clock.Monotonic.Now()
// work
end := clock.Monotonic.Now()
duration := end.Sub(start)
fmt.Println("Elapsed:", duration)
}
Further reading:
Discussion about monotonic clocks in go
Outlook to get a clock interface in the go stdlib

Resources