Run a benchmark in parallel, i.e. simulate simultaneous requests - go

When testing a database procedure invoked from an API, when it runs sequentially, it seems to run consistently within ~3s. However we've noticed that when several requests come in at the same time, this can take much longer, causing time outs. I am trying to reproduce the "several requests at one time" case as a go test.
I tried the -parallel 10 go test flag, but the timings were the same at ~28s.
Is there something wrong with my benchmark function?
func Benchmark_RealCreate(b *testing.B) {
b.ResetTimer()
for n := 0; n < b.N; n++ {
name := randomdata.SillyName()
r := gofight.New()
u := []unit{unit{MefeUnitID: name, MefeCreatorUserID: "user", BzfeCreatorUserID: 55, ClassificationID: 2, UnitName: name, UnitDescriptionDetails: "Up on the hills and testing"}}
uJSON, _ := json.Marshal(u)
r.POST("/create").
SetBody(string(uJSON)).
Run(h.BasicEngine(), func(r gofight.HTTPResponse, rq gofight.HTTPRequest) {
assert.Contains(b, r.Body.String(), name)
assert.Equal(b, http.StatusOK, r.Code)
})
}
}
Else how I can achieve what I am after?

The -parallel flag is not for running the same test or benchmark parallel, in multiple instances.
Quoting from Command go: Testing flags:
-parallel n
Allow parallel execution of test functions that call t.Parallel.
The value of this flag is the maximum number of tests to run
simultaneously; by default, it is set to the value of GOMAXPROCS.
Note that -parallel only applies within a single test binary.
The 'go test' command may run tests for different packages
in parallel as well, according to the setting of the -p flag
(see 'go help build').
So basically if your tests allow, you can use -parallel to run multiple distinct testing or benchmark functions parallel, but not the same one in multiple instances.
In general, running multiple benchmark functions parallel defeats the purpose of benchmarking a function, because running it parallel in multiple instances usually distorts the benchmarking.
However, in your case code efficiency is not what you want to measure, you want to measure an external service. So go's built-in testing and benchmarking facilities are not really suitable.
Of course we could still use the convenience of having this "benchmark" run automatically when our other tests and benchmarks run, but you should not force this into the conventional benchmarking framework.
First thing that comes to mind is to use a for loop to launch n goroutines which all attempt to call the testable service. One problem with this is that this only ensures n concurrent goroutines at the start, because as the calls start to complete, there will be less and less concurrency for the remaining ones.
To overcome this and truly test n concurrent calls, you should have a worker pool with n workers, and continously feed jobs to this worker pool, making sure there will be n concurrent service calls at all times. For a worker pool implementation, see Is this an idiomatic worker thread pool in Go?
So all in all, fire up a worker pool with n workers, have a goroutine send jobs to it for an arbitrary time (e.g. for 30 seconds or 1 minute), and measure (count) the completed jobs. The benchmark result will be a simple division.
Also note that for solely testing purposes, a worker pool might not even be needed. You can just use a loop to launch n goroutines, but make sure each started goroutine keeps calling the service and not return after a single call.

I'm new to go, but why don't you try to make a function and run it using the standard parallel test?
func Benchmark_YourFunc(b *testing.B) {
b.RunParralel(func(pb *testing.PB) {
for pb.Next() {
YourFunc(staff ...T)
}
})
}

Your example code mixes several things. Why are you using assert there? This is not a test it is a benchmark. If the assert methods are slow, your benchmark will be.
You also moved the parallel execution out of your code into the test command. You should try to make a parallel request by using concurrency. Here just a possibility how to start:
func executeRoutines(routines int) {
wg := &sync.WaitGroup{}
wg.Add(routines)
starter := make(chan struct{})
for i := 0; i < routines; i++ {
go func() {
<-starter
// your request here
wg.Done()
}()
}
close(starter)
wg.Wait()
}
https://play.golang.org/p/ZFjUodniDHr
We start some goroutines here, which are waiting until starter is closed. So you can set your request direct after that line. That the function waits until all the requests are done we are using a WaitGroup.
BUT IMPORTANT: Go just supports concurrency. So if your system has not 10 cores the 10 goroutines will not run parallel. So ensure that you have enough cores availiable.
With this start you can play a little bit. You could start to call this function inside your benchmark. You could also play around with the numbers of goroutines.

As the documentation indicates, the parallel flag is to allow multiple different tests to be run in parallel. You generally do not want to run benchmarks in parallel because that would run different benchmarks at the same time, throwing off the results for all of them. If you want to benchmark parallel traffic, you need to write parallel traffic generation into your test. You need to decide how this should work with b.N which is your work factor; I would probably use it as the total request count, and write a benchmark or multiple benchmarks testing different concurrent load levels, e.g.:
func Benchmark_RealCreate(b *testing.B) {
concurrencyLevels := []int{5, 10, 20, 50}
for _, clients := range concurrencyLevels {
b.Run(fmt.Sprintf("%d_clients", clients), func(b *testing.B) {
sem := make(chan struct{}, clients)
wg := sync.WaitGroup{}
for n := 0; n < b.N; n++ {
wg.Add(1)
go func() {
name := randomdata.SillyName()
r := gofight.New()
u := []unit{unit{MefeUnitID: name, MefeCreatorUserID: "user", BzfeCreatorUserID: 55, ClassificationID: 2, UnitName: name, UnitDescriptionDetails: "Up on the hills and testing"}}
uJSON, _ := json.Marshal(u)
sem <- struct{}{}
r.POST("/create").
SetBody(string(uJSON)).
Run(h.BasicEngine(), func(r gofight.HTTPResponse, rq gofight.HTTPRequest) {})
<-sem
wg.Done()
}()
}
wg.Wait()
})
}
}
Note here I removed the initial ResetTimer; the timer doesn't start until you benchmark function is called, so calling it as the first op in your function is pointless. It's intended for cases where you have time-consuming setup prior to the benchmark loop that you don't want included in the benchmark results. I've also removed the assertions, because this is a benchmark, not a test; assertions are for validity checking in tests and only serve to throw off timing results in benchmarks.

One thing is benchmarking (measuring time code takes to run) another one is load/stress testing.
The -parallel flag as stated above, is to allow a set of tests to execute in parallel, allowing the test set to execute faster, not to execute some test N times in parallel.
But is simple to achieve what you want (execution of same test N times). Bellow a very simple (really quick and dirty) example just to clarify/demonstrate the important points, that gets this very specific situation done:
You define a test and mark it to be executed in parallel => TestAverage with a call to t.Parallel
You then define another test and use RunParallel to execute the number of instances of the test (TestAverage) you want.
The class to test:
package math
import (
"fmt"
"time"
)
func Average(xs []float64) float64 {
total := float64(0)
for _, x := range xs {
total += x
}
fmt.Printf("Current Unix Time: %v\n", time.Now().Unix())
time.Sleep(10 * time.Second)
fmt.Printf("Current Unix Time: %v\n", time.Now().Unix())
return total / float64(len(xs))
}
The testing funcs:
package math
import "testing"
func TestAverage(t *testing.T) {
t.Parallel()
var v float64
v = Average([]float64{1,2})
if v != 1.5 {
t.Error("Expected 1.5, got ", v)
}
}
func TestTeardownParallel(t *testing.T) {
// This Run will not return until the parallel tests finish.
t.Run("group", func(t *testing.T) {
t.Run("Test1", TestAverage)
t.Run("Test2", TestAverage)
t.Run("Test3", TestAverage)
})
// <tear-down code>
}
Then just do a go test and you should see:
X:\>go test
Current Unix Time: 1556717363
Current Unix Time: 1556717363
Current Unix Time: 1556717363
And 10 secs after that
...
Current Unix Time: 1556717373
Current Unix Time: 1556717373
Current Unix Time: 1556717373
Current Unix Time: 1556717373
Current Unix Time: 1556717383
PASS
ok _/X_/y 20.259s
The two extra lines, in the end are because TestAverage is executed also.
The interesting point here: if you remove t.Parallel() from TestAverage, it will all be execute sequencially:
X:> go test
Current Unix Time: 1556717564
Current Unix Time: 1556717574
Current Unix Time: 1556717574
Current Unix Time: 1556717584
Current Unix Time: 1556717584
Current Unix Time: 1556717594
Current Unix Time: 1556717594
Current Unix Time: 1556717604
PASS
ok _/X_/y 40.270s
This can of course be made more complex and extensible...

Related

It's concurrent, but what makes it run in parallel?

So I'm trying to understand how parallel computing works while also learning Go. I understand the difference between concurrency and parallelism, however, what I'm a little stuck on is how Go (or the OS) determines that something should be executed in parallel...
Is there something I have to do when writing my code, or is it all handled by the schedulers?
In the example below, I have two functions that are run in separate Go routines using the go keyword. Because the default GOMAXPROCS is the number of processors available on your machine (and I'm also explicitly setting it) I would expect that these two functions run at the same time and thus the output would be a mix of number in particular order - And furthermore that each time it is run the output would be different. However, this is not the case. Instead, they are running one after the other and to make matters more confusing function two is running before function one.
Code:
func main() {
runtime.GOMAXPROCS(6)
var wg sync.WaitGroup
wg.Add(2)
fmt.Println("Starting")
go func() {
defer wg.Done()
for smallNum := 0; smallNum < 20; smallNum++ {
fmt.Printf("%v ", smallNum)
}
}()
go func() {
defer wg.Done()
for bigNum := 100; bigNum > 80; bigNum-- {
fmt.Printf("%v ", bigNum)
}
}()
fmt.Println("Waiting to finish")
wg.Wait()
fmt.Println("\nFinished, Now terminating")
}
Output:
go run main.go
Starting
Waiting to finish
100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Finished, Now terminating
I am following this article, although just about every example I've come across does something similar.
Concurrency, Goroutines and GOMAXPROCS
Is this working the way is should and I'm not understanding something correctly, or is my code not right?
Is there something I have to do when writing my code,
No.
or is it all handled by the schedulers?
Yes.
In the example below, I have two functions that are run in separate Go routines using the go keyword. Because the default GOMAXPROCS is the number of processors available on your machine (and I'm also explicitly setting it) I would expect that these two functions run at the same time
They might or might not, you have no control here.
and thus the output would be a mix of number in particular order - And furthermore that each time it is run the output would be different. However, this is not the case. Instead, they are running one after the other and to make matters more confusing function two is running before function one.
Yes. Again you cannot force parallel computation.
Your test is flawed: You just don't do much in each goroutine. In your example goroutine 2 might be scheduled to run, starts running and completes before goroutine 1 started running. "Starting" a goroutine with go doesn't force it to start executing right away, all there is done is creating a new goroutine which can run. From all goroutines which can run some are scheduled onto your processors. All this scheduling cannot be controlled, it is fully automatic. As you seem to know this is the difference between concurrent and parallel. You have control over concurrency in Go but not (much) on what is done actually in parallel on two or more cores.
More realistic examples with actual, long-running goroutines which do actual work will show interleaved output.
It's all handled by the scheduler.
With only two loops of 20 short instructions, you will be hard pressed to see the effects of concurrency or parallelism.
Here is another toy example : https://play.golang.org/p/xPKITzKACZp
package main
import (
"fmt"
"runtime"
"sync"
"sync/atomic"
"time"
)
const (
ConstMaxProcs = 2
ConstRunners = 4
ConstLoopcount = 1_000_000
)
func runner(id int, wg *sync.WaitGroup, cptr *int64) {
var times int
for i := 0; i < ConstLoopcount; i++ {
val := atomic.AddInt64(cptr, 1)
if val > 1 {
times++
}
atomic.AddInt64(cptr, -1)
}
fmt.Printf("[runner %d] cptr was > 1 on %d occasions\n", id, times)
wg.Done()
}
func main() {
runtime.GOMAXPROCS(ConstMaxProcs)
var cptr int64
wg := &sync.WaitGroup{}
wg.Add(ConstRunners)
start := time.Now()
for id := 1; id <= ConstRunners; id++ {
go runner(id, wg, &cptr)
}
wg.Wait()
fmt.Printf("completed in %s\n", time.Now().Sub(start))
}
As with your example : you don't have control on the scheduler, this example just has more "surface" to witness some effects of concurrency.
It's hard to witness the actual difference between concurrency and parallelism from within the program, you can view your processor's activity while it runs, or check the global execution time.
The playground does not give sub-second precision on its clock, if you want to see the actual timing, copy/paste the code in a local file and tune the constants to see various effects.
Note that some other effects (probably : branch prediction on the if val > 1 {...} check and/or memory invalidation around the shared cptr variable) make the execution very volatile on my machine, so don't expect a straight "running with ConstMaxProcs = 4 is 4 times quicker than ConstMaxProcs = 1".

Multiple go routines read pcap files cannot improve performance?

I need to read about 600 pcap files, each file is about 100MB.
I use gopacket to load pcap file, and check it.
Case1: uses 1 routine to check.
Case2: uses 40 routines to check.
And I found that the time consumed by case1 and case2 are similar.
The difference is cpu usage of case1 only has 200%, and case2 can reach to 3000%.
My question is why multiple routines cannot improve performance?
There are some comments in code, hope that will help.
package main
import (
"flag"
"fmt"
"io/ioutil"
"log"
"os"
"strings"
"sync"
"github.com/google/gopacket"
"github.com/google/gopacket/layers"
"github.com/google/gopacket/pcap"
)
func main() {
var wg sync.WaitGroup
var dir = flag.String("dir", "../pcap", "input dir")
var threadNum = flag.Int("threads", 40, "input thread number")
flag.Parse()
fmt.Printf("dir=%s, threadNum=%d\n", *dir, *threadNum)
pcapFileList, err := ioutil.ReadDir(*dir)
if err != nil {
panic(err)
}
log.Printf("start. file number=%d.", len(pcapFileList))
fileNumPerRoutine := len(pcapFileList) / *threadNum
lastFileNum := len(pcapFileList) % *threadNum
// split files to different routine
// each routine only process files which belong to itself
if fileNumPerRoutine > 0 {
for i := 0; i < *threadNum; i++ {
start := fileNumPerRoutine * i
end := fileNumPerRoutine * (i + 1)
if lastFileNum > 0 && i == (*threadNum-1) {
end = len(pcapFileList)
}
// fmt.Printf("start=%d, end=%d\n", start, end)
wg.Add(1)
go checkPcapRoutine(i, &wg, dir, pcapFileList[start:end])
}
}
wg.Wait()
log.Printf("end.")
}
func checkPcapRoutine(id int, wg *sync.WaitGroup, dir *string, pcapFileList []os.FileInfo) {
defer wg.Done()
for _, p := range pcapFileList {
if !strings.HasSuffix(p.Name(), "pcap") {
continue
}
pcapFile := *dir + "/" + p.Name()
log.Printf("checkPcapRoutine(%d): process %s.", id, pcapFile)
handle, err := pcap.OpenOffline(pcapFile)
if err != nil {
log.Printf("error=%s.", err)
return
}
defer handle.Close()
packetSource := gopacket.NewPacketSource(handle, handle.LinkType())
// Per my test, if I don't parse packets, it is very fast, even use only 1 routine, so IO should not be the bottleneck.
// What puzzles me is that every routine has their own packets, each routine is independent, but it still seems to be processed serially.
// This is the first time I use gopacket, maybe used wrong parameter?
for packet := range packetSource.Packets() {
gtpLayer := packet.Layer(layers.LayerTypeGTPv1U)
lays := packet.Layers()
outerIPLayer := lays[1]
outerIP := outerIPLayer.(*layers.IPv4)
if gtpLayer == nil && (outerIP.Flags&layers.IPv4MoreFragments != 0) && outerIP.Length < 56 {
log.Panicf("file:%s, idx=%d may leakage.", pcapFile, j+1)
break
}
}
}
}
To run two or more tasks in parallel the operations required to carry out those tasks must have a property of not being dependent on each other or some external resources which are then said to be shared by those tasks.
In real world, the tasks which are truly and completely independent are rare (so rare that there is even a dedicated name for the class of such tasks: they are said to be embarrasingly parallel) but when the amount of dependency of tasks on each other's progression and contending to access shared resources is below some threshold, adding more "workers" (goroutines) may improve the total time it takes to complete a set of tasks.
Notice "may" here: for instance, your storage device and the file system on it and the kernel data structures and code to work with the filesystem and the storage device is a shared medium all your goroutines have to access. This medium has a certain limit on both its throughput and latency; basically, you can only read, like, M bytes per second from that medium—and whether you have a single reader fully utilizing this bandwidth, or N readers—each utilizing some amount around M/N of it—does not matter: you physically cannot read faster than that limit of M BPS.
Moreover, the resources most frequently found in real world tend to degrade their performance when contended for: say, if the resource has to be locked to be accessed, the more accessors actively wanting to take the lock you have, the more CPU time is spent in the lock management code (when the resource is more complicated — such as that conglomerate of intricate stuff which "an FS on a storage device—all managed by the kernel" is — the analysis of how it degrades when is being accessed concurrently becomes way more complicated).
TL;DR
I can make an educated guess that your task is simply I/O-bound as the goroutines have to read the files.
You can verify that by modifying the code to first fetch all the files into memory and then handing the buffers to the parsing goroutines.
The drastic amount of CPU spent you're observing in your case is a red herring: contemporary systems like to take 100% CPU utilization to mean "full utilization of a single hardware processing thread" — so if you have, like 4 CPU cores with HyperThreading™ (or whatever AMD has for this) enabled, the full capacity of your system is 4×2=8, or 800%.
The fact you're may be seeing more than the theoretical capacity (which we do not know) may be explained by your system showing that way the so-called "starvation": you have many software threads wanting to be executed but waiting for their CPU time, and the system showing that as insane CPU utilization.

goroutines affect each other when printing timecost

I'm a freshman for Golang. I know goroutine is an abstract group of cpu and memory to run a piece of code.
So When I run some computing funcs(like sort) in goroutines, I'm hoping they run parallel. But the printed result seems weird, the "paralell" codes print nearly the same timecost.
Why? Is there something I missed about goroutine, or it's because of the func printTime() ?
codes: https://play.golang.org/p/n9DLn57ftM
P.S. codes should be copied to local go file and run. Those run in play.golang has some limitation.
the result is:
MaxProcs: 8
Source : 2.0001ms
Quick sort : 3.0002ms
Merge sort : 8.0004ms
Insertion sort : 929.0532ms
Goroutine num: 1
Source : 2.0001ms
Goroutine num: 4
Insertion sort : 927.0531ms
Quick sort : 930.0532ms
Merge sort : 934.0535ms
You should measure total time cost instead of individual time cost required by each sorting algorithm. The individual time cost might be longer when the task is distributed into several goroutines since it need additional time to setup the goroutine. Depending on the nature of your program, additional time might be needed for communication between goroutine and/or with main process. There're some resources related to goroutine, e.g.
Is a Go goroutine a coroutine?
http://divan.github.io/posts/go_concurrency_visualize/
If you change main function to:
func main() {
fmt.Println("MaxProcs:", runtime.GOMAXPROCS(0)) // 8
start := time.Now()
sequentialTest()
seq := time.Now()
concurrentTest()
con := time.Now()
fmt.Printf("\n\nCalculation time, sequential: %v, concurrent: %v\n",
seq.Sub(start), con.Sub(seq))
}
the output will look like:
MaxProcs: 4
Source : 3.0037ms
Quick sort : 5.0034ms
Merge sort : 13.0069ms
Insertion sort : 1.2590941s
Goroutine num: 1
Source : 3.0015ms
Goroutine num: 4
Insertion sort : 1.2399076s
Quick sort : 1.2459121s
Merge sort : 1.2519156s
Calculation time, sequential: 1.2831112s, concurrent: 1.2559194s
After removing printTime, it looks like:
MaxProcs: 4
Goroutine num: 1
Goroutine num: 4
Calculation time, sequential: 1.3154314s, concurrent: 1.244112s
The time cost value might change slightly, but most of the time the result will be sequential > concurrent. In summary, distributing the task into several goroutines, may increase the overall performance (time cost) but not the individual task.
Sorry, I don't understand what you want to test. At the first, you code doesn't work for quickSort because you run quickSort with go quickSort(...).
wg.Add(1)
go func(){
go quickSort(s1, nil)
wg.Done()
}()
This goroutine will quit immediately.
Now, I tested your code. And I notice that you must get total time between start and all of finish.
https://play.golang.org/p/O8Gj-OYIdR
wg.Add(1)
go func(){
go quickSort(s1, nil)
wg.Done()
}()
You don't need that second go
It should be like
wg.Add(1)
go func(){
quickSort(s1, nil)
wg.Done()
}()
What happened is that go func() started one goroutine, and go quickSort(s1, nil) starts another. As a result, wg.Done() (and, as a result, wg.Done()) gets executed practically right away (without waiting for quickSort).

Distribute the same keyword to multiple goroutines

I have something like this mock (code below) which distributes the same keyword out to multiple goroutines, except the goroutines all take different amount of times doing things with the keyword but can operate independently of each other so they don't need any synchronization. The solution given below to distribute clearly synchronizes the goroutines.
I just want to toss this idea out there to see how other people would deal with this type of distribution, as I assume it is fairly common and someone else has thought about it before.
Here are some other solutions I have thought up and why they seem kinda meh to me:
One goroutine for each keyword
Each time a new keyword comes in spawn a goroutine to handle the distribution
Give the keyword a bitmask or something for each goroutine to update
This way once all of the workers have touched the keyword it can be deleted and we can move on
Give each worker its own stack to work off of
This seems kinda appealing, just give each worker a stack to add each keyword to, but we would eventually run into a problem of a ton of memory being taken up since it is planned to run so long
The problem with all of these is that my code is supposed to run for a long time, unwatched, and that would lead to either a huge build up of keywords or goroutines due to the lazy worker taking longer than the others. It almost seems like it'd be nice to give each worker its own Amazon SQS queue or implement something similar to that myself.
EDIT:
Store the keyword outside the program
I just thought of doing it this way instead, I could perhaps just store the keyword outside the program until they all grab it and then delete it once it has been used up. This sits ok with me actually, I don't have a problem with using up disk space
Anyway here is an example of the approach that waits for all to finish:
package main
import (
"flag"
"fmt"
"math/rand"
"os"
"os/signal"
"strconv"
"time"
)
var (
shutdown chan struct{}
count = flag.Int("count", 5, "number to run")
)
type sleepingWorker struct {
name string
sleep time.Duration
ch chan int
}
func NewQuicky(n string) sleepingWorker {
var rq sleepingWorker
rq.name = n
rq.ch = make(chan int)
rq.sleep = time.Duration(rand.Intn(5)) * time.Second
return rq
}
func (r sleepingWorker) Work() {
for {
fmt.Println(r.name, "is about to sleep, number:", <-r.ch)
time.Sleep(r.sleep)
}
}
func NewLazy() sleepingWorker {
var rq sleepingWorker
rq.name = "Lazy slow worker"
rq.ch = make(chan int)
rq.sleep = 20 * time.Second
return rq
}
func distribute(gen chan int, workers ...sleepingWorker) {
for kw := range gen {
for _, w := range workers {
fmt.Println("sending keyword to:", w.name)
select {
case <-shutdown:
return
case w.ch <- kw:
fmt.Println("keyword sent to:", w.name)
}
}
}
}
func main() {
flag.Parse()
shutdown = make(chan struct{})
go func() {
c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt)
<-c
close(shutdown)
}()
x := make([]sleepingWorker, *count)
for i := 0; i < (*count)-1; i++ {
x[i] = NewQuicky(strconv.Itoa(i))
go x[i].Work()
}
x[(*count)-1] = NewLazy()
go x[(*count)-1].Work()
gen := make(chan int)
go distribute(gen, x...)
go func() {
i := 0
for {
i++
select {
case <-shutdown:
return
case gen <- i:
}
}
}()
<-shutdown
os.Exit(0)
}
Let's assume I understand the problem correctly:
There's not too much you can do about it I'm afraid. You have limited resources (assuming all resources are limited) so if data to your input is written faster then you process it, there will be some synchronisation needed. At the end the whole process will run as quickly as the slowest worker anyway.
If you really need data from the workers available as soon as possible, the best you can do is to add some kind of buffering. But the buffer must be limited in size (even if you run in the cloud it would be limited by your wallet) so assuming never ending torrent of input it will only postpone the choke until some time in the future where you will start seeing "synchronisation" again.
All the ideas you presented in your questions are based on buffering the data. Even if you run a routine for every keyword-worker pair, this will buffer one element in every routine and, unless you implement the limit on total number of routines, you'll run out of memory. And even if you always leave some room for the quickest worker to spawn a new routine, the input queue won't be able to deliver new items as it would be choked on the slowest worker.
Buffering would solve your problem if on average you input is slower than processing time, but you have occasional spikes. If your buffer is big enough you can than accommodate the increase of throughput and maybe your quickest worker won't notice a thing.
Solution?
As go comes with buffered channels, this is the easiest to implement (also suggested by icza in the comment). Just give each worker a buffer. If you know which worker is the slowest, you can give it a bigger buffer. In this scenario you're limited by the memory of your machine.
If you're not happy with the single-machine memory limit then yes, per one of your ideas, you can "simply" store the buffer (queue) for each worker on the hard drive. But this is also limited and just postpones the blocking scenario until later. This is essentially the same as your Amazon SQS proposal (you could keep buffer in the cloud, but you need either limit it reasonably or prepare for the bill.)
The final note, depending on the system you're building, it might be not a good idea to buffer items in such a massive scale allowing to build up the backlog for the slower workers – it's often not desirable to have a worker hours, days, weeks behind the input flow and this is what would happen with an infinite buffer. The real answer then would be: improve your slowest worker to process things faster. (And add some buffering to improve the experience.)

How do I find out which of 2 methods is faster?

I have 2 methods to trim the domain suffix from a subdomain and I'd like to find out which one is faster. How do I do that?
2 string trimming methods
You can use the builtin benchmark capabilities of go test.
For example (on play):
import (
"strings"
"testing"
)
func BenchmarkStrip1(b *testing.B) {
for br := 0; br < b.N; br++ {
host := "subdomain.domain.tld"
s := strings.Index(host, ".")
_ = host[:s]
}
}
func BenchmarkStrip2(b *testing.B) {
for br := 0; br < b.N; br++ {
host := "subdomain.domain.tld"
strings.TrimSuffix(host, ".domain.tld")
}
}
Store this code in somename_test.go and run go test -test.bench='.*'. For me this gives
the following output:
% go test -test.bench='.*'
testing: warning: no tests to run
PASS
BenchmarkStrip1 100000000 12.9 ns/op
BenchmarkStrip2 100000000 16.1 ns/op
ok 21614966 2.935s
The benchmark utility will attempt to do a certain number of runs until a meaningful time is
measured which is reflected in the output by the number 100000000. The code was run
100000000 times and each operation in the loop took 12.9 ns and 16.1 ns respectively.
So you can conclude that the code in BenchmarkStrip1 performed better.
Regardless of the outcome, it is often better to profile your program to see where the
real bottleneck is instead of wasting your time with micro benchmarks like these.
I would also not recommend writing your own benchmarking as there are some factors you might
not consider such as the garbage collector and running your samples long enough.

Resources