Running thousands of goroutines concurrently - go

I'm developing a simulator model for historical trading data. I have the below pseudo code where I'm using many goroutines.
start_day := "2015-01-01 09:15:00 "
end_day := "2015-02-01 09:15:00"
ch = make(chan float64)
var value bool
// pair_map have some int pairs like {1,2},{10,11} like approx 20k pairs which are taken
// dynamically from a range of input
for start_day.Before(end_day) {
// a true value in ch is set at end of each func call
go get_output(ch, start_day, pair_map)
value <- ch
start_day = start_date.Add(time.Hour * 24)
}
Now the problem here is that each get_otput() is executed approx. for 20k combinations per single day (which is taking approx. 1 min), here I'm trying to do it for a month time span using go routines (on single core machine), but it is taking almost same amount of time as running sequentially (approx. 30 min).
Is there anything wrong with my approach, or is it because I'm using a single core?

I'm trying to do it for a month time span using go routines (on single core machine), but it is taking almost same amount of time as running sequentially
What makes you think it should perform any better? Running multiple goroutines has its overhead: goroutines have to be managed and scheduled, results must be communicated / gathered. If you have a single core, using multiple gorotuines cannot yield a performance improvement, only detriment.
Using goroutines may give a performance boost if you have multiple cores and the task you distribute are "large" enough so the goroutine management will be smaller than the gain of the parallel execution.
See related: Is golang good to use in multithreaded application?

Related

Golang: How to tell whether producer or consumer is slower when communicating via buffered channels?

I have an app in Golang where I have a pipeline setup where each component performs some work, then pass along its results to another component via a buffered channel, then that component performs some work on its input then pass along its results to yet another component via another buffered channel, and so on. For example:
C1 -> C2 -> C3 -> ...
where C1, C2, C3 are components in the pipeline and each "->" is a buffered channel.
In Golang buffered channels are great because it forces a fast producer to slow down to match its downstream consumer (or a fast consumer to slow down to match its upstream producer). So like an assembly line, my pipeline is moving along as fast as the slowest component in that pipeline.
The problem is I want to figure out which component in my pipeline is the slowest one so I can focus on improving that component in order to make the whole pipeline faster.
The way that Golang forces a fast producer or a fast consumer to slow down is by blocking the producer when it tries to send to a buffered channel that is full, or when a consumer tries to consume from a channel that is empty. Like this:
outputChan <- result // producer would block here when sending to full channel
input := <- inputChan // consumer would block here when consuming from empty channel
This makes it hard to tell which one, the producer or consumer, is blocking the most, and thus the slowest component in pipeline. As I cannot tell how long it is blocking for. The one that is blocking the most amount of time is the fastest component and the one that is blocking the least (or not blocking at all) is the slowest component.
I can add code like this just before the read or write to channel to tell whether it would block:
// for producer
if len(outputChan) == cap(outputChan) {
producerBlockingCount++
}
outputChan <- result
// for consumer
if len(inputChan) == 0 {
consumerBlockingCount++
}
input := <-inputChan
However, that would only tell me the number of times it would block, not the total amount of time it is blocked. Not to mention the TOCTOU issue where the check is for a single point in time where state could change immediately right after the check rendering the check incorrect/misleading.
Anybody that has ever been to a casino knows that it's not the number of times that you win or lose that matters, it's the total amount of money that you win or lose that's really matter. I can lose 10 hands with $10 each (for a total of $100 loss) and then wins one single hand of $150, I would still comes out ahead.
Likewise, it's not the number of times that a producer or consumer is blocked that's meaningful. It's the total amount of time that a producer or consumer is blocked that's the determining factor whether it's the slowest component or not.
But I cannot think of anyway to determine the total amount that something is blocked at the reading to / writing from a buffered channel. Or my google-fu isn't good enough. Anyone has any bright idea?
There are several solutions that spring to mind.
1. stopwatch
The least invasive and most obvious is to just note the time,
before and after,
each read or write.
Log it, sum it, report on total I/O delay.
Similarly report on elapsed processing time.
2. benchmark
Do a synthetic bench,
where you have each stage operate on a million
identical inputs, producing a million identical outputs.
Or do a "system test" where you wiretap the
messages that flowed through production,
write them to log files,
and replay relevant log messages to each
of your various pipeline stages,
measuring elapsed times.
Due to the replay, there will be no I/O throttling.
3. pub/sub
Re-architect to use a higher overhead
comms infrastructure, such as Kafka / 0mq / RabbitMQ.
Change the number of nodes participating
in stage-1 processing, stage-2, etc.
The idea is to overwhelm the stage currently
under study, no idle cycles, to measure
its transactions / second throughput
when saturated.
Alternatively, just distribute each stage
to its own node, and measure {user, sys, idle} times,
during normal system behavior.

Schedule sending messages to consumers at different rate

I'm looking for best algorithm for message schedule. What I mean with message schedule is a way to send a messages on the bus when we have many consumers at different rate.
Example :
Suppose that we have data D1 to Dn
. D1 to send to many consumer C1 every 5ms, C2 every 19ms, C3 every 30ms, Cn every Rn ms
. Dn to send to C1 every 10ms, C2 every 31ms , Cn every 50ms
What is best algorithm which schedule this actions with the best performance (CPU, Memory, IO)?
Regards
I can think of quite a few options, each with their own costs and benefits. It really comes down to exactly what your needs are -- what really defines "best" for you. I've pseudocoded a couple possibilities below to hopefully help you get started.
Option 1: Execute the following every time unit (in your example, millisecond)
func callEachMs
time = getCurrentTime()
for each datum
for each customer
if time % datum.customer.rate == 0
sendMsg()
This has the advantage of requiring no consistently stored memory -- you just check at each time unit whether your should be sending a message. This can also deal with messages that weren't sent at time == 0 -- just store the time the message was initially sent modulo the rate, and replace the conditional with if time % datum.customer.rate == data.customer.firstMsgTimeMod.
A downside to this method is it is completely reliant on always being called at a rate of 1 ms. If there's lag caused by another process on a CPU and it misses a cycle, you may miss sending a message altogether (as opposed to sending it a little late).
Option 2: Maintain a list of lists of tuples, where each entry represents the tasks that need to be done at that millisecond. Make your list at least as long as the longest rate divided by the time unit (if your longest rate is 50 ms and you're going by ms, your list must be at least 50 long). When you start your program, place the first time a message will be sent into the queue. And then each time you send a message, update the next time you'll send it in that list.
func buildList(&list)
for each datum
for each customer
if list.size < datum.customer.rate
list.resize(datum.customer.rate+1)
list[customer.rate].push_back(tuple(datum.name, customer.name))
func callEachMs(&list)
for each (datum.name, customer.name) in list[0]
sendMsg()
list[customer.rate].push_back((datum.name, customer.name))
list.pop_front()
list.push_back(empty list)
This has the advantage of avoiding the many unnecessary modulus calculations option 1 required. However, that comes with the cost of increased memory usage. This implementation would also not be efficient if there's a large disparity in the rate of your various messages (although you could modify this to deal with algorithms with longer rates more efficiently). And it still has to be called every millisecond.
Finally, you'll have to think very carefully about what data structure you use, as this will make a huge difference in its efficiency. Because you pop from the front and push from the back at every iteration, and the list is a fixed size, you may want to implement a circular buffer to avoid unneeded moving of values. For the lists of tuples, since they're only ever iterated over (random access isn't needed), and there are frequent additions, a singly-linked list may be your best solution.
.
Obviously, there are many more ways that you could do this, but hopefully, these ideas can get you started. Also, keep in mind that the nature of the system you're running this on could have a strong effect on which method works better, or whether you want to do something else entirely. For example, both methods require that they can be reliably called at a certain rate. I also haven't described parallellized implementations, which may be the best option if your application supports them.
Like Helium_1s2 described, there is a second way which based on what I called a schedule table and this is what I used now but this solution has its limits.
Suppose that we have one data to send and two consumer C1 and C2 :
Like you can see we must extract our schedule table and we must identify the repeating transmission cycle and the value of IDLE MINIMUM PERIOD. In fact, it is useless to loop on the smallest peace of time ex 1ms or 1ns or 1mn or 1h (depending on the case) BUT it is not always the best period and we can optimize this loop as follows.
for example one (C1 at 6 and C2 at 9), we remark that there is cycle which repeats from 0 to 18. with a minimal difference of two consecutive send event equal to 3.
so :
HCF(6,9) = 3 = IDLE MINIMUM PERIOD
LCM(6,9) = 18 = transmission cycle length
LCM/HCF = 6 = size of our schedule table
And the schedule table is :
and the sending loop looks like :
while(1) {
sleep(IDLE_MINIMUM_PERIOD); // free CPU for idle min period
i++; // initialized at 0
send(ScheduleTable[i]);
if (i == sizeof(ScheduleTable)) i=0;
}
The problem with this method is that this array will grows if LCM grows which is the case if we have bad combination like with rate = prime number, etc.

Optimal size of worker pool

I'm building a Go app which uses a "worker pool" of goroutines, initially I start the pool creating a number of workers. I was wondering what would be the optimal number of workers in a mult-core processor, for example in a CPU with 4 cores ? I'm currently using the following aproach:
// init pool
numCPUs := runtime.NumCPU()
runtime.GOMAXPROCS(numCPUs + 1) // numCPUs hot threads + one for async tasks.
maxWorkers := numCPUs * 4
jobQueue := make(chan job.Job)
module := Module{
Dispatcher: job.NewWorkerPool(maxWorkers),
JobQueue: jobQueue,
Router: router,
}
// A buffered channel that we can send work requests on.
module.Dispatcher.Run(jobQueue)
The complete implementation is under
job.NewWorkerPool(maxWorkers)
and
module.Dispatcher.Run(jobQueue)
My use-case for using a worker pool: I have a service which accepts requests and calls multiple external APIs and aggregate their results into a single response. Each call can be done independently from others as the order of results doesn't matter. I dispatch the calls to the worker pool where each call is done in one available goroutine in an asynchronous way. My "request" thread keeps listening to the return channels while fetching and aggregating results as soon as a worker thread is done. When all are done the final aggregated result is returned as a response. Since each external API call may render variable response times some calls can be completed earlier than others. As per my understanding doing it in a parallel way would be better in terms of performance as if compared to doing it in a synchronous way calling each external API one after another
The comments in your sample code suggest you may be conflating the two concepts of GOMAXPROCS and a worker pool. These two concepts are completely distinct in Go.
GOMAXPROCS sets the maximum number of CPU threads the Go runtime will use. This defaults to the number of CPU cores found on the system, and should almost never be changed. The only time I can think of to change this would be if you wanted to explicitly limit a Go program to use fewer than the available CPUs for some reason, then you might set this to 1, for example, even when running on a 4-core CPU. This should only ever matter in rare situations.
TL;DR; Never set runtime.GOMAXPROCS manually.
Worker pools in Go are a set of goroutines, which handle jobs as they arrive. There are different ways of handling worker pools in Go.
What number of workers should you use? There is no objective answer. Probably the only way to know is to benchmark various configurations until you find one that meets your requirements.
As a simple case, suppose your worker pool is doing something very CPU-intensive. In this case, you probably want one worker per CPU.
As a more likely example, though, lets say your workers are doing something more I/O bound--such as reading HTTP requests, or sending email via SMTP. In this case, you may reasonably handle dozens or even thousands of workers per CPU.
And then there's also the question of if you even should use a worker pool. Most problems in Go do not require worker pools at all. I've worked on dozens of production Go programs, and never once used a worker pool in any of them. I've also written many times more one-time-use Go tools, and only used a worker pool maybe once.
And finally, the only way in which GOMAXPROCS and worker pools relate is the same as how goroutines relates to GOMAXPROCS. From the docs:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit. This package's GOMAXPROCS function queries and changes the limit.
From this simple description, it's easy to see that there could be many more (potentially hundreds of thousands... or more) goroutines than GOMAXPROCS--GOMAXPROCS only limits how many "operating system threads that can execute user-level Go code simultaneously"--goroutines which aren't executing user-level Go code at the moment don't count. And in I/O-bound goroutines (such as those waiting for a network response) aren't executing code. So you have a theoretical maximum number of goroutines limited only by your system's available memory.

How to determine which side of go channel is waiting?

How do I determine which side of a go channel is waiting on the other side?
I'd like to know this so I can figure out where my processing is being limited and respond by allocating more resources.
Some options
The 2 methods I thought of both require something to do a moving average of recorded values so measurements are not too noisy, but that's not a big problem.
Use a timer to check % of time waiting in consumer
In the case of a single consumer, I can start a timer before consuming from the channel, stopping the timer after I get a record. I can keep track of the % of the time I spend waiting and respond accordingly during each fetch cycle.
Sample length of buffered channel
If the channel is regularly 0, that means we are consuming faster than sending. Similarly if the buffer is full, we're sending faster than we can receive. We can check the length of the our channel over time to determine what is running slow.
Is there a good reason to prefer one of there to another, for performance reasons or otherwise? Is there a simpler solution to this problem?
Example
I have a service that is performing N HTTP requests to grab content in up to W goroutines at the same time and sending all that content on a channel to a processor running in a single goroutine, which in turn feeds data back to a client.
Each worker task will result in a large number of messages sent on the channel. Each worker's task can take several minutes to complete.
The following diagram summarizes the flow of data with 3 concurrent workers (W=3).
[worker: task 1] -
\
[worker: task 2] - | --- [ channel ] --- [ processor ] -> [ client ]
/
[worker: task 3] -
I want to know whether I should run more workers (increase W) or less workers (decrease W) during a request. This can vary a lot per request since client work over connections of very different speeds.
One way to reach your goal is to use "bounded send" and "bounded receive" operations—if you're able to come up with reasonable polling timeouts.
When any one of your workers attempts to send a completed result over the channel, don't let it block "forever" (until there's space in the channel's buffer); instead, only allow it to block some maximum amount of time. If the timeout occurs before there was space in the channel's buffer, you can react to that condition: count how many times it occurs, adjust future deadlines, throttle or reduce the worker count, and so on.
Similarly, for your "processor" receiving results from the workers, you can limit the amount of time it blocks. If the timeout occurs before there was a value available, the processor is starved. Create more workers to feed it more quickly (assuming the workers will benefit from such parallelism).
The downside to this approach is the overhead in creating timers for each send or receive operation.
Sketching, with these declarations accessible to each of your workers:
const minWorkers = 3
var workers uint32
In each worker goroutine:
atomic.AddUint32(&workers, 1)
for {
result, ok := produce()
if !ok {
break
}
// Detect when channel "p"'s buffer is full.
select {
case p <- result:
case <-time.After(500 * time.Millisecond):
// Hand over the pending result, no matter how long it takes.
p <- result
// Reduce worker count if above minimum.
if current := atomic.LoadUint32(&workers); current > minWorkers &&
atomic.CompareAndSwapUint32(&workers, current, current-1) {
return
}
// Consider whether to try decrementing the working count again
// if we're still above the minimum. It's possible another one
// of the workers also exited voluntarily, changing the count.
}
}
atomic.AddUint32(&workers, -1)
Note that as written above, you could achieve the same effect by timing how long it takes for the send to channel p to complete, and reacting to it having taken too long, as opposed to doing one bounded send followed by a potential blocking send. However, I sketched it that way because I suspect that such code would mature to include logging and instrumentation counter bumps when the timeout expires.
Similarly, in your processor goroutine, you could limit the amount of time you block receiving a value from the workers:
for {
select {
case result <- p:
consume(result)
case <-time.After(500 * time.Millisecond):
maybeStartAnotherWorker()
}
}
Obviously, there are many knobs you can attach to this contraption. You wind up coupling the scheduling of the producers to both the consumer and the producers themselves. Introducing an opaque "listener" to which the producers and consumer "complain" about delays allows you to break this circular relationship and more easily vary the policy that governs how you react to congestion.

Performance problem: CPU intensive work performs better with more concurrency in Erlang

tl;dr
I'm getting better performance with my erlang program when I perform my CPU intensive tasks at higher concurrency (e.g. 10K at once vs. 4). Why?
I'm writing a map reduce framework using erlang, and I'm doing performance tests.
My map function is highly CPU intensive (mostly pure calculation). It also needs access to some static data, so I have a few persistent (lingering i.e. lives through the app. life cycle) worker processes on my machine, each having a part of this data in-memory, and awaiting map requests. The output of map is sent to the manager process (which sent out the map requests to the workers), where the reduce (very lightweight) is performed.
Anyways, I noticed that I'm getting better throughput when I immediately spawn a new process for each map request that the workers receives, rather than letting the worker process itself synchronously perform the map request by itself one-by-one (thus leaving bunch of map requests in its process queue, because I'm firing the map requests all at once).
Code snippet:
%% When I remove the comment, I get significant performance boost (90% -> 96%)
%% spawn_link(fun()->
%% One invocation uses around 250ms of CPU time
do_map(Map, AssignedSet, Emit, Data),
Manager ! {finished, JobId, self(), AssignedSet, normal},
%% end),
Compared to when I perform the same calculation in a tight loop, I get 96% throughput (efficiency) using the "immediately spawning" method (e.g. 10000 map reduce jobs running completely in parallel). When I use the "worker performs one-by-one" method, I get only around 90%.
I understand Erlang is supposed to be good at concurrent stuff, and I'm impressed that efficiency doesn't change even if I perform 10K map reduce requests at once as opposed to 100 etc! However, since I have only 4 CPU cores, I'd expect to get better throughput if I use lower concurrency like 4 or maybe 5.
Weirdly, my CPU usage looks very similar in the 2 different implementation (almost completely pegged at 100% on all cores). The performance difference is quite stable. I.e. even when I just do 100 map reduce jobs, I still get around 96% efficiency with the "immediately spawn" method, and around 90% when I use "one-by-one" method. Likewise when I test with 200, 500, 1000, 10K jobs.
I first suspected that the queuing at the worker process queue is the culprit, but even when I should only have something like 25 messages in the worker process queue, I still see the lower performance. 25 messages seem to be quite small for causing a clog (I am doing selective message matching, but not in a way the process would have to put messages back to the queue).
I'm not sure how I should proceed from here. Am I doing something wrong, or am I completely missing something??
UPDATE
I did some more tests and found out that the performance difference can disappear depending on conditions (particularly into how many worker process I divide the static data). Looks like I have much more to learn!
Assuming 1 worker process with 3 map actions, we have the first variants:
_______ _______ _______
| m | | m | | m |
| | | | | |
_| |_| |_| |_
a a a r
Where a is administrative tasks (reading from the message queue, dispatching the map etc.) m is the actual map and r is sending back the result. The second variant where a process is spawned for every map:
_________________._
| m r
| ___________________._
| | m r
| | _____________________._
_|_|_| m r
a a a
As you can see, there's both administrative tasks (a) going on as the same time as maps (m) and as the same times as sending back results (r).
This will keep the CPU busy with map (i.e. calculation intensive) work all the time, as opposed to having short dips every now and then. This is most likely the small gain you see in throughput.
As you have quite high concurrency from the beginning, you only see a relatively small gain in throughput. Compare this to theoretically running only one worker process (as in the first variant) you'll see much bigger gains.
First, let me remarkt this is a very interesting question. I'd like to give you some hints:
Task switching occours per run queue ([rq:x] in the shell) due to reduction: if the Erlang process calls a BIF or user-defined function, it increases it's reduction counter. When running CPU intensive code in one process, it increases it's reduction counter very often. When the reduction counter reaches a certain threshold, a process-switch will occur. (So one process with longer life-time has the same overhead as a multiple processes with shorter life-time: they both have the "same" total reduction counter and fire it when it reaches a threshold, e.g. the one process: 50,000 reductions, more processes: 5 * 10,000 reductions = 50,000 reductions.) (Runtime reasons)
Running on 4 cores vs. 1 core makes a difference: however, timing is the difference. The reason your cores are at 100% is because one or more core(s) is/are doing the mapping, while the other(s) are/is effectively "filling" your message queue. When you are spawning the mapping, there is less time to "fill" the message queue, more time to do the mapping. Apparently, mapping is a more costly operation than filling the queue, and giving it more cores thus increases performance. (Timing/tuning reasons)
You'll get higher throughput when you increase concurrency levels, if processes are waiting (receiving/calling OTP servers/etc.). For instance: requesting data from your static persistent workers takes some time. (Language reasons)

Resources