Are goroutines appropriate for large, parallel, compute-bound problems? - parallel-processing

Are go-routines pre-emptively multitasked for numerical problems?
I am very intrigued by the lean design of Go, the speed, but most by the fact that channels are first-class objects. I hope the last point may enable a whole new class of deep-analysis algorithms for big data, via the complex interconnection patterns which they should allow.
My problem domain requires real-time compute-bound analysis of streaming incoming data. The data can be partitioned into between 100-1000 "problems" each of which will take between 10 and 1000 seconds to compute (ie their granularity is highly variable). Results must however all be available before the output makes sense, ie, say 500 problems come in, and all 500 must be solved before I can use any of them. The application must be able to scale, potentially to thousands (but unlikely 100s of thousands) problems.
Given that I am less worried about numerical library support (most of this stuff is custom), Go seems ideal as I can map each problem to a goroutine. Before I invest in learning Go rather than say, Julia, Rust, or a functional language (none of which, as far as I can see, have first-class channels so for me are at an immediate disadvantage) I need to know if goroutines are properly pre-emptively multi-tasked. That is, if I run 500 compute-bound goroutines on a powerful multicore computer, can I expect reasonably load balancing across all the "problems" or will I have to cooperatively "yield" all the time, 1995-style. This issue is particularly important given the variable granularity of the problem and the fact that, during compute, I usually will not know how much longer it will take.
If another language would serve me better, I am happy to hear about it, but I have a requirement that threads (or go/coroutines) of execution be lightweight. Python multiprocessing module for example, is far too resource intensive for my scaling ambitions. Just to pre-empt: I do understand the difference between parallelism and concurrency.

The Go runtime has a model where multiple Go routines are mapped onto multiple threads in an automatic fashion. No Go routine is bound to a certain thread, the scheduler may (and will) schedule Go routines to the next available thread. The number of threads a Go program uses is taken from the GOMAXPROCS environment variable and can be overriden with runtime.GOMAXPROCS(). This is a simplified description which is sufficient for understanding.
Go routines may yield in the following cases:
On any operation that might block, i.e. any operation that cannot return a result on the spot because it is either a (possible) blocking system-call like io.Read() or an operation that might require waiting for other Go routines, like acquiring a mutex or sending to or receiving from a channel
On various runtime operations
On function call if the scheduler detects that the preempted Go routine took a lot of CPU time (this is new in Go 1.2)
On call to runtime.Gosched()
On panic()
As of Go 1.14, tight loops can be preempted by the runtime. As a result, loops without function calls no longer potentially deadlock the scheduler or significantly delay garbage collection. This is not supported on all platforms - be sure to review the release notes. Also see issue #36365 for future plans in this area.
On various other occasions
The following things prevent a Go routine from yielding:
Executing C code. A Go routine can't yield while it's executing C code via cgo.
Calling runtime.LockOSThread(), until runtime.UnlockOSThread() has been called.

Not sure I fully understand you, however you can set runtime.GOMAXPROCS to scale to all processes, then use channels (or locks) to synchronize the data, example:
const N = 100
func main() {
runtime.GOMAXPROCS(runtime.NumCPU()) //scale to all processors
var stuff [N]bool
var wg sync.WaitGroup
ch := make(chan int, runtime.NumCPU())
done := make(chan struct{}, runtime.NumCPU())
go func() {
for i := range ch {
stuff[i] = true
}
}()
wg.Add(N)
for i := range stuff {
go func(i int) {
for { //cpu bound loop
select {
case <-done:
fmt.Println(i, "is done")
ch <- i
wg.Done()
return
default:
}
}
}(i)
}
go func() {
for _ = range stuff {
time.Sleep(time.Microsecond)
done <- struct{}{}
}
close(done)
}()
wg.Wait()
close(ch)
for i, v := range stuff { //false-postive datarace
if !v {
panic(fmt.Sprintf("%d != true", i))
}
}
fmt.Println("All done")
}
EDIT: Information about the scheduler # http://tip.golang.org/src/pkg/runtime/proc.c
Goroutine scheduler
The scheduler's job is to distribute ready-to-run goroutines over worker threads.
The main concepts are:
G - goroutine.
M - worker thread, or machine.
P - processor, a resource that is required to execute Go code. M must have an associated P to execute Go code, however it can be blocked or in a syscall w/o an associated P.
Design doc at http://golang.org/s/go11sched.

Related

will mutex lock guarantee mem sync?

in goroutine A, I do this
mu.Lock()
n++
mu.UnLock()
can I see the changed n in goroutine B immediately?
in other word, will mutex guarantee memory sync
and in goroutine B
fmt.Println(n)
If immediately means after the mu.Unlock() then yes, but keep in mind, that before reading value you should also acquire the lock. For instance, in a multiprocessor (multicore) system, each core has its own cache, and there may be situations when variable values in different core's caches are different.
For this specific situation of a single int variable perhaps using "sync/atomic" package is more beneficial. To update use atomic.AddUint64(&n, 1) and to read atomic.LoadUint64(&n)
If the struct of interest is map then you could use sync.Map with embedded RWMutex. RWMutex exposes different locks for update (m.Lock()) and read (m.RLock()). And you could have safe parallel reads and after m.RUnlock() sequential updates.

Golang assignment safety with single reader and single writer

Say I have two go routines:
var sequence int64
// writer
for i := sequence; i < max; i++ {
doSomethingWithSequence(i)
sequence = i
}
// reader
for {
doSomeOtherThingWithSequence(sequence)
}
So can I get by without atomic?
Some potential risks I can think of:
reorder (for the writer, updating sequence happens before doSomething) could happen, but I can live with that.
sequence is not properly aligned in memory so the reader might observe a partially updated i. Running on (recent kernel) linux with x86_64,
can we rule that out?
go compiler 'cleverly optimizes' the reader, so the access to i never goes to memory but cached in a register. Is that possible in go?
Anything else?
Go's motto: Do not communicate by sharing memory; instead, share memory by communicating. Which is an effective best-practice most of the time.
If you care about ordering, you care about synchronizing the two goroutines.
I don't think they are possible. Anyway, those are not things you should worry about if you properly design the synchronization.
The same as above.
Luckily, Go has a data race detector integrated. Try to run your example with go run -race. You will probably see the race condition happening on sequence variable.

Possibility to NOT hook all available CPU power?

I know, most of the beginners of go ask how to have performative go-routines / concurrency, this point I passed a few weeks ago. :-)
I have a real fast trans-coder that uses every cycle available of my 4+4 (i7 HT) CPU. It reads a file into a slice of pointers to structs, does calculations on these and writes the result back to disk. I am using bufio. I am coming from VB so the performance of Go is unbelievable.
I tried to add minimal sleeps (via time.Sleep()), but that drastically decreased performance.
While my trans-coder is working the whole system is lagging. I must change the go task's priority to low or idle to be able to work again.
How could I implement something that keeps the system responsive?
Right now I start thousands of go-routines (loop over a slice of pointers). Should I limit the number of routines?
Lowering the process priority is arguably the correct way to do this. Use your OS's scheduler. That's what it's for. Per this question you can start your process with a specified priority like so:
start "MyApp" /low "C:\myapp.exe"
You may also be able to set process priority from within the application per this question:
err := syscall.Setpriority(syscall.Getpid(), -1, -1)
Lastly, you can use GOMAXPROCS to configure how many CPUs the process is allowed to use. You can pass it in as an environment variable at runtime, or call runtime.GOMAXPROCS() within your code to override it.
Probably the simplest solution is to limit the number of concurrent goroutines using a semaphore, for example:
sem := make(chan int, 10) // limited at go routines
for {
sem <- 1
go doThis()
}
func doThis() {
//do this
<- sem
}
The "sem <- 1" inside the for loop blocks until a goroutine "slot" is freed up by doThis extracting something from the sem channel/

How can Go's race detector be aware of lock?

Adding lock to a program with race condition can solve the race condition and make the race detector quiet. How can Go's race detector be aware of lock?
Someone points out that "the race detector can only detect race conditions if and when they actually occur".
Consider the following program:
package main
import (
"sync"
"time"
)
func main() {
var a int
var wg sync.WaitGroup
workers := 2
wg.Add(workers)
for i := 1; i <= workers; i++ {
go func(sleep int) {
time.Sleep(time.Duration(sleep) * time.Second)
a = 1
wg.Done()
}(i * 5)
}
wg.Wait()
}
One goroutine sleeps for 5 seconds, another sleeps for 10 seconds, they don't write a at the same time in most cases, but the race detector prints the race condition warning every time. Why?
The race detector does not analyze the source code, it doesn't know that you added locks in the source code.
The race detector works at runtime:
When the -race command-line flag is set, the compiler instruments all memory accesses with code that records when and how the memory was accessed, while the runtime library watches for unsynchronized accesses to shared variables.
Because of this design, the race detector can only detect race conditions if and when they actually occur. So when you add proper locking / synchronization, the race condition will not happen (the if condition will not be met), and so no warning will be printed.
See this blog post for more details: Introducing the Go Race Detector
And this article: Data Race Detector
Edit for your example:
It may be that your 2 goroutines will never reach the point where they write the shared variable a at the same physical time (because the code runs so fast and the sleep time is relatively huge), but they run concurrently, in different goroutines, without explicit synchronization (a synchronization point may be channel communication, mutex lock/unlock etc.).
The race condition does not imply that access to a shared variable does happen at the same time (one of which must be a write). The race condition is also met if access to a shared variable happens concurrently (from multiple goroutines), without synchronization. This can be detected at runtime by the race detector (due to the instrumented memory access code).
Code generated by the compiler is allowed to use cached instances of the a variable in multiple goroutines, the runtime only has to guarantee that the cached instances are "refreshed" or disposed of if a synchronization point is reached. For details, see The Go Memory Model.
Also note that time.Sleep() does not guarantee that execution will continue right after the specified duration, only that execution will be suspended for at least the specified duration (so the execution may continue at a later time):
Sleep pauses the current goroutine for at least the duration d.
The data race detector does not do static analysis. It is not aware of your lock. It’s empirical, it just notices that when you run your code with the lock, two threads are never writing to the same value at the same time (or one writing, one reading).

Simple concurrent queue

Could someone please mention the flaws and performance drawbacks in the Queue like implementation?
type Queue struct {
sync.Mutex
Items []interface{}
}
func (q *Queue) Push(item interface{}) {
q.Lock()
defer q.Unlock()
q.Items = append(q.Items, item)
}
func (q *Queue) Pop() interface{} {
q.Lock()
defer q.Unlock()
if len(q.Items) == 0 {
return nil
}
item := q.Items[0]
q.Items = q.Items[1:]
return item
}
I also have methods like PopMany and PushMany, and what I am concerned about is: Is too much re-slicing that bad?
You could simply use a buffered channel.
var queue = make(chan interface{}, 100)
The size of the buffer could to be determined empirically to be large enough for the high-water mark for the rate of pushes vs rate of pops. It should ideally not be much larger than this, to avoid wasting memory.
Indeed, a smaller buffer size will also work, provided the interacting goroutines don't deadlock for other reasons. If you use a smaller buffer size, you are effectively getting queueing via the run-queue of the goroutine time-slice engine, part of the Go runtime. (Quite possible, a buffer size of zero could work in many circumstances.)
Channels allow many reader goroutines and many writer goroutines. The concurrency of their access is handled automatically by the Go runtime. All writes into the channel are interleaved so as to be a sequential stream. All the reads are also interleaved to extract values sequentially in the same order they were enqueued. Here's further discussion on this topic.
The re-slicing is not an issue here. It will also make no difference whether you have a thread-safe or unsafe version as this is pretty much how the re-sizing is meant to be done.
You can alleviate some of the re-sizing overhead by initializing the queue with a capacity:
func NewQueue(capacity int) *Queue {
return &Queue {
Items: make([]interface{}, 0, capacity),
}
}
This will initialize the queue. It can still grow beyond the capacity, but you will not be having any unnecessary copying/re-allocation until that capacity is reached.
What may potentially cause problems with many concurrent accesses, is the mutex lock. At some point, you will be spending more time waiting for locks to be released than you are actually doing work. This is a general problem with lock contention and can be solved by implementing the queue as a lock-free data structure.
There are a few third-party packages out there which provide lock free implementations of basic data structures.
Whether this will actually be useful to you can only be determined with some benchmarking. Lock-free structures can have a higher base cost, but they scale much better when you get many concurrent users. There is a cutoff point at which mutex locks become more expensive than the lock-free approach.
I think the best way to approach this is to use a linked list, there is already one available for you in standard package here
The answer marked correct says re-slicing is not an issue. That is not correct, it is an issue. What Dave is suggesting is right, we should mark that element as nil.
Refer more about slices here: https://go.dev/blog/slices-intro

Resources