Possibility to NOT hook all available CPU power? - windows

I know, most of the beginners of go ask how to have performative go-routines / concurrency, this point I passed a few weeks ago. :-)
I have a real fast trans-coder that uses every cycle available of my 4+4 (i7 HT) CPU. It reads a file into a slice of pointers to structs, does calculations on these and writes the result back to disk. I am using bufio. I am coming from VB so the performance of Go is unbelievable.
I tried to add minimal sleeps (via time.Sleep()), but that drastically decreased performance.
While my trans-coder is working the whole system is lagging. I must change the go task's priority to low or idle to be able to work again.
How could I implement something that keeps the system responsive?
Right now I start thousands of go-routines (loop over a slice of pointers). Should I limit the number of routines?

Lowering the process priority is arguably the correct way to do this. Use your OS's scheduler. That's what it's for. Per this question you can start your process with a specified priority like so:
start "MyApp" /low "C:\myapp.exe"
You may also be able to set process priority from within the application per this question:
err := syscall.Setpriority(syscall.Getpid(), -1, -1)
Lastly, you can use GOMAXPROCS to configure how many CPUs the process is allowed to use. You can pass it in as an environment variable at runtime, or call runtime.GOMAXPROCS() within your code to override it.

Probably the simplest solution is to limit the number of concurrent goroutines using a semaphore, for example:
sem := make(chan int, 10) // limited at go routines
for {
sem <- 1
go doThis()
}
func doThis() {
//do this
<- sem
}
The "sem <- 1" inside the for loop blocks until a goroutine "slot" is freed up by doThis extracting something from the sem channel/

Related

Golang assignment safety with single reader and single writer

Say I have two go routines:
var sequence int64
// writer
for i := sequence; i < max; i++ {
doSomethingWithSequence(i)
sequence = i
}
// reader
for {
doSomeOtherThingWithSequence(sequence)
}
So can I get by without atomic?
Some potential risks I can think of:
reorder (for the writer, updating sequence happens before doSomething) could happen, but I can live with that.
sequence is not properly aligned in memory so the reader might observe a partially updated i. Running on (recent kernel) linux with x86_64,
can we rule that out?
go compiler 'cleverly optimizes' the reader, so the access to i never goes to memory but cached in a register. Is that possible in go?
Anything else?
Go's motto: Do not communicate by sharing memory; instead, share memory by communicating. Which is an effective best-practice most of the time.
If you care about ordering, you care about synchronizing the two goroutines.
I don't think they are possible. Anyway, those are not things you should worry about if you properly design the synchronization.
The same as above.
Luckily, Go has a data race detector integrated. Try to run your example with go run -race. You will probably see the race condition happening on sequence variable.

Golang: why runtime.GOMAXPROCS is limited to 256?

I was playing with golang 1.7.3 on MacBook and Ubuntu and found that runtime.GOMAXPROCS is limited to 256. Does anyone know where this limit comes from? Is this documented anywhere and why would there be a limit? Is this an implementation optimization?
Only reference to 256 I could find is on this page that describes golang's runtime package: https://golang.org/pkg/runtime/. The runtime.MemStats struct has a couple of stat arrays of size 256:
type MemStats struct {
...
PauseNs [256]uint64 // circular buffer of recent GC pause durations, most recent at [(NumGC+255)%256]
PauseEnd [256]uint64 // circular buffer of recent GC pause end times
Here's example golang code I used:
func main() {
runtime.GOMAXPROCS(1000)
log.Printf("GOMAXPROCS %d\n", runtime.GOMAXPROCS(-1))
}
Prints
GOMAXPROCS 256
P.S.
Also, can someone point me to documentation on how this GOMAXPROCS relate to OS thread count used by golang scheduler (if at all). Shall we observe go-compiled code running GOMAXPROCS OS threads?
EDIT: Thanks #twotwotwo for pointing out how GOMAXPROCS relate to OS threads. Still it's interesting that documentation does not mention this 256 limit (other that in the MemStats struct which may or may not be related).
I wonder if anyone is aware of the true reason for this 256 number.
Note that, starting the next Go 1.10 (Q1 2018), GOMAXPROCS will be limited by ... nothing.
The runtime no longer artificially limits GOMAXPROCS (previously it was limited to 1024).
See commit ee55000 by Austin Clements (aclements), which fixes issue 15131.
Now that allp is dynamically allocated, there's no need for a hard cap
on GOMAXPROCS.
allp is defined here.
See also commit e900e27:
runtime: clean up loops over allp
allp now has length gomaxprocs, which means none of allp[i] are nil or in state _Pdead.
This lets replace several different styles of loops over allp with normal range loops.
for i := 0; i < gomaxprocs; i++ { ... } loops can simply range over
allp.
Likewise, range loops over allp[:gomaxprocs] can just range over
allp.
Loops that check for p == nil || p.state == _Pdead don't need to check
this any more.
Loops that check for p == nil don't have to check this if dead Ps
don't affect them. I checked that all such loops are, in fact,
unaffected by dead Ps. One loop was potentially affected, which this
fixes by zeroing p.gcAssistTime in procresize.
The package runtime docs clarify how GOMAXPROCS relates to OS threads:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit. This package's GOMAXPROCS function queries and changes the limit.
So you could see more than GOMAXPROCS OS threads (because some are blocked in system calls, and there's no limit to how many), or fewer (because GOMAXPROCS is only documented to limit the number of threads, not prescribe it exactly).
I think capping GOMAXPROCS is consistent with the spirit of that documentation--you specified you were OK with 1000 OS threads running Go code, but the runtime decided to 'only' run 256. That doesn't limit the number of goroutines active because they're multiplexed onto OS threads--when one goroutine blocks (waiting for a network read to complete, say) Go's internal scheduler starts other work on the same OS thread.
The Go team might have made this choice to minimize the chance that Go programs end up running many times more OS threads than most machines today have cores; that would cause more OS context switches, which can be slower than user-mode goroutine switches that would occur if GOMAXPROCS were kept down to the number of CPU cores present. Or it might just have been convenient for the design Go's internal scheduler to have an upper bound on GOMAXPROCS.
Goroutines vs Threads is not perfect, e.g. goroutines don't have segmented stacks now, but it may help you understand what's going on here under the hood.

Simple concurrent queue

Could someone please mention the flaws and performance drawbacks in the Queue like implementation?
type Queue struct {
sync.Mutex
Items []interface{}
}
func (q *Queue) Push(item interface{}) {
q.Lock()
defer q.Unlock()
q.Items = append(q.Items, item)
}
func (q *Queue) Pop() interface{} {
q.Lock()
defer q.Unlock()
if len(q.Items) == 0 {
return nil
}
item := q.Items[0]
q.Items = q.Items[1:]
return item
}
I also have methods like PopMany and PushMany, and what I am concerned about is: Is too much re-slicing that bad?
You could simply use a buffered channel.
var queue = make(chan interface{}, 100)
The size of the buffer could to be determined empirically to be large enough for the high-water mark for the rate of pushes vs rate of pops. It should ideally not be much larger than this, to avoid wasting memory.
Indeed, a smaller buffer size will also work, provided the interacting goroutines don't deadlock for other reasons. If you use a smaller buffer size, you are effectively getting queueing via the run-queue of the goroutine time-slice engine, part of the Go runtime. (Quite possible, a buffer size of zero could work in many circumstances.)
Channels allow many reader goroutines and many writer goroutines. The concurrency of their access is handled automatically by the Go runtime. All writes into the channel are interleaved so as to be a sequential stream. All the reads are also interleaved to extract values sequentially in the same order they were enqueued. Here's further discussion on this topic.
The re-slicing is not an issue here. It will also make no difference whether you have a thread-safe or unsafe version as this is pretty much how the re-sizing is meant to be done.
You can alleviate some of the re-sizing overhead by initializing the queue with a capacity:
func NewQueue(capacity int) *Queue {
return &Queue {
Items: make([]interface{}, 0, capacity),
}
}
This will initialize the queue. It can still grow beyond the capacity, but you will not be having any unnecessary copying/re-allocation until that capacity is reached.
What may potentially cause problems with many concurrent accesses, is the mutex lock. At some point, you will be spending more time waiting for locks to be released than you are actually doing work. This is a general problem with lock contention and can be solved by implementing the queue as a lock-free data structure.
There are a few third-party packages out there which provide lock free implementations of basic data structures.
Whether this will actually be useful to you can only be determined with some benchmarking. Lock-free structures can have a higher base cost, but they scale much better when you get many concurrent users. There is a cutoff point at which mutex locks become more expensive than the lock-free approach.
I think the best way to approach this is to use a linked list, there is already one available for you in standard package here
The answer marked correct says re-slicing is not an issue. That is not correct, it is an issue. What Dave is suggesting is right, we should mark that element as nil.
Refer more about slices here: https://go.dev/blog/slices-intro

Are goroutines appropriate for large, parallel, compute-bound problems?

Are go-routines pre-emptively multitasked for numerical problems?
I am very intrigued by the lean design of Go, the speed, but most by the fact that channels are first-class objects. I hope the last point may enable a whole new class of deep-analysis algorithms for big data, via the complex interconnection patterns which they should allow.
My problem domain requires real-time compute-bound analysis of streaming incoming data. The data can be partitioned into between 100-1000 "problems" each of which will take between 10 and 1000 seconds to compute (ie their granularity is highly variable). Results must however all be available before the output makes sense, ie, say 500 problems come in, and all 500 must be solved before I can use any of them. The application must be able to scale, potentially to thousands (but unlikely 100s of thousands) problems.
Given that I am less worried about numerical library support (most of this stuff is custom), Go seems ideal as I can map each problem to a goroutine. Before I invest in learning Go rather than say, Julia, Rust, or a functional language (none of which, as far as I can see, have first-class channels so for me are at an immediate disadvantage) I need to know if goroutines are properly pre-emptively multi-tasked. That is, if I run 500 compute-bound goroutines on a powerful multicore computer, can I expect reasonably load balancing across all the "problems" or will I have to cooperatively "yield" all the time, 1995-style. This issue is particularly important given the variable granularity of the problem and the fact that, during compute, I usually will not know how much longer it will take.
If another language would serve me better, I am happy to hear about it, but I have a requirement that threads (or go/coroutines) of execution be lightweight. Python multiprocessing module for example, is far too resource intensive for my scaling ambitions. Just to pre-empt: I do understand the difference between parallelism and concurrency.
The Go runtime has a model where multiple Go routines are mapped onto multiple threads in an automatic fashion. No Go routine is bound to a certain thread, the scheduler may (and will) schedule Go routines to the next available thread. The number of threads a Go program uses is taken from the GOMAXPROCS environment variable and can be overriden with runtime.GOMAXPROCS(). This is a simplified description which is sufficient for understanding.
Go routines may yield in the following cases:
On any operation that might block, i.e. any operation that cannot return a result on the spot because it is either a (possible) blocking system-call like io.Read() or an operation that might require waiting for other Go routines, like acquiring a mutex or sending to or receiving from a channel
On various runtime operations
On function call if the scheduler detects that the preempted Go routine took a lot of CPU time (this is new in Go 1.2)
On call to runtime.Gosched()
On panic()
As of Go 1.14, tight loops can be preempted by the runtime. As a result, loops without function calls no longer potentially deadlock the scheduler or significantly delay garbage collection. This is not supported on all platforms - be sure to review the release notes. Also see issue #36365 for future plans in this area.
On various other occasions
The following things prevent a Go routine from yielding:
Executing C code. A Go routine can't yield while it's executing C code via cgo.
Calling runtime.LockOSThread(), until runtime.UnlockOSThread() has been called.
Not sure I fully understand you, however you can set runtime.GOMAXPROCS to scale to all processes, then use channels (or locks) to synchronize the data, example:
const N = 100
func main() {
runtime.GOMAXPROCS(runtime.NumCPU()) //scale to all processors
var stuff [N]bool
var wg sync.WaitGroup
ch := make(chan int, runtime.NumCPU())
done := make(chan struct{}, runtime.NumCPU())
go func() {
for i := range ch {
stuff[i] = true
}
}()
wg.Add(N)
for i := range stuff {
go func(i int) {
for { //cpu bound loop
select {
case <-done:
fmt.Println(i, "is done")
ch <- i
wg.Done()
return
default:
}
}
}(i)
}
go func() {
for _ = range stuff {
time.Sleep(time.Microsecond)
done <- struct{}{}
}
close(done)
}()
wg.Wait()
close(ch)
for i, v := range stuff { //false-postive datarace
if !v {
panic(fmt.Sprintf("%d != true", i))
}
}
fmt.Println("All done")
}
EDIT: Information about the scheduler # http://tip.golang.org/src/pkg/runtime/proc.c
Goroutine scheduler
The scheduler's job is to distribute ready-to-run goroutines over worker threads.
The main concepts are:
G - goroutine.
M - worker thread, or machine.
P - processor, a resource that is required to execute Go code. M must have an associated P to execute Go code, however it can be blocked or in a syscall w/o an associated P.
Design doc at http://golang.org/s/go11sched.

When to use a buffered channel?

What are the uses cases for buffered channels ? If i want multiple parallel actions i could just use the default, synchronous channel eq.
package main
import "fmt"
import "time"
func longLastingProcess(c chan string) {
time.Sleep(2000 * time.Millisecond)
c <- "tadaa"
}
func main() {
c := make(chan string)
go longLastingProcess(c)
go longLastingProcess(c)
go longLastingProcess(c)
fmt.Println(<- c)
}
What would be the practical cases for increasing the buffer size ?
To give a single, slightly-more-concrete use case:
Suppose you want your channel to represent a task queue, so that a task scheduler can send jobs into the queue, and a worker thread can consume a job by receiving it in the channel.
Suppose further that, though in general you expect each job to be handled in a timely fashion, it takes longer for a worker to complete a task than it does for the scheduler to schedule it.
Having a buffer allows the scheduler to deposit jobs in the queue and still remain responsive to user input (or network traffic, or whatever) because it does not have to sleep until the worker is ready each time it schedules a task. Instead, it goes about its business, and trusts the workers to catch up during a quieter period.
If you want an EVEN MORE CONCRETE example dealing with a specific piece of software then I'll see what I can do, but I hope this meets your needs.
Generally, buffering in channels is beneficial for performance reasons.
If a program is designed using an event-flow or data-flow approach, channels provide the means for the events to pass between one process and another (I use the term process in the same sense as in Tony Hoare's Communicating Sequential Processes (CSP), ie. effectively synonymous with the goroutine).
There are times when a program needs its components to remain in lock-step synchrony. In this case, unbuffered channels are required.
Otherwise, it is typically beneficial to add buffering to the channels. This should be seen as an optimisation step (deadlock may still be possible if not designed out).
There are novel throttle structures made possible by using channels with small buffers (example).
There are special overwriting or lossy forms of channels used in occam and jcsp for fixing the special case of a cycle (or loop) of processes that would otherwise probably deadlock. This is also possible in Go by writing an overwriting goroutine buffer (example).
You should never add buffering merely to fix a deadlock. If your program deadlocks, it's far easier to fix by starting with zero buffering and think through the dependencies. Then add buffering when you know it won't deadlock.
You can construct goroutines compositionally - that is, a goroutine may itself contain goroutines. This is a feature of CSP and benefits scalability greatly. The internal channels between a group of goroutines are not of interest when designing the external use of the group as a self-contained component. This principle can be applied repeatedly at increasingly-larger scales.
If the receiver of the channel is always slower than sender a buffer of any size will eventually be consumed. That will leave you with a channel that pauses your go routine as often as a unbuffered channel so you might as well use an unbuffered channel.
If the receiver is typically faster than the sender except for an occasional burst a buffered channel may be helpful and the buffer should be set to the size of the typical burst which you can arrive at by measurement at runtime.
As an alternative to a buffered channel it may better to just send an array or a struct containing an array over the channel to deal with bursts/batches.
Buffered channels are non-blocking for the sender as long as there's still room. This can increase responsiveness and throughput.
Sending several items on one buffered channel makes sure they are processed in the order in which they are sent.
From Effective Go (with example): "A buffered channel can be used like a semaphore, for instance to limit throughput."
In general, there are many use-cases and patterns of channel usage, so this is not an exhausting answer.
It's a hard question b/c the program is incorrect: It exits after receiving a signal from one goroutine, but three were started. Buffering the channel makes it no different.
EDIT: For example, here is a bit of general discussion about channel buffers. And some exercise. And a book chapter about the same.

Resources