Does a goroutine necessarily run on a different CPU? - go

The following excerpt is from https://go.dev/doc/effective_go#parallel.
We launch the pieces independently in a loop, one per CPU. They can complete in any order but it doesn't matter; we just count the completion signals by draining the channel after launching all the goroutines.
const numCPU = 4 // number of CPU cores
func (v Vector) DoAll(u Vector) {
c := make(chan int, numCPU) // Buffering optional but sensible.
for i := 0; i < numCPU; i++ {
go v.DoSome(i*len(v)/numCPU, (i+1)*len(v)/numCPU, u, c)
}
// Drain the channel.
for i := 0; i < numCPU; i++ {
<-c // wait for one task to complete
}
// All done.
}
Why does the article specify "one per CPU"? Multiple goroutines need not be executed on different CPUs. In fact, the last paragraph in the sub-section reminds the reader that concurrency is not parallelism:
Be sure not to confuse the ideas of concurrency—structuring a program as independently executing components—and parallelism—executing calculations in parallel for efficiency on multiple CPUs.

Does a goroutine necessarily run on a different CPU?
No, but it might.
Nothing to see here.
why the article specifies "one per CPU"
It could have said 5 or 2. Really there is nothing of importance hidden here. This is just an example, not the specification of goroutine scheduling.

Why does the article specify "one per CPU"? Multiple goroutines need
not be executed on different CPUs. In fact, the last paragraph in the
sub-section reminds the reader that concurrency is not parallelism:
goroutines are mapped with OS threads (M:N) and each goroutine can use a maximum of one thread at a time but you can not predict whether it is using the 1 CPU to execute all the 4 (GOMAXPROCS) goutines or it is use 2 cpu or 3 cpu or all 4 cpu.It is depending upon many factor's and all these complexity is hidden by go runtime

Related

How many write operation can be blocked in chan

I use chan for goroutines to write/read, if the chan is full, the writing goroutines will be blocked until another goroutine read from the chan.
I know there is a recvq and sendq double linked list in chan to record blocked goroutines. My question is how many goroutines totally can be blocked if chan is not read? Does this depend on memory size?
TLDR: as long as your app can fit into memory and can run, you won't have any problems with channel waiting queues.
The language spec does not limit the number of waiting goroutines for a channel, so there's no practical limit.
The runtime implementation might limit the waiting goroutines to an insignificant high value (e.g. due to pointer size, integer counter size or the likes), but to reach such an implementation limit, you would run out of memory much-much sooner.
Goroutines are lightweight threads, but they do require a small memory. They start with a small stack which is around a KB, so even if you estimate it to 1 KB, if you have a million goroutines, that's already 1 GB memory at least, and if you have a billion goroutines, that's 1 TB. And a billion is nowhere near to the max value of an int64 for example.
Your CPU and Go runtime would have trouble managing billions of goroutines earlier than running into implementation specific waiting queue limits.
Yes, it depends on the memory. It depends on the len of channel as mentioned in the docs, a buffered channel is blocked once the chan is full, and gets unblocked, when another value is added to the chan. channels
Code snippet from docs:
var sem = make(chan int, MaxOutstanding)
func handle(r *Request) {
sem <- 1 // Wait for active queue to drain.
process(r) // May take a long time.
<-sem // Done; enable next request to run.
}
func Serve(queue chan *Request) {
for {
req := <-queue
go handle(req) // Don't wait for handle to finish.
}
}
Once MaxOutstanding handlers are executing process, any more will block trying to send into the filled channel buffer, until one of the existing handlers finishes and receives from the buffer.
My question is how many goroutines totally can be blocked if chan is not read?
All of them.

Most efficient number of goroutines on this machine

So I do have a concurrent quicksort implementation written by me. It looks like this:
func Partition(A []int, p int, r int) int {
index := MedianOf3(A, p, r)
swapArray(A, index, r)
x := A[r]
j := p - 1
i := p
for i < r {
if A[i] <= x {
j++
tmp := A[j]
A[j] = A[i]
A[i] = tmp
}
i++
}
swapArray(A, j+1, r)
return j + 1
}
func ConcurrentQuicksort(A []int, p int, r int) {
wg := sync.WaitGroup{}
if p < r {
q := Partition(A, p, r)
select {
case sem <- true:
wg.Add(1)
go func() {
ConcurrentQuicksort(A, p, q-1)
<-sem
wg.Done()
}()
default:
Quicksort(A, p, q-1)
}
select {
case sem <- true:
wg.Add(1)
go func() {
ConcurrentQuicksort(A, q+1, r)
<-sem
wg.Done()
}()
default:
Quicksort(A, q+1, r)
}
}
wg.Wait()
}
func Quicksort(A []int, p int, r int) {
if p < r {
q := Partition(A, p, r)
Quicksort(A, p, q-1)
Quicksort(A, q+1, r)
}
}
I have a sem buffered channel, which I use to limit the number of goroutines running (if its reaches that number, I dont set up another goroutine, I just do the normal quicksort on the subarray). First I started with 100, then I've changed to 50, 20. The benchmarks would get slightly better. But after switching to 10, it started to go back, times started to get bigger. So there is some arbitrary number, at least for my hardware, that makes the algorithm run most efficient.
When I was implementing this, I actually saw some SO question about the number of goroutines that would be the best and now I cannot find it (stupid Chrome history actually saves not all visited sites). Do you know how to calculate such a things? And it would be the best if I didn't have to hardcode it, just let the program do it itself.
P.S I have nonconcurrent Quicksort, which runs about 1.7x slower than this. As you can see in my code, I do Quicksort, when the number of running goroutines exceeds the number I've set up earlier. I thought what about using a ConcurrentQuicksort, but not calling it with go keyword, just simply calling it, and maybe if other goroutines finish their job, the ConcurrentQuicksort which I called would start to launch up goroutines, speeding up the process (cuz as you can see Quicksort would only launch recursive quicksorts, without goroutines). I did that, and actually the time was like 10% slower than the regular Quicksort. Do you know why would that happen?
You have to experiment a bit with this stuff, but I don't think the main concern is goroutines running at once. As the answer #reticentroot linked to says, it's not necessarily a problem to run a lot of simultaneous goroutines.
I think your main concern should be total number of goroutine launches. The current implementation could theoretically start a goroutine to sort just a few items, and that goroutine would spend a lot more time on startup/coordination than actual sorting.
The ideal is you only start as many goroutines as you need to get good utilization of all your CPUs. If your work items are ~equal size and your cores are ~equally busy, then starting one task per core is perfect.
Here, tasks aren't evenly sized, so you might split the sort into somewhat more tasks than you have CPUs and distribute them. (In production you would typically use a worker pool to distribute work without starting a new goroutine for every task, but I think we can get away with skipping that here.)
To get a workable number of tasks--enough to keep all cores busy, but not so many that you create lots of overhead--you can set a minimum size (initial array size/100 or whatever), and only split off sorts of arrays larger than that.
In slightly more detail, there is a bit of cost every time you send a task off to the background. For starters:
Each goroutine launch spends a little time setting up the stack and doing scheduler bookkeeping
Each task switch spends some time in the scheduler and may incur cache misses when the two goroutines are looking at different code or data
Your own coordination code (channel sends and sync ops) takes time
Other things can prevent ideal speedups from happening: you could hit a systemwide limit on e.g. memory bandwidth as Volker pointed out, some sync costs can increase as you add cores, and you can run into various trickier issues sometimes. But the setup, switching, and coordination costs are a good place to start.
The benefit that can outweigh the coordination costs is, of course, other CPUs getting work done when they'd otherwise sit idle.
I think, but haven't tested, that your problems at 50 goroutines are 1) you already reached nearly-full utilization long ago, so adding more tasks adds more coordination work without making things go faster, and 2) you're creating goroutines for tiny sorts, which may spend more of their time setting up and coordinating than they actually do sorting. And at 10 goroutines your problem might be that you're no longer achieving full CPU utilization.
If you wanted, you could test those theories by counting the number of total goroutine launches at various goroutine limits (in an atomic global counter) and measuring CPU utilization at various limits (e.g. by running your program under the Linux/UNIX time utility).
The approach I'd suggest for a divide-and-conquer problem like this is only fork off a goroutine for large enough subproblems (for quicksort, that means large enough subarrays). You can try different limits: maybe you only start goroutines for pieces that are more than 1/64th of the original array, or pieces above some static threshold like 1000 items.
And you meant this sort routine as an exercise, I suspect, but there are various things you can do to make your sorts faster or more robust against weird inputs. The standard libary sort falls back to insertion sort for small subarrays and uses heapsort for the unusual data patterns that cause quicksort problems.
You can also look at other algorithms like radix sort for all or part of the sorting, which I played with. That sorting library is also parallel. I wound up using a minimum cutoff of 127 items before I'd hand a subarray off for other goroutines to sort, and I used an arrangement with a fixed pool of goroutines and a buffered chan to pass tasks between them. That produced decent practical speedups at the time, though it was likely not the best approach at the time and I'm almost sure it's not on today's Go scheduler. Experimentation is fun!
If the operation is CPU bounded, my experiments show that the optimal is the number of CPUs.
template

Is lock necessary when GOMAXPROCS is 1

The GOMAXPROCS variable limits the number of operating system
threads that can execute user-level Go code simultaneously.
So, if GOMAXPROCS is 1, no matter how many goroutines I have, it is safe to access variable (like map) from different goroutines without any lock. Correct?
The short answer is, "no" it is not safe. The long answer is really too long to explain in enough detail here, but I'll give a short summary and some links to articles which should help you put the pieces together.
Let's differentiate between "concurrent" and "parallel" first. Consider two functions. Running in parallel they can both be executing at the same instant on separate processors. Running concurrently either, or both, or neither may be executing but both are able to execute. If they are concurrent but not parallel then they are switching—and without channels or locks we cannot guarantee the sequence in terms of which gets where first.
It may be weird to think about "concurrent but not parallel" but consider that the opposite is quite unremarkable, parallel but not concurrent; my text editor, terminal and browser are all running in parallel but are most definitely not concurrent.
So if two (or 20,000) functions have access to the same memory, say one writes and one reads, and they are running concurrently, then perhaps the write happens first, perhaps the read happens first. There are no guarantees unless we take responsibility for the scheduling/sequencing, hence locks and channels.
Setting GOMAXSPROCS to greater than 1 makes it possible for a concurrent program to run in parallel, but it may not, all concurrent goroutines may be on one CPU thread, or they may be on multiple. Thus setting GOMAXPROCS to 1 is no guarantee that concurrent processes are safe without locks or channels to orchestrate their execution.
Threads are [typically] scheduled by the operating system. See Wikipedia or your favorite repository of human knowledge. Goroutines are scheduled by Go.
Next consider this:
Even with [a] single logical processor and operating system thread, hundreds of thousands of goroutines can be scheduled to run concurrently with amazing efficiency and performance.
and this:
The problem with building concurrency into our applications is eventually our goroutines are going to attempt to access the same resources, possibly at the same time. Read and write operations against a shared resource must always be atomic. In other words reads and writes must happen by one goroutine at a time or else we create race conditions in our programs.
from this article, which explains the difference really well and references some other material you may want to look up (the article is somewhat outdated, as GMAXPROCS no longer defaults to 1, but the general theory is still accurate).
And finally, Effective Go can be daunting when you're starting out but is a must-read. Here is the explanation of concurrency in Go.
Yes, locks are still needed, even if you are running your program on one processor. Concurrency and Parallelism are different things, you'll find a very good explanation here.
Just a little example here:
func main() {
runtime.GOMAXPROCS(1)
t := &test{}
go func() {
for i := 0; i < 100; i++ {
// Some computation prior to using t.Num
time.Sleep(300 * time.Microsecond)
num := t.Num
// Some computation using num
time.Sleep(300 * time.Microsecond)
t.Num = num + 1
}
}()
go func() {
for i := 0; i < 100; i++ {
num := t.Num
// Some computation using num
time.Sleep(300 * time.Microsecond)
t.Num = num + 1
}
}()
time.Sleep(1 * time.Second) // Wait goroutines to finish
fmt.Println(t.Num)
}
The sleeping time is there to represent some computation which takes some time. I wanted to keep the example runnable and simple, that's why I used it.
When running this, even on a single processor, the output is not 200 as we want to. So yes, locks are necessary when accessing variables concurrently, or you'll run into problems.
Changing state, with runtime.GOMAXPROCS(1) assumed, will fail; even for just two go-routines:
func main() {
runtime.GOMAXPROCS(1)
start := make(chan struct{})
wg := &sync.WaitGroup{}
N := 2 //10, 1000, 10000, ... fails with even 2 go-routines
for i := 0; i < N; i++ {
wg.Add(1)
go func() {
defer wg.Done()
<-start
//processing state
initialState := globalState
//give another goroutine a chance, by halting this one
//and lend some processing cycles
//(also simulating "concurrent" processing of initialState)
runtime.Gosched()
if globalState != initialState {
panic(fmt.Sprintf("oops! %d != %d", initialState, globalState))
}
globalState = initialState + 1
}()
}
close(start)
wg.Wait()
log.Println(`global state:`, globalState)
}
var (
globalState int
)
Other answers went more into details - good for studying different aspects of concurrent programming.

OpenMP output for "for" loop

I am new to OpenMP and I just tried to write a small program with the parallel for construct. I have trouble understanding the output of my program. I don't understand why thread number 3 prints the output before 1 and 2. Could someone offer me an explanation?
So, the program is:
#pragma omp parallel for
for (i = 0; i < 7; i++) {
printf("We are in thread number %d and are printing %d\n",
omp_get_thread_num(), i);
}
and the output is:
We are in thread number 0 and are printing 0
We are in thread number 0 and are printing 1
We are in thread number 3 and are printing 6
We are in thread number 1 and are printing 2
We are in thread number 1 and are printing 3
We are in thread number 2 and are printing 4
We are in thread number 2 and are printing 5
My processor is a Intel(R) Core(TM) i5-2410M CPU with 4 cores.
Thank you!
OpenMP makes no guarantees of the relative ordering, in time, of the execution of statements by different threads. OpenMP leaves it to the programmer to impose such ordering if it is required. In general it is not required, in many cases not even desirable, which is why OpenMP's default behaviour is as it is. The cost, in time, of imposing such an ordering is likely to be significant.
I suggest you run much larger tests several times, you should observe that the cross-thread sequencing of events is, essentially, random.
If you want to print in order then you can use the ordered construct
#pragma omp parallel for ordered
for (i = 0; i < 7; i++) {
#pragma omp ordered
printf("We are in thread number %d and are printing %d\n",
omp_get_thread_num(), i);
}
I assume this requires threads from larger iterations to wait for the ones with lower iteration so it will have an effect on performance. You can see it used here http://bisqwit.iki.fi/story/howto/openmp/#ExampleCalculatingTheMandelbrotFractalInParallel
That draws the Mandelbrot set as characters using ordered. A much faster solution than using ordered is to fill an array in parallel of the characters and then draw them serially (try the code). Since one uses OpenMP for performance I have never found a good reason to use ordered but I'm sure it has its use somewhere.

Go-lang parallel segment runs slower than series segment

I have built an epidemic mathematics model which is fairly computationally intense in Go. I'm trying now to build a set of systems to test my model, where I change an input and expect a different output. I built a version in series to slowly increase HIV prevalence and see effects on HIV deaths. It takes ~200 milliseconds to run.
for q = 0.0; q < 1000; q++ {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
fmt.Println(results.HivDeaths[20])
}
Then I made a "parallel" version using channels, and it takes longer, ~400 milliseconds to run. These small changes are important as we will be running millions of runs with different inputs, so would like to make it as efficient as possible. Here is the parallel version:
ch := make(chan ChData)
var q float64
for q = 0.0; q < 1000; q++ {
go func(q float64, inputs *costanalysis.Inputs, ch chan ChData) {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
fmt.Println(results.HivDeaths[20])
ch <- ChData{int(q), results.HivDeaths[20]}
}(q, inputs, ch)
}
for q = 0.0; q < 1000; q++ {
theResults := <-ch
fmt.Println(theResults)
}
Any thoughts are very much appreciated.
There's overhead to starting and communicating with background tasks. The time spent on your cost analyses probably dwarfs equals the cost of communication if the program was taking 200ms, but if coordination cost ever does kill your app, a common approach is to hand off largish chunks of work at a time--e.g., make each goroutine do analyses for a range of 10 q values instead of just one. (Edit: And as #Innominate says, making a "worker pool" of goroutines that process a queue of job objects is another common approach.)
Also, the code you pasted has a race condition. The contents of your Inputs struct don't get copied each time you spawn a goroutine, because you're passing your function a pointer. So goroutines running in parallel will read from and write to the same Inputs instance.
Simply making a brand new Inputs instance for each analysis, with its own arrays, etc. would avoid the race. If that ended up wasting tons of memory or causing lots of redundant copies, you could 1) recycle Inputs instances, 2) separate out read-only data that can safely be shared (maybe there's country data that's fixed, dunno), or 3) change some of the relatively big arrays to be local variables within costAnalysisHandler rather than stuff that needs to be passed around (maybe it could just take initial HIV prevalence and return HIV deaths at t=20, and everything else is local and on the stack).
This doesn't apply to Go today, but did when the question was originally posted: nothing is really running in parallel unless you call runtime.GOMAXPROCS() with your desired concurrency level, e.g., runtime.GOMAXPROCS(runtime.NumCPU()).
Finally, you should only worry about all of this if you're doing some larger analysis and actually have a performance problem; if .2 seconds of waiting is all that performance work can save you here, it's not worth it.
Parallelizing a computationally intensive set of calculations requires that the parallel computations can actually run in parallel on your machine. If they don't then the extra overhead of creating goroutines, channels and reading off the channel will make the program run slower.
I'm guessing that is the problem here.
Try setting the GOMAXPROCS environment variable to the number of CPU's you have before running your code. Or call runtime.GOMAXRPROCS(runtime.NumCPU()) before you start the parallell computations.
I see two issues related to parallel performance,
The first and more obvious one is that you must set GOMAXPROCS in order to get the Go runtime to use more than one cpu/core. Typically one would set it for the number of processors in the machine but the ideal setting can vary.
The second problem is a bit trickier, which is that your code doesn't appear to be parallelizing very well. Simply starting a thousand goroutines and assuming they'll work it out isn't going to give good results. You should probably be using some kind of worker pool, running a limited number of simultaneous computations(a good starting number would be to set it the same as GOMAXPROCS) rather than trying to do 1000 at once.
See: http://golang.org/doc/faq#Why_no_multi_CPU

Resources