Is lock necessary when GOMAXPROCS is 1

Is lock necessary when GOMAXPROCS is 1 - go

The GOMAXPROCS variable limits the number of operating system
threads that can execute user-level Go code simultaneously.
So, if GOMAXPROCS is 1, no matter how many goroutines I have, it is safe to access variable (like map) from different goroutines without any lock. Correct?

The short answer is, "no" it is not safe. The long answer is really too long to explain in enough detail here, but I'll give a short summary and some links to articles which should help you put the pieces together.
Let's differentiate between "concurrent" and "parallel" first. Consider two functions. Running in parallel they can both be executing at the same instant on separate processors. Running concurrently either, or both, or neither may be executing but both are able to execute. If they are concurrent but not parallel then they are switching—and without channels or locks we cannot guarantee the sequence in terms of which gets where first.
It may be weird to think about "concurrent but not parallel" but consider that the opposite is quite unremarkable, parallel but not concurrent; my text editor, terminal and browser are all running in parallel but are most definitely not concurrent.
So if two (or 20,000) functions have access to the same memory, say one writes and one reads, and they are running concurrently, then perhaps the write happens first, perhaps the read happens first. There are no guarantees unless we take responsibility for the scheduling/sequencing, hence locks and channels.
Setting GOMAXSPROCS to greater than 1 makes it possible for a concurrent program to run in parallel, but it may not, all concurrent goroutines may be on one CPU thread, or they may be on multiple. Thus setting GOMAXPROCS to 1 is no guarantee that concurrent processes are safe without locks or channels to orchestrate their execution.
Threads are [typically] scheduled by the operating system. See Wikipedia or your favorite repository of human knowledge. Goroutines are scheduled by Go.
Next consider this:
Even with [a] single logical processor and operating system thread, hundreds of thousands of goroutines can be scheduled to run concurrently with amazing efficiency and performance.
and this:
The problem with building concurrency into our applications is eventually our goroutines are going to attempt to access the same resources, possibly at the same time. Read and write operations against a shared resource must always be atomic. In other words reads and writes must happen by one goroutine at a time or else we create race conditions in our programs.
from this article, which explains the difference really well and references some other material you may want to look up (the article is somewhat outdated, as GMAXPROCS no longer defaults to 1, but the general theory is still accurate).
And finally, Effective Go can be daunting when you're starting out but is a must-read. Here is the explanation of concurrency in Go.

Yes, locks are still needed, even if you are running your program on one processor. Concurrency and Parallelism are different things, you'll find a very good explanation here.
Just a little example here:
func main() {
runtime.GOMAXPROCS(1)
t := &test{}
go func() {
for i := 0; i < 100; i++ {
// Some computation prior to using t.Num
time.Sleep(300 * time.Microsecond)
num := t.Num
// Some computation using num
time.Sleep(300 * time.Microsecond)
t.Num = num + 1
}
}()
go func() {
for i := 0; i < 100; i++ {
num := t.Num
// Some computation using num
time.Sleep(300 * time.Microsecond)
t.Num = num + 1
}
}()
time.Sleep(1 * time.Second) // Wait goroutines to finish
fmt.Println(t.Num)
}
The sleeping time is there to represent some computation which takes some time. I wanted to keep the example runnable and simple, that's why I used it.
When running this, even on a single processor, the output is not 200 as we want to. So yes, locks are necessary when accessing variables concurrently, or you'll run into problems.

Changing state, with runtime.GOMAXPROCS(1) assumed, will fail; even for just two go-routines:
func main() {
runtime.GOMAXPROCS(1)
start := make(chan struct{})
wg := &sync.WaitGroup{}
N := 2 //10, 1000, 10000, ... fails with even 2 go-routines
for i := 0; i < N; i++ {
wg.Add(1)
go func() {
defer wg.Done()
<-start
//processing state
initialState := globalState
//give another goroutine a chance, by halting this one
//and lend some processing cycles
//(also simulating "concurrent" processing of initialState)
runtime.Gosched()
if globalState != initialState {
panic(fmt.Sprintf("oops! %d != %d", initialState, globalState))
}
globalState = initialState + 1
}()
}
close(start)
wg.Wait()
log.Println(`global state:`, globalState)
}
var (
globalState int
)
Other answers went more into details - good for studying different aspects of concurrent programming.

Related

Does a goroutine necessarily run on a different CPU?

The following excerpt is from https://go.dev/doc/effective_go#parallel.
We launch the pieces independently in a loop, one per CPU. They can complete in any order but it doesn't matter; we just count the completion signals by draining the channel after launching all the goroutines.
const numCPU = 4 // number of CPU cores
func (v Vector) DoAll(u Vector) {
c := make(chan int, numCPU) // Buffering optional but sensible.
for i := 0; i < numCPU; i++ {
go v.DoSome(i*len(v)/numCPU, (i+1)*len(v)/numCPU, u, c)
}
// Drain the channel.
for i := 0; i < numCPU; i++ {
<-c // wait for one task to complete
}
// All done.
}
Why does the article specify "one per CPU"? Multiple goroutines need not be executed on different CPUs. In fact, the last paragraph in the sub-section reminds the reader that concurrency is not parallelism:
Be sure not to confuse the ideas of concurrency—structuring a program as independently executing components—and parallelism—executing calculations in parallel for efficiency on multiple CPUs.

Does a goroutine necessarily run on a different CPU?
No, but it might.
Nothing to see here.
why the article specifies "one per CPU"
It could have said 5 or 2. Really there is nothing of importance hidden here. This is just an example, not the specification of goroutine scheduling.

Why does the article specify "one per CPU"? Multiple goroutines need
not be executed on different CPUs. In fact, the last paragraph in the
sub-section reminds the reader that concurrency is not parallelism:
goroutines are mapped with OS threads (M:N) and each goroutine can use a maximum of one thread at a time but you can not predict whether it is using the 1 CPU to execute all the 4 (GOMAXPROCS) goutines or it is use 2 cpu or 3 cpu or all 4 cpu.It is depending upon many factor's and all these complexity is hidden by go runtime

How many write operation can be blocked in chan

I use chan for goroutines to write/read, if the chan is full, the writing goroutines will be blocked until another goroutine read from the chan.
I know there is a recvq and sendq double linked list in chan to record blocked goroutines. My question is how many goroutines totally can be blocked if chan is not read? Does this depend on memory size?

TLDR: as long as your app can fit into memory and can run, you won't have any problems with channel waiting queues.
The language spec does not limit the number of waiting goroutines for a channel, so there's no practical limit.
The runtime implementation might limit the waiting goroutines to an insignificant high value (e.g. due to pointer size, integer counter size or the likes), but to reach such an implementation limit, you would run out of memory much-much sooner.
Goroutines are lightweight threads, but they do require a small memory. They start with a small stack which is around a KB, so even if you estimate it to 1 KB, if you have a million goroutines, that's already 1 GB memory at least, and if you have a billion goroutines, that's 1 TB. And a billion is nowhere near to the max value of an int64 for example.
Your CPU and Go runtime would have trouble managing billions of goroutines earlier than running into implementation specific waiting queue limits.

Yes, it depends on the memory. It depends on the len of channel as mentioned in the docs, a buffered channel is blocked once the chan is full, and gets unblocked, when another value is added to the chan. channels
Code snippet from docs:
var sem = make(chan int, MaxOutstanding)
func handle(r *Request) {
sem <- 1 // Wait for active queue to drain.
process(r) // May take a long time.
<-sem // Done; enable next request to run.
}
func Serve(queue chan *Request) {
for {
req := <-queue
go handle(req) // Don't wait for handle to finish.
}
}
Once MaxOutstanding handlers are executing process, any more will block trying to send into the filled channel buffer, until one of the existing handlers finishes and receives from the buffer.

My question is how many goroutines totally can be blocked if chan is not read?
All of them.

Most efficient number of goroutines on this machine

So I do have a concurrent quicksort implementation written by me. It looks like this:
func Partition(A []int, p int, r int) int {
index := MedianOf3(A, p, r)
swapArray(A, index, r)
x := A[r]
j := p - 1
i := p
for i < r {
if A[i] <= x {
j++
tmp := A[j]
A[j] = A[i]
A[i] = tmp
}
i++
}
swapArray(A, j+1, r)
return j + 1
}
func ConcurrentQuicksort(A []int, p int, r int) {
wg := sync.WaitGroup{}
if p < r {
q := Partition(A, p, r)
select {
case sem <- true:
wg.Add(1)
go func() {
ConcurrentQuicksort(A, p, q-1)
<-sem
wg.Done()
}()
default:
Quicksort(A, p, q-1)
}
select {
case sem <- true:
wg.Add(1)
go func() {
ConcurrentQuicksort(A, q+1, r)
<-sem
wg.Done()
}()
default:
Quicksort(A, q+1, r)
}
}
wg.Wait()
}
func Quicksort(A []int, p int, r int) {
if p < r {
q := Partition(A, p, r)
Quicksort(A, p, q-1)
Quicksort(A, q+1, r)
}
}
I have a sem buffered channel, which I use to limit the number of goroutines running (if its reaches that number, I dont set up another goroutine, I just do the normal quicksort on the subarray). First I started with 100, then I've changed to 50, 20. The benchmarks would get slightly better. But after switching to 10, it started to go back, times started to get bigger. So there is some arbitrary number, at least for my hardware, that makes the algorithm run most efficient.
When I was implementing this, I actually saw some SO question about the number of goroutines that would be the best and now I cannot find it (stupid Chrome history actually saves not all visited sites). Do you know how to calculate such a things? And it would be the best if I didn't have to hardcode it, just let the program do it itself.
P.S I have nonconcurrent Quicksort, which runs about 1.7x slower than this. As you can see in my code, I do Quicksort, when the number of running goroutines exceeds the number I've set up earlier. I thought what about using a ConcurrentQuicksort, but not calling it with go keyword, just simply calling it, and maybe if other goroutines finish their job, the ConcurrentQuicksort which I called would start to launch up goroutines, speeding up the process (cuz as you can see Quicksort would only launch recursive quicksorts, without goroutines). I did that, and actually the time was like 10% slower than the regular Quicksort. Do you know why would that happen?

You have to experiment a bit with this stuff, but I don't think the main concern is goroutines running at once. As the answer #reticentroot linked to says, it's not necessarily a problem to run a lot of simultaneous goroutines.
I think your main concern should be total number of goroutine launches. The current implementation could theoretically start a goroutine to sort just a few items, and that goroutine would spend a lot more time on startup/coordination than actual sorting.
The ideal is you only start as many goroutines as you need to get good utilization of all your CPUs. If your work items are ~equal size and your cores are ~equally busy, then starting one task per core is perfect.
Here, tasks aren't evenly sized, so you might split the sort into somewhat more tasks than you have CPUs and distribute them. (In production you would typically use a worker pool to distribute work without starting a new goroutine for every task, but I think we can get away with skipping that here.)
To get a workable number of tasks--enough to keep all cores busy, but not so many that you create lots of overhead--you can set a minimum size (initial array size/100 or whatever), and only split off sorts of arrays larger than that.
In slightly more detail, there is a bit of cost every time you send a task off to the background. For starters:
Each goroutine launch spends a little time setting up the stack and doing scheduler bookkeeping
Each task switch spends some time in the scheduler and may incur cache misses when the two goroutines are looking at different code or data
Your own coordination code (channel sends and sync ops) takes time
Other things can prevent ideal speedups from happening: you could hit a systemwide limit on e.g. memory bandwidth as Volker pointed out, some sync costs can increase as you add cores, and you can run into various trickier issues sometimes. But the setup, switching, and coordination costs are a good place to start.
The benefit that can outweigh the coordination costs is, of course, other CPUs getting work done when they'd otherwise sit idle.
I think, but haven't tested, that your problems at 50 goroutines are 1) you already reached nearly-full utilization long ago, so adding more tasks adds more coordination work without making things go faster, and 2) you're creating goroutines for tiny sorts, which may spend more of their time setting up and coordinating than they actually do sorting. And at 10 goroutines your problem might be that you're no longer achieving full CPU utilization.
If you wanted, you could test those theories by counting the number of total goroutine launches at various goroutine limits (in an atomic global counter) and measuring CPU utilization at various limits (e.g. by running your program under the Linux/UNIX time utility).
The approach I'd suggest for a divide-and-conquer problem like this is only fork off a goroutine for large enough subproblems (for quicksort, that means large enough subarrays). You can try different limits: maybe you only start goroutines for pieces that are more than 1/64th of the original array, or pieces above some static threshold like 1000 items.
And you meant this sort routine as an exercise, I suspect, but there are various things you can do to make your sorts faster or more robust against weird inputs. The standard libary sort falls back to insertion sort for small subarrays and uses heapsort for the unusual data patterns that cause quicksort problems.
You can also look at other algorithms like radix sort for all or part of the sorting, which I played with. That sorting library is also parallel. I wound up using a minimum cutoff of 127 items before I'd hand a subarray off for other goroutines to sort, and I used an arrangement with a fixed pool of goroutines and a buffered chan to pass tasks between them. That produced decent practical speedups at the time, though it was likely not the best approach at the time and I'm almost sure it's not on today's Go scheduler. Experimentation is fun!

If the operation is CPU bounded, my experiments show that the optimal is the number of CPUs.
template

Write operation cost

I have a Go program which writes strings into a file.I have a loop which is iterated 20000 times and in each iteration i am writing around 20-30 strings into a file. I just wanted to know which is the best way to write it into a file.
Approach 1: Keep open the file pointer at the start of the code and
write it for every string. It makes it 20000*30 write operations.
Approach 2: Use bytes.Buffer Go and store everything in the buffer and
write it at the end.Also in this case should the file pointer be
opened from the beginning of the code or at the end of the code. Does
it matter?
I am assuming approach 2 should work better. Can someone confirm this with a reason. How does writing at once be better than writing periodically. Because the file pointer will anyways be open.
I am using f.WriteString(<string>) and buffer.WriteString(<some string>) buffer is of type bytes.Buffer and f is the file pointer open.

bufio package has been created exactly for this kind of task. Instead of making a syscall for each Write call bufio.Writer buffers up to a fixed number of bytes in the internal memory before making a syscall. After a syscall the internal buffer is reused for the next portion of data
Comparing to your second approach bufio.Writer
makes more syscalls (N/S instead of 1)
uses less memory (S bytes instead of N bytes)
where S - is buffer size (can be specified via bufio.NewWriterSize), N - total size of data that needs to be written.
Example usage (https://play.golang.org/p/AvBE1d6wpT):
f, err := os.Create("file.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
w := bufio.NewWriter(f)
fmt.Fprint(w, "Hello, ")
fmt.Fprint(w, "world!")
err = w.Flush() // Don't forget to flush!
if err != nil {
log.Fatal(err)
}

The operations that take time when writing in files are the syscalls and the disk I/O. The fact that the file pointer is open doesn't cost you anything. So naively, we could say that the second method is best.
Now, as you may know, you OS doesn't directly write into files, it uses an internal in-memory cache for files that are written and do the real I/O later. I don't know the exacts details of that, and generally speaking I don't need to.
What I would advise is a middle-ground solution: do a buffer for every loop iteration, and write this one N times. That way to cut a big part of the number of syscalls and (potentially) disk writes, but without consuming too much memory with the buffer (dependeing on the size of your strings, that my be a point to be taken into account).
I would suggest benchmarking for the best solution, but due to the caching done by the system, benchmarking disk I/O is a real nightmare.

Syscalls are not cheap, so the second approach is better.
You can use lat_syscall tool from lmbench to measure how long it takes to call single write:
$ ./lat_syscall write
Simple write: 0.1522 microseconds
So, on my system it will take approximately 20000 * 0.15μs = 3ms extra time just to call write for every string.

Go-lang parallel segment runs slower than series segment

I have built an epidemic mathematics model which is fairly computationally intense in Go. I'm trying now to build a set of systems to test my model, where I change an input and expect a different output. I built a version in series to slowly increase HIV prevalence and see effects on HIV deaths. It takes ~200 milliseconds to run.
for q = 0.0; q < 1000; q++ {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
fmt.Println(results.HivDeaths[20])
}
Then I made a "parallel" version using channels, and it takes longer, ~400 milliseconds to run. These small changes are important as we will be running millions of runs with different inputs, so would like to make it as efficient as possible. Here is the parallel version:
ch := make(chan ChData)
var q float64
for q = 0.0; q < 1000; q++ {
go func(q float64, inputs *costanalysis.Inputs, ch chan ChData) {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
fmt.Println(results.HivDeaths[20])
ch <- ChData{int(q), results.HivDeaths[20]}
}(q, inputs, ch)
}
for q = 0.0; q < 1000; q++ {
theResults := <-ch
fmt.Println(theResults)
}
Any thoughts are very much appreciated.

There's overhead to starting and communicating with background tasks. The time spent on your cost analyses probably dwarfs equals the cost of communication if the program was taking 200ms, but if coordination cost ever does kill your app, a common approach is to hand off largish chunks of work at a time--e.g., make each goroutine do analyses for a range of 10 q values instead of just one. (Edit: And as #Innominate says, making a "worker pool" of goroutines that process a queue of job objects is another common approach.)
Also, the code you pasted has a race condition. The contents of your Inputs struct don't get copied each time you spawn a goroutine, because you're passing your function a pointer. So goroutines running in parallel will read from and write to the same Inputs instance.
Simply making a brand new Inputs instance for each analysis, with its own arrays, etc. would avoid the race. If that ended up wasting tons of memory or causing lots of redundant copies, you could 1) recycle Inputs instances, 2) separate out read-only data that can safely be shared (maybe there's country data that's fixed, dunno), or 3) change some of the relatively big arrays to be local variables within costAnalysisHandler rather than stuff that needs to be passed around (maybe it could just take initial HIV prevalence and return HIV deaths at t=20, and everything else is local and on the stack).
This doesn't apply to Go today, but did when the question was originally posted: nothing is really running in parallel unless you call runtime.GOMAXPROCS() with your desired concurrency level, e.g., runtime.GOMAXPROCS(runtime.NumCPU()).
Finally, you should only worry about all of this if you're doing some larger analysis and actually have a performance problem; if .2 seconds of waiting is all that performance work can save you here, it's not worth it.

Parallelizing a computationally intensive set of calculations requires that the parallel computations can actually run in parallel on your machine. If they don't then the extra overhead of creating goroutines, channels and reading off the channel will make the program run slower.
I'm guessing that is the problem here.
Try setting the GOMAXPROCS environment variable to the number of CPU's you have before running your code. Or call runtime.GOMAXRPROCS(runtime.NumCPU()) before you start the parallell computations.

I see two issues related to parallel performance,
The first and more obvious one is that you must set GOMAXPROCS in order to get the Go runtime to use more than one cpu/core. Typically one would set it for the number of processors in the machine but the ideal setting can vary.
The second problem is a bit trickier, which is that your code doesn't appear to be parallelizing very well. Simply starting a thousand goroutines and assuming they'll work it out isn't going to give good results. You should probably be using some kind of worker pool, running a limited number of simultaneous computations(a good starting number would be to set it the same as GOMAXPROCS) rather than trying to do 1000 at once.
See: http://golang.org/doc/faq#Why_no_multi_CPU

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio