Limiting the number of concurrent tasks running - go

So I come accross this issue with go a lot. Let's say I have a text file with 100,000 lines of text. Now I wanna save all these lines to a db. So I would do something like this:
file, _ := iotuil.ReadFile("file.txt")
fileLines := strings.Split(string(file), "\n")
Now I would loop over all the lines in the file:
for _, l := range fileLines{
saveToDB(l)
}
Now I wanna run this saveToDB func concurrently:
var wg sync.WaitGroup
for _, l := range fileLines{
wg.Add(1)
go saveToDB(l, &wg)
}
wg.Wait()
I don't know if this is a problem or not but that would run 100,000 concurrent functions. Is there any way of saying hey run 100 concurrent functions wait for all of those to finish then run 100 more.
for i, _ := range fileLine {
for t = 0; t < 100; t++{
wg.Add(1)
go saveToDB(fileLine[i], &wg)
}
wg.Wait()
}
Do I need to do something like that or is there a cleaner way to go about this? Or is me running the 100,000 concurrent tasks not an issue?

I think the best approach for this would be to keep a pool of worker goroutines, dispatch the work for them in channels, and then close the channel so they would exit.
something like this:
// create a channel for work "tasks"
ch := make(chan string)
wg := sync.WaitGroup{}
// start the workers
for t := 0; t < 100; t++{
wg.Add(1)
go saveToDB(ch, &wg)
}
// push the lines to the queue channel for processing
for _, line := range fileline {
ch <- line
}
// this will cause the workers to stop and exit their receive loop
close(ch)
// make sure they all exit
wg.Wait()
and then the saveFunction looks like this:
func saveToDB(ch chan string, wg *sync.WaitGroup) {
// cnosume a line
for line := range ch {
// do work
actuallySaveToDB(line)
}
// we've exited the loop when the dispatcher closed the channel,
// so now we can just signal the workGroup we're done
wg.Done()
}

Related

Deadlock on All GoRoutines When Using Channels and WaitGroup

I am new to Go and am currently attempting to run a function that creates a file and returns it's filename and have this run concurrently.
I've decided to try and accomplish this with goroutines and a WaitGroup. When I use this approach, I end up with a list size that is a couple hundred files less than the input size. E.g. for 5,000 files I get around 4,700~ files created.
I believe this is due to some race conditions:
wg := sync.WaitGroup{}
filenames := make([]string, 0)
for i := 0; i < totalFiles; i++ {
wg.Add(1)
go func() {
defer wg.Done()
filenames = append(filenames, createFile())
}()
}
wg.Wait()
return filenames, nil
Don't communicate by sharing memory; share memory by communicating.
I tried using channels to "share memory by communicating". Whenever I do this, there appears to be a deadlock that I can't seem to wrap my head around why. Would anyone be able to point me in the right direction for using channels and waitgroups together properly in order to save all of the created files to a shared data structure?
This is the code that produces the deadlock for me (fatal error: all goroutines are asleep - deadlock!):
wg := sync.WaitGroup{}
filenames := make([]string, 0)
ch := make(chan string)
for i := 0; i < totalFiles; i++ {
wg.Add(1)
go func() {
defer wg.Done()
ch <- createFile()
}()
}
wg.Wait()
for i := range ch {
filenames = append(filenames, i)
}
return filenames, nil
Thanks!
The first one has a race. You have to protect access to filenames:
mu:=sync.Mutex{}
for i := 0; i < totalFiles; i++ {
wg.Add(1)
go func() {
defer wg.Done()
mu.Lock()
defer mu.Unlock()
filenames = append(filenames, createFile())
}()
}
For the second case, you are waiting for the goroutines to finish, but goroutines can only finish once you read from the channel, so deadlock. You can fix it by reading from the channel in a separate goroutine.
go func() {
for i := range ch {
filenames = append(filenames, i)
}
}()
wg.Wait()
close(ch) // Required, so the goroutine can terminate
return filenames, nil
There is a lock-free version, if the number of files is fixed:
filenames := make([]string, totalFiles)
for i := 0; i < totalFiles; i++ {
wg.Add(1)
go func(index int) {
defer wg.Done()
filenames[index]=createFile()
}(i)
}
wg.Wait()

Why is this golang script giving me a deadlock ? + a few questions

I got this code from someone on github and I am trying to play around with it to understand concurrency.
package main
import (
"bufio"
"fmt"
"os"
"sync"
"time"
)
var wg sync.WaitGroup
func sad(url string) string {
fmt.Printf("gonna sleep a bit\n")
time.Sleep(2 * time.Second)
return url + " added stuff"
}
func main() {
sc := bufio.NewScanner(os.Stdin)
urls := make(chan string)
results := make(chan string)
for i := 0; i < 20; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for url := range urls {
n := sad(url)
results <- n
}
}()
}
for sc.Scan() {
url := sc.Text()
urls <- url
}
for result := range results {
fmt.Printf("%s arrived\n", result)
}
wg.Wait()
close(urls)
close(results)
}
I have a few questions:
Why does this code give me a deadlock?
How does that for loop exist before the operation of taking in input from user does the go routines wait until anything is passes in the urls channel then start doing work? I don't get this because it's not sequential, like why is taking in input from user then putting every input in the urls channel then running the go routines is considered wrong?
Inside the for loop I have another loop which is iterating over the urls channel, does each go routine deal with exactly one line of input? or does one go routine handle multiple lines at once? how does any of this work?
Am i gathering the output correctly here?
Mostly you're doing things correctly, but have things a little out of order. The for sc.Scan() loop will continue until Scanner is done, and the for result := range results loop will never run, thus no go routine ('main' in this case) will be able to receive from results. When running your example, I started the for result := range results loop before for sc.Scan() and also in its own go routine--otherwise for sc.Scan() will never be reached.
go func() {
for result := range results {
fmt.Printf("%s arrived\n", result)
}
}()
for sc.Scan() {
url := sc.Text()
urls <- url
}
Also, because you run wg.Wait() before close(urls), the main goroutine is left blocked waiting for the 20 sad() go routines to finish. But they can't finish until close(urls) is called. So just close that channel before waiting for the waitgroup.
close(urls)
wg.Wait()
close(results)
The for-loop creates 20 goroutines, all waiting input from the urls channel. When someone writes into this channel, one of the goroutines will pick it up and work on in. This is a typical worker-pool implementation.
Then, then scanner reads input line by line, and sends it to the urls channel, where one of the goroutines will pick it up and write the response to the results channel. At this point, there are no other goroutines reading from the results channel, so this will block.
As the scanner reads URLs, all other goroutines will pick them up and block. So if the scanner reads more than 20 URLs, it will deadlock because all goroutines will be waiting for a reader.
If there are fewer than 20 URLs, the scanner for-loop will end, and the results will be read. However that will eventually deadlock as well, because the for-loop will terminate when the channel is closed, and there is no one there to close the channel.
To fix this, first, close the urls channel right after you finish reading. That will release all the for-loops in the goroutines. Then you should put the for-loop reading from the results channel into a goroutine, so you can call wg.Wait while results are being processed. After wg.Wait, you can close the results channel.
This does not guarantee that all items in the results channel will be read. The program may terminate before all messages are processed, so use a third channel which you close at the end of the goroutine that reads from the results channel. That is:
done:=make(chan struct{})
go func() {
defer close(done)
for result := range results {
fmt.Printf("%s arrived\n", result)
}
}()
wg.Wait()
close(results)
<-done
I am not super happy with previous answers, so here is a solution based on the documented behavior in the go tour, the go doc, the specifications.
package main
import (
"bufio"
"fmt"
"strings"
"sync"
"time"
)
var wg sync.WaitGroup
func sad(url string) string {
fmt.Printf("gonna sleep a bit\n")
time.Sleep(2 * time.Millisecond)
return url + " added stuff"
}
func main() {
// sc := bufio.NewScanner(os.Stdin)
sc := bufio.NewScanner(strings.NewReader(strings.Repeat("blah blah\n", 15)))
urls := make(chan string)
results := make(chan string)
for i := 0; i < 20; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for url := range urls {
n := sad(url)
results <- n
}
}()
}
// results is consumed by so many goroutines
// we must wait for them to finish before closing results
// but we dont want to block here, so put that into a routine.
go func() {
wg.Wait()
close(results)
}()
go func() {
for sc.Scan() {
url := sc.Text()
urls <- url
}
close(urls) // done consuming a channel, close it, right away.
}()
for result := range results {
fmt.Printf("%s arrived\n", result)
} // the program will finish when it gets out of this loop.
// It will get out of this loop because you have made sure the results channel is closed.
}

Deadlock using channels as queues

I'm learning Go and I am trying to implement a job queue.
What I'm trying to do is:
Have the main goroutine feed lines through a channel for multiple parser workers (that parse a line to s struct), and have each parser send the struct to a channel of structs that other workers (goroutines) will process (send to database, etc).
The code looks like this:
lineParseQ := make(chan string, 5)
jobProcessQ := make(chan myStruct, 5)
doneQ := make(chan myStruct, 5)
fileName := "myfile.csv"
file, err := os.Open(fileName)
if err != nil {
log.Fatal(err)
}
defer file.Close()
reader := bufio.NewReader(file)
// Start line parsing workers and send to jobProcessQ
for i := 1; i <= 2; i++ {
go lineToStructWorker(i, lineParseQ, jobProcessQ)
}
// Process myStruct from jobProcessQ
for i := 1; i <= 5; i++ {
go WorkerProcessStruct(i, jobProcessQ, doneQ)
}
lineCount := 0
countSend := 0
for {
line, err := reader.ReadString('\n')
if err != nil && err != io.EOF {
log.Fatal(err)
}
if err == io.EOF {
break
}
lineCount++
if lineCount > 1 {
countSend++
lineParseQ <- line[:len(line)-1] // Avoid last char '\n'
}
}
for i := 0; i < countSend; i++ {
fmt.Printf("Received %+v.\n", <-doneQ)
}
close(doneQ)
close(jobProcessQ)
close(lineParseQ)
Here's a simplified playground: https://play.golang.org/p/yz84g6CJraa
the workers look like this:
func lineToStructWorker(workerID int, lineQ <-chan string, strQ chan<- myStruct ) {
for j := range lineQ {
strQ <- lineToStruct(j) // just parses the csv to a struct...
}
}
func WorkerProcessStruct(workerID int, strQ <-chan myStruct, done chan<- myStruct) {
for a := range strQ {
time.Sleep(time.Millisecond * 500) // fake long operation...
done <- a
}
}
I know the problem is related to the "done" channel because if I don't use it, there's no error, but I can't figure out how to fix it.
You don't start reading from doneQ until you've finished sending all the lines to lineParseQ, which is more lines than there is buffer space. So once the doneQ buffer is full, that send blocks, which starts filling the lineParseQ buffer, and once that's full, it deadlocks. Move either the loop sending to lineParseQ, the loop reading from doneQ, or both, to separate goroutine(s), e.g.:
go func() {
for _, line := range lines {
countSend++
lineParseQ <- line
}
close(lineParseQ)
}()
This will still deadlock at the end, because you've got a range over a channel and the close after it in the same goroutine; since range continues until the channel is closed, and the close comes after the range finishes, you still have a deadlock. You need to put the closes in appropriate places; that being, either in the sending routine, or blocked on a WaitGroup monitoring the sending routines if there are multiple senders for a given channel.
// Start line parsing workers and send to jobProcessQ
wg := new(sync.WaitGroup)
for i := 1; i <= 2; i++ {
wg.Add(1)
go lineToStructWorker(i, lineParseQ, jobProcessQ, wg)
}
// Process myStruct from jobProcessQ
for i := 1; i <= 5; i++ {
go WorkerProcessStruct(i, jobProcessQ, doneQ)
}
countSend := 0
go func() {
for _, line := range lines {
countSend++
lineParseQ <- line
}
close(lineParseQ)
}()
go func() {
wg.Wait()
close(jobProcessQ)
}()
for a := range doneQ {
fmt.Printf("Received %v.\n", a)
}
// ...
func lineToStructWorker(workerID int, lineQ <-chan string, strQ chan<- myStruct, wg *sync.WaitGroup) {
for j := range lineQ {
strQ <- lineToStruct(j) // just parses the csv to a struct...
}
wg.Done()
}
func WorkerProcessStruct(workerID int, strQ <-chan myStruct, done chan<- myStruct) {
for a := range strQ {
time.Sleep(time.Millisecond * 500) // fake long operation...
done <- a
}
close(done)
}
Full working example here: https://play.golang.org/p/XsnewSZeb2X
Coordinate the pipeline with sync.WaitGroup breaking each piece into stages. When you know one piece of the pipeline is complete (and no one is writing to a particular channel), close the channel to instruct all "workers" to exit e.g.
var wg sync.WaitGroup
for i := 1; i <= 5; i++ {
i := i
wg.Add(1)
go func() {
Worker(i)
wg.Done()
}()
}
// wg.Wait() signals the above have completed
Buffered channels are handy to handle burst workloads, but sometimes they are used to avoid deadlocks in poor designs. If you want to avoid running certain parts of your pipeline in a goroutine you can buffer some channels (matching the number of workers typically) to avoid a blockage in your main goroutine.
If you have dependent pieces that read & write and want to avoid deadlock - ensure they are in separate goroutines. Having all parts of the pipeline it its own goroutine will even remove the need for buffered channels:
// putting all channel work into separate goroutines
// removes the need for buffered channels
lineParseQ := make(chan string, 0)
jobProcessQ := make(chan myStruct, 0)
doneQ := make(chan myStruct, 0)
Its a tradeoff of course - a goroutine costs about 2K in resources - versus a buffered channel which is much less. As with most designs it depends on how it is used.
Also don't get caught by the notorious Go for-loop gotcha, so use a closure assignment to avoid this:
for i := 1; i <= 5; i++ {
i := i // new i (not the i above)
go func() {
myfunc(i) // otherwise all goroutines will most likely get '5'
}()
}
Finally ensure you wait for all results to be processed before exiting.
It's a common mistake to return from a channel based function and believe all results have been processed. In a service this will eventually be true. But in a standalone executable the processing loop may still be working on results.
go func() {
wgW.Wait() // waiting on worker goroutines to finish
close(doneQ) // safe to close results channel now
}()
// ensure we don't return until all results have been processed
for a := range doneQ {
fmt.Printf("Received %v.\n", a)
}
by processing the results in the main goroutine, we ensure we don't return prematurely without having processed everything.
Pulling it all together:
https://play.golang.org/p/MjLpQ5xglP3

Race condition between close and send to channel

I'm trying to build a generic pipeline library using worker pools. I created an interface for a source, pipe, and sink. You see, the pipe's job is to receive data from an input channel, process it, and output the result onto a channel. Here is its intended behavior:
Receive data from an input channel.
Delegate the data to an available worker.
The worker sends the result to the output channel.
Close the output channel once all workers are finished.
func (p *pipe) Process(in chan interface{}) (out chan interface{}) {
var wg sync.WaitGroup
out = make(chan interface{}, 100)
go func() {
for i := 1; i <= 100; i++ {
go p.work(in, out, &wg)
}
wg.Wait()
close(out)
}()
return
}
func (p *pipe) work(jobs <-chan interface{}, out chan<- interface{}, wg *sync.WaitGroup) {
for j := range jobs {
func(j Job) {
defer wg.Done()
wg.Add(1)
res := doSomethingWith(j)
out <- res
}(j)
}
}
However, running it may either exit without processing all of the inputs or panic with a send on closed channel message. Building the source with the -race flag gives out a data race warning between close(out) and out <- res.
Here's what I think might happen. Once a number of workers have finished their jobs, there's a split second where wg's counter reach zero. Hence, wg.Wait() is done and the program proceeds to close(out). Meanwhile, the job channel isn't finished producing data, meaning some workers are still running in another goroutine. Since the out channel is already closed, it results in a panic.
Should the wait group be placed somewhere else? Or is there a better way to wait for all workers to finish?
It's not clear why you want one worker per job, but if you do, you can restructure your outer loop setup (see untested code below). This kind of obviates the need for worker pools in the first place.
Always, though, do a wg.Add before spinning off any worker. Right here, you are spinning off exactly 100 workers:
var wg sync.WaitGroup
out = make(chan interface{}, 100)
go func() {
for i := 1; i <= 100; i++ {
go p.work(in, out, &wg)
}
wg.Wait()
close(out)
}()
You could therefore do this:
var wg sync.WaitGroup
out = make(chan interface{}, 100)
go func() {
wg.Add(100) // ADDED - count the 100 workers
for i := 1; i <= 100; i++ {
go p.work(in, out, &wg)
}
wg.Wait()
close(out)
}()
Note that you can now move wg itself down into the goroutine that spins off the workers. This can make things cleaner, if you give up on the notion of having each worker spin off jobs as new goroutines. But if each worker is going to spin off another goroutine, that worker itself must also use wg.Add, like this:
for j := range jobs {
wg.Add(1) // ADDED - count the spun-off goroutines
func(j Job) {
res := doSomethingWith(j)
out <- res
wg.Done() // MOVED (for illustration only, can defer as before)
}(j)
}
wg.Done() // ADDED - our work in `p.work` is now done
That is, each anonymous function is another user of the channel, so increment the users-of-channel count (wg.Add(1)) before spinning off a new goroutine. When you have finished reading the input channel jobs, call wg.Done() (perhaps via an earlier defer, but I showed it at the end here).
The key to thinking about this is that wg counts the number of active goroutines that could, at this point, write to the channel. It only goes to zero when no goroutines intend to write any more. That makes it safe to close the channel.
Consider using the rather simpler (but untested):
func (p *pipe) Process(in chan interface{}) (out chan interface{}) {
out = make(chan interface{})
var wg sync.WaitGroup
go func() {
defer close(out)
for j := range in {
wg.Add(1)
go func(j Job) {
res := doSomethingWith(j)
out <- res
wg.Done()
}(j)
}
wg.Wait()
}()
return out
}
You now have one goroutine that is reading the in channel as fast as it can, spinning off jobs as it goes. You'll get one goroutine per incoming job, except when they finish their work early. There is no pool, just one worker per job (same as your code except that we knock out the pools that aren't doing anything useful).
Or, since there are only some number of CPUs available, spin off some number of goroutines as you did before at the start, but have each one run one job to completion, and deliver its result, then go back to reading the next job:
func (p *pipe) Process(in chan interface{}) (out chan interface{}) {
out = make(chan interface{})
go func() {
defer close(out)
var wg sync.WaitGroup
ncpu := runtime.NumCPU() // or something fancier if you like
wg.Add(ncpu)
for i := 0; i < ncpu; i++ {
go func() {
defer wg.Done()
for j := range in {
out <- doSomethingWith(j)
}
}()
}
wg.Wait()
}
return out
}
By using runtime.NumCPU() we get only as many workers reading jobs as there are CPUs to run jobs. Those are the pools and they only do one job at a time.
There's generally no need to buffer the output channel, if the output-channel readers are well-structured (i.e., don't cause the pipeline to constipate). If they're not, the depth of buffering here limits how many jobs you can "work ahead" of whoever is consuming the results. Set it based on how useful it is to do this "working ahead"—not necessarily the number of CPUs, or the number of expected jobs, or whatever.
It's possible that the jobs are being completed just as fast as they're being sent. In this case the WaitGroup will be floating near zero even while there's many more items to process.
One fix for this is to add one before sending jobs, and decrement that one after sending them all, effectively consider the sender to be one of the 'jobs'. In this case, it's better if we do the wg.Add in the sender:
func (p *pipe) Process(in chan interface{}) (out chan interface{}) {
var wg sync.WaitGroup
out = make(chan interface{}, 100)
go func() {
for i := 1; i <= 100; i++ {
wg.Add(1)
go p.work(in, out, &wg)
}
wg.Wait()
close(out)
}()
return
}
func (p *pipe) work(jobs <-chan interface{}, out chan<- interface{}, wg *sync.WaitGroup) {
for j := range jobs {
func(j Job) {
res := doSomethingWith(j)
out <- res
wg.Done()
}(j)
}
}
One thing I notice in the code is that a goroutine is started for each job. At the same time each job processes the jobs channel in a loop until empty/closed. It doesn't seem necessary to do both.

Implementing a job worker pool in Go

Since Go does not have generics, all the premade solutions use type casting which I do not like very much. I also want to implement it on my own and tried the following code. However, sometimes it does not wait for all goroutines, am I closing the jobs channel prematurely? I do not have anything to fetch from them. I might have used a pseudo output channel too and waited to fetch the exact amount from them however I believe the following code should work too. What am I missing?
func jobWorker(id int, jobs <-chan string, wg sync.WaitGroup) {
wg.Add(1)
defer wg.Done()
for job := range jobs {
item := ParseItem(job)
item.SaveItem()
MarkJobCompleted(item.ID)
log.Println("Saved", item.Title)
}
}
// ProcessJobs processes the jobs from the list and deletes them
func ProcessJobs() {
jobs := make(chan string)
list := GetJobs()
// Start workers
var wg sync.WaitGroup
for w := 0; w < 10; w++ {
go jobWorker(w, jobs, wg)
}
for _, url := range list {
jobs <- url
}
close(jobs)
wg.Wait()
}
Call wg.Add outside of the goroutine and pass a pointer to the wait group.
If Add is called from inside the goroutine, it's possible for the main goroutine to call Wait before the goroutines get a chance to run. If Add has not been called, then Wait will return immediately.
Pass a pointer to the goroutine. Otherwise, the goroutines use their own copy of the wait group.
func jobWorker(id int, jobs <-chan string, wg *sync.WaitGroup) {
defer wg.Done()
for job := range jobs {
item := ParseItem(job)
item.SaveItem()
MarkJobCompleted(item.ID)
log.Println("Saved", item.Title)
}
}
// ProcessJobs processes the jobs from the list and deletes them
func ProcessJobs() {
jobs := make(chan string)
list := GetJobs()
// Start workers
var wg sync.WaitGroup
for w := 0; w < 10; w++ {
wg.Add(1)
go jobWorker(w, jobs, &wg)
}
for _, url := range list {
jobs <- url
}
close(jobs)
wg.Wait()
}
You need to pass a pointer to the waitgroup, or else every job receives it's own copy.
func jobWorker(id int, jobs <-chan string, wg *sync.WaitGroup) {
wg.Add(1)
defer wg.Done()
for job := range jobs {
item := ParseItem(job)
item.SaveItem()
MarkJobCompleted(item.ID)
log.Println("Saved", item.Title)
}
}
// ProcessJobs processes the jobs from the list and deletes them
func ProcessJobs() {
jobs := make(chan string)
list := GetJobs()
// Start workers
var wg sync.WaitGroup
for w := 0; w < 10; w++ {
go jobWorker(w, jobs, &wg)
}
for _, url := range list {
jobs <- url
}
close(jobs)
wg.Wait()
}
See the difference here: without pointer, with pointer.

Resources