I'm working through The Go Programming Language and learning about goroutines, and came across the following issue. In this example, the following function is meant to take a channel of files and process each of them:
func makeThumbnails5(filenames <-chan string) int64 {
sizes := make(chan int64)
var wg sync.WaitGroup
for f := range filenames {
wg.Add(1)
// worker
go func(f string) {
defer wg.Done()
thumb, err := thumbnail.ImageFile(f)
if err != nil {
log.Println(err)
return
}
info, _ := os.Stat(thumb)
sizes <- info.Size()
}(f)
}
// closer
go func() {
wg.Wait()
close(sizes)
}()
var total int64
for size := range sizes {
total += size
}
wg.Wait()
return total
}
I've tried to use this function the following way:
func main() {
thumbnails := os.Args[1:] /* Get a list of all the images from the CLI */
ch := make(chan string, len(thumbnails))
for _, val := range thumbnails {
ch <- val
}
makeThumbnails5(ch)
}
However, when I run this program, I get the following error:
fatal error: all goroutines are asleep - deadlock!
It doesn't appear that the closer goroutine is running. Could someone help me understand what is going wrong here, and what I can do to run this function correctly?
As I commented it deadlocks because the filenames chan is never closed and thus the for f := range filenames loop never completes. However, just closing the input chan means that all goroutines launched in the loop would get stuck at the line sizes <- info.Size() until the loop ends. Not a problem in this case but if the input can be huge it could be (then you'd probably want to limit the number of concurrent workers too). So it makes sense to have the main loop in a goroutine too so that the for size := range sizes loop can start consuming. Following should work:
func makeThumbnails5(filenames <-chan string) int64 {
sizes := make(chan int64)
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
for f := range filenames {
wg.Add(1)
// worker
go func(f string) {
defer wg.Done()
thumb, err := thumbnail.ImageFile(f)
if err != nil {
log.Println(err)
return
}
info, _ := os.Stat(thumb)
sizes <- info.Size()
}(f)
}
}()
// closer
go func() {
wg.Wait()
close(sizes)
}()
var total int64
for size := range sizes {
total += size
}
return total
}
The implementation of the main has a similar problem that if the input is huge you're essentially load it all into memory (buffered chan) before passing it on to be processed. Perhaps something like following is better
func main() {
ch := make(chan string)
go func(thumbnails []string) {
defer close(ch)
for _, val := range thumbnails {
ch <- val
}
}(os.Args[1:])
makeThumbnails5(ch)
}
Related
I'm learning Go and I am trying to implement a job queue.
What I'm trying to do is:
Have the main goroutine feed lines through a channel for multiple parser workers (that parse a line to s struct), and have each parser send the struct to a channel of structs that other workers (goroutines) will process (send to database, etc).
The code looks like this:
lineParseQ := make(chan string, 5)
jobProcessQ := make(chan myStruct, 5)
doneQ := make(chan myStruct, 5)
fileName := "myfile.csv"
file, err := os.Open(fileName)
if err != nil {
log.Fatal(err)
}
defer file.Close()
reader := bufio.NewReader(file)
// Start line parsing workers and send to jobProcessQ
for i := 1; i <= 2; i++ {
go lineToStructWorker(i, lineParseQ, jobProcessQ)
}
// Process myStruct from jobProcessQ
for i := 1; i <= 5; i++ {
go WorkerProcessStruct(i, jobProcessQ, doneQ)
}
lineCount := 0
countSend := 0
for {
line, err := reader.ReadString('\n')
if err != nil && err != io.EOF {
log.Fatal(err)
}
if err == io.EOF {
break
}
lineCount++
if lineCount > 1 {
countSend++
lineParseQ <- line[:len(line)-1] // Avoid last char '\n'
}
}
for i := 0; i < countSend; i++ {
fmt.Printf("Received %+v.\n", <-doneQ)
}
close(doneQ)
close(jobProcessQ)
close(lineParseQ)
Here's a simplified playground: https://play.golang.org/p/yz84g6CJraa
the workers look like this:
func lineToStructWorker(workerID int, lineQ <-chan string, strQ chan<- myStruct ) {
for j := range lineQ {
strQ <- lineToStruct(j) // just parses the csv to a struct...
}
}
func WorkerProcessStruct(workerID int, strQ <-chan myStruct, done chan<- myStruct) {
for a := range strQ {
time.Sleep(time.Millisecond * 500) // fake long operation...
done <- a
}
}
I know the problem is related to the "done" channel because if I don't use it, there's no error, but I can't figure out how to fix it.
You don't start reading from doneQ until you've finished sending all the lines to lineParseQ, which is more lines than there is buffer space. So once the doneQ buffer is full, that send blocks, which starts filling the lineParseQ buffer, and once that's full, it deadlocks. Move either the loop sending to lineParseQ, the loop reading from doneQ, or both, to separate goroutine(s), e.g.:
go func() {
for _, line := range lines {
countSend++
lineParseQ <- line
}
close(lineParseQ)
}()
This will still deadlock at the end, because you've got a range over a channel and the close after it in the same goroutine; since range continues until the channel is closed, and the close comes after the range finishes, you still have a deadlock. You need to put the closes in appropriate places; that being, either in the sending routine, or blocked on a WaitGroup monitoring the sending routines if there are multiple senders for a given channel.
// Start line parsing workers and send to jobProcessQ
wg := new(sync.WaitGroup)
for i := 1; i <= 2; i++ {
wg.Add(1)
go lineToStructWorker(i, lineParseQ, jobProcessQ, wg)
}
// Process myStruct from jobProcessQ
for i := 1; i <= 5; i++ {
go WorkerProcessStruct(i, jobProcessQ, doneQ)
}
countSend := 0
go func() {
for _, line := range lines {
countSend++
lineParseQ <- line
}
close(lineParseQ)
}()
go func() {
wg.Wait()
close(jobProcessQ)
}()
for a := range doneQ {
fmt.Printf("Received %v.\n", a)
}
// ...
func lineToStructWorker(workerID int, lineQ <-chan string, strQ chan<- myStruct, wg *sync.WaitGroup) {
for j := range lineQ {
strQ <- lineToStruct(j) // just parses the csv to a struct...
}
wg.Done()
}
func WorkerProcessStruct(workerID int, strQ <-chan myStruct, done chan<- myStruct) {
for a := range strQ {
time.Sleep(time.Millisecond * 500) // fake long operation...
done <- a
}
close(done)
}
Full working example here: https://play.golang.org/p/XsnewSZeb2X
Coordinate the pipeline with sync.WaitGroup breaking each piece into stages. When you know one piece of the pipeline is complete (and no one is writing to a particular channel), close the channel to instruct all "workers" to exit e.g.
var wg sync.WaitGroup
for i := 1; i <= 5; i++ {
i := i
wg.Add(1)
go func() {
Worker(i)
wg.Done()
}()
}
// wg.Wait() signals the above have completed
Buffered channels are handy to handle burst workloads, but sometimes they are used to avoid deadlocks in poor designs. If you want to avoid running certain parts of your pipeline in a goroutine you can buffer some channels (matching the number of workers typically) to avoid a blockage in your main goroutine.
If you have dependent pieces that read & write and want to avoid deadlock - ensure they are in separate goroutines. Having all parts of the pipeline it its own goroutine will even remove the need for buffered channels:
// putting all channel work into separate goroutines
// removes the need for buffered channels
lineParseQ := make(chan string, 0)
jobProcessQ := make(chan myStruct, 0)
doneQ := make(chan myStruct, 0)
Its a tradeoff of course - a goroutine costs about 2K in resources - versus a buffered channel which is much less. As with most designs it depends on how it is used.
Also don't get caught by the notorious Go for-loop gotcha, so use a closure assignment to avoid this:
for i := 1; i <= 5; i++ {
i := i // new i (not the i above)
go func() {
myfunc(i) // otherwise all goroutines will most likely get '5'
}()
}
Finally ensure you wait for all results to be processed before exiting.
It's a common mistake to return from a channel based function and believe all results have been processed. In a service this will eventually be true. But in a standalone executable the processing loop may still be working on results.
go func() {
wgW.Wait() // waiting on worker goroutines to finish
close(doneQ) // safe to close results channel now
}()
// ensure we don't return until all results have been processed
for a := range doneQ {
fmt.Printf("Received %v.\n", a)
}
by processing the results in the main goroutine, we ensure we don't return prematurely without having processed everything.
Pulling it all together:
https://play.golang.org/p/MjLpQ5xglP3
I am prototyping a series of go routines for a pipeline that each perform a transformation. The routines are terminating before all the data has passed through.
I have checked Donavan and Kernighan book and Googled for solutions.
Here is my code:
package main
import (
"fmt"
"sync"
)
func main() {
a1 := []string{"apple", "apricot"}
chan1 := make(chan string)
chan2 := make(chan string)
chan3 := make(chan string)
var wg sync.WaitGroup
go Pipe1(chan2, chan1, &wg)
go Pipe2(chan3, chan2, &wg)
go Pipe3(chan3, &wg)
func (data []string) {
defer wg.Done()
for _, s := range data {
wg.Add(1)
chan1 <- s
}
go func() {
wg.Wait()
close(chan1)
}()
}(a1)
}
func Pipe1(out chan<- string, in <-chan string, wg *sync.WaitGroup) {
defer wg.Done()
for s := range in {
wg.Add(1)
out <- s + "s are"
}
}
func Pipe2(out chan<- string, in <-chan string, wg *sync.WaitGroup) {
defer wg.Done()
for s := range in {
wg.Add(1)
out <- s + " good for you"
}
}
func Pipe3(in <-chan string, wg *sync.WaitGroup) {
defer wg.Done()
for s := range in {
wg.Add(1)
fmt.Println(s)
}
}
My expected output is:
apples are good for you
apricots are good for you
The results of running main are inconsistent. Sometimes I get both lines. Sometimes I just get the apples. Sometimes nothing is output.
As Adrian already pointed out, your WaitGroup.Add and WaitGroup.Done calls are mismatched. However, in cases like this the "I am done" signal is typically given by closing the output channel. WaitGroups are only necessary if work is shared between several goroutines (i.e. several goroutines consume the same channel), which isn't the case here.
package main
import (
"fmt"
)
func main() {
a1 := []string{"apple", "apricot"}
chan1 := make(chan string)
chan2 := make(chan string)
chan3 := make(chan string)
go func() {
for _, s := range a1 {
chan1 <- s
}
close(chan1)
}()
go Pipe1(chan2, chan1)
go Pipe2(chan3, chan2)
// This range loop terminates when chan3 is closed, which Pipe2 does after
// chan2 is closed, which Pipe1 does after chan1 is closed, which the
// anonymous goroutine above does after it sent all values.
for s := range chan3 {
fmt.Println(s)
}
}
func Pipe1(out chan<- string, in <-chan string) {
for s := range in {
out <- s + "s are"
}
close(out) // let caller know that we're done
}
func Pipe2(out chan<- string, in <-chan string) {
for s := range in {
out <- s + " good for you"
}
close(out) // let caller know that we're done
}
Try it on the playground: https://play.golang.org/p/d2J4APjs_lL
You're calling wg.Wait in a goroutine, so main is allowed to return (and therefore your program exits) before the other routines have finished. This would cause the behavior your see, but taking out of a goroutine alone isn't enough.
You're also misusing the WaitGroup in general; your Add and Done calls don't relate to one another, and you don't have as many Dones as you have Adds, so the WaitGroup will never finish. If you're calling Add in a loop, then every loop iteration must also result in a Done call; as you have it now, you defer wg.Done() before each of your loops, then call Add inside the loop, resulting in one Done and many Adds. This code would need to be significantly revised to work as intended.
This is in reference to following code in The Go Programming Language - Chapter 8 p.238 copied below from this link
// makeThumbnails6 makes thumbnails for each file received from the channel.
// It returns the number of bytes occupied by the files it creates.
func makeThumbnails6(filenames <-chan string) int64 {
sizes := make(chan int64)
var wg sync.WaitGroup // number of working goroutines
for f := range filenames {
wg.Add(1)
// worker
go func(f string) {
defer wg.Done()
thumb, err := thumbnail.ImageFile(f)
if err != nil {
log.Println(err)
return
}
info, _ := os.Stat(thumb) // OK to ignore error
fmt.Println(info.Size())
sizes <- info.Size()
}(f)
}
// closer
go func() {
wg.Wait()
close(sizes)
}()
var total int64
for size := range sizes {
total += size
}
return total
}
Why do we need to put the closer in a goroutine? Why can't below work?
// closer
// go func() {
fmt.Println("waiting for reset")
wg.Wait()
fmt.Println("closing sizes")
close(sizes)
// }()
If I try running above code it gives:
waiting for reset
3547
2793
fatal error: all goroutines are asleep - deadlock!
Why is there a deadlock in above? fyi, In the method that calls makeThumbnail6 I do close the filenames channel
Your channel is unbuffered (you didn't specify any buffer size when make()ing the channel). This means that a write to the channel blocks until the value written is read. And you read from the channel after your call to wg.Wait(), so nothing ever gets read and all your goroutines get stuck on the blocking write.
That said, you do not need WaitGroup here. WaitGroups are good when you don't know when your goroutine is done, but you are sending results back, so you know. Here is a sample code that does a similar thing to what you are trying to do (with fake worker payload).
package main
import (
"fmt"
"time"
)
func main() {
var procs int = 0
filenames := []string{"file1", "file2", "file3", "file4"}
mychan := make(chan string)
for _, f := range filenames {
procs += 1
// worker
go func(f string) {
fmt.Printf("Worker processing %v\n", f)
time.Sleep(time.Second)
mychan <- f
}(f)
}
for i := 0; i < procs; i++ {
select {
case msg := <-mychan:
fmt.Printf("got %v from worker channel\n", msg)
}
}
}
Test it in the playground here https://play.golang.org/p/RtMkYbAqtGO
Although it's been a while since the question was raised, I encountered the same issue. Initially my main looked like the following:
func main() {
filenames := make(chan string, len(os.Args))
for _, f := range os.Args[1:] {
filenames <- f
}
sizes := makeThumbnails6(filenames)
close(filenames)
log.Println("Total size: ", sizes)}
This version deadlocks as the call range filenames in makeThumbnails6 is synchronous, thus close(filenames) in main was never called. The channel in makeThumbnails6 is unbuffered so goroutines block when trying to send back the size.
The solution was to move close(filenames) before making the function call in main.
The code is wrong. In short, the channel sizes is unbuffered. To fix it, we need to use a buffered channel with enough capacity when creating sizes. One-liner fix is enough, as shown. Here I just made a simple assumption that 1024 is big enough.
func makeThumbnails6(filenames chan string) int64 {
sizes := make(chan int64, 1024) // CHANGE
var wg sync.WaitGroup // number of working goroutines
for f := range filenames {
wg.Add(1)
// worker
go func(f string) {
defer wg.Done()
thumb, err := thumbnail.ImageFile(f)
if err != nil {
log.Println(err)
return
}
info, _ := os.Stat(thumb) // OK to ignore error
fmt.Println(info.Size())
sizes <- info.Size()
}(f)
}
// closer
go func() {
wg.Wait()
close(sizes)
}()
var total int64
for size := range sizes {
total += size
}
return total
}
I'm currently staring at a beefed up version of the following code:
func embarrassing(data []string) []string {
resultChan := make(chan string)
var waitGroup sync.WaitGroup
for _, item := range data {
waitGroup.Add(1)
go func(item string) {
defer waitGroup.Done()
resultChan <- doWork(item)
}(item)
}
go func() {
waitGroup.Wait()
close(resultChan)
}()
var results []string
for result := range resultChan {
results = append(results, result)
}
return results
}
This is just blowing my mind. All this is doing can be expressed in other languages as
results = parallelMap(data, doWork)
Even if it can't be done quite this easily in Go, isn't there still a better way than the above?
If you need all the results, you don't need the channel (and the extra goroutine to close it) to communicate the results, you can write directly into the results slice:
func cleaner(data []string) []string {
results := make([]string, len(data))
wg := &sync.WaitGroup{}
wg.Add(len(data))
for i, item := range data {
go func(i int, item string) {
defer wg.Done()
results[i] = doWork(item)
}(i, item)
}
wg.Wait()
return results
}
This is possible because slice elements act as distinct variables, and thus can be written individually without synchronization. For details, see Can I concurrently write different slice elements. You also get the results in the same order as your input for free.
Anoter variation: if doWork() would not return the result but get the address where the result should be "placed", and additionally the sync.WaitGroup to signal completion, that doWork() function could be executed "directly" as a new goroutine.
We can create a reusable wrapper for doWork():
func doWork2(item string, result *string, wg *sync.WaitGroup) {
defer wg.Done()
*result = doWork(item)
}
If you have the processing logic in such format, this is how it can be executed concurrently:
func cleanest(data []string) []string {
results := make([]string, len(data))
wg := &sync.WaitGroup{}
wg.Add(len(data))
for i, item := range data {
go doWork2(item, &results[i], wg)
}
wg.Wait()
return results
}
Yet another variation could be to pass a channel to doWork() on which it is supposed to deliver the result. This solution doesn't even require a sync.Waitgroup, as we know how many elements we want to receive from the channel:
func cleanest2(data []string) []string {
ch := make(chan string)
for _, item := range data {
go doWork3(item, ch)
}
results := make([]string, len(data))
for i := range results {
results[i] = <-ch
}
return results
}
func doWork3(item string, res chan<- string) {
res <- "done:" + item
}
"Weakness" of this last solution is that it may collect the result "out-of-order" (which may or may not be a problem). This approach can be improved to retain order by letting doWork() receive and return the index of the item. For details and examples, see How to collect values from N goroutines executed in a specific order?
You can also use reflection to achieve something similar.
In this example it distribute the handler function over 4 goroutines and returns the results in a new instance of the given source slice type.
package main
import (
"fmt"
"reflect"
"strings"
"sync"
)
func parralelMap(some interface{}, handle interface{}) interface{} {
rSlice := reflect.ValueOf(some)
rFn := reflect.ValueOf(handle)
dChan := make(chan reflect.Value, 4)
rChan := make(chan []reflect.Value, 4)
var waitGroup sync.WaitGroup
for i := 0; i < 4; i++ {
waitGroup.Add(1)
go func() {
defer waitGroup.Done()
for v := range dChan {
rChan <- rFn.Call([]reflect.Value{v})
}
}()
}
nSlice := reflect.MakeSlice(rSlice.Type(), rSlice.Len(), rSlice.Cap())
for i := 0; i < rSlice.Len(); i++ {
dChan <- rSlice.Index(i)
}
close(dChan)
go func() {
waitGroup.Wait()
close(rChan)
}()
i := 0
for v := range rChan {
nSlice.Index(i).Set(v[0])
i++
}
return nSlice.Interface()
}
func main() {
fmt.Println(
parralelMap([]string{"what", "ever"}, strings.ToUpper),
)
}
Test here https://play.golang.org/p/iUPHqswx8iS
I have 5 huge (4 million rows each) logfiles that I process in Perl currently and I thought I may try to implement the same in Go and its concurrent features. So, being very inexperienced in Go, I was thinking of doing as below. Any comments on the approach will be greatly appreciated.
Some rough pseudocode:
var wg1 sync.WaitGroup
var wg2 sync.WaitGroup
func processRow (r Row) {
wg2.Add(1)
defer wg2.Done()
res = <process r>
return res
}
func processFile(f File) {
wg1.Add(1)
open(newfile File)
defer wg1.Done()
line = <row from f>
result = go processRow(line)
newFile.Println(result) // Write new processed line to newFile
wg2.Wait()
newFile.Close()
}
func main() {
for each f logfile {
go processFile(f)
}
wg1.Wait()
}
So, idea is that I process these 5 files concurrently and then all rows of each file will in turn also be processed concurrently.
Will that work?
You should definitely use channels to manage your processed rows. Alternatively you could also write another goroutine to handle your output.
var numGoWriters = 10
func processRow(r Row, ch chan<- string) {
res := process(r)
ch <- res
}
func writeRow(f File, ch <-chan string) {
w := bufio.NewWriter(f)
for s := range ch {
_, err := w.WriteString(s + "\n")
}
func processFile(f File) {
outFile, err := os.Create("/path/to/file.out")
if err != nil {
// handle it
}
defer outFile.Close()
var wg sync.WaitGroup
ch := make(chan string, 10) // play with this number for performance
defer close(ch) // once we're done processing rows, we close the channel
// so our worker threads exit
fScanner := bufio.NewScanner(f)
for fScanner.Scan() {
wg.Add(1)
go func() {
processRow(fScanner.Text(), ch)
wg.Done()
}()
}
for i := 0; i < numGoWriters; i++ {
go writeRow(outFile, ch)
}
wg.Wait()
}
Here we have processRow doing all the processing (I assumed to string), writeRow doing all the out I/O, and processFile tying each file together. Then all main has to do is hand off the files, spawn the goroutines, et voila.
func main() {
var wg sync.WaitGroup
filenames := [...]string{"here", "are", "some", "log", "paths"}
for fname := range filenames {
inFile, err := os.Open(fname)
if err != nil {
// handle it
}
defer inFile.Close()
wg.Add(1)
go processFile(inFile)
}
wg.Wait()