Goroutine deadlock while walking folders - go

I have this code based on the pipelines example. walkFiles takes one or more than one folder (as specified in the folders variable) and "visits" the files in all folders given as a parameter. It also takes a done channel to allow for cancellation, but I don't think it matters for this problem.
The code works as expected when passed only one folder to walk. But when given two it gives me the infamous fatal error: all goroutines are asleep - deadlock! error. It even looks like it's doing the right thing by processing the files of the two folders, but it doesn't end well. What is the (probably obvious) error I'm making in the concurrency of this function?
Here's the code:
type result struct {
path string
checksum []byte
err error
}
type FileData struct {
Hash []byte
}
// walkFiles starts a goroutine to walk the directory tree at root and send the
// path of each regular file on the string channel. It sends the result of the
// walk on the error channel. If done is closed, walkFiles abandons its work.
func (p Processor) walkFiles(done <-chan struct{}, folders []string) (<-chan string, <-chan error) {
paths := make(chan string)
errc := make(chan error, 1)
visit := func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
if !info.Mode().IsRegular() {
return nil
}
select {
case paths <- path:
case <-done:
return errors.New("walk canceled")
}
return nil
}
var wg sync.WaitGroup
for i, folder := range folders {
wg.Add(1)
go func(f string, i int) {
defer wg.Done()
// No select needed for this send, since errc is buffered.
errc <- filepath.Walk(f, visit)
}(folder, i)
}
go func() {
wg.Wait()
close(paths)
}()
return paths, errc
}
func closeFile(f *os.File) {
err := f.Close()
if err != nil {
fmt.Fprintf(os.Stderr, "error: %v\n", err)
os.Exit(1)
}
}
// processor reads path names from paths and sends digests of the corresponding
// files on c until either paths or done is closed.
func (p Processor) process(done <-chan struct{}, files <-chan string, c chan<- result, loc *locator.Locator) {
for f := range files {
func() {
file, err := os.Open(f.path)
if err != nil {
fmt.Println(err)
return
}
defer closeFile(file)
// Hashing file, producing `checksum` variable, and an `err`
select {
case c <- result{f.path, checksum, err}:
case <-done:
return
}
}()
}
}
// MD5All reads all the files in the file tree rooted at root and returns a map
// from file path to the MD5 sum of the file's contents. If the directory walk
// fails or any read operation fails, MD5All returns an error. In that case,
// MD5All does not wait for inflight read operations to complete.
func (p Processor) MD5All(folders []string) (map[string]FileData, error) {
// MD5All closes the done channel when it returns; it may do so before
// receiving all the values from c and errc.
done := make(chan struct{})
defer close(done)
paths, errc := p.walkFiles(done, folders)
c := make(chan result)
var wg sync.WaitGroup
wg.Add(NUM_DIGESTERS)
for i := 0; i < NUM_DIGESTERS; i++ {
go func() {
p.process(done, paths, c, loc)
wg.Done()
}()
}
go func() {
wg.Wait()
close(c)
}()
// End of pipeline. OMIT
m := make(map[string]FileData)
for r := range c {
if r.err != nil {
return nil, r.err
}
m[r.path] = FileData{r.checksum}
}
if err := <-errc; err != nil {
return nil, err
}
return m, nil
}
func (p Processor) Start() map[string]FileData {
m, err := p.MD5All(p.folders)
if err != nil {
log.Fatal(err)
}
return m
}

The problem is here:
if err := <-errc; err != nil {
return nil, err
}
You're reading from the errc only once, but all groutines are writing to it. Once the errc is read for the first completing goroutine, all others are stuck waiting to write to it.
Read using a for-loop.

Related

Concurrency issues with crawler

I try to build concurrent crawler based on Tour and some others SO answers regarding that. What I have currently is below but I think I have here two subtle issues.
Sometimes I get 16 urls in response and sometimes 17 (debug print in main). I know it because when I even change WriteToSlice to Read then in Read sometimes 'Read: end, counter = ' is never reached and it's always when I get 16 urls.
I have troubles with err channel, I get no messages in this channel, even when I run my main Crawl method with address like www.golang.org so without valid schema error should be send via err channel
Concurrency is really difficult topic, help and advice will be appreciated
package main
import (
"fmt"
"net/http"
"sync"
"golang.org/x/net/html"
)
type urlCache struct {
urls map[string]struct{}
sync.Mutex
}
func (v *urlCache) Set(url string) bool {
v.Lock()
defer v.Unlock()
_, exist := v.urls[url]
v.urls[url] = struct{}{}
return !exist
}
func newURLCache() *urlCache {
return &urlCache{
urls: make(map[string]struct{}),
}
}
type results struct {
data chan string
err chan error
}
func newResults() *results {
return &results{
data: make(chan string, 1),
err: make(chan error, 1),
}
}
func (r *results) close() {
close(r.data)
close(r.err)
}
func (r *results) WriteToSlice(s *[]string) {
for {
select {
case data := <-r.data:
*s = append(*s, data)
case err := <-r.err:
fmt.Println("e ", err)
}
}
}
func (r *results) Read() {
fmt.Println("Read: start")
counter := 0
for c := range r.data {
fmt.Println(c)
counter++
}
fmt.Println("Read: end, counter = ", counter)
}
func crawl(url string, depth int, wg *sync.WaitGroup, cache *urlCache, res *results) {
defer wg.Done()
if depth == 0 || !cache.Set(url) {
return
}
response, err := http.Get(url)
if err != nil {
res.err <- err
return
}
defer response.Body.Close()
node, err := html.Parse(response.Body)
if err != nil {
res.err <- err
return
}
urls := grablUrls(response, node)
res.data <- url
for _, url := range urls {
wg.Add(1)
go crawl(url, depth-1, wg, cache, res)
}
}
func grablUrls(resp *http.Response, node *html.Node) []string {
var f func(*html.Node) []string
var results []string
f = func(n *html.Node) []string {
if n.Type == html.ElementNode && n.Data == "a" {
for _, a := range n.Attr {
if a.Key != "href" {
continue
}
link, err := resp.Request.URL.Parse(a.Val)
if err != nil {
continue
}
results = append(results, link.String())
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
return results
}
res := f(node)
return res
}
// Crawl ...
func Crawl(url string, depth int) []string {
wg := &sync.WaitGroup{}
output := &[]string{}
visited := newURLCache()
results := newResults()
defer results.close()
wg.Add(1)
go crawl(url, depth, wg, visited, results)
go results.WriteToSlice(output)
// go results.Read()
wg.Wait()
return *output
}
func main() {
r := Crawl("https://www.golang.org", 2)
// r := Crawl("www.golang.org", 2) // no schema, error should be generated and send via err
fmt.Println(len(r))
}
Both your questions 1 and 2 are a result of the same bug.
In Crawl() you are not waiting for this go routine to finish: go results.WriteToSlice(output). On the last crawl() function, the wait group is released, the output is returned and printed before the WriteToSlice function finishes with the data and err channel. So what has happened is this:
crawl() finishes, placing data in results.data and results.err.
Waitgroup wait() unblocks, causing main() to print the length of the result []string
WriteToSlice adds the last data (or err) item to the channel
You need to return from Crawl() not only when the data is done being written to the channel, but also when the channel is done being read in it's entirety (including the buffer). A good way to do this is close channels when you are sure that you are done with them. By organizing your code this way, you can block on the go routine that is draining the channels, and instead of using the wait group to release to main, you wait until the channels are 100% done.
You can see this gobyexample https://gobyexample.com/closing-channels. Remember that when you close a channel, the channel can still be used until the last item is taken. So you can close a buffered channel, and the reader will still get all the items that were queued in the channel.
There is some code structure that can change to make this cleaner, but here is a quick way to fix your program. Change Crawl to block on WriteToSlice. Close the data channel when the crawl function finishes, and wait for WriteToSlice to finish.
// Crawl ...
func Crawl(url string, depth int) []string {
wg := &sync.WaitGroup{}
output := &[]string{}
visited := newURLCache()
results := newResults()
go func() {
wg.Add(1)
go crawl(url, depth, wg, visited, results)
wg.Wait()
// All data is written, this makes `WriteToSlice()` unblock
close(results.data)
}()
// This will block until results.data is closed
results.WriteToSlice(output)
close(results.err)
return *output
}
Then on write to slice, you have to check for the closed channel to exit the for loop:
func (r *results) WriteToSlice(s *[]string) {
for {
select {
case data, open := <-r.data:
if !open {
return // All data done
}
*s = append(*s, data)
case err := <-r.err:
fmt.Println("e ", err)
}
}
}
Here is the full code: https://play.golang.org/p/GBpGk-lzrhd (it won't work in the playground)

Optimize writing to CSV in Go

The following snippet validates a phone number and write the details to CSV.
func Parse(phone Input, output *PhoneNumber) error {
var n PhoneNumber
num, _ := phonenumbers.Parse(phone.Number, phone.Prefix)
n.PhoneNumber = phonenumbers.Format(num, phonenumbers.E164)
n.CountryCode = num.GetCountryCode()
n.PhoneType = phonenumbers.GetNumberType(num)
n.NetworkName, _ = phonenumbers.GetCarrierForNumber(num, "EN")
n.Region = phonenumbers.GetRegionCodeForNumber(num)
*output = n
return nil
}
func createFile(path string) {
// detect if file exists
var _, err = os.Stat(path)
// create file if not exists
if os.IsNotExist(err) {
var file, err = os.Create(path)
if err != nil {
return
}
defer file.Close()
}
}
func worker(ctx context.Context, dst chan string, src chan []string) {
for {
select {
case dataArray, ok := <-src: // you must check for readable state of the channel.
if !ok {
return
}
go processNumber(dataArray[0])
case <-ctx.Done(): // if the context is cancelled, quit.
return
}
}
}
func processNumber(number string) {
num, e := phonenumbers.Parse(number, "")
if e != nil {
return
}
region := phonenumbers.GetRegionCodeForNumber(num)
carrier, _ := phonenumbers.GetCarrierForNumber(num, "EN")
path := "sample_all.csv"
createFile(path)
var csvFile, _ = os.OpenFile(path, os.O_APPEND|os.O_WRONLY, os.ModeAppend)
csvwriter := csv.NewWriter(csvFile)
_ = csvwriter.Write([]string{phonenumbers.Format(num, phonenumbers.E164), fmt.Sprintf("%v", num.GetCountryCode()), fmt.Sprintf("%v", phonenumbers.GetNumberType(num)), carrier, region})
defer csvFile.Close()
csvwriter.Flush()
}
func ParseFile(phone Input, output *PhoneNumber) error {
// create a context
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// that cancels at ctrl+C
go onSignal(os.Interrupt, cancel)
numberOfWorkers := 2
start := time.Now()
csvfile, err := os.Open(phone.File)
if err != nil {
log.Fatal(err)
}
defer csvfile.Close()
reader := csv.NewReader(csvfile)
// create the pair of input/output channels for the controller=>workers com.
src := make(chan []string)
out := make(chan string)
// use a waitgroup to manage synchronization
var wg sync.WaitGroup
// declare the workers
for i := 0; i < numberOfWorkers; i++ {
wg.Add(1)
go func() {
defer wg.Done()
worker(ctx, out, src)
}()
}
// read the csv and write it to src
go func() {
for {
record, err := reader.Read()
if err == io.EOF {
break
} else if err != nil {
log.Fatal(err)
}
src <- record // you might select on ctx.Done().
}
close(src) // close src to signal workers that no more job are incoming.
}()
// wait for worker group to finish and close out
go func() {
wg.Wait() // wait for writers to quit.
close(out) // when you close(out) it breaks the below loop.
}()
// drain the output
for res := range out {
fmt.Println(res)
}
fmt.Printf("\n%2fs", time.Since(start).Seconds())
return nil
}
In processNumber function, if I skip writing to CSV, the process of verifying number completes 6 seconds but writing one record at a time on CSV stretch the time consumption to 15s.
How can I optimize the code?
Can I chunk the records and write them in chunks instead of writing one row at a time?
Do work directly in worker goroutine instead of firing off goroutine per task.
Open file output file once. Flush output file once.
func worker(ctx context.Context, dst chan []string, src chan []string) {
for {
select {
case dataArray, ok := <-src: // you must check for readable state of the channel.
if !ok {
return
}
dst <- processNumber(dataArray[0])
case <-ctx.Done(): // if the context is cancelled, quit.
return
}
}
}
func processNumber(number string) []string {
num, e := phonenumbers.Parse(number, "")
if e != nil {
return
}
region := phonenumbers.GetRegionCodeForNumber(num)
carrier, _ := phonenumbers.GetCarrierForNumber(num, "EN")
return []string{phonenumbers.Format(num, phonenumbers.E164), fmt.Sprintf("%v", num.GetCountryCode()), fmt.Sprintf("%v", phonenumbers.GetNumberType(num)), carrier, region}
}
func ParseFile(phone Input, output *PhoneNumber) error {
// create a context
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// that cancels at ctrl+C
go onSignal(os.Interrupt, cancel)
numberOfWorkers := 2
start := time.Now()
csvfile, err := os.Open(phone.File)
if err != nil {
log.Fatal(err)
}
defer csvfile.Close()
reader := csv.NewReader(csvfile)
// create the pair of input/output channels for the controller=>workers com.
src := make(chan []string)
out := make(chan string)
// use a waitgroup to manage synchronization
var wg sync.WaitGroup
// declare the workers
for i := 0; i < numberOfWorkers; i++ {
wg.Add(1)
go func() {
defer wg.Done()
worker(ctx, out, src)
}()
}
// read the csv and write it to src
go func() {
for {
record, err := reader.Read()
if err == io.EOF {
break
} else if err != nil {
log.Fatal(err)
}
src <- record // you might select on ctx.Done().
}
close(src) // close src to signal workers that no more job are incoming.
}()
// wait for worker group to finish and close out
go func() {
wg.Wait() // wait for writers to quit.
close(out) // when you close(out) it breaks the below loop.
}()
path := "sample_all.csv"
file, err := os.Create(path)
if err != nil {
return err
}
defer file.Close()
csvwriter := csv.NewWriter(csvFile)
// drain the output
for res := range out {
csvwriter.Write(res)
}
csvwriter.Flush()
fmt.Printf("\n%2fs", time.Since(start).Seconds())
return nil
}

What happens to a value never consumed from a Buffered channel

I have the following code
func f() {
...
chan := make(chan error, 1)
go func() {
...
chan <- err
}()
err := other_method()
if err != nil {
log(err)
return
}
err <- chan
if err != nil {
log(err)
}
}
What will happen to the value written in the buffered channel if it is never ever read, because the func exited before reading from it? Is this a resource leak I need to care about?

Go channel infinite loop

I am trying to catch errors from a group of goroutines using a channel, but the channel enters an infinite loop, starts consuming CPU.
func UnzipFile(f *bytes.Buffer, location string) error {
zipReader, err := zip.NewReader(bytes.NewReader(f.Bytes()), int64(f.Len()))
if err != nil {
return err
}
if err := os.MkdirAll(location, os.ModePerm); err != nil {
return err
}
errorChannel := make(chan error)
errorList := []error{}
go errorChannelWatch(errorChannel, errorList)
fileWaitGroup := &sync.WaitGroup{}
for _, file := range zipReader.File {
fileWaitGroup.Add(1)
go writeZipFileToLocal(file, location, errorChannel, fileWaitGroup)
}
fileWaitGroup.Wait()
close(errorChannel)
log.Println(errorList)
return nil
}
func errorChannelWatch(ch chan error, list []error) {
for {
select {
case err := <- ch:
list = append(list, err)
}
}
}
func writeZipFileToLocal(file *zip.File, location string, ch chan error, wg *sync.WaitGroup) {
defer wg.Done()
zipFilehandle, err := file.Open()
if err != nil {
ch <- err
return
}
defer zipFilehandle.Close()
if file.FileInfo().IsDir() {
if err := os.MkdirAll(filepath.Join(location, file.Name), os.ModePerm); err != nil {
ch <- err
}
return
}
localFileHandle, err := os.OpenFile(filepath.Join(location, file.Name), os.O_WRONLY|os.O_CREATE|os.O_TRUNC, file.Mode())
if err != nil {
ch <- err
return
}
defer localFileHandle.Close()
if _, err := io.Copy(localFileHandle, zipFilehandle); err != nil {
ch <- err
return
}
ch <- fmt.Errorf("Test error")
}
So I am looping a slice of files and writing them to my disk, when there is an error I report back to the errorChannel to save that error into a slice.
I use a sync.WaitGroup to wait for all goroutines and when they are done I want to print errorList and check if there was any error during the execution.
The list is always empty, even if I add ch <- fmt.Errorf("test") at the end of writeZipFileToLocal and the channel always hangs up.
I am not sure what I am missing here.
1. For the first point, the infinite loop:
Citing from golang language spec:
A receive operation on a closed channel can always proceed
immediately, yielding the element type's zero value after any
previously sent values have been received.
So in this function
func errorChannelWatch(ch chan error, list []error) {
for {
select {
case err := <- ch:
list = append(list, err)
}
}
}
after ch gets closed this turns into an infinite loop adding nil values to list.
Try this instead:
func errorChannelWatch(ch chan error, list []error) {
for err := range ch {
list = append(list, err)
}
}
2. For the second point, why you don't see anything in your error list:
The problem is this call:
errorChannel := make(chan error)
errorList := []error{}
go errorChannelWatch(errorChannel, errorList)
Here you hand errorChannelWatch the errorList as a value. So the slice errorList will not be changed by the function. What is changed, is the underlying array, as long as the append calls don't need to allocate a new one.
To remedy the situation, either hand a slice pointer to errorChannelWatch or rewrite it as a call to a closure, capturing
errorList.
For the first proposed solution, change errorChannelWatch to
func errorChannelWatch(ch chan error, list *[]error) {
for err := range ch {
*list = append(*list, err)
}
}
and the call to
errorChannel := make(chan error)
errorList := []error{}
go errorChannelWatch(errorChannel, &errorList)
For the second proposed solution, just change the call to
errorChannel := make(chan error)
errorList := []error{}
go func() {
for err := range errorChannel {
errorList = append(errorList, err)
}
} ()
3. A minor remark:
One could think, that there is a synchronisation problem here:
fileWaitGroup.Wait()
close(errorChannel)
log.Println(errorList)
How can you be sure, that errorList isn't modified, after the call to close? One could reason, that you can't know, how many values the goroutine errorChannelWatch still has to process.
Your synchronisation seems correct to me, as you do the wg.Done()
after the send to the error channel and so all error values will
be sent, when fileWaitGroup.Wait() returns.
But that can change, if someone later adds a buffering to the error
channel or alters the code.
So I would advise to at least explain the synchronisation in a comment.

Aggregating multiple results from go routines into a single array

I have the following function which spins off a given amount of go routines
func (r *Runner) Execute() {
var wg sync.WaitGroup
wg.Add(len(r.pipelines))
for _, p := range r.pipelines {
go executePipeline(p, &wg)
}
wg.Wait()
errs := ....//contains list of errors reported by any/all go routines
}
I was thinking there might be some way with channels, but I can't seem to figure it out.
One way to do this is using mutexes if you can make executePipeline retuen errors:
// ...
for _, p := range r.pipelines {
go func(p pipelineType) {
if err := executePipeline(p, &wg); err != nil {
mu.Lock()
errs = append(errs, err)
mu.UnLock()
}
}(p)
}
To use a channel, you can have a separate goroutine listning for errors:
errCh := make(chan error)
go func() {
for e := range errCh {
errs = append(errs, e)
}
}
and in the Execute function, make the following changes:
// ...
wg.Add(len(r.pipelines))
for _, p := range r.pipelines {
go func(p pipelineType) {
if err := executePipeline(p, &wg); err != nil {
errCh <- err
}
}(p)
}
wg.Wait()
close(errCh)
You can always use #zerkms method listed above if the number of goroutines is not high.
instead of returning error from executePipleline and using a anonymous function wrapper, you can always make above changes within the function itself.
You can use channels as #Kaveh Shahbazian suggested:
func (r *Runner) Execute() {
pipelineChan := makePipeline(r.pipelines)
for cnt := 0; cnt < len(r.pipelines); cnt++{
//recieve from channel
p := <- pipelineChan
//do something with the result
}
}
func makePipeline(pipelines []pipelineType) <-chan pipelineType{
pipelineChan := make(chan pipelineType)
go func(){
for _, p := range pipelines {
go func(p pipelineType){
pipelineChan <- executePipeline(p)
}(p)
}
}()
return pipelineChan
}
Please see this example: https://gist.github.com/steven-ferrer/9b2eeac3eed3f7667e8976f399d0b8ad

Resources