Golang - Have some troubles with Go-routines and channels - go

I'm kinda new to Golang and trying to develop a program that uploads images async to imgur. However I'm having some difficulties with my code.
So this is my task;
func uploadT(url string,c chan string, d chan string) {
var subtask string
subtask=upload(url)
var status string
var url string
if subtask!=""{
status = "Success!"
url =subtask
} else {
status = "Failed!"
url =subtask
}
c<-url
d<-status
}
And here is my POST request loop for async uploading;
c:=make(chan string, len(js.Urls))
d:=make(chan string, len(js.Urls))
wg:=sync.WaitGroup{}
for i := range js.Urls{
wg.Add(1)
go uploadTask(js.Urls[i],c,d)
//Below commented out code is slowing down the routine therefore, I commented out.
//Needs to be working as well, however, it can work if I put this on task as well. I think I'm kinda confused with this one as well
//pol=append(pol,retro{Url:<-c,Status:<-d})
}
<-c
<-d
wg.Done()
FinishedTime := time.Now().UTC().Format(time.RFC3339)
qwe=append(qwe,outputURLs{
jobID:jobID,
retro:pol,
CreateTime: CreateTime,
FinishedTime: FinishedTime,
})
fmt.Println(jobID)
So I think my channels and routine does not work. It does print out jobID before the upload tasks. And also uploads seems too slow for async uploading.
I know the code is kinda mess, sorry for that. Any help is highly appreciated! Thanks in advance!

You're actually not using WaitGroup correctly. Everytime you call wg.Done() its actually subtracting 1 from the previous wg.Add to determine that a given task is complete. Finally, you'll need a wg.Wait() to synchronously wait for all tasks. WaitGroups are typically for fan out usage of running multiple tasks in parallel.
The simplest way based on your code example is to pass in the wg into your task, uploadT and call wg.Done() inside of the task. Note that you'll also want to use a pointer instead of the struct value.
The next implementation detail is to call wg.Wait() outside of the loop because you want to block until all the tasks are complete since all your tasks are ran with go which makes it async. If you don't wg.Wait(), it will log the jobID immediately like you said. Let me know if that is clear.
As a boilerplate, it should look something like this
func task(wg *sync.WaitGroup) {
wg.Done()
}
wg := &sync.WaitGroup{}
for i := 0; i < 10; i++ {
wg.Add(1)
go task(wg)
}
wg.Wait()
// do something after the task is done
fmt.Println("done")
The other thing I want to note is that in your current code example, you're using channels but you're not doing anything with the values you're pushing into the channels so you can technically remove them.

Your code is kinda confusing. But if I understand correctly what you are trying to do, you are processing a list of requests and want to return the url and status of each request and time each request completed. And you want to process these in parallel.
You don't need to use WaitGroups at all. WaitGroups are good when you just want to run a bunch of tasks without bothering with the results, just want to know when everything is done. But if you are returning results, channels are sufficient.
Here is an example code that does what I think you are trying to do
package main
import (
"time"
"fmt"
)
type Result struct {
URL string
Status string
Finished string
}
func task(url string, c chan string, d chan string) {
time.Sleep(time.Second)
c <- url
d <- "Success"
}
func main() {
var results []Result
urls := []string{"url1", "url2", "url3", "url4", "url5"}
c := make(chan string, len(urls))
d := make(chan string, len(urls))
for _, u := range urls {
go task(u, c, d)
}
for i := 0; i < len(urls); i++ {
res := Result{}
res.URL = <-c
res.Status = <-d
res.Finished = time.Now().UTC().Format(time.RFC3339)
results = append(results, res)
}
fmt.Println(results)
}
You can try it in the playground https://play.golang.org/p/N3oeA7MyZ8L
That said, this is a bit fragile. You are making channels the same size as your url list. This would work fine for a few urls, but if you have a list of a million urls you will make a rather large channel. You might want to fix the channel buffer size to some reasonable value and check whether or not channel is ready for processing before sending your request. This way you would avoid making a million requests all at once.

Related

how to batch dealing with files using Goroutine?

Assuming I have a bunch of files to deal with(say 1000 or more), first they should be processed by function A(), function A() will generate a file, then this file will be processed by B().
If we do it one by one, that's too slow, so I'm thinking process 5 files at a time using goroutine(we can not process too much at a time cause the CPU cannot bear).
I'm a newbie in golang, I'm not sure if my thought is correct, I think the function A() is a producer and the function B() is a consumer, function B() will deal with the file that produced by function A(), and I wrote some code below, forgive me, I really don't know how to write the code, can anyone give me a help? Thank you in advance!
package main
import "fmt"
var Box = make(chan string, 1024)
func A(file string) {
fmt.Println(file, "is processing in func A()...")
fileGenByA := "/path/to/fileGenByA1"
Box <- fileGenByA
}
func B(file string) {
fmt.Println(file, "is processing in func B()...")
}
func main() {
// assuming that this is the file list read from a directory
fileList := []string{
"/path/to/file1",
"/path/to/file2",
"/path/to/file3",
}
// it seems I can't do this, because fileList may have 1000 or more file
for _, v := range fileList {
go A(v)
}
// can I do this?
for file := range Box {
go B(file)
}
}
Update:
sorry, maybe I haven’t made myself clear, actually the file generated by function A() is stored in the hard disk(generated by a command line tool, I just simple execute it using exec.Command()), not in a variable(the memory), so it doesn't have to be passed to function B() immediately.
I think there are 2 approach:
approach1
approach2
Actually I prefer approach2, as you can see, the first B() doesn't have to process the file1GenByA, it's the same for B() to process any file in the box, because file1GenByA may generated after file2GenByA(maybe the file is larger so it takes more time).
You could spawn 5 goroutines that read from a work channel. That way you have at all times 5 goroutines running and don't need to batch them so that you have to wait until 5 are finished to start the next 5.
func main() {
stack := []string{"a", "b", "c", "d", "e", "f", "g", "h"}
work := make(chan string)
results := make(chan string)
// create worker 5 goroutines
wg := sync.WaitGroup{}
for i := 0; i < 5; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for s := range work {
results <- B(A(s))
}
}()
}
// send the work to the workers
// this happens in a goroutine in order
// to not block the main function, once
// all 5 workers are busy
go func() {
for _, s := range stack {
// could read the file from disk
// here and pass a pointer to the file
work <- s
}
// close the work channel after
// all the work has been send
close(work)
// wait for the workers to finish
// then close the results channel
wg.Wait()
close(results)
}()
// collect the results
// the iteration stops if the results
// channel is closed and the last value
// has been received
for result := range results {
// could write the file to disk
fmt.Println(result)
}
}
https://play.golang.com/p/K-KVX4LEEoK
you're halfway there. There's a few things you need to fix:
your program deadlocks because nothing closes Box, so the main function can never get done rangeing over it.
You aren't waiting for your goroutines to finish, and there than 5 goroutines. (The solutions to these are too intertwined to describe them separately)
1. Deadlock
fatal error: all goroutines are asleep - deadlock!
goroutine 1 [chan receive]:
main.main()
When you range over a channel, you read each value from the channel until it is both closed and empty. Since you never close the channel, the range over that channel can never complete, and the program can never finish.
This is a fairly easy problem to solve in your case: we just need to close the channel when we know there will be no more writes to the channel.
for _, v := range fileList {
go A(v)
}
close(Box)
Keep in mind that closeing a channel doesn't stop it from being read, only written. Now consumers can distinguish between an empty channel that may receive more data in the future, and an empty channel that will never receive more data.
Once you add the close(Box), the program doesn't deadlock anymore, but it still doesn't work.
2. Too Many Goroutines and not waiting for them to complete
To run a certain maximum number of concurrent executions, instead of creating a goroutine for each input, create the goroutines in a "worker pool":
Create a channel to pass the workers their work
Create a channel for the goroutines to return their results, if any
Start the number of goroutines you want
Start at least one additional goroutine to either dispatch work or collect the result, so you don't have to try doing both from the main goroutine
use a sync.WaitGroup to wait for all data to be processed
close the channels to signal to the workers and the results collector that their channels are done being filled.
Before we get into the implementation, let's talk aobut how A and B interact.
first they should be processed by function A(), function A() will generate a file, then this file will be processed by B().
A() and B() must, then, execute serially. They can still pass their data through a channel, but since their execution must be serial, it does nothing for you. Simpler is to run them sequentially in the workers. For that, we'll need to change A() to either call B, or to return the path for B and the worker can call. I choose the latter.
func A(file string) string {
fmt.Println(file, "is processing in func A()...")
fileGenByA := "/path/to/fileGenByA1"
return fileGenByA
}
Before we write our worker function, we also must consider the result of B. Currently, B returns nothing. In the real world, unless B() cannot fail, you would at least want to either return the error, or at least panic. I'll skip over collecting results for now.
Now we can write our worker function.
func worker(wg *sync.WaitGroup, incoming <-chan string) {
defer wg.Done()
for file := range incoming {
B(A(file))
}
}
Now all we have to do is start 5 such workers, write the incoming files to the channel, close it, and wg.Wait() for the workers to complete.
incoming_work := make(chan string)
var wg sync.WaitGroup
for i := 0; i < 5; i++ {
wg.Add(1)
go worker(&wg, incoming_work)
}
for _, v := range fileList {
incoming_work <- v
}
close(incoming_work)
wg.Wait()
Full example at https://go.dev/play/p/A1H4ArD2LD8
Returning Results.
It's all well and good to be able to kick off goroutines and wait for them to complete. But what if you need results back from your goroutines? In all but the simplest of cases, you would at least want to know if files failed to process so you could investigate the errors.
We have only 5 workers, but we have many files, so we have many results. Each worker will have to return several results. So, another channel. It's usually worth defining a struct for your return:
type result struct {
file string
err error
}
This tells us not just whether there was an error but also clearly defines which file from which the error resulted.
How will we test an error case in our current code? In your example, B always gets the same value from A. If we add A's incoming file name to the path it passes to B, we can mock an error based on a substring. My mocked error will be that file3 fails.
func A(file string) string {
fmt.Println(file, "is processing in func A()...")
fileGenByA := "/path/to/fileGenByA1/" + file
return fileGenByA
}
func B(file string) (r result) {
r.file = file
fmt.Println(file, "is processing in func B()...")
if strings.Contains(file, "file3") {
r.err = fmt.Errorf("Test error")
}
return
}
Our workers will be sending results, but we need to collect them somewhere. main() is busy dispatching work to the workers, blocking on its write to incoming_work when the workers are all busy. So the simplest place to collect the results is another goroutine. Our results collector goroutine has to read from a results channel, print out errors for debugging, and the return the total number of failures so our program can return a final exit status indicating overall success or failure.
failures_chan := make(chan int)
go func() {
var failures int
for result := range results {
if result.err != nil {
failures++
fmt.Printf("File %s failed: %s", result.file, result.err.Error())
}
}
failures_chan <- failures
}()
Now we have another channel to close, and it's important we close it after all workers are done. So we close(results) after we wg.Wait() for the workers.
close(incoming_work)
wg.Wait()
close(results)
if failures := <-failures_chan; failures > 0 {
os.Exit(1)
}
Putting all that together, we end up with this code:
package main
import (
"fmt"
"os"
"strings"
"sync"
)
func A(file string) string {
fmt.Println(file, "is processing in func A()...")
fileGenByA := "/path/to/fileGenByA1/" + file
return fileGenByA
}
func B(file string) (r result) {
r.file = file
fmt.Println(file, "is processing in func B()...")
if strings.Contains(file, "file3") {
r.err = fmt.Errorf("Test error")
}
return
}
func worker(wg *sync.WaitGroup, incoming <-chan string, results chan<- result) {
defer wg.Done()
for file := range incoming {
results <- B(A(file))
}
}
type result struct {
file string
err error
}
func main() {
// assuming that this is the file list read from a directory
fileList := []string{
"/path/to/file1",
"/path/to/file2",
"/path/to/file3",
}
incoming_work := make(chan string)
results := make(chan result)
var wg sync.WaitGroup
for i := 0; i < 5; i++ {
wg.Add(1)
go worker(&wg, incoming_work, results)
}
failures_chan := make(chan int)
go func() {
var failures int
for result := range results {
if result.err != nil {
failures++
fmt.Printf("File %s failed: %s", result.file, result.err.Error())
}
}
failures_chan <- failures
}()
for _, v := range fileList {
incoming_work <- v
}
close(incoming_work)
wg.Wait()
close(results)
if failures := <-failures_chan; failures > 0 {
os.Exit(1)
}
}
And when we run it, we get:
/path/to/file1 is processing in func A()...
/path/to/fileGenByA1//path/to/file1 is processing in func B()...
/path/to/file2 is processing in func A()...
/path/to/fileGenByA1//path/to/file2 is processing in func B()...
/path/to/file3 is processing in func A()...
/path/to/fileGenByA1//path/to/file3 is processing in func B()...
File /path/to/fileGenByA1//path/to/file3 failed: Test error
Program exited.
A final thought: buffered channels.
There is nothing wrong with buffered channels. Especially if you know the overall size of incoming work and results, buffered channels can obviate the results collector goroutine because you can allocate a buffered channel big enough to hold all results. However, I think it's more straightforward to understand this pattern if the channels are unbuffered. The key takeaway is that you don't need to know the number of incoming or outgoing results, which could indeed be different numbers or based on something that can't be predetermined.

Closing channel when all workers have finished

I am implementing a web crawler and I have a Parse function that takes an link as an input and should return all links contained in the page.
I would like to make the most of go routines to make it as fast as possible. To do so, I want to create a pool of workers.
I set up a channel of strings representing the links links := make(chan string) and pass it as an argument to the Parse function. I want the workers to communicate through a unique channel. When the function starts, it takes a link from links, parse it and **for each valid link found in the page, add the link to links.
func Parse(links chan string) {
l := <- links
// If link already parsed, return
for url := newUrlFounds {
links <- url
}
}
However, the main issue here is to indicate when no more links have been found. One way I thought of doing it was to wait before all workers have completed. But I don't know how to do so in Go.
As Tim already commented, don't use the same channel for reading and writing in a worker. This will deadlock eventually (even if buffered, because Murphy).
A far simpler design is simply launching one goroutine per URL. A buffered channel can serve as a simple semaphore to limit the number of concurrent parsers (goroutines that don't do anything because they are blocked are usually negligible). Use a sync.WaitGroup to wait until all work is done.
package main
import (
"sync"
)
func main() {
sem := make(chan struct{}, 10) // allow ten concurrent parsers
wg := &sync.WaitGroup{}
wg.Add(1)
Parse("http://example.com", sem, wg)
wg.Wait()
// all done
}
func Parse(u string, sem chan struct{}, wg *sync.WaitGroup) {
defer wg.Done()
sem <- struct{}{} // grab
defer func() { <-sem }() // release
// If URL already parsed, return.
var newURLs []string
// ...
for u := range newURLs {
wg.Add(1)
go Parse(u)
}
}

How to allow multiple objects get data from a single go subroutine

I have a case where I want to spin up a go subroutine that will fetch some data from a source periodically. If the call fails, it stores the error until the next call succeeds. Now there are several instances in the code where a instance would access this data pulled by the go subroutine. How can I implement something like that?
UPDATE
I have had some sleep and coffee in me and I think I need to rephrase the problem more coherently using java-ish semantics.
I have come up with a basic singleton pattern that returns me a interface implementation that is running a go subroutine internally in a forever loop (lets put the cardinal sin of forever loops aside for a moment). The problem is that this interface implementation is being accessed by multiple threads to get the data collected by the go subroutine. Essentially, the data is pulled every 10 mins by the subroutine and then requested infinite number of times. How can I implement something like that?
Here's a very basic example of how you can periodically fetch and collect data.
Have in mind: running this code will do nothing as main will return before anything really happens, but how you handle this depends on your specific use case. This code is really bare bones and needs improvements. It is a sketch of a possible solution to a part of your problem :)
I didn't handle errors here, but you could handle them the same way fetched data is handled (so, one more chan for errors and one more goroutine to read from it).
func main() {
period := time.Second
respChan := make(chan string)
cancelChan := make(chan struct{})
dataCollection := []string
// periodicaly fetch data and send it to respChan
go func(period time.Duration, respChan chan string, cancelChan chan struct{}) {
ticker := time.Ticker(period)
for {
select {
case <-ticker.C:
go fetchData(respChan)
case <-cancelChan:
// close respChan to stop reading goroutine
close(respChan)
return
}
}
}(period, cancelChan)
// read from respChan and write to dataCollection
go func(respChan chan string) {
for data := range respChan {
dataCollection = append(dataCollection, data)
}
}(respChan)
// close cancelChan to gracefuly stop the app
// close(cancelChan)
}
func fetchData(respChan chan string) {
data := "fetched data"
respChan <- data
}
You can use channel for that, but then you would push data not pull. I guess that wouldn't be a problem.
var channelXY = make(chan struct{}, 5000) //Change queue limits to your need, if push is much faster than pull you need to calculate the buffer
go func(channelXY <- chan struct{})
for struct{} := range channelXY {
//DO STUFF
}
WaitGroup.Done()
}(channelXY)
go func() {
channelXY <- struct{}
}
remember to manage all routines with WaitGroup otherwise you programm will end before all routines are done.
EDIT: Close the channel to stop the channel-read go-routine:
close(channelXY)

Why is my code causing a stall or race condition?

For some reason, once I started adding strings through a channel in my goroutine, the code stalls when I run it. I thought that it was a scope/closure issue so I moved all code directly into the function to no avail. I have looked through Golang's documentation and all examples look similar to mine so I am kind of clueless as to what is going wrong.
func getPage(url string, c chan<- string, swg sizedwaitgroup.SizedWaitGroup) {
defer swg.Done()
doc, err := goquery.NewDocument(url)
if err != nil{
fmt.Println(err)
}
nodes := doc.Find(".v-card .info")
for i := range nodes.Nodes {
el := nodes.Eq(i)
var name string
if el.Find("h3.n span").Size() != 0{
name = el.Find("h3.n span").Text()
}else if el.Find("h3.n").Size() != 0{
name = el.Find("h3.n").Text()
}
address := el.Find(".adr").Text()
phoneNumber := el.Find(".phone.primary").Text()
website, _ := el.Find(".track-visit-website").Attr("href")
//c <- map[string] string{"name":name,"address":address,"Phone Number": phoneNumber,"website": website,};
c <- fmt.Sprint("%s%s%s%s",name,address,phoneNumber,website)
fmt.Println([]string{name,address,phoneNumber,website,})
}
}
func getNumPages(url string) int{
doc, err := goquery.NewDocument(url)
if err != nil{
fmt.Println(err);
}
pagination := strings.Split(doc.Find(".pagination p").Contents().Eq(1).Text()," ")
numItems, _ := strconv.Atoi(pagination[len(pagination)-1])
return int(math.Ceil(float64(numItems)/30))
}
func main() {
arrChan := make(chan string)
swg := sizedwaitgroup.New(8)
zips := []string{"78705","78710","78715"}
for _, item := range zips{
swg.Add()
go getPage(fmt.Sprintf(base_url,item,1),arrChan,swg)
}
swg.Wait()
}
Edit:
so I fixed it by passing sizedwaitgroup as a reference but when I remove the buffer it doesn't work does that mean that I need to know how many elements will be sent to the channel in advance?
Issue
Building off of Colin Stewart's answer, from the code you have posted, as far as I can tell, your issue is actually with reading your arrChan. You write into it, but there's no place where you read from it in your code.
From the documentation :
If the channel is unbuffered, the sender blocks until the receiver has received the value. If the channel has a buffer, the sender blocks only until the value
has been copied to the buffer; if the buffer is full, this means
waiting until some receiver has retrieved a value.
By making the channel buffered, what's happening is your code is no longer blocking on the channel write operations, the line that looks like:
c <- fmt.Sprint("%s%s%s%s",name,address,phoneNumber,website)
My guess is that if you're still hanging at when the channel has a size of 5000, it's because you have more than 5000 values returned across all of your loops over node.Nodes. Once your buffered channel is full, the operations block until the channel has space, just like if you were writing to an unbuffered channel.
Fix
Here's a minimal example showing you how you would fix something like this (basically just add a reader)
package main
import "sync"
func getPage(item string, c chan<- string) {
c <- item
}
func readChannel(c <-chan string) {
for {
<-c
}
}
func main() {
arrChan := make(chan string)
wg := sync.WaitGroup{}
zips := []string{"78705", "78710", "78715"}
for _, item := range zips {
wg.Add(1)
go func() {
defer wg.Done()
getPage(item, arrChan)
}()
}
go readChannel(arrChan) // comment this out and you'll deadlock
wg.Wait()
}
Your channel has no buffer, so writes will block until the value can be read, and at least in the code you have posted, there are no readers.
You don't need to know size to make it work. But you might in order to exit cleanly. Which can be a bit tricky to observe at time because your program will exit once your main function exits and all goroutines still running are killed immediately finished or not.
As a warm up example, change readChannel in photoionized's response to this:
func readChannel(c <-chan string) {
for {
url := <-c
fmt.Println (url)
}
}
It only adds printing to the original code. But now you'll see better what is actually happening. Notice how it usually only prints two strings when code actually writes 3. This is because code exits once all writing goroutines finish, but reading goroutine is aborted mid way as result. You can "fix" it by removing "go" before readChannel (which would be same as reading the channel in main function). And then you'll see 3 strings printed, but program crashes with a dead lock as readChannel is still reading from the channel, but nobody writes into it anymore. You can fix that too by reading exactly 3 strings in readChannel(), but that requires knowing how many strings you expect to receive.
Here is my minimal working example (I'll use it to illustrate the rest):
package main
import (
"fmt"
"sync"
)
func getPage(url string, c chan<- string, wg *sync.WaitGroup) {
defer wg.Done()
c <- fmt.Sprintf("Got page for %s\n",url)
}
func readChannel(c chan string, wg *sync.WaitGroup) {
defer wg.Done()
var url string
ok := true
for ok {
url, ok = <- c
if ok {
fmt.Printf("Received: %s\n", url)
} else {
fmt.Println("Exiting readChannel")
}
}
}
func main() {
arrChan := make(chan string)
var swg sync.WaitGroup
base_url := "http://test/%s/%d"
zips := []string{"78705","78710","78715"}
for _, item := range zips{
swg.Add(1)
go getPage(fmt.Sprintf(base_url,item,1),arrChan,&swg)
}
var wg2 sync.WaitGroup
wg2.Add(1)
go readChannel(arrChan, &wg2)
swg.Wait()
// All written, signal end to readChannel by closing the channel
close(arrChan)
wg2.Wait()
}
Here I close the channel to signal to readChannel that there is nothing left to read, so it can exit cleanly at proper time. But sometimes you might want instead to tell readChannel to read exactly 3 strings and finish. Or may be you would want to start one reader for each writer and each reader will read exactly one string... Well, there are many ways to skin a cat and choice is all yours.
Note, if you remove wg2.Wait() line your code becomes equivalent to photoionized's response and will only print two strings whilst writing 3. This is because code exits once all writers finish (ensured by swg.Wait()), but it does not wait for readChannel to finish.
If you remove close(arrChan) line instead, your code will crash with a deadlock after printing 3 lines as code waits for readChannel to finish, but readChannel waits to read from a channel which nobody is writing to anymore.
If you just remove "go" before the readChannel call, it becomes equivalent of reading from channel inside main function. It will again crash with a dead lock after printing 3 strings because readChannel is still reading when all writers have already finished (and readChannel has already read all they written). A tricky point here is that swg.Wait() line will never be reached by this code as readChannel never exits.
If you move readChannel call after the swg.Wait() then code will crash before even printing a single string. But this is a different dead lock. This time code reaches swg.Wait() and stops there waiting for writers. First writer succeeds, but channel is not buffered, so next writer blocks until someone reads from the channel the data already written. Trouble is - nobody reads from the channel yet as readChannel has not been called yet. So, it stalls and crashes with a dead lock. This particular issue can be "fixed", but making channel buffered as in make(chan string, 3) as that will allow writers to keep writing even though nobody is reading from that channel yet. And sometimes this is what you want. But here again you have to know the maximum of messages to ever be in the channel buffer. And most of the time it's only deferring a problem - just add one more writer and you are where you started - code stalls and crashes as channel buffer is full and that one extra writer is waiting for someone to read from the buffer.
Well, this should covers all bases. So, check your code and see which case is yours.

Reading a file concurrently

The reading part isn't concurrent but the processing is. I phrased the title this way because I'm most likely to search for this problem again using that phrase. :)
I'm getting a deadlock after trying to go beyond the examples so this is a learning experience for me. My goals are these:
Read a file line by line (eventually use a buffer to do groups of lines).
Pass off the text to a func() that does some regex work.
Send the results somewhere but avoid mutexes or shared variables. I'm sending ints (always the number 1) to a channel. It's sort of silly but if it's not causing problems I'd like to leave it like this unless you folks have a neater option.
Use a worker pool to do this. I'm not sure how I tell the workers to requeue themselves?
Here is the playground link. I tried to write helpful comments, hopefully this makes sense. My design could be completely wrong so don't hesitate to refactor.
package main
import (
"bufio"
"fmt"
"regexp"
"strings"
"sync"
)
func telephoneNumbersInFile(path string) int {
file := strings.NewReader(path)
var telephone = regexp.MustCompile(`\(\d+\)\s\d+-\d+`)
// do I need buffered channels here?
jobs := make(chan string)
results := make(chan int)
// I think we need a wait group, not sure.
wg := new(sync.WaitGroup)
// start up some workers that will block and wait?
for w := 1; w <= 3; w++ {
wg.Add(1)
go matchTelephoneNumbers(jobs, results, wg, telephone)
}
// go over a file line by line and queue up a ton of work
scanner := bufio.NewScanner(file)
for scanner.Scan() {
// Later I want to create a buffer of lines, not just line-by-line here ...
jobs <- scanner.Text()
}
close(jobs)
wg.Wait()
// Add up the results from the results channel.
// The rest of this isn't even working so ignore for now.
counts := 0
// for v := range results {
// counts += v
// }
return counts
}
func matchTelephoneNumbers(jobs <-chan string, results chan<- int, wg *sync.WaitGroup, telephone *regexp.Regexp) {
// Decreasing internal counter for wait-group as soon as goroutine finishes
defer wg.Done()
// eventually I want to have a []string channel to work on a chunk of lines not just one line of text
for j := range jobs {
if telephone.MatchString(j) {
results <- 1
}
}
}
func main() {
// An artificial input source. Normally this is a file passed on the command line.
const input = "Foo\n(555) 123-3456\nBar\nBaz"
numberOfTelephoneNumbers := telephoneNumbersInFile(input)
fmt.Println(numberOfTelephoneNumbers)
}
You're almost there, just need a little bit of work on goroutines' synchronisation. Your problem is that you're trying to feed the parser and collect the results in the same routine, but that can't be done.
I propose the following:
Run scanner in a separate routine, close input channel once everything is read.
Run separate routine waiting for the parsers to finish their job, than close the output channel.
Collect all the results in you main routine.
The relevant changes could look like this:
// Go over a file line by line and queue up a ton of work
go func() {
scanner := bufio.NewScanner(file)
for scanner.Scan() {
jobs <- scanner.Text()
}
close(jobs)
}()
// Collect all the results...
// First, make sure we close the result channel when everything was processed
go func() {
wg.Wait()
close(results)
}()
// Now, add up the results from the results channel until closed
counts := 0
for v := range results {
counts += v
}
Fully working example on the playground: http://play.golang.org/p/coja1_w-fY
Worth adding you don't necessarily need the WaitGroup to achieve the same, all you need to know is when to stop receiving results. This could be achieved for example by scanner advertising (on a channel) how many lines were read and then the collector reading only specified number of results (you would need to send zeros as well though).
Edit: The answer by #tomasz above is the correct one. Please disregard this answer.
You need to do two things:
use buffered chan's so that sending doesn't block
close the results chan so that receiving doesn't block.
The use of buffered channels is essential because unbuffered channels need a receive for each send, which is causing the deadlock you're hitting.
If you fix that, you'll run into a deadlock when you try to receive the results, because results hasn't been closed.
Here's the fixed playground: http://play.golang.org/p/DtS8Matgi5

Resources