Reading a file concurrently

Reading a file concurrently - go

The reading part isn't concurrent but the processing is. I phrased the title this way because I'm most likely to search for this problem again using that phrase. :)
I'm getting a deadlock after trying to go beyond the examples so this is a learning experience for me. My goals are these:
Read a file line by line (eventually use a buffer to do groups of lines).
Pass off the text to a func() that does some regex work.
Send the results somewhere but avoid mutexes or shared variables. I'm sending ints (always the number 1) to a channel. It's sort of silly but if it's not causing problems I'd like to leave it like this unless you folks have a neater option.
Use a worker pool to do this. I'm not sure how I tell the workers to requeue themselves?
Here is the playground link. I tried to write helpful comments, hopefully this makes sense. My design could be completely wrong so don't hesitate to refactor.
package main
import (
"bufio"
"fmt"
"regexp"
"strings"
"sync"
)
func telephoneNumbersInFile(path string) int {
file := strings.NewReader(path)
var telephone = regexp.MustCompile(`\(\d+\)\s\d+-\d+`)
// do I need buffered channels here?
jobs := make(chan string)
results := make(chan int)
// I think we need a wait group, not sure.
wg := new(sync.WaitGroup)
// start up some workers that will block and wait?
for w := 1; w <= 3; w++ {
wg.Add(1)
go matchTelephoneNumbers(jobs, results, wg, telephone)
}
// go over a file line by line and queue up a ton of work
scanner := bufio.NewScanner(file)
for scanner.Scan() {
// Later I want to create a buffer of lines, not just line-by-line here ...
jobs <- scanner.Text()
}
close(jobs)
wg.Wait()
// Add up the results from the results channel.
// The rest of this isn't even working so ignore for now.
counts := 0
// for v := range results {
// counts += v
// }
return counts
}
func matchTelephoneNumbers(jobs <-chan string, results chan<- int, wg *sync.WaitGroup, telephone *regexp.Regexp) {
// Decreasing internal counter for wait-group as soon as goroutine finishes
defer wg.Done()
// eventually I want to have a []string channel to work on a chunk of lines not just one line of text
for j := range jobs {
if telephone.MatchString(j) {
results <- 1
}
}
}
func main() {
// An artificial input source. Normally this is a file passed on the command line.
const input = "Foo\n(555) 123-3456\nBar\nBaz"
numberOfTelephoneNumbers := telephoneNumbersInFile(input)
fmt.Println(numberOfTelephoneNumbers)
}

You're almost there, just need a little bit of work on goroutines' synchronisation. Your problem is that you're trying to feed the parser and collect the results in the same routine, but that can't be done.
I propose the following:
Run scanner in a separate routine, close input channel once everything is read.
Run separate routine waiting for the parsers to finish their job, than close the output channel.
Collect all the results in you main routine.
The relevant changes could look like this:
// Go over a file line by line and queue up a ton of work
go func() {
scanner := bufio.NewScanner(file)
for scanner.Scan() {
jobs <- scanner.Text()
}
close(jobs)
}()
// Collect all the results...
// First, make sure we close the result channel when everything was processed
go func() {
wg.Wait()
close(results)
}()
// Now, add up the results from the results channel until closed
counts := 0
for v := range results {
counts += v
}
Fully working example on the playground: http://play.golang.org/p/coja1_w-fY
Worth adding you don't necessarily need the WaitGroup to achieve the same, all you need to know is when to stop receiving results. This could be achieved for example by scanner advertising (on a channel) how many lines were read and then the collector reading only specified number of results (you would need to send zeros as well though).

Edit: The answer by #tomasz above is the correct one. Please disregard this answer.
You need to do two things:
use buffered chan's so that sending doesn't block
close the results chan so that receiving doesn't block.
The use of buffered channels is essential because unbuffered channels need a receive for each send, which is causing the deadlock you're hitting.
If you fix that, you'll run into a deadlock when you try to receive the results, because results hasn't been closed.
Here's the fixed playground: http://play.golang.org/p/DtS8Matgi5

Related

Golang - Have some troubles with Go-routines and channels

I'm kinda new to Golang and trying to develop a program that uploads images async to imgur. However I'm having some difficulties with my code.
So this is my task;
func uploadT(url string,c chan string, d chan string) {
var subtask string
subtask=upload(url)
var status string
var url string
if subtask!=""{
status = "Success!"
url =subtask
} else {
status = "Failed!"
url =subtask
}
c<-url
d<-status
}
And here is my POST request loop for async uploading;
c:=make(chan string, len(js.Urls))
d:=make(chan string, len(js.Urls))
wg:=sync.WaitGroup{}
for i := range js.Urls{
wg.Add(1)
go uploadTask(js.Urls[i],c,d)
//Below commented out code is slowing down the routine therefore, I commented out.
//Needs to be working as well, however, it can work if I put this on task as well. I think I'm kinda confused with this one as well
//pol=append(pol,retro{Url:<-c,Status:<-d})
}
<-c
<-d
wg.Done()
FinishedTime := time.Now().UTC().Format(time.RFC3339)
qwe=append(qwe,outputURLs{
jobID:jobID,
retro:pol,
CreateTime: CreateTime,
FinishedTime: FinishedTime,
})
fmt.Println(jobID)
So I think my channels and routine does not work. It does print out jobID before the upload tasks. And also uploads seems too slow for async uploading.
I know the code is kinda mess, sorry for that. Any help is highly appreciated! Thanks in advance!

You're actually not using WaitGroup correctly. Everytime you call wg.Done() its actually subtracting 1 from the previous wg.Add to determine that a given task is complete. Finally, you'll need a wg.Wait() to synchronously wait for all tasks. WaitGroups are typically for fan out usage of running multiple tasks in parallel.
The simplest way based on your code example is to pass in the wg into your task, uploadT and call wg.Done() inside of the task. Note that you'll also want to use a pointer instead of the struct value.
The next implementation detail is to call wg.Wait() outside of the loop because you want to block until all the tasks are complete since all your tasks are ran with go which makes it async. If you don't wg.Wait(), it will log the jobID immediately like you said. Let me know if that is clear.
As a boilerplate, it should look something like this
func task(wg *sync.WaitGroup) {
wg.Done()
}
wg := &sync.WaitGroup{}
for i := 0; i < 10; i++ {
wg.Add(1)
go task(wg)
}
wg.Wait()
// do something after the task is done
fmt.Println("done")
The other thing I want to note is that in your current code example, you're using channels but you're not doing anything with the values you're pushing into the channels so you can technically remove them.

Your code is kinda confusing. But if I understand correctly what you are trying to do, you are processing a list of requests and want to return the url and status of each request and time each request completed. And you want to process these in parallel.
You don't need to use WaitGroups at all. WaitGroups are good when you just want to run a bunch of tasks without bothering with the results, just want to know when everything is done. But if you are returning results, channels are sufficient.
Here is an example code that does what I think you are trying to do
package main
import (
"time"
"fmt"
)
type Result struct {
URL string
Status string
Finished string
}
func task(url string, c chan string, d chan string) {
time.Sleep(time.Second)
c <- url
d <- "Success"
}
func main() {
var results []Result
urls := []string{"url1", "url2", "url3", "url4", "url5"}
c := make(chan string, len(urls))
d := make(chan string, len(urls))
for _, u := range urls {
go task(u, c, d)
}
for i := 0; i < len(urls); i++ {
res := Result{}
res.URL = <-c
res.Status = <-d
res.Finished = time.Now().UTC().Format(time.RFC3339)
results = append(results, res)
}
fmt.Println(results)
}
You can try it in the playground https://play.golang.org/p/N3oeA7MyZ8L
That said, this is a bit fragile. You are making channels the same size as your url list. This would work fine for a few urls, but if you have a list of a million urls you will make a rather large channel. You might want to fix the channel buffer size to some reasonable value and check whether or not channel is ready for processing before sending your request. This way you would avoid making a million requests all at once.

How to allow multiple objects get data from a single go subroutine

I have a case where I want to spin up a go subroutine that will fetch some data from a source periodically. If the call fails, it stores the error until the next call succeeds. Now there are several instances in the code where a instance would access this data pulled by the go subroutine. How can I implement something like that?
UPDATE
I have had some sleep and coffee in me and I think I need to rephrase the problem more coherently using java-ish semantics.
I have come up with a basic singleton pattern that returns me a interface implementation that is running a go subroutine internally in a forever loop (lets put the cardinal sin of forever loops aside for a moment). The problem is that this interface implementation is being accessed by multiple threads to get the data collected by the go subroutine. Essentially, the data is pulled every 10 mins by the subroutine and then requested infinite number of times. How can I implement something like that?

Here's a very basic example of how you can periodically fetch and collect data.
Have in mind: running this code will do nothing as main will return before anything really happens, but how you handle this depends on your specific use case. This code is really bare bones and needs improvements. It is a sketch of a possible solution to a part of your problem :)
I didn't handle errors here, but you could handle them the same way fetched data is handled (so, one more chan for errors and one more goroutine to read from it).
func main() {
period := time.Second
respChan := make(chan string)
cancelChan := make(chan struct{})
dataCollection := []string
// periodicaly fetch data and send it to respChan
go func(period time.Duration, respChan chan string, cancelChan chan struct{}) {
ticker := time.Ticker(period)
for {
select {
case <-ticker.C:
go fetchData(respChan)
case <-cancelChan:
// close respChan to stop reading goroutine
close(respChan)
return
}
}
}(period, cancelChan)
// read from respChan and write to dataCollection
go func(respChan chan string) {
for data := range respChan {
dataCollection = append(dataCollection, data)
}
}(respChan)
// close cancelChan to gracefuly stop the app
// close(cancelChan)
}
func fetchData(respChan chan string) {
data := "fetched data"
respChan <- data
}

You can use channel for that, but then you would push data not pull. I guess that wouldn't be a problem.
var channelXY = make(chan struct{}, 5000) //Change queue limits to your need, if push is much faster than pull you need to calculate the buffer
go func(channelXY <- chan struct{})
for struct{} := range channelXY {
//DO STUFF
}
WaitGroup.Done()
}(channelXY)
go func() {
channelXY <- struct{}
}
remember to manage all routines with WaitGroup otherwise you programm will end before all routines are done.
EDIT: Close the channel to stop the channel-read go-routine:
close(channelXY)

Why is my code causing a stall or race condition?

For some reason, once I started adding strings through a channel in my goroutine, the code stalls when I run it. I thought that it was a scope/closure issue so I moved all code directly into the function to no avail. I have looked through Golang's documentation and all examples look similar to mine so I am kind of clueless as to what is going wrong.
func getPage(url string, c chan<- string, swg sizedwaitgroup.SizedWaitGroup) {
defer swg.Done()
doc, err := goquery.NewDocument(url)
if err != nil{
fmt.Println(err)
}
nodes := doc.Find(".v-card .info")
for i := range nodes.Nodes {
el := nodes.Eq(i)
var name string
if el.Find("h3.n span").Size() != 0{
name = el.Find("h3.n span").Text()
}else if el.Find("h3.n").Size() != 0{
name = el.Find("h3.n").Text()
}
address := el.Find(".adr").Text()
phoneNumber := el.Find(".phone.primary").Text()
website, _ := el.Find(".track-visit-website").Attr("href")
//c <- map[string] string{"name":name,"address":address,"Phone Number": phoneNumber,"website": website,};
c <- fmt.Sprint("%s%s%s%s",name,address,phoneNumber,website)
fmt.Println([]string{name,address,phoneNumber,website,})
}
}
func getNumPages(url string) int{
doc, err := goquery.NewDocument(url)
if err != nil{
fmt.Println(err);
}
pagination := strings.Split(doc.Find(".pagination p").Contents().Eq(1).Text()," ")
numItems, _ := strconv.Atoi(pagination[len(pagination)-1])
return int(math.Ceil(float64(numItems)/30))
}
func main() {
arrChan := make(chan string)
swg := sizedwaitgroup.New(8)
zips := []string{"78705","78710","78715"}
for _, item := range zips{
swg.Add()
go getPage(fmt.Sprintf(base_url,item,1),arrChan,swg)
}
swg.Wait()
}
Edit:
so I fixed it by passing sizedwaitgroup as a reference but when I remove the buffer it doesn't work does that mean that I need to know how many elements will be sent to the channel in advance?

Issue
Building off of Colin Stewart's answer, from the code you have posted, as far as I can tell, your issue is actually with reading your arrChan. You write into it, but there's no place where you read from it in your code.
From the documentation :
If the channel is unbuffered, the sender blocks until the receiver has received the value. If the channel has a buffer, the sender blocks only until the value
has been copied to the buffer; if the buffer is full, this means
waiting until some receiver has retrieved a value.
By making the channel buffered, what's happening is your code is no longer blocking on the channel write operations, the line that looks like:
c <- fmt.Sprint("%s%s%s%s",name,address,phoneNumber,website)
My guess is that if you're still hanging at when the channel has a size of 5000, it's because you have more than 5000 values returned across all of your loops over node.Nodes. Once your buffered channel is full, the operations block until the channel has space, just like if you were writing to an unbuffered channel.
Fix
Here's a minimal example showing you how you would fix something like this (basically just add a reader)
package main
import "sync"
func getPage(item string, c chan<- string) {
c <- item
}
func readChannel(c <-chan string) {
for {
<-c
}
}
func main() {
arrChan := make(chan string)
wg := sync.WaitGroup{}
zips := []string{"78705", "78710", "78715"}
for _, item := range zips {
wg.Add(1)
go func() {
defer wg.Done()
getPage(item, arrChan)
}()
}
go readChannel(arrChan) // comment this out and you'll deadlock
wg.Wait()
}

Your channel has no buffer, so writes will block until the value can be read, and at least in the code you have posted, there are no readers.

You don't need to know size to make it work. But you might in order to exit cleanly. Which can be a bit tricky to observe at time because your program will exit once your main function exits and all goroutines still running are killed immediately finished or not.
As a warm up example, change readChannel in photoionized's response to this:
func readChannel(c <-chan string) {
for {
url := <-c
fmt.Println (url)
}
}
It only adds printing to the original code. But now you'll see better what is actually happening. Notice how it usually only prints two strings when code actually writes 3. This is because code exits once all writing goroutines finish, but reading goroutine is aborted mid way as result. You can "fix" it by removing "go" before readChannel (which would be same as reading the channel in main function). And then you'll see 3 strings printed, but program crashes with a dead lock as readChannel is still reading from the channel, but nobody writes into it anymore. You can fix that too by reading exactly 3 strings in readChannel(), but that requires knowing how many strings you expect to receive.
Here is my minimal working example (I'll use it to illustrate the rest):
package main
import (
"fmt"
"sync"
)
func getPage(url string, c chan<- string, wg *sync.WaitGroup) {
defer wg.Done()
c <- fmt.Sprintf("Got page for %s\n",url)
}
func readChannel(c chan string, wg *sync.WaitGroup) {
defer wg.Done()
var url string
ok := true
for ok {
url, ok = <- c
if ok {
fmt.Printf("Received: %s\n", url)
} else {
fmt.Println("Exiting readChannel")
}
}
}
func main() {
arrChan := make(chan string)
var swg sync.WaitGroup
base_url := "http://test/%s/%d"
zips := []string{"78705","78710","78715"}
for _, item := range zips{
swg.Add(1)
go getPage(fmt.Sprintf(base_url,item,1),arrChan,&swg)
}
var wg2 sync.WaitGroup
wg2.Add(1)
go readChannel(arrChan, &wg2)
swg.Wait()
// All written, signal end to readChannel by closing the channel
close(arrChan)
wg2.Wait()
}
Here I close the channel to signal to readChannel that there is nothing left to read, so it can exit cleanly at proper time. But sometimes you might want instead to tell readChannel to read exactly 3 strings and finish. Or may be you would want to start one reader for each writer and each reader will read exactly one string... Well, there are many ways to skin a cat and choice is all yours.
Note, if you remove wg2.Wait() line your code becomes equivalent to photoionized's response and will only print two strings whilst writing 3. This is because code exits once all writers finish (ensured by swg.Wait()), but it does not wait for readChannel to finish.
If you remove close(arrChan) line instead, your code will crash with a deadlock after printing 3 lines as code waits for readChannel to finish, but readChannel waits to read from a channel which nobody is writing to anymore.
If you just remove "go" before the readChannel call, it becomes equivalent of reading from channel inside main function. It will again crash with a dead lock after printing 3 strings because readChannel is still reading when all writers have already finished (and readChannel has already read all they written). A tricky point here is that swg.Wait() line will never be reached by this code as readChannel never exits.
If you move readChannel call after the swg.Wait() then code will crash before even printing a single string. But this is a different dead lock. This time code reaches swg.Wait() and stops there waiting for writers. First writer succeeds, but channel is not buffered, so next writer blocks until someone reads from the channel the data already written. Trouble is - nobody reads from the channel yet as readChannel has not been called yet. So, it stalls and crashes with a dead lock. This particular issue can be "fixed", but making channel buffered as in make(chan string, 3) as that will allow writers to keep writing even though nobody is reading from that channel yet. And sometimes this is what you want. But here again you have to know the maximum of messages to ever be in the channel buffer. And most of the time it's only deferring a problem - just add one more writer and you are where you started - code stalls and crashes as channel buffer is full and that one extra writer is waiting for someone to read from the buffer.
Well, this should covers all bases. So, check your code and see which case is yours.

Spread sequential tests into 4 go routines and terminate all if one fails

Suppose I have a simple loop which does sequential tests like this.
for f := 1; f <= 1000; f++ {
if doTest(f) {
break
}
}
I loop through range of numbers and do a test for each number. If test fails for one number, I break and exit the main thread. Simple enough.
Now, how do correctly feed the test numbers in say four or several go routines. Basically, I want to test the numbers from 1 to 1000 in batches of 4 (or whatever number of go routines is).
Do I create 4 routines reading from one channel and feed the numbers sequentially into this channel? Or do I make 4 routines with an individual channel?
And another question. How do I stop all 4 routines if one of them fails the test? I've been reading some texts on channels but I cannot put the pieces together.

You can create a producer/consumer system: https://play.golang.org/p/rks0gB3aDb
func main() {
ch := make(chan int)
clients := 4
// make it buffered, so all clients can fail without hanging
notifyCh := make(chan struct{}, clients)
go produce(100, ch, notifyCh)
var wg sync.WaitGroup
wg.Add(clients)
for i := 0; i < clients; i++ {
go func() {
consumer(ch, notifyCh)
wg.Done()
}()
}
wg.Wait()
}
func consumer(in chan int, notifyCh chan struct{}) {
fmt.Printf("Start consumer\n")
for i := range in {
<-time.After(100 * time.Millisecond)
if i == 42 {
fmt.Printf("%d fails\n", i)
notifyCh <- struct{}{}
return
} else {
fmt.Printf("%d\n", i)
}
}
fmt.Printf("Consumer stopped working\n")
}
func produce(N int, out chan int, notifyCh chan struct{}) {
for i := 0; i < N; i++ {
select {
case out <- i:
case <-notifyCh:
close(out)
return
}
}
close(out)
}
The producer pushes numbers from 0 to 99 to the channel, the consumer consumes until the channel is closed. In main we create 4 clients and add them to a waitgroup to reliably check if every goroutine returned.
Every consumer can signal on the notifyCh, the producer stops working and no further numbers are generated, therefor all consumers return after their current number.
There's also an option to create 4 go routines, wait for all of them to return, start the next 4 go routines. But this adds quite an overhead on waiting.
Since you mentioned prime numbers, here's a really cool prime seive: https://golang.org/doc/play/sieve.go

Whether you will create one channel common or a channel per routines depend on what you want.
If you want only put some numbers (or more general - requests) inside and you don't care which goroutine serve that, than of course is better to share a channel. In case when you want for example first 250 request to be served by goroutine1, than of course you cannot share a channel.
For channel is a good practice use it as input or output. And the simples thing how sender can sent, that he is finished is close the channel. Good article about that is https://blog.golang.org/pipelines
What is not mentiond in the question - is you need also another channel (or channels) or or any other communication primitive to get results. And here is the channel most interesting than to feeding.
What information should be sent - it should be sent, a bool after every doTest, or just know when everthing was done (it this case neither bool is not necessary just close a channel)?
If you prefer program at first fail. Than I would prefer use buffered shared channel to feed the numbers. Don't forget to close it, when all numbers will be feed.
And another unbuffered chan to let main thread know, that tests are done. It can be channel, there you only put the number, where test failed, or if you also want a positive result - channel of struct containing number and result, or any other informantion returned from doTest.
Very good article about channel is also http://dave.cheney.net/2014/03/19/channel-axioms
Each of your four goroutines can report a failure (by sending error and closing channel). But gotcha is what goroutines should do, when all numbers passed and feeding channel is closed. And about that is also nice article http://nathanleclaire.com/blog/2014/02/15/how-to-wait-for-all-goroutines-to-finish-executing-before-continuing/

Goroutines, Channels and Deadlock

I'm trying to understand more about go's channels and goroutines, so I decided to make a little program that count words from a file, read by a bufio.NewScanner object:
nCPUs := flag.Int("cpu", 2, "number of CPUs to use")
flag.Parse()
runtime.GOMAXPROCS(*nCPUs)
scanner := bufio.NewScanner(file)
lines := make(chan string)
results := make(chan int)
for i := 0; i < *nCPUs; i++ {
go func() {
for line := range lines {
fmt.Printf("%s\n", line)
results <- len(strings.Split(line, " "))
}
}()
}
for scanner.Scan(){
lines <- scanner.Text()
}
close(lines)
acc := 0
for i := range results {
acc += i
}
fmt.Printf("%d\n", acc)
Now, in most examples I've found so far both the lines and results channels would be buffered, such as make(chan int, NUMBER_OF_LINES_IN_FILE). Still, after running this code, my program exists with a fatal error: all goroutines are asleep - deadlock! error message.
Basically my thought it's that I need two channels: one to communicate to the goroutine the lines from the file (as it can be of any size, I don't like to think that I need to inform the size in the make(chan) function call. The other channel would collect the results from the goroutine and in the main function I would use it to e.g. calculate an accumulated result.
What should be the best option to program in this manner with goroutines and channels? Any help is much appreciated.

As #AndrewN has pointed out, the problem is each goroutine gets to the point where it's trying to send to the results channel, but those sends will block because the results channel is unbuffered and nothing reads from them until the for i := range results loop. You never get to that loop, because you first need to finish the for scanner.Scan() loop, which is trying to send all the lines down the lines channel, which is blocked because the goroutines are never looping back to the range lines because they're stuck sending to results.
The first thing you might try to do to fix this is to put the scanner.Scan() stuff in a goroutine, so that something can start reading off the results channel right away. However, the next problem you'll have is knowing when to end the for i := range results loop. You want to have something close the results channel, but only after the original goroutines are done reading off the lines channel. You could close the results channel right after closing the lines channel, however I think that might introduce a potential race, so the safest thing to do is also wait for the original two goroutines to be done before closing the results channel: (playground link):
package main
import "fmt"
import "runtime"
import "bufio"
import "strings"
import "sync"
func main() {
runtime.GOMAXPROCS(2)
scanner := bufio.NewScanner(strings.NewReader(`
hi mom
hi dad
hi sister
goodbye`))
lines := make(chan string)
results := make(chan int)
wg := sync.WaitGroup{}
for i := 0; i < 2; i++ {
wg.Add(1)
go func() {
for line := range lines {
fmt.Printf("%s\n", line)
results <- len(strings.Split(line, " "))
}
wg.Done()
}()
}
go func() {
for scanner.Scan() {
lines <- scanner.Text()
}
close(lines)
wg.Wait()
close(results)
}()
acc := 0
for i := range results {
acc += i
}
fmt.Printf("%d\n", acc)
}

Channels in go are unbuffered by default, which means that none of the anonymous goroutines you spawn can send to the results channel until you start trying to receive from that channel. That doesn't start executing in the main program until scanner.Scan() is done filling up the line channel...which it's blocked from doing until your anonymous functions can send to the results channel and restart their loops. Deadlock.
The other problem in your code, even when trivially fixing the above by buffering the channels, is that for i := range results will also deadlock once there are no more results being fed into it, since the channel hasn't been closed.
Edit: Here's one potential solution, if you want to avoid buffered channels. Basically, the first issue is avoided by performing the send to the results channel via a new goroutine, allowing the lines loop to complete. The second issue (not knowing when to stop reading a channel) is avoided by counting the goroutines as they are created and explicitly closing down the channel when every goroutine is accounted for. It's probably better to do something similar with waitgroups, but this is just a very fast way to show how to do this unbuffered.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Reading a file concurrently - go

Related

Golang - Have some troubles with Go-routines and channels

How to allow multiple objects get data from a single go subroutine

Why is my code causing a stall or race condition?

Spread sequential tests into 4 go routines and terminate all if one fails

Goroutines, Channels and Deadlock

Categories

Resources