GO code with execution control using channels - go

I'm extracting from a long redshift table all its data in chunks, each chunk to a csv file. I want to control how many files are created at the "same" time (concurrently), i.e. if the whole process will create 10 files, I want to, let's say, create 4 files, wait until they are created and once they are "done", create another 4, and then the remaining 2.
How can I achieve this using channel/s?
I've tried to change the following slice to a channel, but I couldn't get it to work as I said, the implementation I did, did not wait/stop for the 4 first files to end before creating the following ones.
Right now I'm doing the following using WaitGroup:
package, imports, var, etc...
//Inside func main:
//Create a WaitGroup
var wg = sync.WaitGroup{}
//Opening the connection
db, err := sql.Open("postgres", connStr)
if err != nil {
panic(err.Error())
}
defer db.Close()
//Define chunks using a slice
chunkSizer := Slicer(totalRowsInTable, numberRowsByChunk) // e.g. []int{100, 100, 100... 100}
//Iterating over the array
for index, value := range chunkSizer {
wg.Add(1)
go ExtractorToCSV(db, queriedSection, expFileName)
if (index+1)% 4 == 0 { // <-- 4 is the maximum number of files created at the "same" time
wg.Wait()
}
wg.Wait() // <-- waits for the remaining files (last 2 in this case)
}
//Outside main
func ExtractorToCSV(db *sql.DB, queryToExtract, fileName string) {
//... do its process
wg.Done()
}
I've tried using a buffered channel of the size that I wanted to stop (4 in this case), but I didn't use it properly, I don't know...
Thanks in advance.

UPDATED - STOP CONDITION
You can use channel to hold the next line of code like this. This is minimum code that I write for you. Tweak it as you like
var doneCh = make(chan bool)
func main() {
WRITE_POOL := 4
for index, val := range RANGE {
go extractToFile(val)
if (index + 1) % WRITE_POOL == 0 {
// wait for doneCh to finish
// if the iteration is divisive of WRITE_POOL
<-doneCh
<-doneCh
<-doneCh
<-doneCh
} else if index == MAX - 1 {
// wait for whatever doneCh left to finish
// if current val is the last one
LEFT := MAX - index - 1
for i := 0; i < LEFT; i++ {
<-doneCh
}
}
}
}
func extractToFile(val int) {
os.Create(fmt.Sprintf("test-%d", val))
doneCh<-true
}
For better performance, try to :
Create data channel to main function can send the data to it and ExtractorToCSV can receive it.
Create ExtractorToCSV as goroutine and read from data channel, after ExtractorToCSV finish, send data to doneCh
Send db data to data channel and after ExtractorToCSV finished to write to file, send true to doneCh.
I will update this post if you need more example.

Related

How do I terminate an infinite loop from inside of a goroutine?

I'm writing an app using Go that is interacting with Spotify's API and I find myself needing to use an infinite for loop to call an endpoint until the length of the returned slice is less than the limit, signalling that I've reached the end of the available entries.
For my user account, there are 1644 saved albums (I determined this by looping through without using goroutines). However, when I add goroutines in, I'm getting back 2544 saved albums with duplicates. I'm also using the semaphore pattern to limit the number of goroutines so that I don't exceed the rate limit.
I assume that the issue is with using the active variable rather than channels, but my attempt at that just resulted in an infinite loop
wg := &sync.WaitGroup{}
sem := make(chan bool, 20)
active := true
offset := 0
for {
sem <- true
if active {
// add each new goroutine to waitgroup
wg.Add(1)
go func() error {
// remove from waitgroup when goroutine is complete
defer wg.Done()
// release the worker
defer func() { <-sem }()
savedAlbums, err := client.CurrentUsersAlbums(ctx, spotify.Limit(50), spotify.Offset(offset))
if err != nil {
return err
}
userAlbums = append(userAlbums, savedAlbums.Albums...)
if len(savedAlbums.Albums) < 50 {
// since the limit is set to 50, we know that if the number of returned albums
// is less than 50 that we're done retrieving data
active = false
return nil
} else {
offset += 50
return nil
}
}()
} else {
wg.Wait()
break
}
}
Thanks in advance!
I suspect that your main issue may be a misunderstanding of what the go keyword does; from the docs:
A "go" statement starts the execution of a function call as an independent concurrent thread of control, or goroutine, within the same address space.
So go func() error { starts the execution of the closure; it does not mean that any of the code runs immediately. In fact because, client.CurrentUsersAlbums will take a while, it's likely you will be requesting the first 50 items 20 times. This can be demonstrated with a simplified version of your application (playground)
func main() {
wg := &sync.WaitGroup{}
sem := make(chan bool, 20)
active := true
offset := 0
for {
sem <- true
if active {
// add each new goroutine to waitgroup
wg.Add(1)
go func() error {
// remove from waitgroup when goroutine is complete
defer wg.Done()
// release the worker
defer func() { <-sem }()
fmt.Println("Getting from:", offset)
time.Sleep(time.Millisecond) // Simulate the query
// Pretend that we got back 50 albums
offset += 50
if offset > 2000 {
active = false
}
return nil
}()
} else {
wg.Wait()
break
}
}
}
Running this will produce somewhat unpredictable results (note that the playground caches results so try it on your machine) but you will probably see 20 X Getting from: 0.
A further issue is data races. Updating a variable from multiple goroutines without protection (e.g. sync.Mutex) results in undefined behaviour.
You will want to know how to fix this but unfortunately you will need to rethink your algorithm. Currently the process you are following is:
Set pos to 0
Get 50 records starting from pos
If we got 50 records then pos=pos+50 and loop back to step 2
This is a sequential algorithm; you don't know whether you have all of the data until you have requested the previous section. I guess you could make speculative queries (and handle failures) but a better solution would be to find some way to determine the number of results expected and then split the queries to get that number of records between multiple goroutines.
Note that if you do know the number of responses then you can do something like the following (playground):
noOfResultsToGet := 1644 // In the below we are getting 0-1643
noOfResultsPerRequest := 50
noOfSimultaneousRequests := 20 // You may not need this but many services will limit the number of simultaneous requests you can make (or, at least, rate limit them)
requestChan := make(chan int) // Will be passed the starting #
responseChan := make(chan []string) // Response from whatever request we are making (can be any type really)
// Start goroutines to make the requests
var wg sync.WaitGroup
wg.Add(noOfSimultaneousRequests)
for i := 0; i < noOfSimultaneousRequests; i++ {
go func(routineNo int) {
defer wg.Done()
for startPos := range requestChan {
// Simulate making the request
maxResult := startPos + noOfResultsPerRequest
if maxResult > noOfResultsToGet {
maxResult = noOfResultsToGet
}
rsp := make([]string, 0, noOfResultsPerRequest)
for x := startPos; x < maxResult; x++ {
rsp = append(rsp, strconv.Itoa(x))
}
responseChan <- rsp
fmt.Printf("Goroutine %d handling data from %d to %d\n", routineNo, startPos, startPos+noOfResultsPerRequest)
}
}(i)
}
// Close the response channel when all goroutines have shut down
go func() {
wg.Wait()
close(responseChan)
}()
// Send the requests
go func() {
for reqFrom := 0; reqFrom < noOfResultsToGet; reqFrom += noOfResultsPerRequest {
requestChan <- reqFrom
}
close(requestChan) // Allow goroutines to exit
}()
// Receive responses (note that these may be out of order)
result := make([]string, 0, noOfResultsToGet)
for x := range responseChan {
result = append(result, x...)
}
// Order the results and output (results from gorouting may come back in any order)
sort.Slice(result, func(i, j int) bool {
a, _ := strconv.Atoi(result[i])
b, _ := strconv.Atoi(result[j])
return a < b
})
fmt.Printf("Result: %v", result)
Relying on channels to pass messages often makes this kind of thing easier to think about and reduces the chance that you will make a mistake.
Set offset as an args -> go func(offset int) error {.
Increment offset by 50 after calling go func
Change active type to chan bool
To avoid data race on userAlbums = append(userAlbums, res...). We need to create channel that same type as userAlbums, then run for loop inside goroutine, then send the results to that channel.
this is the example : https://go.dev/play/p/yzk8qCURZFC
if applied to your code :
wg := &sync.WaitGroup{}
worker := 20
active := make(chan bool, worker)
for i := 0; i < worker; i++ {
active <- true
}
// I assume the type of userAlbums is []string
resultsChan := make(chan []string, worker)
go func() {
offset := 0
for {
if <-active {
// add each new goroutine to waitgroup
wg.Add(1)
go func(offset int) error {
// remove from waitgroup when goroutine is complete
defer wg.Done()
savedAlbums, err := client.CurrentUsersAlbums(ctx, spotify.Limit(50), spotify.Offset(offset))
if err != nil {
// active <- false // maybe you need this
return err
}
resultsChan <- savedAlbums.Albums
if len(savedAlbums.Albums) < 50 {
// since the limit is set to 50, we know that if the number of returned albums
// is less than 50 that we're done retrieving data
active <- false
return nil
} else {
active <- true
return nil
}
}(offset)
offset += 50
} else {
wg.Wait()
close(resultsChan)
break
}
}
}()
for res := range resultsChan {
userAlbums = append(userAlbums, res...)
}

Conditionally Run Consecutive Go Routines

I have the following piece of code. I'm trying to run 3 GO routines at the same time never exceeding three. This works as expected, but the code is supposed to be running updates a table in the DB.
So the first routine processes the first 50, then the second 50, and then third 50, and it repeats. I don't want two routines processing the same rows at the same time and due to how long the update takes, this happens almost every time.
To solve this, I started flagging the rows with a new column processing which is a bool. I set it to true for all rows to be updated when the routine starts and sleep the script for 6 seconds to allow the flag to be updated.
This works for a random amount of time, but every now and then, I'll see 2-3 jobs processing the same rows again. I feel like the method I'm using to prevent duplicate updates is a bit janky and was wondering if there was a better way.
stopper := make(chan struct{}, 3)
var counter int
for {
counter++
stopper <- struct{}{}
go func(db *sqlx.DB, c int) {
fmt.Println("start")
updateTables(db)
fmt.Println("stop"b)
<-stopper
}(db, counter)
time.Sleep(6 * time.Second)
}
in updateTables
var ids[]string
err := sqlx.Select(db, &data, `select * from table_data where processing = false `)
if err != nil {
panic(err)
}
for _, row:= range data{
list = append(ids, row.Id)
}
if len(rows) == 0 {
return
}
for _, row:= range data{
_, err = db.Exec(`update table_data set processing = true where id = $1, row.Id)
if err != nil {
panic(err)
}
}
// Additional row processing
I think there's a misunderstanding on approach to go routines in this case.
Go routines to do these kind of work should be approached like worker Threads, using channels as the communication method in between the main routine (which will be doing the synchronization) and the worker go routines (which will be doing the actual job).
package main
import (
"log"
"sync"
"time"
)
type record struct {
id int
}
func main() {
const WORKER_COUNT = 10
recordschan := make(chan record)
var wg sync.WaitGroup
for k := 0; k < WORKER_COUNT; k++ {
wg.Add(1)
// Create the worker which will be doing the updates
go func(workerID int) {
defer wg.Done() // Marking the worker as done
for record := range recordschan {
updateRecord(record)
log.Printf("req %d processed by worker %d", record.id, workerID)
}
}(k)
}
// Feeding the records channel
for _, record := range fetchRecords() {
recordschan <- record
}
// Closing our channel as we're not using it anymore
close(recordschan)
// Waiting for all the go routines to finish
wg.Wait()
log.Println("we're done!")
}
func fetchRecords() []record {
result := []record{}
for k := 0; k < 100; k++ {
result = append(result, record{k})
}
return result
}
func updateRecord(req record) {
time.Sleep(200 * time.Millisecond)
}
You can even buffer things in the main go routine if you need to update all the 50 tables at once.

Deadlock in book <The Go Programming Language>, how it would happen and why it happen?

There are several times in this book talk about deadlock regarding to incorrect usage of goroutine and channel, and I failed to grasp why the deadlock happen.
First thing I want to say is that, I know channel send&receive will block until there is items to be read or rooms to send items into, and maybe that's the deep reason of some deadlock. But please enlighten me with some explanation according to the following excerpt from the book:
Page 240
This code is to concurrently crawl urls, BFS style:
func main() {
worklist := make(chan []string)
// Start with the command-line arguments.
go func() { worklist <- os.Args[1:] }()
// Crawl the web concurrently.
seen := make(map[string]bool)
for list := range worklist {
for _, link := range list {
if !seen[link] {
seen[link] = true
go func(link string) {
worklist <- crawl(link)
}(link)
}
}
}
}
quoting book's second paragraph:
...the initial send of the command-line arguments to the worklist must run in its own goroutine to avoid deadlock, a stuck situation in which both the main goroutine and a crawler goroutine attempt to send to each other while neither is receiving...
suppose the initial send to worklist is not in its own goroutin, I imagine it works like this:
main goroutine send initial to worklist, block until received
the for range receive the initial item, so unblock the worklist channel
the crawler goroutine send its items into worklist, loop...
So to my understanding, it won't block and deadlock. Where am I wrong?
UPDATE: #mkopriva helped me realized since step 1 is blocked, step 2,3 is unreachable. So I'm clear on this one.
Page 243
This deadlock situation might be the same as in page 240:
func main() {
worklist := make(chan []string) // list of URLs, may have duplicates
unseenLinks := make(chan string) // de-duplicated URLs
// Add command-lin arguments to worklist.
go func() { worklist <- os.Args[1:] }()
// Create 20 crawler goroutines to fetch each unseen link.
for i := 0; i < 20; i++ {
go func() {
for link := range unseenLinks {
foundLinks := crawl(link)
go func() { worklist <- foundLinks }()
}
}()
}
// The main goroutine de-duplicates worklist items
// and sends the unseen ones to the crawlers.
seen := make(map[string]bool)
for list := range worklist {
for _, link := range list {
if !seen[link] {
seen[link] = true
unseenLinks <- link
}
}
}
}
So if I put omit go inside the for-range loop, how the deadlock happen?
In the first snippet, the initial channel send needs to be done in a goroutine because without a goroutine the statement would block indefinitely and the execution would never reach the loop that receives from that channel. i.e. To get from 1. to 2., 1. needs to be done in a goroutine. If 1. blocks however, then 2. is never reached.
Where the comments start is where the execution stops:
func main() {
worklist := make(chan []string)
worklist <- os.Args[1:]
//
// seen := make(map[string]bool)
// for list := range worklist {
// for _, link := range list {
// if !seen[link] {
// seen[link] = true
// go func(link string) {
// worklist <- crawl(link)
// }(link)
// }
// }
// }
// }
In the second snippet you have a for-range loop over a channel, such a loop will NOT exit until the ranged-over channel is closed. That means that, while the body of such a loop may continue to get executed, the code after the loop-with-unclosed-channel will never be reached.
https://golang.org/ref/spec#For_range
For channels, the iteration values produced are the successive values sent on the channel until the channel is closed. If the channel
is nil, the range expression blocks forever.
Where the comments start is where the execution stops:
func main() {
worklist := make(chan []string)
unseenLinks := make(chan string)
go func() { worklist <- os.Args[1:] }()
for i := 0; i < 20; i++ {
for link := range unseenLinks {
// foundLinks := crawl(link)
// go func() { worklist <- foundLinks }()
// }
// }
//
// seen := make(map[string]bool)
// for list := range worklist {
// for _, link := range list {
// if !seen[link] {
// seen[link] = true
// unseenLinks <- link
// }
// }
// }
// }
In the second snippet, I think the author is talking about the go routine in below:
go func() { worklist <- foundLinks }()
Links found by crawl are sent to the worklist from a dedicated
goroutine to avoid deadlock.
Why must this go exist?

Unpredictable results with http.Client and goroutines

I'm new to Golang, trying to build a system that fetches content from a set of urls and extract specific lines with regex. The problems start when i wrap the code with goroutines. I'm getting a different number of regex results and many of fetched lines are duplicates.
max_routines := 3
sem := make(chan int, max_routines) // to control the number of working routines
var wg sync.WaitGroup
ch_content := make(chan string)
client := http.Client{}
for i:=2; ; i++ {
// for testing
if i>5 {
break
}
// loop should be broken if feebbacks_checstr is found in content
if loop_break {
break
}
wg.Add(1)
go func(i int) {
defer wg.Done()
sem <- 1 // will block if > max_routines
final_url = url+a.tm_id+"/page="+strconv.Itoa(i)
resp, _ := client.Get(final_url)
var bodyString string
if resp.StatusCode == http.StatusOK {
bodyBytes, _ := ioutil.ReadAll(resp.Body)
bodyString = string(bodyBytes)
}
// checking for stop word in content
if false == strings.Contains(bodyString, feebbacks_checstr) {
res2 = regex.FindAllStringSubmatch(bodyString,-1)
for _,v := range res2 {
ch_content <- v[1]
}
} else {
loop_break = true
}
resp.Body.Close()
<-sem
}(i)
}
for {
select {
case r := <-ch_content:
a.feedbacks = append(a.feedbacks, r) // collecting the data
case <-time.After(500 * time.Millisecond):
show(len(a.feedbacks)) // < always different result, many entries in a.feedbacks are duplicates
fmt.Printf(".")
}
}
As a result len(a.feedbacks) gives sometimes 130, sometimes 139 and a.feedbacks contains duplicates. If i clean the duplicates the number of results is about half of what i'm expecting (109 without duplicates)
You're creating a closure by using an anonymous go routine function. I notice your final_url isn't := but = which means it's defined outside the closure. All go routines will have access to the same value of final_url and there's a race condition going on. Some go routines are overwriting final_url before other go routines are making their requests and this will result in duplicates.
If you define final_url inside the go routine then they won't be stepping on each other's toes and it should work as you expect.
That's the simple fix for what you have. A more idiomatically Go way to do this would be to create an input channel (containing the URLs to request) and an output channel (eventually containing whatever you're pulling out of the response) and instead of trying to manage the life and death of dozens of go routines you would keep alive a constant amount of go routines that try to empty out the input channel.

Efficient way to process lines of file

Im trying to learn some go and I am trying to make a script that reads from a csv file, do some process and take some results.
I am following a pipeline pattern so that a goroutine reads from file per line with scanner, lines are sent to channel and different goroutines consume the channel content.
An example of what I am trying to do:
https://gist.github.com/pkulak/93336af9bb9c7207d592
My problem is: In csv file there are lots of records. I want to do some math between consequent lines. Lets say record1 is r1 record2 is r2 and so on.
When I read file I have r1. In the next scanner loop I have r2. I want to see if r1-r2 is a valid number for me. If yes do some math between them. Then check r3 r4 the same way. If r1-r2 is not valid, don't care about r2 and do r1-r3 and so on.
Should I handle this when reading the file, after I put lines of file in channel and handle channel content?
Any suggestion that does not break concurrency?
I think you should determine "is r1-r2 valid numbers for you" inside Read the lines into the work queue function.
So, you should read current line, and read next lines one by one while you don't have valid numbers pair. When you got it -- you will send this pair inside workQueue channel and search for the next pair.
This is your code with changes:
package main
import (
"bufio"
"log"
"os"
"errors"
)
var concurrency = 100
type Pair struct {
line1 string
line2 string
}
func main() {
// It will be better to receive file-path from somewhere (like args or something like this)
filePath := "/path/to/file.csv"
// This channel has no buffer, so it only accepts input when something is ready
// to take it out. This keeps the reading from getting ahead of the writers.
workQueue := make(chan Pair)
// We need to know when everyone is done so we can exit.
complete := make(chan bool)
// Read the lines into the work queue.
go func() {
file, e := os.Open(filePath)
if e != nil {
log.Fatal(e)
}
// Close when the function returns
defer file.Close()
scanner := bufio.NewScanner(file)
// Get pairs and send them into "workQueue" channel
for {
line1, e := getNextCorrectLine(scanner)
if e != nil {
break
}
line2, e := getNextCorrectLine(scanner)
if e != nil {
break
}
workQueue <- Pair{line1, line2}
}
// Close the channel so everyone reading from it knows we're done.
close(workQueue)
}()
// Now read them all off, concurrently.
for i := 0; i < concurrency; i++ {
go startWorking(workQueue, complete)
}
// Wait for everyone to finish.
for i := 0; i < concurrency; i++ {
<-complete
}
}
func getNextCorrectLine(scanner *bufio.Scanner) (string, error) {
var line string
for scanner.Scan() {
line = scanner.Text()
if isCorrect(line) {
return line, nil
}
}
return "", errors.New("no more lines")
}
func isCorrect(str string) bool {
// Make your validation here
return true
}
func startWorking(pairs <-chan Pair, complete chan<- bool) {
for pair := range pairs {
doTheWork(pair)
}
// Let the main process know we're done.
complete <- true
}
func doTheWork(pair Pair) {
// Do the work with the pair
}

Resources