I have written producer-consumer pattern in golang. Reading multiple csv files and processing records. I am reading all records of csv file in one go.
I want to log percentage of processing completion in interval of 5% of total records including all csv files. for e.g I have 3 csv to process & each have 20,30,50 rows/records (so in total 100 records to process) want to log progress when 5 records are processed.
func processData(inputCSVFiles []string) {
producerCount := len(inputCSVFiles)
consumerCount := producerCount
link := make(chan []string, 100)
wp := &sync.WaitGroup{}
wc := &sync.WaitGroup{}
wp.Add(producerCount)
wc.Add(consumerCount)
for i := 0; i < producerCount; i++ {
go produce(link, inputCSVFiles[i], wp)
}
for i := 0; i < consumerCount; i++ {
go consume(link, wc)
}
wp.Wait()
close(link)
wc.Wait()
fmt.Println("Completed data migration process for all CSV data files.")
}
func produce(link chan<- []string, filePath string, wg *sync.WaitGroup) {
defer wg.Done()
records := readCsvFile(filePath)
totalNumberOfRecords := len(records)
for _, record := range records {
link <- record
}
}
func consume(link <-chan []string, wg *sync.WaitGroup) {
defer wg.Done()
for record := range link {
// process csv record
}
}
I have used atomic variable & counter channel where consumer will push count when record is processed & other goroutine will read from channel & calculate total processed record percentage.
var progressPercentageStep float64 = 5.0
var totalRecordsToProcess int32
func processData(inputCSVFiles []string) {
producerCount := len(inputCSVFiles)
consumerCount := producerCount
link := make(chan []string, 100)
counter := make(chan int, 100)
defer close(counter)
wp := &sync.WaitGroup{}
wc := &sync.WaitGroup{}
wp.Add(producerCount)
wc.Add(consumerCount)
for i := 0; i < producerCount; i++ {
go produce(link, inputCSVFiles[i], wp)
}
go progressStats(counter)
for i := 0; i < consumerCount; i++ {
go consume(link, wc)
}
wp.Wait()
close(link)
wc.Wait()
}
func produce(link chan<- []string, filePath string, wg *sync.WaitGroup) {
defer wg.Done()
records := readCsvFile(filePath)
atomic.AddInt32(&totalRecordsToProcess, int32(len(records)))
for _, record := range records {
link <- record
}
}
func consume(link <-chan []string,counter chan<- int, wg *sync.WaitGroup) {
defer wg.Done()
for record := range link {
// process csv record
counter <- 1
}
}
func progressStats(counter <-chan int) {
var feedbackThreshold = progressPercentageStep
for count := range counter {
totalRemaining := atomic.AddInt32(&totalRecordsToProcess, -count)
donePercent := 100.0 * processed / totalRemaining
// log progress
if donePercent >= feedbackThreshold {
log.Printf("Progress ************** Total Records: %d, Processed Records : %d, Processed Percentage: %.2f **************\n", totalRecordsToProcess, processed, donePercent)
feedbackThreshold += progressPercentageStep
}
}
}
Related
In the code hereunder, I don't understand why the "Worker" methods seem to exit instead of pulling values from the input channel "in" and processing them.
I had assumed they would only return after having consumed all input from the input channel "in" and processing them
package main
import (
"fmt"
"sync"
)
type ParallelCallback func(chan int, chan Result, int, *sync.WaitGroup)
type Result struct {
i int
val int
}
func Worker(in chan int, out chan Result, id int, wg *sync.WaitGroup) {
for item := range in {
item *= item // returns the square of the input value
fmt.Printf("=> %d: %d\n", id, item)
out <- Result{item, id}
}
wg.Done()
fmt.Printf("%d exiting ", id)
}
func Run_parallel(n_workers int, in chan int, out chan Result, Worker ParallelCallback) {
wg := sync.WaitGroup{}
for id := 0; id < n_workers; id++ {
fmt.Printf("Starting : %d\n", id)
wg.Add(1)
go Worker(in, out, id, &wg)
}
wg.Wait() // wait for all workers to complete their tasks
close(out) // close the output channel when all tasks are completed
}
const (
NW = 4
)
func main() {
in := make(chan int)
out := make(chan Result)
go func() {
for i := 0; i < 100; i++ {
in <- i
}
close(in)
}()
Run_parallel(NW, in, out, Worker)
for item := range out {
fmt.Printf("From out : %d: %d", item.i, item.val)
}
}
The output is
Starting : 0
Starting : 1
Starting : 2
Starting : 3
=> 3: 0
=> 0: 1
=> 1: 4
=> 2: 9
fatal error: all goroutines are asleep - deadlock!
fatal error: all goroutines are asleep - deadlock!
The full error shows where each goroutine is "stuck". If you run this in the playground, it will even show you the line number. That made it easy for me to diagnose.
Your Run_parallel runs in the main groutine, so before main can read from out, Run_parallel must return. Before Run_parallel can return, it must wg.Wait(). But before the workers call wg.Done(), they must write to out. That's what causes a deadlock.
One solution is simple: just run Run_parallel concurrently in its own Goroutine.
go Run_parallel(NW, in, out, Worker)
Now, main ranges over out, waiting on outs closure to signal completion. Run_parallel waits for the workers with wg.Wait(), and the workers will range over in. All the work will get done, and the program won't end until it's all done. (https://go.dev/play/p/oMrgH2U09tQ)
Solution :
Run_parallel has to run in it’s own goroutine:
package main
import (
"fmt"
"sync"
)
type ParallelCallback func(chan int, chan Result, int, *sync.WaitGroup)
type Result struct {
id int
val int
}
func Worker(in chan int, out chan Result, id int, wg *sync.WaitGroup) {
defer wg.Done()
for item := range in {
item *= 2 // returns the double of the input value (Bogus handling of data)
out <- Result{id, item}
}
}
func Run_parallel(n_workers int, in chan int, out chan Result, Worker ParallelCallback) {
wg := sync.WaitGroup{}
for id := 0; id < n_workers; id++ {
wg.Add(1)
go Worker(in, out, id, &wg)
}
wg.Wait() // wait for all workers to complete their tasks
close(out) // close the output channel when all tasks are completed
}
const (
NW = 8
)
func main() {
in := make(chan int)
out := make(chan Result)
go func() {
for i := 0; i < 10; i++ {
in <- i
}
close(in)
}()
go Run_parallel(NW, in, out, Worker)
for item := range out {
fmt.Printf("From out [%d]: %d\n", item.id, item.val)
}
println("- - - All done - - -")
}
Alternative formulation of the solution:
In that alternative formulation , it is not necessary to start Run_parallel as a goroutine (it triggers its own goroutine).
I prefer that second solution, because it automates the fact that Run_parallel() has to run parallel to the main function. Also, for the same reason it's safer, less error-prone (no need to remember to run Run_parallel with the go keyword).
package main
import (
"fmt"
"sync"
)
type ParallelCallback func(chan int, chan Result, int, *sync.WaitGroup)
type Result struct {
id int
val int
}
func Worker(in chan int, out chan Result, id int, wg *sync.WaitGroup) {
defer wg.Done()
for item := range in {
item *= 2 // returns the double of the input value (Bogus handling of data)
out <- Result{id, item}
}
}
func Run_parallel(n_workers int, in chan int, out chan Result, Worker ParallelCallback) {
go func() {
wg := sync.WaitGroup{}
defer close(out) // close the output channel when all tasks are completed
for id := 0; id < n_workers; id++ {
wg.Add(1)
go Worker(in, out, id, &wg)
}
wg.Wait() // wait for all workers to complete their tasks *and* trigger the -differed- close(out)
}()
}
const (
NW = 8
)
func main() {
in := make(chan int)
out := make(chan Result)
go func() {
defer close(in)
for i := 0; i < 10; i++ {
in <- i
}
}()
Run_parallel(NW, in, out, Worker)
for item := range out {
fmt.Printf("From out [%d]: %d\n", item.id, item.val)
}
println("- - - All done - - -")
}
package main
import (
"fmt"
"sync"
)
// PUT function
func put(hashMap map[string](chan int), key string, value int, wg *sync.WaitGroup) {
defer wg.Done()
fmt.Printf("this is getting printed")
hashMap[key] <- value
fmt.Printf("this is not getting printed")
fmt.Printf("PUT sent %d\n", value)
}
func main() {
var value int
var key string
wg := &sync.WaitGroup{}
hashMap := make(map[string](chan int), 100)
key = "xyz"
value = 100
for i := 0; i < 5; i++ {
wg.Add(1)
go put(hashMap, key, value, wg)
}
wg.Wait()
}
The last two print statements in the put function are not getting printed, I am trying to put values into the map based on key.
and also how to close the hashMap in this case.
You need to create a channel, for example hashMap[key] = make(chan int)
Since you are not reading from the channel, you need buffered channel to make it work:
key := "xyz"
hashMap[key] = make(chan int, 5)
Try the following code:
func put(hashMap map[string](chan int), key string, value int, wg *sync.WaitGroup) {
hashMap[key] <- value
fmt.Printf("PUT sent %d\n", value)
wg.Done()
}
func main() {
var wg sync.WaitGroup
hashMap := map[string]chan int{}
key := "xyz"
hashMap[key] = make(chan int, 5)
for i := 0; i < 5; i++ {
wg.Add(1)
go put(hashMap, key, 100, &wg)
}
wg.Wait()
}
Output:
PUT sent 100
PUT sent 100
PUT sent 100
PUT sent 100
PUT sent 100
My solution to fix the problem is:
// PUT function
func put(hashMap map[string](chan int), key string, value int, wg *sync.WaitGroup) {
defer wg.Done()
fmt.Printf("this is getting printed")
hashMap[key] <- value // <-- nil problem
fmt.Printf("this is not getting printed")
fmt.Printf("PUT sent %d\n", value)
}
in this line of code hashMap[key] <- value in put function, It cannot accept the value because chan int is nil which is define in put (hashMap map[string](chan int) parameter.
// PUT function
func put(hashMap map[string](chan int), cval chan int, key string, value int, wg *sync.WaitGroup) {
defer wg.Done()
fmt.Println("this is getting printed")
cval <- value // put the value in chan int (cval) which is initialized
hashMap[key] = cval // set the cval(chan int) to hashMap with key
fmt.Println("this is not getting printed")
fmt.Printf("PUT sent %s %d\n", key, value)
}
func main() {
var value int
wg := &sync.WaitGroup{}
cval := make(chan int,100)
hashMap := make(map[string](chan int), 100)
value = 100
for i := 0; i < 5; i++ {
wg.Add(1)
go put(hashMap, cval, fmt.Sprintf("key%d",i), value, wg)
}
wg.Wait()
/* uncomment to test cval
close(cval)
fmt.Println("Result:",<-hashMap["key2"])
fmt.Println("Result:",<-hashMap["key1"])
cval <- 88 // cannot send value to a close channel
hashMap["key34"] = cval
fmt.Println("Result:",<-hashMap["key1"])
*/
}
In my code example. I initialized cval buffered channel 100 same size to hashMap and pass cval as value in put function. you can close cval only and not the hashMap itself.
Also, your code can be reduced to this. Why pass params unnecessarily! One extra modification is that I take different values to make you understand the concept clearer.
package main
import (
"log"
"sync"
)
func put(hash chan int, wg *sync.WaitGroup) {
defer wg.Done()
log.Println("Put sent: ", <-hash)
}
func main() {
hashMap := map[string]chan int{}
key := "xyz"
var wg sync.WaitGroup
hashMap[key] = make(chan int, 5)
for i := 0; i < 5; i++ {
value := i
wg.Add(1)
go func(val int) {
hashMap[key] <- val
put(hashMap[key], &wg)
}(value)
}
wg.Wait()
}
I want to use goroutines to batch requests from different customers' with different date.
I mean 50 consumer goroutines to consume all customers from db, and 2 date consumer goroutines to consume date slice.
Main codes as below, but it hung and didn't exit as expected.
Why doesn't it exit as expected?
func Run(){
var syncWg sync.WaitGroup
syncWg.Add(1)
go SyncCustomerMetricsHistory(&syncWg)
syncWg.Wait()
}
func SyncCustomerMetricsHistory(wg *sync.WaitGroup){
defer wg.Done()
odb := orm.NewOrm()
start := time.Now()
logs.Info("start sync customer metrics, time:[%v]", start)
qs := odb.QueryTable("gg_customer")
var customers []*db.GgCustomer
if num, err := qs.All(&customers); err != nil || num == 0 {
logs.Error("Get customer error, rows:[%v], err:[%v]", num, err)
}
customersChan := make(chan *db.GgCustomer, 50)
var wgC sync.WaitGroup
wgC.Add(50)
for i := 0; i < 50; i++ {
go syncCustomerMetricsHistory(customersChan, &wgC)
}
go func() {
for _, customer := range customers {
customersChan <- customer
}
close(customersChan)
}()
wgC.Wait()
}
func syncCustomerMetricsHistory(customerChan <- chan *db.GgCustomer, wg *sync.WaitGroup){
defer wg.Done()
for customer := range customerChan{
dateChan := make(chan string, 2)
var wgD sync.WaitGroup
wgD.Add(2)
for i := 1; i < 2; i++{
go test(dateChan, customer, &wgD)
}
go func(){
for _, date := range GetAllYearDate(){
dateChan <- date
}
close(dateChan)
}()
wgD.Wait()
}
}
}
func test(dateChan <- chan string, customer *db.GgCustomer, wg *sync.WaitGroup){
defer wg.Done()
for date := range dateChan{
fmt.Println(date, customer)
}
}
func GetAllYearDate() []string{
return []string{"2019-10-01", "2019-10-02"}
}
I have not tried to run this (as it requires additional code) but believe your issue is:
wgD.Add(2)
for i := 1; i < 2; i++{
go test(dateChan, customer, &wgD)
}
That for loop will only iterate once but you called wgD.Add(2) (I think you probably meant the loop to iterate twice; try i <= 2).
One other bit of feedback; the way you are using waitgroups will work but is hard to follow (perhaps leading to you not spotting the issue); how about something like:
func Run(){
SyncCustomerMetricsHistory() // No wait group needed as this will not return before done
}
func SyncCustomerMetricsHistory(){
odb := orm.NewOrm()
start := time.Now()
logs.Info("start sync customer metrics, time:[%v]", start)
qs := odb.QueryTable("gg_customer")
var customers []*db.GgCustomer
if num, err := qs.All(&customers); err != nil || num == 0 {
logs.Error("Get customer error, rows:[%v], err:[%v]", num, err)
}
customersChan := make(chan *db.GgCustomer, 50)
var wgC sync.WaitGroup
wgC.Add(50)
for i := 0; i < 50; i++ {
go func() {
syncCustomerMetricsHistory(customersChan)
wgC.Done()
}()
}
go func() {
for _, customer := range customers {
customersChan <- customer
}
close(customersChan)
}()
wgC.Wait()
}
func syncCustomerMetricsHistory(customerChan <- chan *db.GgCustomer){
for customer := range customerChan{
dateChan := make(chan string, 2)
var wgD sync.WaitGroup
wgD.Add(2)
for i := 1; i < 2; i++{
go func() {
test(dateChan, customer)
wgD.Done()
}()
}
go func(){
for _, date := range GetAllYearDate(){
dateChan <- date
}
close(dateChan)
}()
wgD.Wait()
}
}
}
I think this is easier to follow because you can see where wg.Done() is being called. It's also really easy to stick some fmt.Println commands on either side which makes it simpler to debug this kind of issue.
I'm currently staring at a beefed up version of the following code:
func embarrassing(data []string) []string {
resultChan := make(chan string)
var waitGroup sync.WaitGroup
for _, item := range data {
waitGroup.Add(1)
go func(item string) {
defer waitGroup.Done()
resultChan <- doWork(item)
}(item)
}
go func() {
waitGroup.Wait()
close(resultChan)
}()
var results []string
for result := range resultChan {
results = append(results, result)
}
return results
}
This is just blowing my mind. All this is doing can be expressed in other languages as
results = parallelMap(data, doWork)
Even if it can't be done quite this easily in Go, isn't there still a better way than the above?
If you need all the results, you don't need the channel (and the extra goroutine to close it) to communicate the results, you can write directly into the results slice:
func cleaner(data []string) []string {
results := make([]string, len(data))
wg := &sync.WaitGroup{}
wg.Add(len(data))
for i, item := range data {
go func(i int, item string) {
defer wg.Done()
results[i] = doWork(item)
}(i, item)
}
wg.Wait()
return results
}
This is possible because slice elements act as distinct variables, and thus can be written individually without synchronization. For details, see Can I concurrently write different slice elements. You also get the results in the same order as your input for free.
Anoter variation: if doWork() would not return the result but get the address where the result should be "placed", and additionally the sync.WaitGroup to signal completion, that doWork() function could be executed "directly" as a new goroutine.
We can create a reusable wrapper for doWork():
func doWork2(item string, result *string, wg *sync.WaitGroup) {
defer wg.Done()
*result = doWork(item)
}
If you have the processing logic in such format, this is how it can be executed concurrently:
func cleanest(data []string) []string {
results := make([]string, len(data))
wg := &sync.WaitGroup{}
wg.Add(len(data))
for i, item := range data {
go doWork2(item, &results[i], wg)
}
wg.Wait()
return results
}
Yet another variation could be to pass a channel to doWork() on which it is supposed to deliver the result. This solution doesn't even require a sync.Waitgroup, as we know how many elements we want to receive from the channel:
func cleanest2(data []string) []string {
ch := make(chan string)
for _, item := range data {
go doWork3(item, ch)
}
results := make([]string, len(data))
for i := range results {
results[i] = <-ch
}
return results
}
func doWork3(item string, res chan<- string) {
res <- "done:" + item
}
"Weakness" of this last solution is that it may collect the result "out-of-order" (which may or may not be a problem). This approach can be improved to retain order by letting doWork() receive and return the index of the item. For details and examples, see How to collect values from N goroutines executed in a specific order?
You can also use reflection to achieve something similar.
In this example it distribute the handler function over 4 goroutines and returns the results in a new instance of the given source slice type.
package main
import (
"fmt"
"reflect"
"strings"
"sync"
)
func parralelMap(some interface{}, handle interface{}) interface{} {
rSlice := reflect.ValueOf(some)
rFn := reflect.ValueOf(handle)
dChan := make(chan reflect.Value, 4)
rChan := make(chan []reflect.Value, 4)
var waitGroup sync.WaitGroup
for i := 0; i < 4; i++ {
waitGroup.Add(1)
go func() {
defer waitGroup.Done()
for v := range dChan {
rChan <- rFn.Call([]reflect.Value{v})
}
}()
}
nSlice := reflect.MakeSlice(rSlice.Type(), rSlice.Len(), rSlice.Cap())
for i := 0; i < rSlice.Len(); i++ {
dChan <- rSlice.Index(i)
}
close(dChan)
go func() {
waitGroup.Wait()
close(rChan)
}()
i := 0
for v := range rChan {
nSlice.Index(i).Set(v[0])
i++
}
return nSlice.Interface()
}
func main() {
fmt.Println(
parralelMap([]string{"what", "ever"}, strings.ToUpper),
)
}
Test here https://play.golang.org/p/iUPHqswx8iS
I have 5 huge (4 million rows each) logfiles that I process in Perl currently and I thought I may try to implement the same in Go and its concurrent features. So, being very inexperienced in Go, I was thinking of doing as below. Any comments on the approach will be greatly appreciated.
Some rough pseudocode:
var wg1 sync.WaitGroup
var wg2 sync.WaitGroup
func processRow (r Row) {
wg2.Add(1)
defer wg2.Done()
res = <process r>
return res
}
func processFile(f File) {
wg1.Add(1)
open(newfile File)
defer wg1.Done()
line = <row from f>
result = go processRow(line)
newFile.Println(result) // Write new processed line to newFile
wg2.Wait()
newFile.Close()
}
func main() {
for each f logfile {
go processFile(f)
}
wg1.Wait()
}
So, idea is that I process these 5 files concurrently and then all rows of each file will in turn also be processed concurrently.
Will that work?
You should definitely use channels to manage your processed rows. Alternatively you could also write another goroutine to handle your output.
var numGoWriters = 10
func processRow(r Row, ch chan<- string) {
res := process(r)
ch <- res
}
func writeRow(f File, ch <-chan string) {
w := bufio.NewWriter(f)
for s := range ch {
_, err := w.WriteString(s + "\n")
}
func processFile(f File) {
outFile, err := os.Create("/path/to/file.out")
if err != nil {
// handle it
}
defer outFile.Close()
var wg sync.WaitGroup
ch := make(chan string, 10) // play with this number for performance
defer close(ch) // once we're done processing rows, we close the channel
// so our worker threads exit
fScanner := bufio.NewScanner(f)
for fScanner.Scan() {
wg.Add(1)
go func() {
processRow(fScanner.Text(), ch)
wg.Done()
}()
}
for i := 0; i < numGoWriters; i++ {
go writeRow(outFile, ch)
}
wg.Wait()
}
Here we have processRow doing all the processing (I assumed to string), writeRow doing all the out I/O, and processFile tying each file together. Then all main has to do is hand off the files, spawn the goroutines, et voila.
func main() {
var wg sync.WaitGroup
filenames := [...]string{"here", "are", "some", "log", "paths"}
for fname := range filenames {
inFile, err := os.Open(fname)
if err != nil {
// handle it
}
defer inFile.Close()
wg.Add(1)
go processFile(inFile)
}
wg.Wait()