Efficient way to process lines of file - go

Im trying to learn some go and I am trying to make a script that reads from a csv file, do some process and take some results.
I am following a pipeline pattern so that a goroutine reads from file per line with scanner, lines are sent to channel and different goroutines consume the channel content.
An example of what I am trying to do:
https://gist.github.com/pkulak/93336af9bb9c7207d592
My problem is: In csv file there are lots of records. I want to do some math between consequent lines. Lets say record1 is r1 record2 is r2 and so on.
When I read file I have r1. In the next scanner loop I have r2. I want to see if r1-r2 is a valid number for me. If yes do some math between them. Then check r3 r4 the same way. If r1-r2 is not valid, don't care about r2 and do r1-r3 and so on.
Should I handle this when reading the file, after I put lines of file in channel and handle channel content?
Any suggestion that does not break concurrency?

I think you should determine "is r1-r2 valid numbers for you" inside Read the lines into the work queue function.
So, you should read current line, and read next lines one by one while you don't have valid numbers pair. When you got it -- you will send this pair inside workQueue channel and search for the next pair.
This is your code with changes:
package main
import (
"bufio"
"log"
"os"
"errors"
)
var concurrency = 100
type Pair struct {
line1 string
line2 string
}
func main() {
// It will be better to receive file-path from somewhere (like args or something like this)
filePath := "/path/to/file.csv"
// This channel has no buffer, so it only accepts input when something is ready
// to take it out. This keeps the reading from getting ahead of the writers.
workQueue := make(chan Pair)
// We need to know when everyone is done so we can exit.
complete := make(chan bool)
// Read the lines into the work queue.
go func() {
file, e := os.Open(filePath)
if e != nil {
log.Fatal(e)
}
// Close when the function returns
defer file.Close()
scanner := bufio.NewScanner(file)
// Get pairs and send them into "workQueue" channel
for {
line1, e := getNextCorrectLine(scanner)
if e != nil {
break
}
line2, e := getNextCorrectLine(scanner)
if e != nil {
break
}
workQueue <- Pair{line1, line2}
}
// Close the channel so everyone reading from it knows we're done.
close(workQueue)
}()
// Now read them all off, concurrently.
for i := 0; i < concurrency; i++ {
go startWorking(workQueue, complete)
}
// Wait for everyone to finish.
for i := 0; i < concurrency; i++ {
<-complete
}
}
func getNextCorrectLine(scanner *bufio.Scanner) (string, error) {
var line string
for scanner.Scan() {
line = scanner.Text()
if isCorrect(line) {
return line, nil
}
}
return "", errors.New("no more lines")
}
func isCorrect(str string) bool {
// Make your validation here
return true
}
func startWorking(pairs <-chan Pair, complete chan<- bool) {
for pair := range pairs {
doTheWork(pair)
}
// Let the main process know we're done.
complete <- true
}
func doTheWork(pair Pair) {
// Do the work with the pair
}

Related

how to batch dealing with files using Goroutine?

Assuming I have a bunch of files to deal with(say 1000 or more), first they should be processed by function A(), function A() will generate a file, then this file will be processed by B().
If we do it one by one, that's too slow, so I'm thinking process 5 files at a time using goroutine(we can not process too much at a time cause the CPU cannot bear).
I'm a newbie in golang, I'm not sure if my thought is correct, I think the function A() is a producer and the function B() is a consumer, function B() will deal with the file that produced by function A(), and I wrote some code below, forgive me, I really don't know how to write the code, can anyone give me a help? Thank you in advance!
package main
import "fmt"
var Box = make(chan string, 1024)
func A(file string) {
fmt.Println(file, "is processing in func A()...")
fileGenByA := "/path/to/fileGenByA1"
Box <- fileGenByA
}
func B(file string) {
fmt.Println(file, "is processing in func B()...")
}
func main() {
// assuming that this is the file list read from a directory
fileList := []string{
"/path/to/file1",
"/path/to/file2",
"/path/to/file3",
}
// it seems I can't do this, because fileList may have 1000 or more file
for _, v := range fileList {
go A(v)
}
// can I do this?
for file := range Box {
go B(file)
}
}
Update:
sorry, maybe I haven’t made myself clear, actually the file generated by function A() is stored in the hard disk(generated by a command line tool, I just simple execute it using exec.Command()), not in a variable(the memory), so it doesn't have to be passed to function B() immediately.
I think there are 2 approach:
approach1
approach2
Actually I prefer approach2, as you can see, the first B() doesn't have to process the file1GenByA, it's the same for B() to process any file in the box, because file1GenByA may generated after file2GenByA(maybe the file is larger so it takes more time).
You could spawn 5 goroutines that read from a work channel. That way you have at all times 5 goroutines running and don't need to batch them so that you have to wait until 5 are finished to start the next 5.
func main() {
stack := []string{"a", "b", "c", "d", "e", "f", "g", "h"}
work := make(chan string)
results := make(chan string)
// create worker 5 goroutines
wg := sync.WaitGroup{}
for i := 0; i < 5; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for s := range work {
results <- B(A(s))
}
}()
}
// send the work to the workers
// this happens in a goroutine in order
// to not block the main function, once
// all 5 workers are busy
go func() {
for _, s := range stack {
// could read the file from disk
// here and pass a pointer to the file
work <- s
}
// close the work channel after
// all the work has been send
close(work)
// wait for the workers to finish
// then close the results channel
wg.Wait()
close(results)
}()
// collect the results
// the iteration stops if the results
// channel is closed and the last value
// has been received
for result := range results {
// could write the file to disk
fmt.Println(result)
}
}
https://play.golang.com/p/K-KVX4LEEoK
you're halfway there. There's a few things you need to fix:
your program deadlocks because nothing closes Box, so the main function can never get done rangeing over it.
You aren't waiting for your goroutines to finish, and there than 5 goroutines. (The solutions to these are too intertwined to describe them separately)
1. Deadlock
fatal error: all goroutines are asleep - deadlock!
goroutine 1 [chan receive]:
main.main()
When you range over a channel, you read each value from the channel until it is both closed and empty. Since you never close the channel, the range over that channel can never complete, and the program can never finish.
This is a fairly easy problem to solve in your case: we just need to close the channel when we know there will be no more writes to the channel.
for _, v := range fileList {
go A(v)
}
close(Box)
Keep in mind that closeing a channel doesn't stop it from being read, only written. Now consumers can distinguish between an empty channel that may receive more data in the future, and an empty channel that will never receive more data.
Once you add the close(Box), the program doesn't deadlock anymore, but it still doesn't work.
2. Too Many Goroutines and not waiting for them to complete
To run a certain maximum number of concurrent executions, instead of creating a goroutine for each input, create the goroutines in a "worker pool":
Create a channel to pass the workers their work
Create a channel for the goroutines to return their results, if any
Start the number of goroutines you want
Start at least one additional goroutine to either dispatch work or collect the result, so you don't have to try doing both from the main goroutine
use a sync.WaitGroup to wait for all data to be processed
close the channels to signal to the workers and the results collector that their channels are done being filled.
Before we get into the implementation, let's talk aobut how A and B interact.
first they should be processed by function A(), function A() will generate a file, then this file will be processed by B().
A() and B() must, then, execute serially. They can still pass their data through a channel, but since their execution must be serial, it does nothing for you. Simpler is to run them sequentially in the workers. For that, we'll need to change A() to either call B, or to return the path for B and the worker can call. I choose the latter.
func A(file string) string {
fmt.Println(file, "is processing in func A()...")
fileGenByA := "/path/to/fileGenByA1"
return fileGenByA
}
Before we write our worker function, we also must consider the result of B. Currently, B returns nothing. In the real world, unless B() cannot fail, you would at least want to either return the error, or at least panic. I'll skip over collecting results for now.
Now we can write our worker function.
func worker(wg *sync.WaitGroup, incoming <-chan string) {
defer wg.Done()
for file := range incoming {
B(A(file))
}
}
Now all we have to do is start 5 such workers, write the incoming files to the channel, close it, and wg.Wait() for the workers to complete.
incoming_work := make(chan string)
var wg sync.WaitGroup
for i := 0; i < 5; i++ {
wg.Add(1)
go worker(&wg, incoming_work)
}
for _, v := range fileList {
incoming_work <- v
}
close(incoming_work)
wg.Wait()
Full example at https://go.dev/play/p/A1H4ArD2LD8
Returning Results.
It's all well and good to be able to kick off goroutines and wait for them to complete. But what if you need results back from your goroutines? In all but the simplest of cases, you would at least want to know if files failed to process so you could investigate the errors.
We have only 5 workers, but we have many files, so we have many results. Each worker will have to return several results. So, another channel. It's usually worth defining a struct for your return:
type result struct {
file string
err error
}
This tells us not just whether there was an error but also clearly defines which file from which the error resulted.
How will we test an error case in our current code? In your example, B always gets the same value from A. If we add A's incoming file name to the path it passes to B, we can mock an error based on a substring. My mocked error will be that file3 fails.
func A(file string) string {
fmt.Println(file, "is processing in func A()...")
fileGenByA := "/path/to/fileGenByA1/" + file
return fileGenByA
}
func B(file string) (r result) {
r.file = file
fmt.Println(file, "is processing in func B()...")
if strings.Contains(file, "file3") {
r.err = fmt.Errorf("Test error")
}
return
}
Our workers will be sending results, but we need to collect them somewhere. main() is busy dispatching work to the workers, blocking on its write to incoming_work when the workers are all busy. So the simplest place to collect the results is another goroutine. Our results collector goroutine has to read from a results channel, print out errors for debugging, and the return the total number of failures so our program can return a final exit status indicating overall success or failure.
failures_chan := make(chan int)
go func() {
var failures int
for result := range results {
if result.err != nil {
failures++
fmt.Printf("File %s failed: %s", result.file, result.err.Error())
}
}
failures_chan <- failures
}()
Now we have another channel to close, and it's important we close it after all workers are done. So we close(results) after we wg.Wait() for the workers.
close(incoming_work)
wg.Wait()
close(results)
if failures := <-failures_chan; failures > 0 {
os.Exit(1)
}
Putting all that together, we end up with this code:
package main
import (
"fmt"
"os"
"strings"
"sync"
)
func A(file string) string {
fmt.Println(file, "is processing in func A()...")
fileGenByA := "/path/to/fileGenByA1/" + file
return fileGenByA
}
func B(file string) (r result) {
r.file = file
fmt.Println(file, "is processing in func B()...")
if strings.Contains(file, "file3") {
r.err = fmt.Errorf("Test error")
}
return
}
func worker(wg *sync.WaitGroup, incoming <-chan string, results chan<- result) {
defer wg.Done()
for file := range incoming {
results <- B(A(file))
}
}
type result struct {
file string
err error
}
func main() {
// assuming that this is the file list read from a directory
fileList := []string{
"/path/to/file1",
"/path/to/file2",
"/path/to/file3",
}
incoming_work := make(chan string)
results := make(chan result)
var wg sync.WaitGroup
for i := 0; i < 5; i++ {
wg.Add(1)
go worker(&wg, incoming_work, results)
}
failures_chan := make(chan int)
go func() {
var failures int
for result := range results {
if result.err != nil {
failures++
fmt.Printf("File %s failed: %s", result.file, result.err.Error())
}
}
failures_chan <- failures
}()
for _, v := range fileList {
incoming_work <- v
}
close(incoming_work)
wg.Wait()
close(results)
if failures := <-failures_chan; failures > 0 {
os.Exit(1)
}
}
And when we run it, we get:
/path/to/file1 is processing in func A()...
/path/to/fileGenByA1//path/to/file1 is processing in func B()...
/path/to/file2 is processing in func A()...
/path/to/fileGenByA1//path/to/file2 is processing in func B()...
/path/to/file3 is processing in func A()...
/path/to/fileGenByA1//path/to/file3 is processing in func B()...
File /path/to/fileGenByA1//path/to/file3 failed: Test error
Program exited.
A final thought: buffered channels.
There is nothing wrong with buffered channels. Especially if you know the overall size of incoming work and results, buffered channels can obviate the results collector goroutine because you can allocate a buffered channel big enough to hold all results. However, I think it's more straightforward to understand this pattern if the channels are unbuffered. The key takeaway is that you don't need to know the number of incoming or outgoing results, which could indeed be different numbers or based on something that can't be predetermined.

GO code with execution control using channels

I'm extracting from a long redshift table all its data in chunks, each chunk to a csv file. I want to control how many files are created at the "same" time (concurrently), i.e. if the whole process will create 10 files, I want to, let's say, create 4 files, wait until they are created and once they are "done", create another 4, and then the remaining 2.
How can I achieve this using channel/s?
I've tried to change the following slice to a channel, but I couldn't get it to work as I said, the implementation I did, did not wait/stop for the 4 first files to end before creating the following ones.
Right now I'm doing the following using WaitGroup:
package, imports, var, etc...
//Inside func main:
//Create a WaitGroup
var wg = sync.WaitGroup{}
//Opening the connection
db, err := sql.Open("postgres", connStr)
if err != nil {
panic(err.Error())
}
defer db.Close()
//Define chunks using a slice
chunkSizer := Slicer(totalRowsInTable, numberRowsByChunk) // e.g. []int{100, 100, 100... 100}
//Iterating over the array
for index, value := range chunkSizer {
wg.Add(1)
go ExtractorToCSV(db, queriedSection, expFileName)
if (index+1)% 4 == 0 { // <-- 4 is the maximum number of files created at the "same" time
wg.Wait()
}
wg.Wait() // <-- waits for the remaining files (last 2 in this case)
}
//Outside main
func ExtractorToCSV(db *sql.DB, queryToExtract, fileName string) {
//... do its process
wg.Done()
}
I've tried using a buffered channel of the size that I wanted to stop (4 in this case), but I didn't use it properly, I don't know...
Thanks in advance.
UPDATED - STOP CONDITION
You can use channel to hold the next line of code like this. This is minimum code that I write for you. Tweak it as you like
var doneCh = make(chan bool)
func main() {
WRITE_POOL := 4
for index, val := range RANGE {
go extractToFile(val)
if (index + 1) % WRITE_POOL == 0 {
// wait for doneCh to finish
// if the iteration is divisive of WRITE_POOL
<-doneCh
<-doneCh
<-doneCh
<-doneCh
} else if index == MAX - 1 {
// wait for whatever doneCh left to finish
// if current val is the last one
LEFT := MAX - index - 1
for i := 0; i < LEFT; i++ {
<-doneCh
}
}
}
}
func extractToFile(val int) {
os.Create(fmt.Sprintf("test-%d", val))
doneCh<-true
}
For better performance, try to :
Create data channel to main function can send the data to it and ExtractorToCSV can receive it.
Create ExtractorToCSV as goroutine and read from data channel, after ExtractorToCSV finish, send data to doneCh
Send db data to data channel and after ExtractorToCSV finished to write to file, send true to doneCh.
I will update this post if you need more example.

Stop looking for user input with 'bufio.NewReader(os.Stdin)'

New to golang and programming in general. I am currently writing a small quiz program for a learning task and ran into a snag that the tutorial does not address because I have features not included on the tutorial.
Code is included below:
func runQuestions(randomize bool) int {
tempqSlice := qSlice //Create temporary set of questions (Don't touch original)
if randomize { //If user has chosen to randomize the question order
tempqSlice = shuffle(tempqSlice) //Randomize
}
var runningscore int
userinputchan := make(chan string) //Create return channel for user input
go getInput(userinputchan) //Constantly wait for user input
for num, question := range tempqSlice { //Iterate over each question
fmt.Printf("Question %v:\t%v\n\t", num+1, question.GetQuestion())
select {
case <-time.After(5 * time.Second): //
fmt.Println("-time up! next question-")
continue
case input := <-userinputchan:
if question.GetAnswer() == input { //If the answer is correct
runningscore++
continue //Continue to next question
}
}
}
return runningscore
}
func getInput(returnchan chan<- string) {
for {
reader := bufio.NewReader(os.Stdin) //Create reader
input, _ := reader.ReadString('\n') //Read from
returnchan <- strings.TrimSpace(input) //Trim the input and send it
}
}
Because the specification of the problem requires each question to have a timelimit, I have set a 'endless for loop go routine' running that waits for user input and then sends it when it is given. My problem is simple: I would like to stop the reader looking for input once the quiz is over but since the 'reader.ReadString('/n')' is already awaiting input, I'm not sure how.
I would like to stop the reader looking for input once the quiz is over
While the reader is looking for input, you can use a goroutine that runs in its background to check whether the quiz timer is expired.
Suppose your timer for quiz is 30 seconds. Pass the timer to the goroutine getInputand check for the timer expiry.
var runningscore int
userinputchan := make(chan string) //Create return channel for user input
myTime := flag.Int("timer", 30, "time to complete the quiz")
timer := startTimer(myTime)
go getInput(userinputchan, timer)
func startTimer(myTime *int) *time.Timer {
return time.NewTimer(time.Duration(*myTime) * time.Second)
}
func getInput(returnchan chan<- string, timer *time.Timer) {
for {
reader := bufio.NewReader(os.Stdin) //Create reader
go checkTime(timer)
input, _ := reader.ReadString('\n') //Read from
returnchan <- strings.TrimSpace(input) //Trim the input and send it
}
}
func checkTime(timer *time.Timer) {
<-timer.C
fmt.Println("\nYour quiz is over!!!")
// print the final score
os.Exit(1)
}
You cannot do this out of the box, however you can make a workaround by reading character by character from stdin, buffering the inputs and only sending the buffered data when the user hits Enter.
Unfortunately, there is there is no standard package for reading character by character, I will be using github.com/eiannone/keyboard. (You can get it by go get -u github.com/eiannone/keyboard).
The following class will be your input system. You can instantiate it by calling NewInput(yourResultChannel). This will start reading from stdin character by character and buffers them into buf. Once Enter (char == 0) is hit, it sends the contents of the buffer to the resultChannel end resets the buffer.
type Input struct {
lock sync.Mutex
buf []rune
resultCh chan string
}
func NewInput(resultCh chan string) *Input {
i := &Input{
resultCh: resultCh,
}
go i.readStdin()
return i
}
func (i *Input) ResetBuffer() {
i.lock.Lock()
defer i.lock.Unlock()
i.resetBuffer()
}
func (i *Input) readStdin() {
for {
char, _, err := keyboard.GetSingleKey()
if err != nil {
panic(err)
}
fmt.Printf("%s", string(char))
i.lock.Lock()
if char == 0 {
i.resultCh <- strings.TrimSpace(string(i.buf))
i.resetBuffer()
} else {
i.buf = append(i.buf, char)
}
i.lock.Unlock()
}
}
func (i *Input) resetBuffer() {
i.buf = []rune{}
}
Your only remaining job is to create an instance of the Input class at the point where you were previously starting your getInput() goroutine.
userinputchan := make(chan string) //Create return channel for user input
input := NewInput(userinputchan)
and reset the buffer when the question times out so that the previous garbage input won't be part of the new question's answer. In the select statement:
case <-time.After(5 * time.Second):
fmt.Println("-time up! next question-")
input.ResetBuffer()
continue
Full code: https://play.golang.org/p/-IXVvY9dOZO

How to ensure the Golang channel waits for the data and the program does not terminate if Stdin does not have data

I have a Golang program which makes real time predictions on a machine learning model built using TensorFlow. The data for the prediction needs to be read line by line from Stdin and the prediction has to be performed on each line of data. The flow of data is not constant. I need a system that ensures that every time there is data to be read from Stdin, the prediction method is invoked and if there is no data in Stdin, the program waits for new data and does not terminate.
I tried achieving this using channels and the select, but if there is no data in the Stdin, the program terminates. Below is the code snippet:
func run_the_model(in <-chan string) {
go func(){
...
...
...
//Fetch the model
//Run the prediction
//print the result on StdOut
}()
}
func main() {
data := make(chan string)
// read data from Stdin
go func() {
scan := bufio.NewScanner(os.Stdin)
for scan.Scan() {
data <- scan.Text()
}
}()
time.Sleep(time.Second * 5)
select{
case <-data:
run_the_model(data)
time.Sleep(time.Second * 5)
default:
println("Waiting for data")
time.Sleep(time.Duration(math.MaxInt64))
}
}
When there is no new data in Stdin the Select's default case must be executed and when there is new data in the data channel, the run_the_model must be executed. How can this be achieved?
Put your select in infinite loop.
for {
select{
case <-data:
run_the_model(data)
time.Sleep(time.Second * 5)
default:
println("Waiting for data")
time.Sleep(time.Duration(math.MaxInt64))
}
}
I think u using select wrong , for your case this should be works:
func runTheModel(in string) {
// do what ever u want
}
func main() {
data := make(chan string)
// read data from Stdin
go func() {
scan := bufio.NewScanner(os.Stdin)
for scan.Scan() {
data <- scan.Text()
}
}()
println("waiting for data:")
for d := range data {
// command to exit program
if d == "q" {
return
}
go runTheModel(d)
}
}

I want to split a file into equally sized "chunks", or slices and use goroutines to process them simultaneously

Using Go, I have large log files. Currently I open them, create a new scanner bufio.NewScanner, and then for scanner.Scan() to loop through the lines. Each line is sent through a processing function, which matches it to regular expressions and extracts data. I would like to process this file in chunks simultaneously using goroutines. I believe this may be quicker than looping through the whole file sequentially.
It can take a few seconds per file, and I'm wondering if I can process a single file in, say, 10 pieces at a time. I believe I can sacrifice the memory if needed. I have ~3gb, and the biggest log file is maybe 75mb.
I see that a scanner has a .Split() method, where you can provide a custom split function, but I wasn't able to find a good solution using this method.
I've also tried creating a slice of slices, looping through the scanner with scanner.Scan() and appending scanner.Text() to each slice.
eg:
// pseudocode because I couldn't get this to work either
scanner := bufio.NewScanner(logInfo)
threads := [[], [], [], [], []]
i := 0
for scanner.Scan() {
i = i + 1
if i > 5 {
i = 0
}
threads[i] = append(threads[i], scanner.Text())
}
fmt.Println(threads)
I'm new to Go and concerned about efficiency and performance. I want to learn how to write good Go code! Any help or advice is really appreciated.
Peter gives a good starting point, if you wanted to do something like a fan-out, fan-in pattern you could do something like:
package main
import (
"bufio"
"fmt"
"log"
"os"
"sync"
)
func main() {
file, err := os.Open("/path/to/file.txt")
if err != nil {
log.Fatal(err)
}
defer file.Close()
lines := make(chan string)
// start four workers to do the heavy lifting
wc1 := startWorker(lines)
wc2 := startWorker(lines)
wc3 := startWorker(lines)
wc4 := startWorker(lines)
scanner := bufio.NewScanner(file)
go func() {
defer close(lines)
for scanner.Scan() {
lines <- scanner.Text()
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
}()
merged := merge(wc1, wc2, wc3, wc4)
for line := range merged {
fmt.Println(line)
}
}
func startWorker(lines <-chan string) <-chan string {
finished := make(chan string)
go func() {
defer close(finished)
for line := range lines {
// Do your heavy work here
finished <- line
}
}()
return finished
}
func merge(cs ...<-chan string) <-chan string {
var wg sync.WaitGroup
out := make(chan string)
// Start an output goroutine for each input channel in cs. output
// copies values from c to out until c is closed, then calls wg.Done.
output := func(c <-chan string) {
for n := range c {
out <- n
}
wg.Done()
}
wg.Add(len(cs))
for _, c := range cs {
go output(c)
}
// Start a goroutine to close out once all the output goroutines are
// done. This must start after the wg.Add call.
go func() {
wg.Wait()
close(out)
}()
return out
}
If it is acceptable that line N+1 is processed before line N, you can use a simple fan-out pattern to get started. The Go blog explains this and more advanced patterns, such as cancelation and fan-in.
Note that this is just a starting point to keep it simple and on point. You would almost certainly want to wait for the process functions to return before exiting, for instance. This is explained in the mentioned blog post.
package main
import "bufio"
func main() {
var sc *bufio.Scanner
lines := make(chan string)
go process(lines)
go process(lines)
go process(lines)
go process(lines)
for sc.Scan() {
lines <- sc.Text()
}
close(lines)
}
func process(lines <-chan string) {
for line := range lines {
// implement processing here
}
}

Resources