Use go worker pool implementation to write files in parallel?

Use go worker pool implementation to write files in parallel? - go

I have a slice clientFiles which I am iterating over sequentially and writing it in S3 one by one as shown below:
for _, v := range clientFiles {
err := writeToS3(v.FileContent, s3Connection, v.FileName, bucketName, v.FolderName)
if err != nil {
fmt.Println(err)
}
}
The above code works fine but I want to write in S3 in parallel so that I can speed things up. Does worker pool implementation works better here or is there any other better option here?
I got below code which uses wait group but I am not sure if this is better option here to work with?
wg := sync.WaitGroup{}
for _, v := range clientFiles {
wg.Add(1)
go func(v ClientMapFile) {
err := writeToS3(v.FileContent, s3Connection, v.FileName, bucketName, v.FolderName)
if err != nil {
fmt.Println(err)
}
}(v)
}

Yes, parallelising should help.
Your code should work well after changes regarding usage of WaitGroup. You need to mark work as Done and wait for all goroutines to finish after for-loop.
var wg sync.WaitGroup
for _, v := range clientFiles {
wg.Add(1)
go func(v ClientMapFile) {
defer wg.Done()
err := writeToS3(v.FileContent, s3Connection, v.FileName, bucketName, v.FolderName)
if err != nil {
fmt.Println(err)
}
}(v)
}
wg.Wait()
Be aware that your solution creates N goroutines for N files, which can be not optimal if number of files is very big. In such case use this pattern https://gobyexample.com/worker-pools and try different number of workers to find which works best for you in terms of performance.

Related

How can I use goroutine to execute for loop?

Suppose I have a slice like:
stu = [{"id":"001","name":"A"} {"id":"002", "name":"B"}] and maybe more elements like this. inside of the slice is a long string, I want to use json.unmarshal to parse it.
type Student struct {
Id string `json:"id"`
Name string `json:"name"`
}
studentList := make([]Student,len(stu))
for i, st := range stu {
go func(st string){
studentList[i], err = getList(st)
if err != nil {
return ... //just example
}
}(st)
}
//and a function like this
func getList(stu string)(res Student, error){
var student Student
err := json.Unmarshal(([]byte)(stu), &student)
if err != nil {
return
}
return &student,nil
}
I got the nil result, so I would say the goroutine is out-of-order to execute, so I don't know if it can use studentList[i] to get value.

Here are a few potential issues with your code:
Value of i is probably not what you expect
for i, st := range stu {
go func(st string){
studentList[i], err = getList(st)
if err != nil {
return ... //just example
}
}(st)
}
You kick off a number of goroutines and, within them, reference i. The issue is that i is likely to have changed between the time you started the goroutine and the time the goroutine references it (the for loop runs concurrently to the goroutines it starts). It is quite possible that the for completes before any of the goroutines do meaning that all output will be stored in the last element of studentList (they will overwrite each other so you will end up with one value).
A simple solution is to pass i into the goroutine function (e.g. go func(st string, i int){}(st, i) (this creates a copy). See this for more info.
Output of studentList
You don't say in the question but I suspect you are running fmt.Println(studentList[1] (or similar) immediately after the for loop completes. As mentioned above it's quite possible that none of the goroutines have completed at that point (or they may of, you don't know). Using a WaitGroup is a fairly easy way around this:
var wg sync.WaitGroup
wg.Add(len(stu))
for i, st := range stu {
go func(st string, i int) {
var err error
studentList[i], err = getList(st)
if err != nil {
panic(err)
}
wg.Done()
}(st, i)
}
wg.Wait()
I have corrected these issues in the playground.

not because of this
the goroutine is out-of-order to execute
There are at least two issues here:
you should not use the for loop variable i in goroutine.
multiple goroutines read i, for loop modify i, it's race condition here. to make i works as expected, change the code to:
for i, st := range stu {
go func(i int, st string){
studentList[i], err = getList(st)
if err != nil {
return ... //just example
}
}(i, st)
}
what's more, use sync.WaitGroup to wait for all goroutine.
var wg sync.WaitGroup
for i, st := range stu {
wg.Add(1)
go func(i int, st string){
defer wg.Done()
studentList[i], err = getList(st)
if err != nil {
return ... //just example
}
}(i, st)
}
wg.Wait()
P.S.: (WARNING: maybe not always true)
this line studentList[i], err = getList(st) ,
although it may not cause data race, but it's somehow not friendly to cpu cache line. better avoid writing code like this.

How to optimise processing large data

The objective of my backend service is to process 90 milllion data and at least 10 million of data in 1 day.
My system config:
Ram 2000 Mb
CPU 2core(s)
what I am doing right now is something like this:
var wg sync.WaitGroup
//length of evs is 4455
for i, ev := range evs {
wg.Add(1)
go migrate(&wg)
}
wg.Wait()
func migrate(wg *sync.WaitGroup) {
defer wg.Done()
//processing
time.Sleep(time.Second)
}

Without knowing more detail about the type of work you need to do, your approach seems good. Some things to think about:
Re-using variables and or clients in your processing loop. For example reusing an HTTP client instead of recreating one.
Depending on how your use case calls to handle failures. It might be efficient to use erroGroup. It's a convenience wrapper that stops all the threads on error possibly saving you a lot of time.
In the migrate function be sure to be aware of the caveats regarding closure and goroutines.
func main() {
g := new(errgroup.Group)
var urls = []string{
"http://www.someasdfasdfstupidname.com/",
"ftp://www.golang.org/",
"http://www.google.com/",
}
for _, url := range urls {
url := url // https://golang.org/doc/faq#closures_and_goroutines
g.Go(func() error {
resp, err := http.Get(url)
if err == nil {
resp.Body.Close()
}
return err
})
}
fmt.Println("waiting")
if err := g.Wait(); err == nil {
fmt.Println("Successfully fetched all URLs.")
} else {
fmt.Println(err)
}
}

I have got the solution. to achieve this much huge processing what I have done is
a limited number of goroutine to 50 and increased the number of cores from 2 to 5.

Non-blocking way to read from a Pipe

I want to create a simple app which will be reading continuously output from one app, process it and write processed output to stdout. This app can produce a lot of data within a second and next is silent for a few minutes.
The problem is that mine data processing algorithm is quite slow so main loop is blocked. When the loop is blocked I'm loosing a data which comes and this moment.
cmd := exec.Command("someapp")
stdoutPipe, _ := cmd.StdoutPipe()
stdoutReader := bufio.NewReader(stdoutPipe)
go func() {
bufioReader := bufio.NewReader(stdoutReader)
for {
output, _, err := bufioReader.ReadLine()
if err != nil || err == io.EOF {
break
}
processedOutput := dataProcessor(output);
fmt.Print(processedOutput)
}
}()
Probably the best way to solve this problem is to buffer all output and process it in another Goroutine but I'm not sure how to implement this in Golang. What is the most idiomatic way to solve this problem?

You can have two goroutines, one supplier and other is a consumer. supplier execute a command and pass data with a channel to the consumer.
cmd := exec.Command("someapp")
stdoutPipe, _ := cmd.StdoutPipe()
stdoutReader := bufio.NewReader(stdoutPipe)
Datas := make(chan Data, 100)
go dataProcessor(Datas)
bufioReader := bufio.NewReader(stdoutReader)
for {
output, _, err := bufioReader.ReadLine()
var tempData Data
tempData.Out = output
if err != nil || err == io.EOF {
break
}
Datas <- tempData
}
}
and then you will prccess data in dataProcessor func:
func dataProcessor(Datas <-chan Data) {
for {
select {
case data := <-Datas:
fmt.Println(data)
default:
continue
}
}
}
obviesly it is very simple example and you should customize it and make it better. search about chanle and goroutins. reading this tutorial could be helpfull.

What am I missing on concurrency?

I have a very simple script that makes a get request and then does some thing with the response. I have 2 version one using a go routine and one without I bencharmaked both and there was no difference in speed. Here is a dumb down version of what I'm doing:
Regular Version:
func main() {
url := "http://finance.yahoo.com/q?s=aapl"
for i := 0; i < 250; i++ {
resp, err := http.Get(url)
if err != nil {
fmt.Println(err)
}
fmt.Println(resp.Status)
}
}
Go Routine:
func main() {
url := "http://finance.yahoo.com/q?s=aapl"
for i := 0; i < 250; i++ {
wg.Add(1)
go run(url, &wg)
wg.Wait()
}
}
func run(url string, wg *sync.WaitGroup) {
defer wg.Done()
resp, err := http.Get(url)
if err != nil {
fmt.Println(err)
}
fmt.Println(resp.Status)
}
In most cases when I used a go routine the program took longer to execute. What concept am I missing to understand using concurrency efficiently?

The main problem with your example is that you're calling wg.Wait() within the for loop. This causes execution to block until you the deferred wg.Done() call inside of run. As a result, the execution isn't concurrent, it happens in a goroutine but you block after starting goroutine i and before starting i+1. If you place that statement after the loop instead like below then your code won't block until after the loop (all goroutines have been started, some may have already completed).
func main() {
url := "http://finance.yahoo.com/q?s=aapl"
for i := 0; i < 250; i++ {
wg.Add(1)
go run(url, &wg)
// wg.Wait() don't wait here cause it serializes execution
}
wg.Wait() // wait here, now that all goroutines have been started
}

Parallel zip compression in Go

I am trying build a zip archive from a large number of small-medium sized files. I want to be able to do this concurrently, since compression is CPU intensive, and I'm running on a multi core server. Also I don't want to have the whole archive in memory, since its might turn out to be large.
My question is that do I have to compress every file and then combine manually combine everything together with zip header, checksum etc?
Any help would be greatly appreciated.

I don't think you can combine the zip headers.
What you could do is, run the zip.Writer sequentially, in a separate goroutine, and then spawn a new goroutine for each file that you want to read, and pipe those to the goroutine that is zipping them.
This should reduce the IO overhead that you get by reading the files sequentially, although it probably won't leverage multiple cores for the archiving itself.
Here's a working example. Note that, to keep things simple,
it does not handle errors nicely, just panics if something goes wrong,
and it does not use the defer statement too much, to demonstrate the order in which things should happen.
Since defer is LIFO, it can sometimes be confusing when you stack a lot of them together.
package main
import (
"archive/zip"
"io"
"os"
"sync"
)
func ZipWriter(files chan *os.File) *sync.WaitGroup {
f, err := os.Create("out.zip")
if err != nil {
panic(err)
}
var wg sync.WaitGroup
wg.Add(1)
zw := zip.NewWriter(f)
go func() {
// Note the order (LIFO):
defer wg.Done() // 2. signal that we're done
defer f.Close() // 1. close the file
var err error
var fw io.Writer
for f := range files {
// Loop until channel is closed.
if fw, err = zw.Create(f.Name()); err != nil {
panic(err)
}
io.Copy(fw, f)
if err = f.Close(); err != nil {
panic(err)
}
}
// The zip writer must be closed *before* f.Close() is called!
if err = zw.Close(); err != nil {
panic(err)
}
}()
return &wg
}
func main() {
files := make(chan *os.File)
wait := ZipWriter(files)
// Send all files to the zip writer.
var wg sync.WaitGroup
wg.Add(len(os.Args)-1)
for i, name := range os.Args {
if i == 0 {
continue
}
// Read each file in parallel:
go func(name string) {
defer wg.Done()
f, err := os.Open(name)
if err != nil {
panic(err)
}
files <- f
}(name)
}
wg.Wait()
// Once we're done sending the files, we can close the channel.
close(files)
// This will cause ZipWriter to break out of the loop, close the file,
// and unblock the next mutex:
wait.Wait()
}
Usage: go run example.go /path/to/*.log.
This is the order in which things should be happening:
Open output file for writing.
Create a zip.Writer with that file.
Kick off a goroutine listening for files on a channel.
Go through each file, this can be done in one goroutine per file.
Send each file to the goroutine created in step 3.
After processing each file in said goroutine, close the file to free up resources.
Once each file has been sent to said goroutine, close the channel.
Wait until the zipping has been done (which is done sequentially).
Once zipping is done (channel exhausted), the zip writer should be closed.
Only when the zip writer is closed, should the output file be closed.
Finally everything is closed, so close the sync.WaitGroup to tell the calling function that we're good to go. (A channel could also be used here, but sync.WaitGroup seems more elegant.)
When you get the signal from the zip writer that everything is properly closed, you can exit from main and terminate nicely.
This might not answer your question, but I've been using similar code to generate zip archives on-the-fly for a web service some time ago. It performed quite well, even though the actual zipping was done in a single goroutine. Overcoming the IO bottleneck can already be an improvement.

From the look of it, you won't be able to parallelise the compression using the standard library archive/zip package because:
Compression is performed by the io.Writer returned by zip.Writer.Create or CreateHeader.
Calling Create/CreateHeader implicitly closes the writer returned by the previous call.
So passing the writers returned by Create to multiple goroutines and writing to them in parallel will not work.
If you wanted to write your own parallel zip writer, you'd probably want to structure it something like this:
Have multiple goroutines compress files using the compress/flate module, and keep track of the CRC32 value and length of the uncompressed data. The output should be directed to temporary files. Note the compressed size of the data.
Once everything has been compressed, start writing the Zip file starting with the header.
Write out the file header followed by the contents of the corresponding temporary file for each compressed file.
Write out the central directory record and end record at the end of the file. All the required information should be available at this point.
For added parallelism, step 1 could be performed in parallel with the remaining steps by using a channel to indicate when compression of each file completes.
Due to the file format, you won't be able to perform parallel compression without either storing compressed data in memory or in temporary files.

With Go1.17, parallel compression and merging of zip files are possible using the archive/zip package.
An example is below. In the example, I create zip workers to create individual zip files and an entry provider worker which provides entries to be added to a zip file via a channel to zip workers. Actual files can be provided to the zip workers but I skipped that part.
package main
import (
"archive/zip"
"context"
"fmt"
"io"
"log"
"os"
"strings"
"golang.org/x/sync/errgroup"
)
const numOfZipWorkers = 10
type entry struct {
name string
rc io.ReadCloser
}
func main() {
log.SetFlags(log.LstdFlags | log.Lshortfile)
entCh := make(chan entry, numOfZipWorkers)
zpathCh := make(chan string, numOfZipWorkers)
group, ctx := errgroup.WithContext(context.Background())
for i := 0; i < numOfZipWorkers; i++ {
group.Go(func() error {
return zipWorker(ctx, entCh, zpathCh)
})
}
group.Go(func() error {
defer close(entCh) // Signal workers to stop.
return entryProvider(ctx, entCh)
})
err := group.Wait()
if err != nil {
log.Fatal(err)
}
f, err := os.OpenFile("output.zip", os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0644)
if err != nil {
log.Fatal(err)
}
zw := zip.NewWriter(f)
close(zpathCh)
for path := range zpathCh {
zrd, err := zip.OpenReader(path)
if err != nil {
log.Fatal(err)
}
for _, zf := range zrd.File {
err := zw.Copy(zf)
if err != nil {
log.Fatal(err)
}
}
_ = zrd.Close()
_ = os.Remove(path)
}
err = zw.Close()
if err != nil {
log.Fatal(err)
}
err = f.Close()
if err != nil {
log.Fatal(err)
}
}
func entryProvider(ctx context.Context, entCh chan<- entry) error {
for i := 0; i < 2*numOfZipWorkers; i++ {
select {
case <-ctx.Done():
return ctx.Err()
case entCh <- entry{
name: fmt.Sprintf("file_%d", i+1),
rc: io.NopCloser(strings.NewReader(fmt.Sprintf("content %d", i+1))),
}:
}
}
return nil
}
func zipWorker(ctx context.Context, entCh <-chan entry, zpathch chan<- string) error {
f, err := os.CreateTemp(".", "tmp-part-*")
if err != nil {
return err
}
zw := zip.NewWriter(f)
Loop:
for {
var (
ent entry
ok bool
)
select {
case <-ctx.Done():
err = ctx.Err()
break Loop
case ent, ok = <-entCh:
if !ok {
break Loop
}
}
hdr := &zip.FileHeader{
Name: ent.name,
Method: zip.Deflate, // zip.Store can also be used.
}
hdr.SetMode(0644)
w, e := zw.CreateHeader(hdr)
if e != nil {
_ = ent.rc.Close()
err = e
break
}
_, e = io.Copy(w, ent.rc)
_ = ent.rc.Close()
if e != nil {
err = e
break
}
}
if e := zw.Close(); e != nil && err == nil {
err = e
}
if e := f.Close(); e != nil && err == nil {
err = e
}
if err == nil {
select {
case <-ctx.Done():
err = ctx.Err()
case zpathch <- f.Name():
}
}
return err
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Use go worker pool implementation to write files in parallel? - go

Related

How can I use goroutine to execute for loop?

How to optimise processing large data

Non-blocking way to read from a Pipe

What am I missing on concurrency?

Parallel zip compression in Go

Categories

Resources