I want to create a simple app which will be reading continuously output from one app, process it and write processed output to stdout. This app can produce a lot of data within a second and next is silent for a few minutes.
The problem is that mine data processing algorithm is quite slow so main loop is blocked. When the loop is blocked I'm loosing a data which comes and this moment.
cmd := exec.Command("someapp")
stdoutPipe, _ := cmd.StdoutPipe()
stdoutReader := bufio.NewReader(stdoutPipe)
go func() {
bufioReader := bufio.NewReader(stdoutReader)
for {
output, _, err := bufioReader.ReadLine()
if err != nil || err == io.EOF {
break
}
processedOutput := dataProcessor(output);
fmt.Print(processedOutput)
}
}()
Probably the best way to solve this problem is to buffer all output and process it in another Goroutine but I'm not sure how to implement this in Golang. What is the most idiomatic way to solve this problem?
You can have two goroutines, one supplier and other is a consumer. supplier execute a command and pass data with a channel to the consumer.
cmd := exec.Command("someapp")
stdoutPipe, _ := cmd.StdoutPipe()
stdoutReader := bufio.NewReader(stdoutPipe)
Datas := make(chan Data, 100)
go dataProcessor(Datas)
bufioReader := bufio.NewReader(stdoutReader)
for {
output, _, err := bufioReader.ReadLine()
var tempData Data
tempData.Out = output
if err != nil || err == io.EOF {
break
}
Datas <- tempData
}
}
and then you will prccess data in dataProcessor func:
func dataProcessor(Datas <-chan Data) {
for {
select {
case data := <-Datas:
fmt.Println(data)
default:
continue
}
}
}
obviesly it is very simple example and you should customize it and make it better. search about chanle and goroutins. reading this tutorial could be helpfull.
Related
I have a slice clientFiles which I am iterating over sequentially and writing it in S3 one by one as shown below:
for _, v := range clientFiles {
err := writeToS3(v.FileContent, s3Connection, v.FileName, bucketName, v.FolderName)
if err != nil {
fmt.Println(err)
}
}
The above code works fine but I want to write in S3 in parallel so that I can speed things up. Does worker pool implementation works better here or is there any other better option here?
I got below code which uses wait group but I am not sure if this is better option here to work with?
wg := sync.WaitGroup{}
for _, v := range clientFiles {
wg.Add(1)
go func(v ClientMapFile) {
err := writeToS3(v.FileContent, s3Connection, v.FileName, bucketName, v.FolderName)
if err != nil {
fmt.Println(err)
}
}(v)
}
Yes, parallelising should help.
Your code should work well after changes regarding usage of WaitGroup. You need to mark work as Done and wait for all goroutines to finish after for-loop.
var wg sync.WaitGroup
for _, v := range clientFiles {
wg.Add(1)
go func(v ClientMapFile) {
defer wg.Done()
err := writeToS3(v.FileContent, s3Connection, v.FileName, bucketName, v.FolderName)
if err != nil {
fmt.Println(err)
}
}(v)
}
wg.Wait()
Be aware that your solution creates N goroutines for N files, which can be not optimal if number of files is very big. In such case use this pattern https://gobyexample.com/worker-pools and try different number of workers to find which works best for you in terms of performance.
The objective of my backend service is to process 90 milllion data and at least 10 million of data in 1 day.
My system config:
Ram 2000 Mb
CPU 2core(s)
what I am doing right now is something like this:
var wg sync.WaitGroup
//length of evs is 4455
for i, ev := range evs {
wg.Add(1)
go migrate(&wg)
}
wg.Wait()
func migrate(wg *sync.WaitGroup) {
defer wg.Done()
//processing
time.Sleep(time.Second)
}
Without knowing more detail about the type of work you need to do, your approach seems good. Some things to think about:
Re-using variables and or clients in your processing loop. For example reusing an HTTP client instead of recreating one.
Depending on how your use case calls to handle failures. It might be efficient to use erroGroup. It's a convenience wrapper that stops all the threads on error possibly saving you a lot of time.
In the migrate function be sure to be aware of the caveats regarding closure and goroutines.
func main() {
g := new(errgroup.Group)
var urls = []string{
"http://www.someasdfasdfstupidname.com/",
"ftp://www.golang.org/",
"http://www.google.com/",
}
for _, url := range urls {
url := url // https://golang.org/doc/faq#closures_and_goroutines
g.Go(func() error {
resp, err := http.Get(url)
if err == nil {
resp.Body.Close()
}
return err
})
}
fmt.Println("waiting")
if err := g.Wait(); err == nil {
fmt.Println("Successfully fetched all URLs.")
} else {
fmt.Println(err)
}
}
I have got the solution. to achieve this much huge processing what I have done is
a limited number of goroutine to 50 and increased the number of cores from 2 to 5.
I wrote a piece of code with a IPC purpose. The expected behaviour is that the code reads the content from the named-pipe and prints the string (with the Send("log", buff.String())). First I open the named-pipe 'reader' inside the goroutine, while the reader is open I send a signal that the data can be written to the named-pipe (with the Send("datarequest", "")). Here is the code:
var wg sync.WaitGroup
wg.Add(1)
go func() {
//reader part
file, err := os.OpenFile("tmp/"+os.Args[1], os.O_RDONLY, os.ModeNamedPipe)
if err != nil {
Send("error", err.Error())
}
var buff bytes.Buffer
_, err = io.Copy(&buff, file)
Send("log", buff.String())
if err != nil {
Send("error", err.Error())
}
wg.Done()
}()
Send("datarequest", "")
wg.Wait()
And here is the code which executes when the signal is send:
//writer part
file, err := os.OpenFile("tmp/" + execID, os.O_WRONLY, 0777)
if err != nil {
c <- "[error] error opening file: " + err.Error()
}
bytedata, _ := json.Marshal(moduleParameters)
file.Write(bytedata)
So the behaviour I get it that the code blocks indefinitely when I try to copy it. I really don't know why this happens. When I test it with cat in the terminal I do get the intended result so my question is how do I get the same result with code?
Edit
The execID is the same as os.Args[1]
The writer should close the file after it's done sending using file.Close(). Note that file.Close() may return error.
A simple execution of go command gives some output as given here: How do you get the output of a system command in Go??
But the code I am using is for showing the output with progress from : https://blog.kowalczyk.info/article/wOYk/advanced-command-execution-in-go-with-osexec.html?
Now, I can't actually filter the output that I am getting from this as I don't want everything to be printed and only a part of it. Is there a way to do so?
I have already tried implementing a string to get the output instead of go routine way. But it didn't work. I want the progress too.
The sample you're pointing to reads from the subprocess's stdout, and for each read it writes what it read to its own stdout while also capturing it:
func copyAndCapture(w io.Writer, r io.Reader) ([]byte, error) {
var out []byte
buf := make([]byte, 1024, 1024)
for {
n, err := r.Read(buf[:])
if n > 0 {
d := buf[:n]
out = append(out, d...)
_, err := w.Write(d)
if err != nil {
return out, err
}
}
if err != nil {
// Read returns io.EOF at the end of file, which is not an error for us
if err == io.EOF {
err = nil
}
return out, err
}
}
}
This function is called with os.Stdout as w.
Now, you're free to filter the data d before you print it out with w.Write.
I am trying build a zip archive from a large number of small-medium sized files. I want to be able to do this concurrently, since compression is CPU intensive, and I'm running on a multi core server. Also I don't want to have the whole archive in memory, since its might turn out to be large.
My question is that do I have to compress every file and then combine manually combine everything together with zip header, checksum etc?
Any help would be greatly appreciated.
I don't think you can combine the zip headers.
What you could do is, run the zip.Writer sequentially, in a separate goroutine, and then spawn a new goroutine for each file that you want to read, and pipe those to the goroutine that is zipping them.
This should reduce the IO overhead that you get by reading the files sequentially, although it probably won't leverage multiple cores for the archiving itself.
Here's a working example. Note that, to keep things simple,
it does not handle errors nicely, just panics if something goes wrong,
and it does not use the defer statement too much, to demonstrate the order in which things should happen.
Since defer is LIFO, it can sometimes be confusing when you stack a lot of them together.
package main
import (
"archive/zip"
"io"
"os"
"sync"
)
func ZipWriter(files chan *os.File) *sync.WaitGroup {
f, err := os.Create("out.zip")
if err != nil {
panic(err)
}
var wg sync.WaitGroup
wg.Add(1)
zw := zip.NewWriter(f)
go func() {
// Note the order (LIFO):
defer wg.Done() // 2. signal that we're done
defer f.Close() // 1. close the file
var err error
var fw io.Writer
for f := range files {
// Loop until channel is closed.
if fw, err = zw.Create(f.Name()); err != nil {
panic(err)
}
io.Copy(fw, f)
if err = f.Close(); err != nil {
panic(err)
}
}
// The zip writer must be closed *before* f.Close() is called!
if err = zw.Close(); err != nil {
panic(err)
}
}()
return &wg
}
func main() {
files := make(chan *os.File)
wait := ZipWriter(files)
// Send all files to the zip writer.
var wg sync.WaitGroup
wg.Add(len(os.Args)-1)
for i, name := range os.Args {
if i == 0 {
continue
}
// Read each file in parallel:
go func(name string) {
defer wg.Done()
f, err := os.Open(name)
if err != nil {
panic(err)
}
files <- f
}(name)
}
wg.Wait()
// Once we're done sending the files, we can close the channel.
close(files)
// This will cause ZipWriter to break out of the loop, close the file,
// and unblock the next mutex:
wait.Wait()
}
Usage: go run example.go /path/to/*.log.
This is the order in which things should be happening:
Open output file for writing.
Create a zip.Writer with that file.
Kick off a goroutine listening for files on a channel.
Go through each file, this can be done in one goroutine per file.
Send each file to the goroutine created in step 3.
After processing each file in said goroutine, close the file to free up resources.
Once each file has been sent to said goroutine, close the channel.
Wait until the zipping has been done (which is done sequentially).
Once zipping is done (channel exhausted), the zip writer should be closed.
Only when the zip writer is closed, should the output file be closed.
Finally everything is closed, so close the sync.WaitGroup to tell the calling function that we're good to go. (A channel could also be used here, but sync.WaitGroup seems more elegant.)
When you get the signal from the zip writer that everything is properly closed, you can exit from main and terminate nicely.
This might not answer your question, but I've been using similar code to generate zip archives on-the-fly for a web service some time ago. It performed quite well, even though the actual zipping was done in a single goroutine. Overcoming the IO bottleneck can already be an improvement.
From the look of it, you won't be able to parallelise the compression using the standard library archive/zip package because:
Compression is performed by the io.Writer returned by zip.Writer.Create or CreateHeader.
Calling Create/CreateHeader implicitly closes the writer returned by the previous call.
So passing the writers returned by Create to multiple goroutines and writing to them in parallel will not work.
If you wanted to write your own parallel zip writer, you'd probably want to structure it something like this:
Have multiple goroutines compress files using the compress/flate module, and keep track of the CRC32 value and length of the uncompressed data. The output should be directed to temporary files. Note the compressed size of the data.
Once everything has been compressed, start writing the Zip file starting with the header.
Write out the file header followed by the contents of the corresponding temporary file for each compressed file.
Write out the central directory record and end record at the end of the file. All the required information should be available at this point.
For added parallelism, step 1 could be performed in parallel with the remaining steps by using a channel to indicate when compression of each file completes.
Due to the file format, you won't be able to perform parallel compression without either storing compressed data in memory or in temporary files.
With Go1.17, parallel compression and merging of zip files are possible using the archive/zip package.
An example is below. In the example, I create zip workers to create individual zip files and an entry provider worker which provides entries to be added to a zip file via a channel to zip workers. Actual files can be provided to the zip workers but I skipped that part.
package main
import (
"archive/zip"
"context"
"fmt"
"io"
"log"
"os"
"strings"
"golang.org/x/sync/errgroup"
)
const numOfZipWorkers = 10
type entry struct {
name string
rc io.ReadCloser
}
func main() {
log.SetFlags(log.LstdFlags | log.Lshortfile)
entCh := make(chan entry, numOfZipWorkers)
zpathCh := make(chan string, numOfZipWorkers)
group, ctx := errgroup.WithContext(context.Background())
for i := 0; i < numOfZipWorkers; i++ {
group.Go(func() error {
return zipWorker(ctx, entCh, zpathCh)
})
}
group.Go(func() error {
defer close(entCh) // Signal workers to stop.
return entryProvider(ctx, entCh)
})
err := group.Wait()
if err != nil {
log.Fatal(err)
}
f, err := os.OpenFile("output.zip", os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0644)
if err != nil {
log.Fatal(err)
}
zw := zip.NewWriter(f)
close(zpathCh)
for path := range zpathCh {
zrd, err := zip.OpenReader(path)
if err != nil {
log.Fatal(err)
}
for _, zf := range zrd.File {
err := zw.Copy(zf)
if err != nil {
log.Fatal(err)
}
}
_ = zrd.Close()
_ = os.Remove(path)
}
err = zw.Close()
if err != nil {
log.Fatal(err)
}
err = f.Close()
if err != nil {
log.Fatal(err)
}
}
func entryProvider(ctx context.Context, entCh chan<- entry) error {
for i := 0; i < 2*numOfZipWorkers; i++ {
select {
case <-ctx.Done():
return ctx.Err()
case entCh <- entry{
name: fmt.Sprintf("file_%d", i+1),
rc: io.NopCloser(strings.NewReader(fmt.Sprintf("content %d", i+1))),
}:
}
}
return nil
}
func zipWorker(ctx context.Context, entCh <-chan entry, zpathch chan<- string) error {
f, err := os.CreateTemp(".", "tmp-part-*")
if err != nil {
return err
}
zw := zip.NewWriter(f)
Loop:
for {
var (
ent entry
ok bool
)
select {
case <-ctx.Done():
err = ctx.Err()
break Loop
case ent, ok = <-entCh:
if !ok {
break Loop
}
}
hdr := &zip.FileHeader{
Name: ent.name,
Method: zip.Deflate, // zip.Store can also be used.
}
hdr.SetMode(0644)
w, e := zw.CreateHeader(hdr)
if e != nil {
_ = ent.rc.Close()
err = e
break
}
_, e = io.Copy(w, ent.rc)
_ = ent.rc.Close()
if e != nil {
err = e
break
}
}
if e := zw.Close(); e != nil && err == nil {
err = e
}
if e := f.Close(); e != nil && err == nil {
err = e
}
if err == nil {
select {
case <-ctx.Done():
err = ctx.Err()
case zpathch <- f.Name():
}
}
return err
}