Parallel zip compression in Go

Parallel zip compression in Go - go

I am trying build a zip archive from a large number of small-medium sized files. I want to be able to do this concurrently, since compression is CPU intensive, and I'm running on a multi core server. Also I don't want to have the whole archive in memory, since its might turn out to be large.
My question is that do I have to compress every file and then combine manually combine everything together with zip header, checksum etc?
Any help would be greatly appreciated.

I don't think you can combine the zip headers.
What you could do is, run the zip.Writer sequentially, in a separate goroutine, and then spawn a new goroutine for each file that you want to read, and pipe those to the goroutine that is zipping them.
This should reduce the IO overhead that you get by reading the files sequentially, although it probably won't leverage multiple cores for the archiving itself.
Here's a working example. Note that, to keep things simple,
it does not handle errors nicely, just panics if something goes wrong,
and it does not use the defer statement too much, to demonstrate the order in which things should happen.
Since defer is LIFO, it can sometimes be confusing when you stack a lot of them together.
package main
import (
"archive/zip"
"io"
"os"
"sync"
)
func ZipWriter(files chan *os.File) *sync.WaitGroup {
f, err := os.Create("out.zip")
if err != nil {
panic(err)
}
var wg sync.WaitGroup
wg.Add(1)
zw := zip.NewWriter(f)
go func() {
// Note the order (LIFO):
defer wg.Done() // 2. signal that we're done
defer f.Close() // 1. close the file
var err error
var fw io.Writer
for f := range files {
// Loop until channel is closed.
if fw, err = zw.Create(f.Name()); err != nil {
panic(err)
}
io.Copy(fw, f)
if err = f.Close(); err != nil {
panic(err)
}
}
// The zip writer must be closed *before* f.Close() is called!
if err = zw.Close(); err != nil {
panic(err)
}
}()
return &wg
}
func main() {
files := make(chan *os.File)
wait := ZipWriter(files)
// Send all files to the zip writer.
var wg sync.WaitGroup
wg.Add(len(os.Args)-1)
for i, name := range os.Args {
if i == 0 {
continue
}
// Read each file in parallel:
go func(name string) {
defer wg.Done()
f, err := os.Open(name)
if err != nil {
panic(err)
}
files <- f
}(name)
}
wg.Wait()
// Once we're done sending the files, we can close the channel.
close(files)
// This will cause ZipWriter to break out of the loop, close the file,
// and unblock the next mutex:
wait.Wait()
}
Usage: go run example.go /path/to/*.log.
This is the order in which things should be happening:
Open output file for writing.
Create a zip.Writer with that file.
Kick off a goroutine listening for files on a channel.
Go through each file, this can be done in one goroutine per file.
Send each file to the goroutine created in step 3.
After processing each file in said goroutine, close the file to free up resources.
Once each file has been sent to said goroutine, close the channel.
Wait until the zipping has been done (which is done sequentially).
Once zipping is done (channel exhausted), the zip writer should be closed.
Only when the zip writer is closed, should the output file be closed.
Finally everything is closed, so close the sync.WaitGroup to tell the calling function that we're good to go. (A channel could also be used here, but sync.WaitGroup seems more elegant.)
When you get the signal from the zip writer that everything is properly closed, you can exit from main and terminate nicely.
This might not answer your question, but I've been using similar code to generate zip archives on-the-fly for a web service some time ago. It performed quite well, even though the actual zipping was done in a single goroutine. Overcoming the IO bottleneck can already be an improvement.

From the look of it, you won't be able to parallelise the compression using the standard library archive/zip package because:
Compression is performed by the io.Writer returned by zip.Writer.Create or CreateHeader.
Calling Create/CreateHeader implicitly closes the writer returned by the previous call.
So passing the writers returned by Create to multiple goroutines and writing to them in parallel will not work.
If you wanted to write your own parallel zip writer, you'd probably want to structure it something like this:
Have multiple goroutines compress files using the compress/flate module, and keep track of the CRC32 value and length of the uncompressed data. The output should be directed to temporary files. Note the compressed size of the data.
Once everything has been compressed, start writing the Zip file starting with the header.
Write out the file header followed by the contents of the corresponding temporary file for each compressed file.
Write out the central directory record and end record at the end of the file. All the required information should be available at this point.
For added parallelism, step 1 could be performed in parallel with the remaining steps by using a channel to indicate when compression of each file completes.
Due to the file format, you won't be able to perform parallel compression without either storing compressed data in memory or in temporary files.

With Go1.17, parallel compression and merging of zip files are possible using the archive/zip package.
An example is below. In the example, I create zip workers to create individual zip files and an entry provider worker which provides entries to be added to a zip file via a channel to zip workers. Actual files can be provided to the zip workers but I skipped that part.
package main
import (
"archive/zip"
"context"
"fmt"
"io"
"log"
"os"
"strings"
"golang.org/x/sync/errgroup"
)
const numOfZipWorkers = 10
type entry struct {
name string
rc io.ReadCloser
}
func main() {
log.SetFlags(log.LstdFlags | log.Lshortfile)
entCh := make(chan entry, numOfZipWorkers)
zpathCh := make(chan string, numOfZipWorkers)
group, ctx := errgroup.WithContext(context.Background())
for i := 0; i < numOfZipWorkers; i++ {
group.Go(func() error {
return zipWorker(ctx, entCh, zpathCh)
})
}
group.Go(func() error {
defer close(entCh) // Signal workers to stop.
return entryProvider(ctx, entCh)
})
err := group.Wait()
if err != nil {
log.Fatal(err)
}
f, err := os.OpenFile("output.zip", os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0644)
if err != nil {
log.Fatal(err)
}
zw := zip.NewWriter(f)
close(zpathCh)
for path := range zpathCh {
zrd, err := zip.OpenReader(path)
if err != nil {
log.Fatal(err)
}
for _, zf := range zrd.File {
err := zw.Copy(zf)
if err != nil {
log.Fatal(err)
}
}
_ = zrd.Close()
_ = os.Remove(path)
}
err = zw.Close()
if err != nil {
log.Fatal(err)
}
err = f.Close()
if err != nil {
log.Fatal(err)
}
}
func entryProvider(ctx context.Context, entCh chan<- entry) error {
for i := 0; i < 2*numOfZipWorkers; i++ {
select {
case <-ctx.Done():
return ctx.Err()
case entCh <- entry{
name: fmt.Sprintf("file_%d", i+1),
rc: io.NopCloser(strings.NewReader(fmt.Sprintf("content %d", i+1))),
}:
}
}
return nil
}
func zipWorker(ctx context.Context, entCh <-chan entry, zpathch chan<- string) error {
f, err := os.CreateTemp(".", "tmp-part-*")
if err != nil {
return err
}
zw := zip.NewWriter(f)
Loop:
for {
var (
ent entry
ok bool
)
select {
case <-ctx.Done():
err = ctx.Err()
break Loop
case ent, ok = <-entCh:
if !ok {
break Loop
}
}
hdr := &zip.FileHeader{
Name: ent.name,
Method: zip.Deflate, // zip.Store can also be used.
}
hdr.SetMode(0644)
w, e := zw.CreateHeader(hdr)
if e != nil {
_ = ent.rc.Close()
err = e
break
}
_, e = io.Copy(w, ent.rc)
_ = ent.rc.Close()
if e != nil {
err = e
break
}
}
if e := zw.Close(); e != nil && err == nil {
err = e
}
if e := f.Close(); e != nil && err == nil {
err = e
}
if err == nil {
select {
case <-ctx.Done():
err = ctx.Err()
case zpathch <- f.Name():
}
}
return err
}

Related

How to make concurrent GET requests from url pool

I completed the suggested go-tour, watched some tutorials and gopher-conferences on YouTube. And that's pretty much it.
I have a project which requires me to send get requests and store the results in files. But amount of URL's is around 80 million.
I'm testing with 1000 URLs only.
Problem: I think I couldn't managed to make it concurrent, although I've followed some guidelines. I don't know what's wrong. But maybe I'm wrong and it's concurrent, just did not seem fast to me, the speed felt like sequential requests.
Here is the code I've written:
package main
import (
"bufio"
"io/ioutil"
"log"
"net/http"
"os"
"sync"
"time"
)
var wg sync.WaitGroup // synchronization to wait for all the goroutines
func crawler(urlChannel <-chan string) {
defer wg.Done()
client := &http.Client{Timeout: 10 * time.Second} // single client is sufficient for multiple requests
for urlItem := range urlChannel {
req1, _ := http.NewRequest("GET", "http://"+urlItem, nil) // generating the request
req1.Header.Add("User-agent", "Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/74.0") // changing user-agent
resp1, respErr1 := client.Do(req1) // sending the prepared request and getting the response
if respErr1 != nil {
continue
}
defer resp1.Body.Close()
if resp1.StatusCode/100 == 2 { // means server responded with 2xx code
text1, readErr1 := ioutil.ReadAll(resp1.Body) // try to read the sourcecode of the website
if readErr1 != nil {
log.Fatal(readErr1)
}
f1, fileErr1 := os.Create("200/" + urlItem + ".txt") // creating the relative file
if fileErr1 != nil {
log.Fatal(fileErr1)
}
defer f1.Close()
_, writeErr1 := f1.Write(text1) // writing the sourcecode into our file
if writeErr1 != nil {
log.Fatal(writeErr1)
}
}
}
}
func main() {
file, err := os.Open("urls.txt") // the file containing the url's
if err != nil {
log.Fatal(err)
}
defer file.Close() // don't forget to close the file
urlChannel := make(chan string, 1000) // create a channel to store all the url's
scanner := bufio.NewScanner(file) // each line has another url
for scanner.Scan() {
urlChannel <- scanner.Text()
}
close(urlChannel)
_ = os.Mkdir("200", 0755) // if it's there, it will create an error, and we will simply ignore it
for i := 0; i < 10; i++ {
wg.Add(1)
go crawler(urlChannel)
}
wg.Wait()
}
My question is: why is this code not working concurrently? How can I solve the problem I've mentioned above. Is there something that I'm doing wrong for making concurrent GET requests?

Here's some code to get you thinking. I put the URLs in the code so it is self-sufficient, but you'd probably be piping them to stdin in practice. There's a few things I'm doing here that I think are improvements, or at least worth thinking about.
Before we get started, I'll point out that I put the complete url in the input stream. For one thing, this lets me support http and https both. I don't really see the logic behind hard coding the scheme in the code rather than leaving it in the data.
First, it can handle arbitrarily sized response bodies (your version reads the body into memory, so it is limited by some number of concurrent large requests filling memory). I do this with io.Copy().
[edited]
text1, readErr1 := ioutil.ReadAll(resp1.Body) reads the entire http body. If the body is large, it will take up lots of memory. io.Copy(f1,resp1.Body) would instead copy the data from the http response body directly to the file, without having to hold the whole thing in memory. It may be done in one Read/Write or many.
http.Response.Body is an io.ReadCloser because the HTTP protocol expects the body to be read progressively. http.Response does not yet have the entire body, until it is read. That's why it's not just a []byte. Writing it to the filesystem progressively while the data "streams" in from the tcp socket means that a finite amount of system resources can download an unlimited amount of data.
But there's even more benefit. io.Copy will call ReadFrom() on the file. If you look at the linux implementation (for example): https://golang.org/src/os/readfrom_linux.go , and dig a bit, you'll see it actually uses copy_file_range That system call is cool because
The copy_file_range() system call performs an in-kernel copy between two file descriptors without the additional cost of transferring data from the kernel to user space and then back into the kernel.
*os.File knows how to ask the kernel to deliver data directly from the tcp socket to the file without your program even having to touch it.
See https://golang.org/pkg/io/#Copy.
Second, I make sure to use all the url components in the filename. URLs with different query strings go to different files. The fragment probably doesn't differentiate response bodies, so including that in the path may be ill considered. There's no awesome heuristic for turning URLs into valid file paths - if this were a serious task, I'd probably store the data in files based on a shasum of the url or something - and create an index of results stored in a metadata file.
Third, I handle all errors. req1, _ := http.NewRequest(... might seem like a convenient shortcut, but what it really means is that you won't know the real cause of any errors - at best. I usually add some descriptive text to the errors when percolating up, to make sure I can easily tell which error I'm returning.
Finally, I return successfully processed URLs so that I can see the final results. When scanning millions of URLS, you'd probably also want a list of which failed, but a count of successful is a good start at sending final data back for summary.
package main
import (
"bufio"
"bytes"
"fmt"
"io"
"log"
"net/http"
"net/url"
"os"
"path/filepath"
"time"
)
const urls_text = `http://danf.us/
https://farrellit.net/?3=2&#1
`
func crawler(urls <-chan *url.URL, done chan<- int) {
var processed int = 0
defer func() { done <- processed }()
client := http.Client{Timeout: 10 * time.Second}
for u := range urls {
if req, err := http.NewRequest("GET", u.String(), nil); err != nil {
log.Printf("Couldn't create new request for %s: %s", u.String(), err.Error())
} else {
req.Header.Add("User-agent", "Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/74.0") // changing user-agent
if res, err := client.Do(req); err != nil {
log.Printf("Failed to get %s: %s", u.String(), err.Error())
} else {
filename := filepath.Base(u.EscapedPath())
if filename == "/" || filename == "" {
filename = "response"
} else {
log.Printf("URL Filename is '%s'", filename)
}
destpath := filepath.Join(
res.Status, u.Scheme, u.Hostname(), u.EscapedPath(),
fmt.Sprintf("?%s",u.RawQuery), fmt.Sprintf("#%s",u.Fragment), filename,
)
if err := os.MkdirAll(filepath.Dir(destpath), 0755); err != nil {
log.Printf("Couldn't create directory %s: %s", filepath.Dir(destpath), err.Error())
} else if f, err := os.OpenFile(destpath, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0644); err != nil {
log.Printf("Couldn't open destination file %s: %s", destpath, err.Error())
} else {
if b, err := io.Copy(f, res.Body); err != nil {
log.Printf("Could not copy %s body to %s: %s", u.String(), destpath, err.Error())
} else {
log.Printf("Copied %d bytes from body of %s to %s", b, u.String(), destpath)
processed++
}
f.Close()
}
res.Body.Close()
}
}
}
}
const workers = 3
func main() {
urls := make(chan *url.URL)
done := make(chan int)
var submitted int = 0
var inputted int = 0
var successful int = 0
for i := 0; i < workers; i++ {
go crawler(urls, done)
}
sc := bufio.NewScanner(bytes.NewBufferString(urls_text))
for sc.Scan() {
inputted++
if u, err := url.Parse(sc.Text()); err != nil {
log.Printf("Could not parse %s as url: %w", sc.Text(), err)
} else {
submitted++
urls <- u
}
}
close(urls)
for i := 0; i < workers; i++ {
successful += <-done
}
log.Printf("%d urls input, %d could not be parsed. %d/%d valid URLs successful (%.0f%%)",
inputted, inputted-submitted,
successful, submitted,
float64(successful)/float64(submitted)*100.0,
)
}

When setting up a concurrent pipeline, a good guideline to follow is to always first set up and instantiate the listeners that will execute concurrently (in your case, crawlers), and then start feeding them data through the pipeline (in your case, the urlChannel).
In your example, the only thing preventing a deadlock is the fact that you've instantiated a buffered channel with the same number of rows that your test file has (1000 rows). What the code does is it puts URLs inside the urlChannel. Since there are 1000 rows inside your file, the urlChannel can take all of them without blocking. If you put more URLs inside the file, the execution will block after filling up the urlChannel.
Here is the version of the code that should work:
package main
import (
"bufio"
"io/ioutil"
"log"
"net/http"
"os"
"sync"
"time"
)
func crawler(wg *sync.WaitGroup, urlChannel <-chan string) {
defer wg.Done()
client := &http.Client{Timeout: 10 * time.Second} // single client is sufficient for multiple requests
for urlItem := range urlChannel {
req1, _ := http.NewRequest("GET", "http://"+urlItem, nil) // generating the request
req1.Header.Add("User-agent", "Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/74.0") // changing user-agent
resp1, respErr1 := client.Do(req1) // sending the prepared request and getting the response
if respErr1 != nil {
continue
}
if resp1.StatusCode/100 == 2 { // means server responded with 2xx code
text1, readErr1 := ioutil.ReadAll(resp1.Body) // try to read the sourcecode of the website
if readErr1 != nil {
log.Fatal(readErr1)
}
resp1.Body.Close()
f1, fileErr1 := os.Create("200/" + urlItem + ".txt") // creating the relative file
if fileErr1 != nil {
log.Fatal(fileErr1)
}
_, writeErr1 := f1.Write(text1) // writing the sourcecode into our file
if writeErr1 != nil {
log.Fatal(writeErr1)
}
f1.Close()
}
}
}
func main() {
var wg sync.WaitGroup
file, err := os.Open("urls.txt") // the file containing the url's
if err != nil {
log.Fatal(err)
}
defer file.Close() // don't forget to close the file
urlChannel := make(chan string)
_ = os.Mkdir("200", 0755) // if it's there, it will create an error, and we will simply ignore it
// first, initialize crawlers
wg.Add(10)
for i := 0; i < 10; i++ {
go crawler(&wg, urlChannel)
}
//after crawlers are initialized, start feeding them data through the channel
scanner := bufio.NewScanner(file) // each line has another url
for scanner.Scan() {
urlChannel <- scanner.Text()
}
close(urlChannel)
wg.Wait()
}

Non-blocking way to read from a Pipe

I want to create a simple app which will be reading continuously output from one app, process it and write processed output to stdout. This app can produce a lot of data within a second and next is silent for a few minutes.
The problem is that mine data processing algorithm is quite slow so main loop is blocked. When the loop is blocked I'm loosing a data which comes and this moment.
cmd := exec.Command("someapp")
stdoutPipe, _ := cmd.StdoutPipe()
stdoutReader := bufio.NewReader(stdoutPipe)
go func() {
bufioReader := bufio.NewReader(stdoutReader)
for {
output, _, err := bufioReader.ReadLine()
if err != nil || err == io.EOF {
break
}
processedOutput := dataProcessor(output);
fmt.Print(processedOutput)
}
}()
Probably the best way to solve this problem is to buffer all output and process it in another Goroutine but I'm not sure how to implement this in Golang. What is the most idiomatic way to solve this problem?

You can have two goroutines, one supplier and other is a consumer. supplier execute a command and pass data with a channel to the consumer.
cmd := exec.Command("someapp")
stdoutPipe, _ := cmd.StdoutPipe()
stdoutReader := bufio.NewReader(stdoutPipe)
Datas := make(chan Data, 100)
go dataProcessor(Datas)
bufioReader := bufio.NewReader(stdoutReader)
for {
output, _, err := bufioReader.ReadLine()
var tempData Data
tempData.Out = output
if err != nil || err == io.EOF {
break
}
Datas <- tempData
}
}
and then you will prccess data in dataProcessor func:
func dataProcessor(Datas <-chan Data) {
for {
select {
case data := <-Datas:
fmt.Println(data)
default:
continue
}
}
}
obviesly it is very simple example and you should customize it and make it better. search about chanle and goroutins. reading this tutorial could be helpfull.

Farm out work to a slice but limit number of workers

I'm trying to improve the performance of an app.
One part of its code uploads a file to a server in chunks.
The original version simply does this in a sequential loop. However, it's slow and during the sequence it also needs to talk to another server before uploading each chunk.
The upload of chunks could simply be placed in a goroutine. It works, but is not a good solution because if the source file is extremely large it ends up using a large amount of memory.
So, I try to limit the number of active goroutines by using a buffered channel. Here is some code that shows my attempt. I've stripped it down to show the concept and you can run it to test for yourself.
package main
import (
"fmt"
"io"
"os"
"time"
)
const defaultChunkSize = 1 * 1024 * 1024
// Lets have 4 workers
var c = make(chan int, 4)
func UploadFile(f *os.File) error {
fi, err := f.Stat()
if err != nil {
return fmt.Errorf("err: %s", err)
}
size := fi.Size()
total := (int)(size/defaultChunkSize + 1)
// Upload parts
buf := make([]byte, defaultChunkSize)
for partno := 1; partno <= total; partno++ {
readChunk := func(offset int, buf []byte) (int, error) {
fmt.Println("readChunk", partno, offset)
n, err := f.ReadAt(buf, int64(offset))
if err != nil {
return n, err
}
return n, nil
}
// This will block if there are not enough worker slots available
c <- partno
// The actual worker.
go func() {
offset := (partno - 1) * defaultChunkSize
n, err := readChunk(offset, buf)
if err != nil && err != io.EOF {
return
}
err = uploadPart(partno, buf[:n])
if err != nil {
fmt.Println("Uploadpart failed:", err)
}
<-c
}()
}
return nil
}
func uploadPart(partno int, buf []byte) error {
fmt.Printf("Uploading partno: %d, buflen=%d\n", partno, len(buf))
// Actually upload the part. Lets test it by instead writing each
// buffer to another file. We can then use diff to compare the
// source and dest files.
// Open file. Seek to (partno - 1) * defaultChunkSize, write buffer
f, err := os.OpenFile("/home/matthewh/Downloads/out.tar.gz", os.O_CREATE|os.O_WRONLY, 0755)
if err != nil {
fmt.Printf("err: %s\n", err)
}
n, err := f.WriteAt(buf, int64((partno-1)*defaultChunkSize))
if err != nil {
fmt.Printf("err=%s\n", err)
}
fmt.Printf("%d bytes written\n", n)
defer f.Close()
return nil
}
func main() {
filename := "/home/matthewh/Downloads/largefile.tar.gz"
fmt.Printf("Opening file: %s\n", filename)
f, err := os.Open(filename)
if err != nil {
panic(err)
}
UploadFile(f)
}
It almost works. But there are several problems.
1) The final partno 22 is occuring 3 times. The correct length is actually 612545 as the file length isn't a multiple of 1MB.
// Sample output
...
readChunk 21 20971520
readChunk 22 22020096
Uploading partno: 22, buflen=1048576
Uploading partno: 22, buflen=612545
Uploading partno: 22, buflen=1048576
Another problem, the upload could fail and I am not familiar enough with go and how best to solve failure of the goroutine.
Finally, I want to ordinarily return some data from the uploadPart when it succeeds. Specifically, it'll be a string (an HTTP ETag header value). These etag values need to be collected by the main function.
What is a better way to structure this code in this instance? I've not yet found a good golang design pattern that correctly fulfills my needs here.

Skipping for the moment the question of how better to structure this code, I see a bug in your code which may be causing the problem you're seeing. Since the function you're running in the goroutine uses the variable partno, which changes with each iteration of the loop, your goroutine isn't necessarily seeing the value of partno at the time you invoked the goroutine. A common way of fixing this is to create a local copy of that variable inside the loop:
for partno := 1; partno <= total; partno++ {
partno := partno
// ...
}

Data race #1
Multiple goroutines are using the same buffer concurrently. Note that one gorouting may be filling it with a new chunk while another is still reading an old chunk from it. Instead, each goroutine should have it's own buffer.
Data race #2
As Andy Schweig has pointed, the value in partno is updated by the loop before the goroutine created in that iteration has a chance to read it. This is why the final partno 22 occurs multiple times. To fix it, you can pass partno as a argument to the anonymous function. That will ensure each goroutine has it's own part number.
Also, you can use a channel to pass the results from the workers. Maybe a struct type with the part number and error. That way, you will be able to observe the progress and retry failed uploads.
For an example of a good pattern check out this example from the GOPL book.

Suggested changes
As noted by dev.bmax buf moved into go routine, as noted by Andy Schweig partno is param to anon function, also added WaitGroup since UploadFile was exiting before uploads were complete. Also defer f.Close() file, good habit.
package main
import (
"fmt"
"io"
"os"
"sync"
"time"
)
const defaultChunkSize = 1 * 1024 * 1024
// wg for uploads to complete
var wg sync.WaitGroup
// Lets have 4 workers
var c = make(chan int, 4)
func UploadFile(f *os.File) error {
// wait for all the uploads to complete before function exit
defer wg.Wait()
fi, err := f.Stat()
if err != nil {
return fmt.Errorf("err: %s", err)
}
size := fi.Size()
fmt.Printf("file size: %v\n", size)
total := int(size/defaultChunkSize + 1)
// Upload parts
for partno := 1; partno <= total; partno++ {
readChunk := func(offset int, buf []byte, partno int) (int, error) {
fmt.Println("readChunk", partno, offset)
n, err := f.ReadAt(buf, int64(offset))
if err != nil {
return n, err
}
return n, nil
}
// This will block if there are not enough worker slots available
c <- partno
// The actual worker.
go func(partno int) {
// wait for me to be done
wg.Add(1)
defer wg.Done()
buf := make([]byte, defaultChunkSize)
offset := (partno - 1) * defaultChunkSize
n, err := readChunk(offset, buf, partno)
if err != nil && err != io.EOF {
return
}
err = uploadPart(partno, buf[:n])
if err != nil {
fmt.Println("Uploadpart failed:", err)
}
<-c
}(partno)
}
return nil
}
func uploadPart(partno int, buf []byte) error {
fmt.Printf("Uploading partno: %d, buflen=%d\n", partno, len(buf))
// Actually do the upload. Simulate long running task with a sleep
time.Sleep(time.Second)
return nil
}
func main() {
filename := "/home/matthewh/Downloads/largefile.tar.gz"
fmt.Printf("Opening file: %s\n", filename)
f, err := os.Open(filename)
if err != nil {
panic(err)
}
defer f.Close()
UploadFile(f)
}
I'm sure you can deal a little smarter with the buf situation. I'm just letting go deal with the garbage. Since you are limiting your workers to specific number 4 you really need only 4 x defaultChunkSize buffers. Please do share if you come up with something simple and shareworth.
Have fun!

using io.Pipes() for sending and receiving message

I am using os.Pipes() in my program, but for some reason it gives a bad file descriptor error each time i try to write or read data from it.
Is there some thing I am doing wrong?
Below is the code
package main
import (
"fmt"
"os"
)
func main() {
writer, reader, err := os.Pipe()
if err != nil {
fmt.Println(err)
}
_,err= writer.Write([]byte("hello"))
if err != nil {
fmt.Println(err)
}
var data []byte
_, err = reader.Read(data)
if err != nil {
fmt.Println(err)
}
fmt.Println(string(data))
}
output :
write |0: Invalid argument
read |1: Invalid argument

You are using an os.Pipe, which returns a pair of FIFO connected files from the os. This is different than an io.Pipe which is implemented in Go.
The invalid argument errors are because you are reading and writing to the wrong files. The signature of os.Pipe is
func Pipe() (r *File, w *File, err error)
which shows that the returns values are in the order "reader, writer, error".
and io.Pipe:
func Pipe() (*PipeReader, *PipeWriter)
Also returning in the order "reader, writer"
When you check the error from the os.Pipe function, you are only printing the value. If there was an error, the files are invalid. You need to return or exit on that error.
Pipes are also blocking (though an os.Pipe has a small, hard coded buffer), so you need to read and write asynchronously. If you swapped this for an io.Pipe it would deadlock immediately. Dispatch the Read method inside a goroutine and wait for it to complete.
Finally, you are reading into a nil slice, which will read nothing. You need to allocate space to read into, and you need to record the number of bytes read to know how much of the buffer is used.
A more correct version of your example would look like:
reader, writer, err := os.Pipe()
if err != nil {
log.Fatal(err)
}
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
data := make([]byte, 1024)
n, err = reader.Read(data)
if n > 0 {
fmt.Println(string(data[:n]))
}
if err != nil && err != io.EOF {
fmt.Println(err)
}
}()
_, err = writer.Write([]byte("hello"))
if err != nil {
fmt.Println(err)
}
wg.Wait()

golang zlib reader output not being copied over to stdout

I've modified the official documentation example for the zlib package to use an opened file rather than a set of hardcoded bytes (code below).
The code reads in the contents of a source text file and compresses it with the zlib package. I then try to read back the compressed file and print its decompressed contents into stdout.
The code doesn't error, but it also doesn't do what I expect it to do; which is to display the decompressed file contents into stdout.
Also: is there another way of displaying this information, rather than using io.Copy?
package main
import (
"compress/zlib"
"io"
"log"
"os"
)
func main() {
var err error
// This defends against an error preventing `defer` from being called
// As log.Fatal otherwise calls `os.Exit`
defer func() {
if err != nil {
log.Fatalln("\nDeferred log: \n", err)
}
}()
src, err := os.Open("source.txt")
if err != nil {
return
}
defer src.Close()
dest, err := os.Create("new.txt")
if err != nil {
return
}
defer dest.Close()
zdest := zlib.NewWriter(dest)
defer zdest.Close()
if _, err := io.Copy(zdest, src); err != nil {
return
}
n, err := os.Open("new.txt")
if err != nil {
return
}
r, err := zlib.NewReader(n)
if err != nil {
return
}
defer r.Close()
io.Copy(os.Stdout, r)
err = os.Remove("new.txt")
if err != nil {
return
}
}

Your defer func doesn't do anything, because you're shadowing the err variable on every new assignment. If you want a defer to run, return from a separate function, and call log.Fatal after the return statement.
As for why you're not seeing any output, it's because you're deferring all the Close calls. The zlib.Writer isn't flushed until after the function exits, and neither is the destination file. Call Close() explicitly where you need it.
zdest := zlib.NewWriter(dest)
if _, err := io.Copy(zdest, src); err != nil {
log.Fatal(err)
}
zdest.Close()
dest.Close()

I think you messed up the code logic with all this defer stuff and your "trick" err checking.
Files are definitively written when flushed or closed. You just copy into new.txt without closing it before opening it to read it.
Defering the closing of the file is neat inside a function which has multiple exits: It makes sure the file is closed once the function is left. But your main requires the new.txt to be closed after the copy, before re-opening it. So don't defer the close here.
BTW: Your defense against log.Fatal terminating the code without calling your defers is, well, at least strange. The files are all put into some proper state by the OS, there is absolutely no need to complicate the stuff like this.

Check the error from the second Copy:
2015/12/22 19:00:33
Deferred log:
unexpected EOF
exit status 1
The thing is, you need to close zdest immediately after you've done writing. Close it after the first Copy and it works.

I would have suggested to use io.MultiWriter.
In this way you read only once from src. Not much gain for small files but is faster for bigger files.
w := io.MultiWriter(dest, os.Stdout)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Parallel zip compression in Go - go

Related

How to make concurrent GET requests from url pool

Non-blocking way to read from a Pipe

Farm out work to a slice but limit number of workers

using io.Pipes() for sending and receiving message

golang zlib reader output not being copied over to stdout

Categories

Resources