Gzip writer not writing gzip data to S3 in Golang - go

I've got a (hopefully) simple problem. I'm trying to write the results of a HTTP request to a Gzip file in S3. However when downloading the resultant file from S3, it's just in plain text and not compressed. Below is a snippet of the code (sans bootstrapping). The code builds, lints and runs without error, so I'm not sure where I'm going wrong...any pointers would be greatly appreciated!
r, w := io.Pipe()
gw := gzip.NewWriter(w)
go func() {
defer w.Close()
defer gw.Close()
_, err := gw.Write(httpResponse)
if err != nil {
fmt.Println(“error”)
}
}()
cfg, _ := config.LoadDefaultConfig(context.TODO())
s3Client := s3.NewFromConfig(cfg)
ul := manager.NewUploader(s3Client)
_, err := ul.Upload(context.TODO(), &s3.PutObjectInput{
Bucket: aws.String(bucket),
ContentEncoding: aws.String("gzip"),
Key: aws.String(fileName),
Body: r,
})
if err != nil {
fmt.Println(“error”)
}

This is a side effect of downloading via the browser, it'll decode the gzip but leave the .gz extension (which is frankly confusing). If you use the AWS cli or API to download the file, it will remain as GZIP.
See: https://superuser.com/questions/940605/chromium-prevent-unpacking-tar-gz?noredirect=1&lq=1

Related

How to convert aws.WriteAtBuffer to an io.Reader?

I need to download a file from S3, and then upload the same file into a different S3 bucket. So far I have:
sess := session.Must(session.NewSession())
downloader := s3manager.NewDownloader(sess)
buffer := aws.NewWriteAtBuffer([]byte{})
n, err := downloader.Download(buffer, &s3.GetObjectInput{
Bucket: aws.String(sourceS3Bucket),
Key: aws.String(documentKey),
})
uploader := s3manager.NewUploader(sess)
result, err := uploader.Upload(&s3manager.UploadInput{
Bucket: aws.String(targetS3Bucket),
Key: aws.String(documentKey),
Body: buffer,
})
I have used an aws.WriteAtBuffer, as per the answer here: https://stackoverflow.com/a/48254996/504055
However, I am currently stuck on how to treat this buffer as something that implements the io.Reader interface, which is what the uploader's Upload method requires.
Use bytes.NewReader to create an io.Reader on the bytes in the buffer:
result, err := uploader.Upload(&s3manager.UploadInput{
Bucket: aws.String(targetS3Bucket),
Key: aws.String(documentKey),
Body: bytes.NewReader(buffer.Bytes()),
})

How to Achieve Performance of AWS CLI Sync Command in AWS SDK

The aws s3 sync command in the CLI can download a large collection of files very quickly, and I can not achieve the same performance with the AWS Go SDK. I have millions of files in the bucket so this is critical to me. I need to use the list pages command as well so that I can add a prefix which is not supported well by the sync CLI command.
I have tried using multiple goroutines (10 up to 1000) to make requests to the server, but the time is just so much slower compared to the CLI. It takes about 100 ms per file to run the Go GetObject function which is unacceptable for the number of files that I have. I know that the AWS CLI also uses the Python SDK in the backend, so how does it have so much better performance (I tried my script in boto as well as Go).
I am using ListObjectsV2Pages and GetObject. My region is the same as the S3 server's.
logMtx := &sync.Mutex{}
logBuf := bytes.NewBuffer(make([]byte, 0, 100000000))
err = s3c.ListObjectsV2Pages(
&s3.ListObjectsV2Input{
Bucket: bucket,
Prefix: aws.String("2019-07-21-01"),
MaxKeys: aws.Int64(1000),
},
func(page *s3.ListObjectsV2Output, lastPage bool) bool {
fmt.Println("Received", len(page.Contents), "objects in page")
worker := make(chan bool, 10)
for i := 0; i < cap(worker); i++ {
worker <- true
}
wg := &sync.WaitGroup{}
wg.Add(len(page.Contents))
objIdx := 0
objIdxMtx := sync.Mutex{}
for {
<-worker
objIdxMtx.Lock()
if objIdx == len(page.Contents) {
break
}
go func(idx int, obj *s3.Object) {
gs := time.Now()
resp, err := s3c.GetObject(&s3.GetObjectInput{
Bucket: bucket,
Key: obj.Key,
})
check(err)
fmt.Println("Get: ", time.Since(gs))
rs := time.Now()
logMtx.Lock()
_, err = logBuf.ReadFrom(resp.Body)
check(err)
logMtx.Unlock()
fmt.Println("Read: ", time.Since(rs))
err = resp.Body.Close()
check(err)
worker <- true
wg.Done()
}(objIdx, page.Contents[objIdx])
objIdx += 1
objIdxMtx.Unlock()
}
fmt.Println("ok")
wg.Wait()
return true
},
)
check(err)
Many results look like:
Get: 153.380727ms
Read: 51.562µs
Have you tried using https://docs.aws.amazon.com/sdk-for-go/api/service/s3/s3manager/?
iter := new(s3manager.DownloadObjectsIterator)
var files []*os.File
defer func() {
for _, f := range files {
f.Close()
}
}()
err := client.ListObjectsV2PagesWithContext(ctx, &s3.ListObjectsV2Input{
Bucket: aws.String(bucket),
Prefix: aws.String(prefix),
}, func(output *s3.ListObjectsV2Output, last bool) bool {
for _, object := range output.Contents {
nm := filepath.Join(dstdir, *object.Key)
err := os.MkdirAll(filepath.Dir(nm), 0755)
if err != nil {
panic(err)
}
f, err := os.Create(nm)
if err != nil {
panic(err)
}
log.Println("downloading", *object.Key, "to", nm)
iter.Objects = append(iter.Objects, s3manager.BatchDownloadObject{
Object: &s3.GetObjectInput{
Bucket: aws.String(bucket),
Key: object.Key,
},
Writer: f,
})
files = append(files, f)
}
return true
})
if err != nil {
panic(err)
}
downloader := s3manager.NewDownloader(s)
err = downloader.DownloadWithIterator(ctx, iter)
if err != nil {
panic(err)
}
I ended up settling for my script in the initial post. I tried 20 goroutines and that seemed to work pretty well. On my laptop, the initial script is definitely slower than the command line (i7 8-thread, 16 GB RAM, NVME) versus the CLI. However, on the EC2 instance, the difference was small enough that it was not worth my time to optimize it further. I used a c5.xlarge instance in the same region as the S3 server.

Download a zip file using io.Pipe() read/write golang

I am trying to stream out bytes of a zip file using io.Pipe() function in golang. I am using pipe reader to read the bytes of each file in the zip and then stream those out and use the pipe writer to write the bytes in the response object.
func main() {
r, w := io.Pipe()
// go routine to make the write/read non-blocking
go func() {
defer w.Close()
bytes, err := ReadBytesforEachFileFromTheZip()
err := json.NewEncoder(w).Encode(bytes)
handleErr(err)
}()
This is not a working implementation but a structure of what I am trying to achieve. I don't want to use ioutil.ReadAll since the file is going to be very large and Pipe() will help me avoid bringing all the data into memory. Can someone help with a working implementation using io.Pipe() ?
I made it work using golang io.Pipe().The Pipewriter writes byte to the pipe in chunks and the pipeReader reader from the other end. The reason for using a go-routine is to have a non-blocking write operation while simultaneous reads happen form the pipe.
Note: It's important to close the pipe writer (w.Close()) to send EOF on the stream otherwise it will not close the stream.
func DownloadZip() ([]byte, error) {
r, w := io.Pipe()
defer r.Close()
defer w.Close()
zip, err := os.Stat("temp.zip")
if err != nil{
return nil, err
}
go func(){
f, err := os.Open(zip.Name())
if err != nil {
return
}
buf := make([]byte, 1024)
for {
chunk, err := f.Read(buf)
if err != nil && err != io.EOF {
panic(err)
}
if chunk == 0 {
break
}
if _, err := w.Write(buf[:chunk]); err != nil{
return
}
}
w.Close()
}()
body, err := ioutil.ReadAll(r)
if err != nil {
return nil, err
}
return body, nil
}
Please let me know if someone has another way of doing it.

Push data from Golang to OpenTSTB

I have stored the last one hour data into file. So I've to upload the previous data to openTSTB.
So, the code is as follows:
go func() {
file, err := os.Open("/var/lib/agent/agent.db")
if err != nil {
fmt.Println(err, "Err")
}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
arr := []byte(scanner.Text())
url := "http://192.168.2.40:4242/api/put"
req, err := http.NewRequest("POST", url, bytes.NewBuffer(arr))
req.Header.Set("Content-Type", "")
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
panic(err)
}
defer resp.Body.Close()
}
}()
The above code pushes the last one hour data to openTSTB.
Current data is also pushed to openTSTB using another GoRoutine.
The code is as follows:
// Regular run
go func() {
timeStamp, _ := strconv.ParseInt(strconv.FormatInt(time.Now().UnixNano()/1e9, 10), 10, 64)
err := opentsdb.Put(
MetricName,
4,
timeStamp,
opentsdb.Tag{"host", hostname},
)
}()
The problem is if last record is 4, my previous record has been uploaded with the old data [Ex: 4+4].
If I run single GoRoutine, it is working correctly. If I go with old and current data, the result is wrong.
How to fix this? Any help is greatly appreciated. Thanks in advance.

Do I copy resp.Body?

I am learning go and I have the following code which works fine:
resp, err := http.Get(url) // get the html
...
doc, err := html.Parse(resp.Body) // parse the html page
Now I want to print out the html first then do the parsing:
resp, err := http.Get(url)
...
b, err := ioutil.ReadAll(resp.Body) // this line is added, not working now...
doc, err := html.Parse(resp.Body)
I guess the reason is resp.Body is a reader, I can not call the read twice? Any idea how can I do this correctly? Copy the resp.Body?
Because the client streams the response body from the network, it's not possible to read the body twice.
Read the response to a []byte as you are already doing. Create a io.Reader on the bytes for the HTML parser using bytes.NewReader.
resp, err := http.Get(url)
...
b, err := io.ReadAll(resp.Body)
doc, err := html.Parse(bytes.NewReader(b))

Resources