io.Copy vs gsutil - copying large files to GS - go

After a lot of tests, we cannot seem to match the speed of gsutil when using the GS Go client libraries.
Even a skeleton file with simplest io.Copy() take A LOT longer the the simplest gsutil.
ctx := context.Background()
client, err := storage.NewClient(ctx, option.WithCredentialsFile(*flags.credsFile))
bucket := client.Bucket("my_bucket")
File, _ := os.Open("path_to_file")
wc := bucket.Object("remoteFile").NewWriter(ctx)
_, _ = io.Copy(wc, File)
err = wc.Close()
Also tried with io.CopyBuffer() when buffer is 128x1024, better, but still slow.
Any way to speed up the upload while using go? we dont want to call any external utilities...

It sounds like the io.Copy implementation is not GCS aware, and instead is doing actual byte copies (reading from the source file and writing to the destination file). In contrast, gsutil is calling the GCS Rewrite API, which for the cases where the source and destination are in the same location and storage class, will do metadata-only copies (avoiding byte copying). Doing it that way is far faster, which would match what you're observing, performance-wise.
Can you use a GCS aware Go implementation -- i.e., one that will call Rewrite rather than reading/writing the underlying object bytes?

Related

Why is writing files witht syscall.O_DIRECT flag make writing files slower in go?

I've got a small peice of code named test.go. It counts time(ns) when doing two writings that write a same byte slice to 2 files, one with the flag syscall.O_DIRECT and the other not.
The code is below:
package main;
import (
"os"
"time"
"fmt"
"strconv"
"bytes"
"syscall"
// "os/exec"
)
func main() {
num, _ := strconv.Atoi(os.Args[1]);
writeContent:= bytes.Repeat( ([]byte)("1"), num );
t1:= time.Now().UnixNano();
fd1, err := syscall.Open("abc.txt", syscall.O_WRONLY | syscall.O_DIRECT | syscall.O_TRUNC, 0);
syscall.Write(fd1, writeContent);
if err != nil {panic(err);}
t2:= time.Now().UnixNano();
fmt.Println("sysW1:", t2-t1);
t1= time.Now().UnixNano();
fd2, err := syscall.Open("abc.txt", syscall.O_WRONLY | syscall.O_TRUNC, 0);
syscall.Write(fd2, writeContent);
if err != nil {panic(err);}
t2= time.Now().UnixNano();
fmt.Println("sysW2:", t2-t1);
}
The program is runned in linux command line like this:(after being compiled with go build ./test.go)
./test 1024
I had expected writing file with syscall.O_DIRECT flag to be faster, but the result showed that writing files with syscall.O_DIRECT flag was about 30 times slower than writing without it :(
result:
sysW1: 1107377
sysW2: 37155
Why? I tought writing with syscall.O_DIRECT does less copying and would be faster, but it now turns out to be much slower. Please help me explain it :(
PX: I will not provide playground link since the result is always 0 when running the program on the playground in some reason.
O_DIRECT doesn't do what you think. While it does less memory copying (since it doesn't copy to the cache before copying to the device driver), that doesn't give you a performance boost.
The filesystem cache ensures that the system call can return early before the data is written to the device, and buffer data to send data in larger chunks.
With O_DIRECT, the system call waits until the data is completely transferred to the device.
From the man page for the open call:
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from
user-space buffers. The O_DIRECT flag on its own makes an
effort to transfer data synchronously, but does not give
the guarantees of the O_SYNC flag that data and necessary
metadata are transferred.
See also: What does O_DIRECT really mean?
You don't need to manually release the cache after using it.
The cache is considered free available memory by the Linux kernel. If a process needs memory that is occupied by the cache, the kernel will flush/release the cache at that point. The cache doesn't "use up" memory.

How to save, and then serve again data of type io.Reader?

I would like to parse several times with gocal data I retrieve through a HTTP call. Since I would like to avoid making the call for each of the parsing, I would like to save this data and reuse it.
The Body I get from http.Get is of type io.ReadCloser. The gocal parser requires io.Reader so it works.
Since I can retrieve Body only once, I can save it with body, _ := io.ReadAll(get.Body) but then I do not know how to serve []byte as io.Reader back (to the gocal parser, several times to account for different parsing conditions)
As you have figured, the http.Response.Body is exposed as an io.Reader, this reader is not re usable because it is connected straight to the underlying connection* (might be tcp/utp/or any other stream like reader under the net package).
Once you read the bytes out of the connection, new bytes are sitting their waiting for another read.
In order to save the response, indeed, you need to drain it first, and save that result within a variable.
body, _ := io.ReadAll(get.Body)
To re use that slice of bytes many time using the Go programming language, the standard API provides a buffered reader bytes.NewReader.
This buffer adequately offers the Reset([]byte) method to reset the state of the buffer.
The bytes.Reader.Reset is very useful to read multiple times the same bytes buffer with no allocations. In comparison, bytes.NewReader allocates every time it is called.
Finally, between two consecutive calls to c.Parser, you should reset the buffer with bytes buffer you have collected previously.
such as :
buf := bytes.NewReader(body)
// initialize the parser
c.Parse()
// process the result
// reset the buf, parse again
buf.Reset(body)
c.Parse()
You can try this version https://play.golang.org/p/YaVtCTZHZEP It uses the strings.NewReader buffer, but the interface and behavior are similar.
not super obvious, that is the general principle, the transport reads the headers, and leave the body untouched unless you consume it. see also that.

Corrupted image file in golang api image download

I am building a simple testing API in golang for uploading and download image files(PNG, JPEG, JPG):
/pic [POST] for uploading the image and save it to a folder; /pic [GET] for downloading the image to the client.
I have successfully built the /pic [POST] and the image is successfully uploaded to the server's file. And I can open the file in the storage folder. (In both windows localhost server and ubuntu server)
However, when I built the /pic [GET] for downloading the picture, I am able to download the file to the client (my computer), but the downloaded file is somehow corrupted since when I try to open it with different image viewer such as gallery or Photoshop, it says "It looks like we don't support this file format". So it seems like the download is not successful.
Postman result:
File opening in gallery:
Any ideas on why this happens and how should I fix it?
The golang code for downloading the pic is as follows(omitting the error handling):
func PicDownload(w http.ResponseWriter, r *http.Request){
request := make(map[string]string)
reqBody, _ := ioutil.ReadAll(r.Body)
err = json.Unmarshal(reqBody, &request)
// Error handling
file, err := os.OpenFile("./resources/pic/" + request["filename"], os.O_RDONLY, 0666)
// Error handling
buffer := make([]byte, 512)
_, err = file.Read(buffer)
// Error handling
contentType := http.DetectContentType(buffer)
fileStat, _ := file.Stat()
// Set header
w.Header().Set("Content-Disposition", "attachment; filename=" + request["filename"])
w.Header().Set("Content-Type", contentType)
w.Header().Set("Content-Length", strconv.FormatInt(fileStat.Size(), 10))
// Copying the file content to response body
io.Copy(w, file)
return
}
When you are reading the first 512 bytes from the file in order to determine the content type, the underlying file stream pointer moves forward by 512 bytes. When you later call io.Copy, reading continues from that position.
There are two ways to correct this.
The first is to call file.Seek(0, io.SeekStart) before the call to io.Copy(). This will place the pointer back to the start of the file. This solution requires the least amount of code, but means reading the same 512 bytes from the file twice which causes some overhead.
The second solution is to create a buffer that contains the entire file using buffer := make([]byte, fileStat.Size() and using that buffer for both the http.DetectContentType() call and to write the output (write it with w.Write(buffer) instead of using io.Copy(). This approach has the possible downside of loading the entire file into memory at once, which isn't ideal for very large files (io.Copy uses 32KB chunks instead of loading the whole file).
Note: As Peter mentioned in a comment, you must ensure users cannot traverse your filesystem by posting ../../ or something as a filename.

Append to file using golang where file is on a NFS attached volume

As part of a large file upload feature.
We're using the following to write a 'chunk' of bytes to a file at 'path'. This works fine on local filesystems. Each chunk is correctly written at 'offset'.
f, err := os.OpenFile(path, os.O_APPEND|os.O_WRONLY, os.ModeAppend)
n, err := f.WriteAt(bytes, offset)
On NFS attached storage however the bytes are written at the beginning of the file and not at the requested 'offset'.
Even though it doesn't appear that the process will be able to obtain a lock on the file over NFS. Is there a technique of workaround we could follow to append to a file at 'offset'?

Correct usage of os.NewFile in Go

I'm attempting to compose an image in memory and send it out through http.ResponseWriter without ever touching the file system.
I use the following to create a new file:
file := os.NewFile(0, "temp_destination.png")
However, I don't seem to be able to do anything at all with this file. Here is the function I'm using (which is being called within an http.HandleFunc, which just sends the file's bytes to the browser), which is intended to draw a blue rectangle on a temporary file and encode it as a PNG:
func ComposeImage() ([]byte) {
img := image.NewRGBA(image.Rect(0, 0, 640, 480))
blue := color.RGBA{0, 0, 255, 255}
draw.Draw(img, img.Bounds(), &image.Uniform{blue}, image.ZP, draw.Src)
// in memory destination file, instead of going to the file sys
file := os.NewFile(0, "temp_destination.png")
// write the image to the destination io.Writer
png.Encode(file, img)
bytes, err := ioutil.ReadAll(file)
if err != nil {
log.Fatal("Couldn't read temporary file as bytes.")
}
return bytes
}
If I remove the png.Encode call, and just return the file bytes, the server just hangs and does nothing forever.
Leaving the png.Encode call in results in the file bytes (encoded, includes some of the PNG chunks I'd expect to see) being vomited out to stderr/stdout (I can't tell which) and server hanging indefinitely.
I assume I'm just not using os.NewFile correctly. Can anyone point me in the right direction? Alternative suggestions on how to properly perform in-memory file manipulations are welcome.
os.NewFile is a low level function that most people will never use directly. It takes an already existing file descriptor (system representation of a file) and converts it to an *os.File (Go's representation).
If you never want the picture to touch your filesystem, stay out of the os package entirely. Just treat your ResponseWriter as an io.Writer and pass it to png.Encode.
png.Encode(yourResponseWriter, img)
If you insist on writing to an "in memory file", I suggest using bytes.Buffer:
buf := new(bytes.Buffer)
png.Encode(buf, img)
return buf.Bytes()
Please have a detailed read of the NewFile documentation. NewFile does not create a new file, not at all! It sets up a Go os.File which wraps around an existing file with the given file descriptor (0 in your case which is stdin I think).
Serving images without files is much easier: Just Encode your image to your ResponseWriter. That's what interfaces are there for. No need to write to ome magic "in memory file", no need to read it back with ReadAll, plain and simple: Write to your response.

Resources