How to save data streams in S3? aws-sdk-go example not working? - go

I am trying to persist a given stream of data to an S3 compatible storage.
The size is not known before the stream ends and can vary from 5MB to ~500GB.
I tried different possibilities but did not find a better solution than to implement sharding myself. My best guess is to make a buffer of a fixed size fill it with my stream and write it to the S3.
Is there a better solution? Maybe a way where this is transparent to me, without writing the whole stream to memory?
The aws-sdk-go readme has an example programm that takes data from stdin and writes it to S3: https://github.com/aws/aws-sdk-go#using-the-go-sdk
When I try to pipe data in with a pipe | I get the following error:
failed to upload object, SerializationError: failed to compute request body size
caused by: seek /dev/stdin: illegal seek
Am I doing something wrong or is the example not working as I expect it to?
I although tried minio-go, with PutObject() or client.PutObjectStreaming().
This is functional but consumes as much memory as the data to store.
Is there a better solution?
Is there a small example program that can pipe arbitrary data into S3?

You can use the sdk's Uploader to handle uploads of unknown size but you'll need to make the os.Stdin "unseekable" by wrapping it into an io.Reader. This is because the Uploader, while it requires only an io.Reader as the input body, under the hood it does a check to see whether the input body is also a Seeker and if it is, it does call Seek on it. And since os.Stdin is just an *os.File which implements the Seeker interface, by default, you would get the same error you got from PutObjectWithContext.
The Uploader also allows you to upload the data in chunks whose size you can configure and you can also configure how many of those chunks should be uploaded concurrently.
Here's a modified version of the linked example, stripped off of code that can remain unchanged.
package main
import (
// ...
"io"
"github.com/aws/aws-sdk-go/service/s3/s3manager"
)
type reader struct {
r io.Reader
}
func (r *reader) Read(p []byte) (int, error) {
return r.r.Read(p)
}
func main() {
// ... parse flags
sess := session.Must(session.NewSession())
uploader := s3manager.NewUploader(sess, func(u *s3manager.Uploader) {
u.PartSize = 20 << 20 // 20MB
// ... more configuration
})
// ... context stuff
_, err := uploader.UploadWithContext(ctx, &s3manager.UploadInput{
Bucket: aws.String(bucket),
Key: aws.String(key),
Body: &reader{os.Stdin},
})
// ... handle error
}
As to whether this is a better solution than minio-go I do not know, you'll have to test that yourself.

Related

Does closing io.PipeWriter close the underlying file?

I am using logrus for logging and have a few custom format loggers. Each is initialized to write to a different file like:
fp, _ := os.OpenFile(path, os.O_APPEND|os.O_WRONLY|os.O_CREATE, 0755)
// error handling left out for brevity
log.Out = fp
Later in the application, I need to change the file the logger is writing to (for a log rotation logic). What I want to achieve is to properly close the current file before changing the logger's output file. But the closest thing to the file handle logrus provides me is a Writer() method that returns a io.PipeWriter pointer. So would calling Close() on the PipeWriter also close the underlying file?
If not, what are my options to do this, other than keeping the file pointer stored somewhere.
For the record, twelve-factor tells us that applications should not concern themselves with log rotation. If and how logs are handled best depends on how the application is deployed. Systemd has its own logging system, for instance. Writing to files when deployed in (Docker) containers is annoying. Rotating files are annoying during development.
Now, pipes don't have an "underlying file". There's a Reader end and a Writer end, and that's it. From the docs for PipeWriter:
Close closes the writer; subsequent reads from the read half of the pipe will return no bytes and EOF.
So what happens when you close the writer depends on how Logrus handles EOF on the Reader end. Since Logger.Out is an io.Writer, Logrus cannot possibly call Close on your file.
Your best bet would be to wrap *os.File, perhaps like so:
package main
import "os"
type RotatingFile struct {
*os.File
rotate chan struct{}
}
func NewRotatingFile(f *os.File) RotatingFile {
return RotatingFile{
File: f,
rotate: make(chan struct{}, 1),
}
}
func (r RotatingFile) Rotate() {
r.rotate <- struct{}{}
}
func (r RotatingFile) doRotate() error {
// file rotation logic here
return nil
}
func (r RotatingFile) Write(b []byte) (int, error) {
select {
case <-r.rotate:
if err := r.doRotate(); err != nil {
return 0, err
}
default:
}
return r.File.Write(b)
}
Implementing log file rotation in a robust way is surprisingly tricky. For instance, closing the old file before creating the new one is not a good idea. What if the log directory permissions changed? What if you run out of inodes? If you can't create a new log file you may want to keep writing to the current file. Are you okay with ripping lines apart, or do you only want to rotate after a newline? Do you want to rotate empty files? How do you reliably remove old logs if someone deletes the N-1th file? Will you notice the Nth file or stop looking at the N-2nd?
The best advice I can give you is to leave log rotation to the pros. I like svlogd (part of runit) as a standalone log rotation tool.
The closing of io.PipeWriter will not affect actual Writer behind it. The chain of close execution:
PipeWriter.Close() -> PipeWriter.CloseWithError(err error) ->
pipe.CloseWrite(err error)
and it doesn't influence underlying io.Writer.
To close actual writer you need just to close Logger.Out that is an exported field.

Reading from a file from bufio with a semi complex sequencing through file

So there may be questions like this but its not a super easy thing to google. Basically I have a file thats a set of protobufs encoded and sequenced as they normally are from the protobuf spec.
So think of the bytes values being chunked something like this throughout the file:
[EncodeVarInt(size of protobuf struct)] [protobuf stuct bytes]
So you have a few bytes read one at a time that are used for large jump of a read on our protof structure.
My implementation using the os ReadAt method on a file currently looks something like this.
// getting the next value in a file context feature
func (geobuf *Geobuf_Reader) Next() bool {
if geobuf.EndPos <= geobuf.Pos {
return false
} else {
startpos := int64(geobuf.Pos)
for int(geobuf.Get_Byte(geobuf.Pos)) > 127 {
geobuf.Pos += 1
}
geobuf.Pos += 1
sizebytes := make([]byte,geobuf.Pos-int(startpos))
geobuf.File.ReadAt(sizebytes,startpos)
size,_ := DecodeVarint(sizebytes)
geobuf.Feat_Pos = [2]int{int(size),geobuf.Pos}
geobuf.Pos = geobuf.Pos+int(size)
return true
}
return false
}
// reads a geobuf feature as geojson
func (geobuf *Geobuf_Reader) Feature() *geojson.Feature {
// getting raw bytes
a := make([]byte,geobuf.Feat_Pos[0])
geobuf.File.ReadAt(a,int64(geobuf.Feat_Pos[1]))
return Read_Feature(a)
}
How can I implement something like bufio or other chunked reading mechanisms to speed up so many file ReadAt's? Most bufio implementations I've seen are for having a specific delimitter. Thanks in advance hopefully this wasn't a horrible question.
Package bufio
import "bufio"
type SplitFunc
SplitFunc is the signature of the split function used to tokenize the
input. The arguments are an initial substring of the remaining
unprocessed data and a flag, atEOF, that reports whether the Reader
has no more data to give. The return values are the number of bytes to
advance the input and the next token to return to the user, plus an
error, if any. If the data does not yet hold a complete token, for
instance if it has no newline while scanning lines, SplitFunc can
return (0, nil, nil) to signal the Scanner to read more data into the
slice and try again with a longer slice starting at the same point in
the input.
If the returned error is non-nil, scanning stops and the error is
returned to the client.
The function is never called with an empty data slice unless atEOF is
true. If atEOF is true, however, data may be non-empty and, as always,
holds unprocessed text.
type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)
Use bufio.Scanner and write a custom protobuf struct SplitFunc.

Measure upload speed when using http.ResponseBody

Is there a way to measure a client's download speed when uploading a large quantity of data using an http.ResponseWriter?
Update for context: I'm writing a streaming download endpoint for blob storage which stores blobs in chunks. The files are very large, so loading and buffering whole blobs is not feasible. Being able to monitor the buffer state, bytes written or similar would allow better scheduling of the chunk downloads.
E.g. when Write()ing to the response, is there a way to check how much data is already queued?
An example of the context, but not using a file object.
func downloadHandler(w http.ResponseWriter, req *http.Request, ps httprouter.Params) {
// Open some file.
f := os.Open("somefile.txt")
// Adjust the iteration speed of this loop to the client's download speed.
for
{
data := make([]byte, 1000)
count, err := f.Read(data)
if err != nil {
log.Fatal(err)
}
if count == 0 {
break
}
// Upload data chunk to client.
w.Write(data[:count])
}
}
You could implement a custom http.ResponseWriter that measures bytes sent, and calculates throughput.
There are likely packages to do similar things already. Google found this one (which I haven't used).

How receive binary data in multiple frames in go lang

I successfully sent file data in the form of a frame to the websocket. I can split file data in multiple frames and send to websocket, but i don't know how to receive and merge frames in one data array.
I gonna do this for getting progress of sending file to websocket:
import (
"golang.org/x/net/websocket"
"io/ioutil"
...
...
)
...
...
var data []byte
err = websocket.Message.Receive(ws, &data)
if (err == nil) {
ioutil.WriteFile("/home/img.jpg", data, 0644)
}
Usually it should be unnecessary to split data into multiple frames and then merge them.
If you want to get progress, maybe the func (ws *Conn) Read(msg []byte) (n int, err error) can be called with a relatively small buffer size. So you don't read the whole message in one go. But you may have to always send the file size before file content. (e.g. first 8 bytes is always the file size, and file content starts from the 9th.) So you can use sumOfReceivedSize / fileSize to show the progress after each call of Read.
If you want to split file data no matter what:
Simply send the file size in the first message (as first 8 bytes), and then send all the rest chunks in separate messages.
Multiple messages will be received, until the recipient gets enough data, i.e. reaches the file size.

How to check if a file is a valid image?

I am building a web application.
On one of the pages there is an upload form, where user can upload a file. After the upload is done, I want to check on the server if the uploaded file is an image.
Is it possible to check this beyond simple file extension checking (i.e. not assuming that a *.png filename is actually a PNG image)?
For example, if I edit a JPEG image adding/editing a byte in a random place to make an invalid JPEG file, I want to detect that it is not a JPEG image anymore. I used to do such type of thing via PHP some time ago, using a GD library.
I would like to know if it is possible to do with Go?
DetectContentType is way better than a manual magic number checking. The use is simple:
clientFile, _, _ := r.FormFile("img") // or get your file from a file system
defer clientFile.Close()
buff := make([]byte, 512) // docs tell that it take only first 512 bytes into consideration
if _, err = clientFile.Read(buff); err != nil {
fmt.Println(err) // do something with that error
return
}
fmt.Println(http.DetectContentType(buff)) // do something based on your detection.
Using this method you need to know that you still are not guaranteed to have a correct file. So I would recommend to do some image manipulation with that file (like resize it to make sure this is really an image).
The http package can do this for you:
func DetectContentType(data []byte) string
DetectContentType implements the algorithm described at
http://mimesniff.spec.whatwg.org/ to determine the Content-Type of the
given data. It considers at most the first 512 bytes of data.
DetectContentType always returns a valid MIME type: if it cannot
determine a more specific one, it returns "application/octet-stream".
Code: https://golang.org/src/net/http/sniff.go
What is usually done is checking if the file has the right magic number for the image file format you want. While this test is not super accurate, it is usually good enough. You can use code like this:
package foo
import "strings"
// image formats and magic numbers
var magicTable = map[string]string{
"\xff\xd8\xff": "image/jpeg",
"\x89PNG\r\n\x1a\n": "image/png",
"GIF87a": "image/gif",
"GIF89a": "image/gif",
}
// mimeFromIncipit returns the mime type of an image file from its first few
// bytes or the empty string if the file does not look like a known file type
func mimeFromIncipit(incipit []byte) string {
incipitStr := []byte(incipit)
for magic, mime := range magicTable {
if strings.HasPrefix(incipitStr, magic) {
return mime
}
}
return ""
}

Resources