I'm completely new to Golang. I am trying to send a file from the client to the server. The client should split it into smaller chunks and send it to the rest end point exposed by the server. The server should combine those chunks and save it.
This is the client and server code I have written so far. When I run this to copy a file of size 39 bytes, the client is sending two requests to the server. But the server is displaying the following errors.
2017/05/30 20:19:28 Was not able to access the uploaded file: unexpected EOF
2017/05/30 20:19:28 Was not able to access the uploaded file: multipart: NextPart: EOF
You are dividing buffer with the file into separate chunks and sending each of them as separate HTTP message. This is not how multipart is intended to be used.
multipart MIME means that a single HTTP message may contain one or more entities, quoting HTTP RFC:
MIME provides for a number of "multipart" types -- encapsulations of
one or more entities within a single message-body. All multipart types
share a common syntax, as defined in section 5.1.1 of RFC 2046
You should send the whole file and send it in a single HTTP message (file contents should be a single entity). The HTTP protocol will take care of the rest but you may consider using FTP if the files you are planning to transfer are large (like > 2GB).
If you are using a multipart/form-data, then it is expected to take the entire file and send it up as a single byte stream. Go can handle multi-gigabyte files easily this way. But your code needs to be smart about this.
ioutil.ReadAll(r.Body) is out of the question unless you know for sure that the file will be very small. Please don't do this.
multipartReader, err := r.MultipartReader() use a multipart reader. This will iterate over uploading files, in the order they are included in the encoding. This is important, because you can keep the file entirely out of memory, and do a Copy from one filehandle to another. This is how large files are handled easily.
You will have issues with middle-boxes and reverse proxies. We have to change defaults in Nginx so that it will not cut off large files. Nginx (or whatever reverse-proxy you might use) will need to cooperate, as they often are going to default to some really tiny file size max like 300MB.
Even if you think you dealt with this issue on upload with some file part trick, you will then need to deal with large files on download. Go can do single large files very efficiently by doing a Copy from filehandle to filehandle. You will also end up needing to support partial content (http 206) and not modified (304) if you want great performance for downloading files that you uploaded. Some browsers will ignore your pleas to not ask for partial content when things like large video is involved. So, if you don't support this, then some content will fail to download.
If you want to use some tricks to cut up files and send them in parts, then you will end up needing to use a particular Javascript library. This is going to be quite harmful to interoperability if you are going for programmatic access from any client to your Go server. But maybe you can't fix middle-boxes that impose size limits, and you really want to cut files up into chunks. You will have a lot of work to handle downloading the files that you managed to upload in chunks.
What you are trying to do is the typical code that is written with a tcp connection with most other languages, in GO you can use tcp too with net.Listen and eventually accept on the listener object. Then this should be fine.
Related
I want to download a file that is around 60GB in size.
My internet speed is 100mbps but download speed is not utilizing my entire bandwidth.
If I use aria2c to download this single file, I can utilize increased "connections per server"? It seems aria2c lets me use 16 max connections. Would this option even work for downloading a single file?
The way I'm visualizing how the download goes is like 1 connection tries to download from 1 sector of the file, while the other connection tries to download from a different sector. And basically, the optimal number of concurrent download is until the host bandwidth limit is reached (mine being 100mbps). And when the two connections collide in the sectors they are downloading, then aria2c would immediately see that that specific sector is already downloaded and skips to a different sector. Is this how it plays out when using multiple connections for a single file?
Is this how it plays out when using multiple connections for a single
file?
HTTP standard provide Range request header, which allow to say for example: I want part of file, starting at byte X and ending at byte Y. If server do support this gimmick then it respond with 206 Partial Content. Thus knowing length (size) of file (see Content-Length) it is possible to lay parts so they are disjoint and cover whole file.
Beware that not all servers support this gimmick. You need to check if server hosting file you want to download do so. This can be done using HEAD request, HTTP range requests provides example using curl
curl -I http://i.imgur.com/z4d4kWk.jpg
HTTP/1.1 200 OK
...
Accept-Ranges: bytes
Content-Length: 146515
If you have bytes in Accept-Ranges this mean that server does have support. If you wish you might any other tool able to send HEAD request and provide to you response headers.
My use case requires transcoding a remote MOV file that can’t be stored locally. I was hoping to use http protocol to stream the file into ffmpeg. This works, but I’m observing this to be a very expensive operation with (seemingly) redundant network traffic, so am looking for suggestions.
What I see is that ffmpeg starts out with a Range request “0-“ (which brings in the entire file), followed by a number of open-ended requests (no ending offset) at different positions, each of which makes the http server return large chunks of the file again and again, from the starting position to the very end.
For example, http range requests for a short 10MB file look like this:
bytes=0-
bytes=10947419-
bytes=36-
bytes=3153008-
bytes=5876422-
Is there another input method that would be more network-efficient for my use case? I control the server where the video file resides, so I’m flexible in what code runs there.
Any help is greatly appreciated
This is a follow-up to knt's question about PUT vs POST, with more details. The answer may be independently more useful to future answer-seekers.
can I use PUT instead of POST for uploading using fineuploader?
We have a mostly S3-compatible back-end that supports multipart upload, but not form POST, specifically policy signing. I see in the v5 migration notes that "even if chunking is enabled, a chunked upload request is only sent to traditional endpoints if the associated file must be broken into more than 1 chunk". How is the threshold determined for whether a file needs to be chunked? How can the threshold be adjusted? (or ideally, set to zero)
Thanks,
Fine Uploader will chunk a file if its size is less than the number of bytes specified in the chunking.partSize option (default value is: 2000000 bytes). If your file is smaller than the size specified in that value, then it will not be chunked.
To effectively set it to "zero", you could just increase the partSize to an extremely large value. I also did some experimenting, and it seems like a partSize of -1 will make Fine Uploader NOT chunk files at all. AFAIK, that is not supported behavior, and I have not looked at why that is even possible.
Note that S3 requires chunks to be a minimum of 5MB.
Also, note that you may run into limitations on request size as imposed by certain browsers if you make partSize extremely large.
I'm working on an application that needs to store a large 2GB+ XML file for processing, and I'm facing two problems:
How do I process the file? Loading the whole file into Nokogiri at once won't work. It quickly eats up memory and, as far as I can tell, the process gets nuked from orbit. Are there Heroku-compatible ways to quickly/easily read a large XML file located on a non-Heroku server in smaller chunks?
How do I store the file? The site is set up to use S3, but the data provider needs FTP access to upload the XML file nightly. S3 via FTP is apparently a no-go, and storing the file on Heroku won't work either, as it'll only be seen by the dyno that owns it and is susceptible to being randomly purged. Has anyone encountered this type of constraint before, and if so, how'd you work around it?
Most of the time we prefer parsing the entire file that has been pulled into memory because it's easier to jump back and forth, extracting this and that as our code needs. Because it's in memory, we can do random access easily, if we want.
For your need, you'll want to start at the top of the file, and read each line, looking for the tags of interest, until you get to the end of the file. For that, you want to use Nokogiri::XML::SAX and Nokogiri::XML::SAX::Parser, along with the events in Nokogiri::XML::SAX::Document. Here's a summary of what it does, from Nokogiri's site:
The basic way a SAX style parser works is by creating a parser, telling the parser about the events we’re interested in, then giving the parser some XML to process. The parser will notify you when it encounters events your said you would like to know about.
SAX is a different beast than dealing with the DOM, but it can be very fast, and is a lot easier on memory.
If you wanted to load the file in smaller chunks, you could process the XML inside an OpenURI.open or Net::HTTP block, so you'd be getting it in TCP packet-size chunks. The problem then is that your lines could be split, because TCP doesn't guarantee reading by lines, but by blocks, which is what you'll see inside the read loop. Your code would have to peel off partial lines at the end of the buffer, and then prepend them to the read buffer so the next block read finishes the line.
You'll need a streaming parser. Have a look at https://github.com/craigambrose/sax_stream
You could run your own FTP server on EC2? Or use a hosted provider such as https://hostedftp.com/
I'm working on a project right now where I need to read header data from files on remote servers. I'm talking about many and large files so I cant read whole files, but just the header data I need.
The only solution I have is to mount the remote server with fuse and then read the header from the files as if they where on my local computer. I've tried it and it works. But it has some drawbacks. Specially with FTP:
Really slow (FTP is compared to SSH with curlftpfs). From same server, with SSH 90 files was read in 18 seconds. And with FTP 10 files in 39 seconds.
Not dependable. Sometimes the mountpoint will not be unmounted.
If the server is active and a passive mounting is done. That mountpoint and the parent folder gets locked in about 3 minutes.
Does timeout, even when there's data transfer going (guess this is the FTP-protocol and not curlftpfs).
Fuse is a solution, but I don't like it very much because I don't feel that I can trust it. So my question is basically if there's any other solutions to the problem. Language is preferably Ruby, but any other will work if Ruby does not support the solution.
Thanks!
What type of information are you looking for?
You could try using ruby's open-uri module.
The following example is from http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/index.html
require 'open-uri'
open("http://www.ruby-lang.org/en") {|f|
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}
EDIT: It seems that the op wanted to retrieve ID3 tag information from the remote files. This is more complex.
From wiki:
This appears to be a difficult problem.
On wiki:
Tag location within file
Only with the ID3v2.4 standard has it
been possible to place the tag data at
the end of the file, in common with
ID3v1. ID3v2.2 and 2.3 require that
the tag data precede the file. Whilst
for streaming data this is absolutely
required, for static data it means
that the entire audio file must be
updated to insert data at the front of
the file. For initial tagging this
incurs a large penalty as every file
must be re-written. Tag writers are
encouraged to introduce padding after
the tag data in order to allow for
edits to the tag data without
requiring the entire audio file to be
re-written, but these are not standard
and the tag requirements may vary
greatly, especially if APIC
(associated pictures) are also
embedded.
This means that depending on the ID3 tag version of the file, you may have to read different parts of the file.
Here's an article that outlines the basics of reading ID3 tag using ruby for ID3tagv1.1 but should server as a good starting point: http://rubyquiz.com/quiz136.html
You could also look into using a ID3 parsing library, such as id3.rb or id3lib-ruby; however, I'm not sure if either supports the ability to parse a remote file (Most likely could through some modifications).
A "best-as-nothing" solution would be to start the transfer, and stop it when dowloaded file has more than bytes. Since not many (if any) libraries will allow interruption of the connection, it is more complex and will probably require you to manually code a specific ftp client, with two threads, one doing the FTP connection and transfer, and the other monitoring the size of the downloaded file and killing the first thread.
Or, at least, you could parallelize the file transfers. So that you don't wait for all the files being fully transferred to analyze the start of the file. The transfer will then continue
There has been a proposal of a RANG command, allowing to retrieve only a part of the files (here, the first bytes).
I didn't find any reference of inclusion of this proposal, nor implementation, however.
So, for a specific server it could be useful to test (or check the docs of the FTP server) - and use it if available.