Force S3 multipart uploads - fine-uploader

This is a follow-up to knt's question about PUT vs POST, with more details. The answer may be independently more useful to future answer-seekers.
can I use PUT instead of POST for uploading using fineuploader?
We have a mostly S3-compatible back-end that supports multipart upload, but not form POST, specifically policy signing. I see in the v5 migration notes that "even if chunking is enabled, a chunked upload request is only sent to traditional endpoints if the associated file must be broken into more than 1 chunk". How is the threshold determined for whether a file needs to be chunked? How can the threshold be adjusted? (or ideally, set to zero)
Thanks,

Fine Uploader will chunk a file if its size is less than the number of bytes specified in the chunking.partSize option (default value is: 2000000 bytes). If your file is smaller than the size specified in that value, then it will not be chunked.
To effectively set it to "zero", you could just increase the partSize to an extremely large value. I also did some experimenting, and it seems like a partSize of -1 will make Fine Uploader NOT chunk files at all. AFAIK, that is not supported behavior, and I have not looked at why that is even possible.
Note that S3 requires chunks to be a minimum of 5MB.
Also, note that you may run into limitations on request size as imposed by certain browsers if you make partSize extremely large.

Related

How to increase the size of request in ab benchmarking tool?

I am testing with ab - Apache HTTP server benchmarking tool.
Can I increase the size of the request in Apache (now I see 1 request has the size of 146 bytes).
I tried to increase the size of TCP send/receive buffer (-b option), but It seems does not work. Because I still see the "Total transferred" is 146 bytes.
Do you know any way to increase the size of the request? (change the source code or something).
Or if it is impossible, can you give me a suggestion about some tools which are similar to ab but it can increase the size of the request.
Thank you so much!
Although -b option does seem like it should've worked, I can't say for sure as I haven't used.
Alternatively, have you tried sending a dummy large file in your POST request? that can be accomplished with the -p option followed by a plain text file for instance, that you can either create yourself or Google for something like "generate large file in bytes online" that you can download and pass into the command.
As far as alternatives go, I've heard another open source project called httpperf from HP to be a great option as well, though I doubt we're unable to figure it out how to do it with Apache Benchmark.

Unable to send data in chunks to server in Golang

I'm completely new to Golang. I am trying to send a file from the client to the server. The client should split it into smaller chunks and send it to the rest end point exposed by the server. The server should combine those chunks and save it.
This is the client and server code I have written so far. When I run this to copy a file of size 39 bytes, the client is sending two requests to the server. But the server is displaying the following errors.
2017/05/30 20:19:28 Was not able to access the uploaded file: unexpected EOF
2017/05/30 20:19:28 Was not able to access the uploaded file: multipart: NextPart: EOF
You are dividing buffer with the file into separate chunks and sending each of them as separate HTTP message. This is not how multipart is intended to be used.
multipart MIME means that a single HTTP message may contain one or more entities, quoting HTTP RFC:
MIME provides for a number of "multipart" types -- encapsulations of
one or more entities within a single message-body. All multipart types
share a common syntax, as defined in section 5.1.1 of RFC 2046
You should send the whole file and send it in a single HTTP message (file contents should be a single entity). The HTTP protocol will take care of the rest but you may consider using FTP if the files you are planning to transfer are large (like > 2GB).
If you are using a multipart/form-data, then it is expected to take the entire file and send it up as a single byte stream. Go can handle multi-gigabyte files easily this way. But your code needs to be smart about this.
ioutil.ReadAll(r.Body) is out of the question unless you know for sure that the file will be very small. Please don't do this.
multipartReader, err := r.MultipartReader() use a multipart reader. This will iterate over uploading files, in the order they are included in the encoding. This is important, because you can keep the file entirely out of memory, and do a Copy from one filehandle to another. This is how large files are handled easily.
You will have issues with middle-boxes and reverse proxies. We have to change defaults in Nginx so that it will not cut off large files. Nginx (or whatever reverse-proxy you might use) will need to cooperate, as they often are going to default to some really tiny file size max like 300MB.
Even if you think you dealt with this issue on upload with some file part trick, you will then need to deal with large files on download. Go can do single large files very efficiently by doing a Copy from filehandle to filehandle. You will also end up needing to support partial content (http 206) and not modified (304) if you want great performance for downloading files that you uploaded. Some browsers will ignore your pleas to not ask for partial content when things like large video is involved. So, if you don't support this, then some content will fail to download.
If you want to use some tricks to cut up files and send them in parts, then you will end up needing to use a particular Javascript library. This is going to be quite harmful to interoperability if you are going for programmatic access from any client to your Go server. But maybe you can't fix middle-boxes that impose size limits, and you really want to cut files up into chunks. You will have a lot of work to handle downloading the files that you managed to upload in chunks.
What you are trying to do is the typical code that is written with a tcp connection with most other languages, in GO you can use tcp too with net.Listen and eventually accept on the listener object. Then this should be fine.

Handling large files with Azure search blob extractor

Receiving errors from the Blob extractor that files are too large for the current tier, which is basic. I will be upgrading to a higher tier, but I notice that the max size is currently 256MB.
When I have PPTX files that are mostly video and audio, but have text I'm interested in, is there a way to index those? What does the blob extractor max file size actually mean?
Can I tell the extractor to only take the first X MB or chars and just stop?
There are two related limits in the blob indexer:
Max file size limit that you are hitting. If file size exceeds that limit, indexer doesn't attempt to download it and produces an error to make sure you are aware of the issue. The reason we don't just take first N bytes is because for parsing many formats correctly, the entire file is needed. You can mark blobs as skipable or configure indexer to ignore a number of errors if you want it to make forward progress when encountering blobs that are too large.
The max size of extracted text. In case file contains more text than that, indexer takes N characters up to the limit and includes a warning so you can be aware of the issue. Content that doesn't get extracted (such as video, at least today) doesn't contribute to this limit, of course.
How large are the PPTX you need indexed? I'll add my contact info in a comment.

NiFi-1.0 - content_repo & flowfile_repo

I have a flow, pretty big, which takes a csv and then eventually converts it to sql statements (via avro, json).
For a file of 5GB, flowfile_repo (while processing) went up to 24 GB and content_repo to 18 GB.
content_repo max 18 GB
flowfile_repo max 26 GB
Is there a way to predict how much space would I need for processing N files ?
Why it takes so much space ?
The flow file repo is check-pointed every 2 minutes by default, and is storing the state of every flow file as well as the attributes of every flow file. So it really depends how many flow files and how many attributes per flow file are being written during that 2 min window, as well as how many processors the flow files are passing through and how many of them are modifying the attributes.
The content repo is storing content claims, where each content claim contains the content of one or more flow files. Periodically there is a clean up thread that runs and determines if a content claim can be cleaned up. This is based on whether or not you have archiving enabled. If you have it disabled, then a content claim can be cleaned up when no active flow files reference any of the content in that claim.
The flow file content also follows a copy-on-write pattern, meaning the content is immutable and when a processor modifies the content it is actually writing a new copy. So if you had a 5GB flow file and it passed through a processor that modified the content like ReplaceText, it would write another 5GB to the content repo, and the original one could be removed based on the logic above about archiving and whether or not any flow files reference that content.
If you are interested in more info, there is an in depth document about how all this works here:
https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html

How to tell if end of remote file is reached using Winsock (VB6)?

I am developing a program on top of a basecode taking care of low level socket programming. The problem seems to be that the only way it knows (apparently) when to stop close the connection is when the amount of bytes received exceed that given in the "Content-length" field. But since that field is not set for many sites I don't know how to tell it to stop. What happens now is that in those cases the entire file is downloaded but the connection is still kept.
There must be something to look for in the incoming data/messages? Thanks.
Since you mention a Content-Length header, are you downloading from an HTTP server? If so, then the Content-Length header is omitted when a Transfer-Encoding header indicates that a chunked transfer is being used, which means the data is sent in small sequential chunks (thus the full size cannot be reported in the Content-Length header ahead of time). Each chunk has its own header that specifies the size of its chunk. You need to parse and discard each chunk header so you do not save them in your target file, only the chunk data. The end of the file is reached when you receive a chunk header that reports a cheunk data size of 0.
Read RFC 2616 Section 3.6.1 for more details.

Resources