Rate limiting a ruby file stream - ruby

I am working on a project which involves uploading flash video files to a S3 bucket from a number of geographically distributed nodes.
The video files are about 2-3mb each, and we are only sending one file (per node) every ten minutes, however the bandwidth we consume needs to be rate limited to ~20k/s, as these nodes are delivering streaming media to a CDN, and due to the locations we are only able to get 512k max upload.
I have been looking into the ASW-S3 gem and while it doesn't offer any kind of rate limiting I am aware that you can pass in a IO Stream. Given this I am wondering if it might be possible to create a rate-limited stream which overrides the read method, adds in the rate limiting logic (e.g. in its simplest form a call to sleep between reads) and then call out to the super of the overridden method.
Another option I considered is hacking the code for Net::HTTP and putting the rate limiting into the send_request_with_body_stream method which is using a while loop, but I'm not entirely sure which would be the best option.
I have attempted at extending the IO class, however that didn't work at all, simply inheriting from the class with class ThrottledIO < IO didn't do anything.
Any suggestions will be greatly appreciated.

You need to use Delegate if you want to "augment" an IO. This puts a "facade" around your IO object that will be used by all "external" readers of the object but will have no effect on the operation of the object itself.
I've extracted that into a gem since it proved to be generally useful
Here's an example for an IO that gets read from
http://rubygems.org/gems/progressive_io
Here there is an aspect added to all reading methods. I think you might be able to extend that to do basic throttling. After you are done you will be able to wrap your, say, File, into it:
throttled_file = ProgressiveIO.new(some_file) do | offset, size |
# compute rate and if needed sleep()
end

We've used the aiaio's active_resource_throttle to limit requests from pulling from the Harvest API on a project at work. I didn't set it up, but it works.

Related

Save file from POST request to disk without storing in memory with Python's BaseHTTPServer

I'm writing an HTTP server in Python 2 with BaseHTTPServer, and it's assumed that it accepts multiple connections at the same time, on each connection the user can send a large file through a POST request. However my understanding is that the whole request will be stored in the server's memory before being processed, and multiple uploaded file at the same time can exceed the amount of memory on the server. Is there any way to, instead of storing the file/request in memory, stream it to a file on disk directly?
BaseHTTPServer doesn't come with a POST handler out of the box, so you'll have to implement it yourself or find an implementation that works for you. (These are easy to search for; here's one I found that looked straightforward.
Your question is similar to this question about limiting the max-size of POST; the answer points out you'll need to read through all that data in order to ensure proper browser functionality. The comments to that answer seem to indicate the use of other techniques ("e.g. AJAX and realtime notifications via WebSocket." #dmitry-nedbaylo)

How to avoid lambda trigger recursive call

I've written a lambda function that is triggered via an s3 bucket's putObject event. I am modifying the headers of an object post upload, downloading the object, and reuploading with appropriate headers. But because the function itself uses the putObject to reupload the object, the lambda triggers itself.
Three options:
Use a different API to upload your changes than the one that you have an event on. ie, if your lambda is triggered by PUT, then use a POST to modify the content afterwards (tough to do since POST isn't supported well by SDKs AFAIK, so this may not be an option).
Track usage and have a small guard at the beginning of your handler to short circuit if the only changes made to a file are ones you made. If you can't programmatically detect the headers you've set, you'll probably need a small dynamo table or similar for keeping track of which files you've already touched. This will let you abort immediately and only be charged the minimum 100ms fee.
Reorganize your project to have an 'ingest' bucket and an output bucket. Un-processed are put into the former, modified, and then placed into the latter. This has a number of advantages. The first is that you don't end up with the current situation, so that's a plus. The second is that you don't have whatever process consumes these modified files potentially pulling an unmodified version. The third is that you get better insight into the process - if something goes wrong, it's easy to see which batches of files have undergone which process.
Overall, I'd recommend option 3 for you, though I know that in my lazier moments I might try to opt for 1 or 2.
Either way, good luck.

Why does OpenURI treat files under 10kb in size as StringIO?

I fetch images with open-uri from a remote website and persist them on my local server within my Ruby on Rails application. Most of the images were shown without a problem, but some images just didn't show up.
After a very long debugging-session I finally found out (thanks to this blogpost) that the reason for this is that the class Buffer in the open-uri-libary treats files with less than 10kb in size as IO-objects instead of tempfiles.
I managed to get around this problem by following the answer from Micah Winkelspecht to this StackOverflow question, where I put the following code within a file in my initializers:
require 'open-uri'
# Don't allow downloaded files to be created as StringIO. Force a tempfile to be created.
OpenURI::Buffer.send :remove_const, 'StringMax' if OpenURI::Buffer.const_defined?('StringMax')
OpenURI::Buffer.const_set 'StringMax', 0
This works as expected so far, but I keep wondering, why they put this code into the library in the first place? Does anybody know a specific reason, why files under 10kb in size get treated as StringIO ?
Since the above code practically resets this behaviour globally for my entire application, I just want to make sure that I am not breaking anything else.
When one does network programming, you allocate a buffer of a reasonably large size and send and read units of data which will fit in the buffer. However, when dealing with files (or sometimes things called BLOBs) you cannot assume that the data will fit into your buffer. So, you need special handling for these large streams of data.
(Sometimes the units of data which fit into the buffer are called packets. However, packets are really a layer 4 thing, like frames are at layer 2. Since this is happening a layer 7, they might better be called messages.)
For replies larger than 10K, the open-uri library is setting up the extra overhead to write to a stream objects. When under the StringMax size, it just includes the string in the message, since it knows it can fit in the buffer.

When sending a file via AJAX, does it get read into memory first?

I'm writing an uploader that has to be able to transmit files of any size (up to 30gigs) to the server.
My original intention was to write a java applet that would break the file up into pieces, send those to the server, and then reassemble them there.
However, someone has suggested that AJAX's XMLHttpRequest can do the job in conjunction with nsIFileInputStream
(example here: https://developer.mozilla.org/en/using_xmlhttprequest#Sending_files_using_a_FormData_object )
and by using PUT instead of POST.
I'm worried about 2 things and can't seem to find the answer.
1) Will AJAX attempt to read the file into memory before sending it (that obviously would break the whole thing)
[EDIT]
This http://www.codeproject.com/KB/ajax/AJAXFileUpload.aspx?msg=2329446 example explicitly states that they're using ActiveXObject because that DOESN'T load the file into memory... which suggests to me that XMLHttpRequest would load it into memory. I'm surprised I'm having such a hard time finding this info, to be honest.
2) How reliable is this approach. I realize that if the connection just dies the upload would have to resume from scratch, but realistically, how likely is it that using a standard cable connection with an upload throttle of about .5MB/s that a 30 gig file would arrive at the server?
I'm trying something similar using File Api and blob.slice, but it turned out to clock up memory on large files.. However, you could use Google Gears, which plays much better with large sliced files. It also doesnt cause errors with the slice order, which FileReader combined with XHR does frequently and randomly.
I do however find (generally) that uploading files via JavaScript is very unstable..

Streaming files from EventMachine handler?

I am creating a streaming eventmachine server. I'm concerned about avoiding blocking IO or doing anything else to muck up the event loop.
From what I've read, ruby's non-blocking IO can be used to stream files in a non-blocking way, or I can call next_tick, but I'm a little unclear about which of these approaches is preferable.
Part of the problem is that I have not found a good explanation of non-blocking IO library functions in ruby.
Short version:
Assuming a long-lived network IO operation, several wall clock minutes of streaming per file, transfer, what is the best way to do this in eventmachine without gumming up the event loop?
while 1 do
file.read do |bytes|
#conn.send_data bytes
end
end
I understand that the above code will block and I'm wondering what to put in its place. Also, I cannot use the FileStreamer class that is part of eventmachine as is, because I need to manipulate the data after it's read but before it's sent.
I think you can still use FileStreamer. FileStreamer expects its first argument to be a Connection, but this is a loose contract. As long as you implement the methods that FileStreamer expects, it should work. Take a look at this
https://gist.github.com/f4d997c3eeb6bdc5a9f3
The methods you'll need to handle are send_data and send_file_data. You can perform your manipulations here. Then pass the result along to EM::Connection.
Also, from my reading of the code, the special property of FileStreamer is that it allocates a memory mapped file (unless the file is small). You could do essentially the same thing by opening a regular Ruby File, reading blocks out of it, doing your manipulation, and emulating the behavior of FileStreamer.stream_one_chunk. Which is basically:
Each iteration must either send some data to the Connection, or reschedule itself using next_tick
Data can be repeatedly written to the Connection until the outbound buffer is full (according to get_outbound_data_size)
Once the file has been fully read, it should be closed (of course)
In fact, it seems to me that you had better not use FileStreamer unless your file will comfortably fit in memory.
You can look at the EM::Protocols for ideas about how to transform the data as it is streaming through.

Resources