Streaming files from EventMachine handler? - ruby

I am creating a streaming eventmachine server. I'm concerned about avoiding blocking IO or doing anything else to muck up the event loop.
From what I've read, ruby's non-blocking IO can be used to stream files in a non-blocking way, or I can call next_tick, but I'm a little unclear about which of these approaches is preferable.
Part of the problem is that I have not found a good explanation of non-blocking IO library functions in ruby.
Short version:
Assuming a long-lived network IO operation, several wall clock minutes of streaming per file, transfer, what is the best way to do this in eventmachine without gumming up the event loop?
while 1 do
file.read do |bytes|
#conn.send_data bytes
end
end
I understand that the above code will block and I'm wondering what to put in its place. Also, I cannot use the FileStreamer class that is part of eventmachine as is, because I need to manipulate the data after it's read but before it's sent.

I think you can still use FileStreamer. FileStreamer expects its first argument to be a Connection, but this is a loose contract. As long as you implement the methods that FileStreamer expects, it should work. Take a look at this
https://gist.github.com/f4d997c3eeb6bdc5a9f3
The methods you'll need to handle are send_data and send_file_data. You can perform your manipulations here. Then pass the result along to EM::Connection.
Also, from my reading of the code, the special property of FileStreamer is that it allocates a memory mapped file (unless the file is small). You could do essentially the same thing by opening a regular Ruby File, reading blocks out of it, doing your manipulation, and emulating the behavior of FileStreamer.stream_one_chunk. Which is basically:
Each iteration must either send some data to the Connection, or reschedule itself using next_tick
Data can be repeatedly written to the Connection until the outbound buffer is full (according to get_outbound_data_size)
Once the file has been fully read, it should be closed (of course)
In fact, it seems to me that you had better not use FileStreamer unless your file will comfortably fit in memory.
You can look at the EM::Protocols for ideas about how to transform the data as it is streaming through.

Related

Is there any way to use IOCP to notify when a socket is readable / writeable?

I'm looking for some way to get a signal on an I/O completion port when a socket becomes readable/writeable (i.e. the next send/recv will complete immediately). Basically I want an overlapped version of WSASelect.
(Yes, I know that for many applications, this is unnecessary, and you can just keep issuing overlapped send calls. But in other applications you want to delay generating the message to send until the last moment possible, as discussed e.g. here. In these cases it's useful to do (a) wait for socket to be writeable, (b) generate the next message, (c) send the next message.)
So far the best solution I've been able to come up with is to spawn a thread just to call select and then PostQueuedCompletionStatus, which is awful and not particularly scalable... is there any better way?
It turns out that this is possible!
Basically the trick is:
Use the WSAIoctl SIO_BASE_HANDLE to peek through any "layered service providers"
Use DeviceIoControl to submit an AFD_POLL request for the base handle, to the AFD driver (this is what select does internally)
There are many, many complications that are probably worth understanding, but at the end of the day the above should just work in practice. This is supposed to be a private API, but libuv uses it, and MS's compatibility policies mean that they will never break libuv, so you're fine. For details, read the thread starting from this message: https://github.com/python-trio/trio/issues/52#issuecomment-424591743
For detecting that a socket is readable, it turns out that there is an undocumented but well-known piece of folklore: you can issue a "zero byte read", i.e., an overlapped WSARecv with a zero-byte receive buffer, and that will not complete until there is some data to be read. This has been recommended for servers that are trying to do simultaneous reads from a large number of mostly-idle sockets, in order to avoid problems with memory usage (apparently IOCP receive buffers get pinned into RAM). An example of this technique can be seen in the libuv source code. They also have an additional refinement, which is that to use this with UDP sockets, they issue a zero-byte receive with MSG_PEEK set. (This is important because without that flag, the zero-byte receive would consume a packet, truncating it to zero bytes.) MSDN claims that you can't combine MSG_PEEK with overlapped I/O, but apparently it works for them...
Of course, that's only half of an answer, because there's still the question of detecting writability.
It's possible that a similar "zero-byte send" trick would work? (Used directly for TCP, and adding the MSG_PARTIAL flag on UDP sockets, to avoid actually sending a zero-byte packet.) Experimentally I've checked that attempting to do a zero-byte send on a non-writable non-blocking TCP socket returns WSAEWOULDBLOCK, so that's a promising sign, but I haven't tried with overlapped I/O. I'll get around to it eventually and update this answer; or alternatively if someone wants to try it first and post their own consolidated answer then I'll probably accept it :-)

For what do I need the Socket::MSG_* constants in ruby?

I want to develop a p2p app which communicates via UDPSockets. I'm just starting to read the docs for that and I couldn't understand that piece of ruby's socket management.
Specifically it's possible to add those "flags", as ruby-doc calls them, to every send call. (http://www.ruby-doc.org/stdlib-1.9.3/libdoc/socket/rdoc/UDPSocket.html#method-i-send)
But when do I use those and how?
You'll probably know if you need to use them as you'll have an example or some documentation that refers to them.
Some of the more common options used with recvfrom are: MSG_OOB to process out-of-band data, MSG_PEEK to peek at the incoming message without de-queueing it, and MSG_WAITALL to wait for the receive buffer to fill up.
These are really quite edge-case so you probably won't ever see one used.
Those flags come from the low-level recv call on which Socket is based.

Can a read() by one process see a partial write() by another?

If one process does a write() of size (and alignment) S (e.g. 8KB), then is it possible for another process to do a read (also of size and alignment S and the same file) that sees a mix of old and new data?
The writing process adds a checksum to each data block, and I'd like to know whether I can use a reading process to verify the checksums in the background. If the reader can see a partial write, then it will falsely indicate corruption.
What standards or documents apply here? Is there a portable way to avoid problems here, preferably without introducing lots of locking?
When a function is guaranteed to complete without there being any chance of any other process/thread/anything seeing things in a half finished state, it's said to be atomic. It either has or hasn't happened, there is no part way. While I can't speak to Windows, there are very few file operations in POSIX (which is what Linux/BSD/etc attempt to stick to) that are guaranteed to be atomic. Reading and writing are not guaranteed to be atomic.
While it would be pretty unlikely for you to write 2 bytes to a file and another process only see one of those bytes written, if by dumb luck your write straddled two different pages in memory and the VM system had to do something to prepare the second page, it's possible you'd see one byte without the other in a second process. Usually if things are page aligned in your file, they will be in memory, but again you can't rely on that.
Here's a list someone made of what is atomic in POSIX, which is pretty short, and I can't vouch for it's authenticity. (I can't think of why unlink isn't listed, for example).
I'd also caution you against testing what appears to work and running with it, the moment you start accessing files over a network file system (NFS on Unix, or SMB mounts in Windows) a lot of things that seemed to be atomic before no longer are.
If you want to have a second process calculating checksums while a first process is writing the file, you may want to open a pipe between the two and have the first process write a copy of everything down the pipe to the checksumming process. That may be faster than dealing with locking.

Does using the Ruby "File" class without closing leak memory?

So I was doing some research on the File class in Ruby. As I was digging I learned that File was a subclass of IO. To my understanding when you create an IO object (or File object), a buffer is opened to that file that allows you to read and write to that file. I don't completely understand what a buffer is, but apparently it stays open until you call the #close method on the object. To my understanding this buffer is opened whether you call File.new or File.open (please correct me if I'm wrong on any of this).
So say you like to use the File class for paths and stuff like this:
f = File.new('spec/tmp/testfile.md')
File.basename(f)
But you never call f.close. Does leaving this buffer open leak memory? If I called this several hundred times for a tree in a filesystem would I be in deep trouble?
Thanks for your replies!
PS I know you can just use File.basename('spec/tmp/testfile.md') instead, I'm just using this as an example
Yes
Except for the sys* family of operations, Ruby's IO ops ultimately allocate both file descriptors and buffers.
If you don't close the IO object then you are correct ... you most likely leak both the fd and the buffer.
Now, if you allocate it in such a way as to overwrite or otherwise end the lifetime of the old reference, then Ruby can g/c the entire object. This will definitely free the buffer, and it will eventually free the FD as well.
In all languages, however, it's considered quite bad practice to rely upon a g/c-triggered finalizer as it's unpredictable how long it will take and how many outstanding OS-level resources will exist at one time. You may exceed some local limit before the g/c machinery even starts up.
The general rule is to allocate and free OS resources synchronously.
And as long as I'm beating the subject to death, there is an exception. If you are allocating a fixed number of descriptors or something, and they all must exist at once anyway, and the program is going to exit after finishing its work, then it's OK to just leave them. The OS cleans up everything. For example, it's best not to free memory right before exit. The processing needed to manage the heap is completely wasted if the program is about to exit. The OS is just going to put every single page of the program on its free list. And there is an exception to the exception. If it's homework, I would free everything.

Rate limiting a ruby file stream

I am working on a project which involves uploading flash video files to a S3 bucket from a number of geographically distributed nodes.
The video files are about 2-3mb each, and we are only sending one file (per node) every ten minutes, however the bandwidth we consume needs to be rate limited to ~20k/s, as these nodes are delivering streaming media to a CDN, and due to the locations we are only able to get 512k max upload.
I have been looking into the ASW-S3 gem and while it doesn't offer any kind of rate limiting I am aware that you can pass in a IO Stream. Given this I am wondering if it might be possible to create a rate-limited stream which overrides the read method, adds in the rate limiting logic (e.g. in its simplest form a call to sleep between reads) and then call out to the super of the overridden method.
Another option I considered is hacking the code for Net::HTTP and putting the rate limiting into the send_request_with_body_stream method which is using a while loop, but I'm not entirely sure which would be the best option.
I have attempted at extending the IO class, however that didn't work at all, simply inheriting from the class with class ThrottledIO < IO didn't do anything.
Any suggestions will be greatly appreciated.
You need to use Delegate if you want to "augment" an IO. This puts a "facade" around your IO object that will be used by all "external" readers of the object but will have no effect on the operation of the object itself.
I've extracted that into a gem since it proved to be generally useful
Here's an example for an IO that gets read from
http://rubygems.org/gems/progressive_io
Here there is an aspect added to all reading methods. I think you might be able to extend that to do basic throttling. After you are done you will be able to wrap your, say, File, into it:
throttled_file = ProgressiveIO.new(some_file) do | offset, size |
# compute rate and if needed sleep()
end
We've used the aiaio's active_resource_throttle to limit requests from pulling from the Harvest API on a project at work. I didn't set it up, but it works.

Resources