Storing and processing large XML files with Heroku? - ruby

I'm working on an application that needs to store a large 2GB+ XML file for processing, and I'm facing two problems:
How do I process the file? Loading the whole file into Nokogiri at once won't work. It quickly eats up memory and, as far as I can tell, the process gets nuked from orbit. Are there Heroku-compatible ways to quickly/easily read a large XML file located on a non-Heroku server in smaller chunks?
How do I store the file? The site is set up to use S3, but the data provider needs FTP access to upload the XML file nightly. S3 via FTP is apparently a no-go, and storing the file on Heroku won't work either, as it'll only be seen by the dyno that owns it and is susceptible to being randomly purged. Has anyone encountered this type of constraint before, and if so, how'd you work around it?

Most of the time we prefer parsing the entire file that has been pulled into memory because it's easier to jump back and forth, extracting this and that as our code needs. Because it's in memory, we can do random access easily, if we want.
For your need, you'll want to start at the top of the file, and read each line, looking for the tags of interest, until you get to the end of the file. For that, you want to use Nokogiri::XML::SAX and Nokogiri::XML::SAX::Parser, along with the events in Nokogiri::XML::SAX::Document. Here's a summary of what it does, from Nokogiri's site:
The basic way a SAX style parser works is by creating a parser, telling the parser about the events we’re interested in, then giving the parser some XML to process. The parser will notify you when it encounters events your said you would like to know about.
SAX is a different beast than dealing with the DOM, but it can be very fast, and is a lot easier on memory.
If you wanted to load the file in smaller chunks, you could process the XML inside an OpenURI.open or Net::HTTP block, so you'd be getting it in TCP packet-size chunks. The problem then is that your lines could be split, because TCP doesn't guarantee reading by lines, but by blocks, which is what you'll see inside the read loop. Your code would have to peel off partial lines at the end of the buffer, and then prepend them to the read buffer so the next block read finishes the line.

You'll need a streaming parser. Have a look at https://github.com/craigambrose/sax_stream
You could run your own FTP server on EC2? Or use a hosted provider such as https://hostedftp.com/

Related

How to create a partially modifiable binary file format?

I'm creating my custom binary file extension.
I use the RIFF standard for encoding data. And it seems to work pretty well.
But there are some additional requirements:
Binary files could be large up to 500 MB.
Real-time saving data into the binary file in intervals when data on the application has changed.
Application could run on the browser.
The problem I face is when I want to save data it needs to read everything from memory and rewrite the whole binary file.
This won't be a problem when data is small. But when it's getting larger, the Real-time saving feature seems to be unscalable.
So main requirement of this binary file could be:
Able to partially read the binary file (Cause file is huge)
Able to partially write changed data into the file without rewriting the whole file.
Streaming protocol like .m3u8 is not an option, We can't split it into chunks and point it using separate URLs.
Any guidance on how to design a binary file system that scales in this scenario?
There is an answer from a random user that has been deleted here.
It seems great to me.
You can claim your answer back and I'll delete this one.
He said:
If we design the file to be support addition then we able to add whatever data we want without needing to rewrite the whole file.
This idea gives me a very great starting point.
So I can append more and more changes at the end of the file.
Then obsolete old chunks of data in the middle of the file.
I can then reuse these obsolete data slots later if I want to.
The downside is that I need to clean up the obsolete slot when I have a chance to rewrite the whole file.

Ruby PStore file too large

I am using PStore to store the results of some computer simulations. Unfortunately, when the file becomes too large (more than 2GB from what I can see) I am not able to write the file to disk anymore and I receive the following error;
Errno::EINVAL: Invalid argument - <filename>
I am aware that this is probably a limitation of IO but I was wondering whether there is a workaround. For example, to read large JSON files, I would first split the file and then read it in parts. Probably the definitive solution should be to switch to a proper database in the backend, but because of some limitations of the specific Ruby (Sketchup) I am using this is not always possible.
I am going to assume that your data has a field that could be used as a crude key.
Therefore I would suggest that instead of dumping data into one huge file, you could put your data into different files/buckets.
For example, if your data has a name field, you could take the first 1-4 chars of the name, create a file with those chars like rojj-datafile.pstore and add the entry there. Any records with a name starting 'rojj' go in that file.
A more structured version is to take the first char as a directory, then put the file inside that, like r/rojj-datafile.pstore.
Obviously your mechanism for reading/writing will have to take this new file structure into account, and it will undoubtedly end up slower to process the data into the pstores.

Read and write file atomically

I'd like to read and write a file atomically in Ruby between multiple independent Ruby processes (not threads).
I found atomic_write from ActiveSupport. This writes to a temp file, then moves it over the original and sets all permissions. However, this does not prevent the file from being read while it is being written.
I have not found any atomic_read. (Are file reads already atomic?)
Do I need to implement my own separate 'lock' file that I check for before reads and writes? Or is there a better mechanism already present in the file system for flagging a file as 'busy' that I could check before any read/write?
The motivation is dumb, but included here because you're going to ask about it.
I have a web application using Sinatra and served by Thin which (for its own reasons) uses a JSON file as a 'database'. Each request to the server reads the latest version of the file, makes any necessary changes, and writes out changes to the file.
This would be fine if I only had a single instance of the server running. However, I was thinking about having multiple copies of Thin running behind an Apache reverse proxy. These are discrete Ruby processes, and thus running truly in parallel.
Upon further reflection I realize that I really want to make the act of read-process-write atomic. At which point I realize that this basically forces me to process only one request at a time, and thus there's no reason to have multiple instances running. But the curiosity about atomic reads, and preventing reads during write, remains. Hence the question.
You want to use File#flock in exclusive mode. Here's a little demo. Run this in two different terminal windows.
filename = 'test.txt'
File.open(filename, File::RDWR) do |file|
file.flock(File::LOCK_EX)
puts "content: #{file.read}"
puts 'doing some heavy-lifting now'
sleep(10)
end
Take a look at transaction and open_and_lock_file methods in "pstore.rb" (Ruby stdlib).
YAML::Store works fine for me. So when I need to read/write atomically I (ab)use it to store data as a Hash.

Sencha Touch - big XML file issue

I am reading the content out from a xml file over the internet!
The file contains about 10000 xml-elements and is loaded into a list (one picture and headline for each element)!
This slows down the app extremly!
Is there a way to speed this up?
Maybe with a select-command?
Are there some examples or tutorials out there?
You are out of luck for a easy-straight forward answer.
If you control the server that the XML file is coming from, you should make the changes on it to support pagination of the results instead of sending the complete document.
If you don't control the server, you could set up one to proxy the results and do the pagination for the application on the server side.
The last option is the process the file in chunks. This would mean, processing sub-strings of the text. Just take a sub-string of the first x characters, parse it and then do something with the results. If you needed more you would process the next x characters. This could get very messy fast (as XML doesn't really parse nicely in this manner) and just downloading a document with 10k elements and loading it into memory is probably going to be taxing/slow/expensive (if downloading over a 3G connection) for mobile devices.

What is the best way to read files in an EventMachine-based app?

In order not to block the reactor I would like to read files asynchronously, but I've found no obvious way of doing it using EventMachine. I've tried a few different approaches, but none of them feels right:
Just read the file, it'll block the reactor, but what the hell, it's not that slow (unless it's a big file, and then it definitely is).
Open the file for reading and read a chunk on each tick (but how much to read? too much and it'll block the reactor, too little and reading will get slower than necessary).
EM.popen('cat some/file', FileReader) feels really weird, but works better than the alternatives above. In combination with the LineAndTextProtocol it reads lines pretty swiftly.
EM.attach, but I haven't found any examples of how to use it, and the only thing I've found on the mailing list is that it's deprecated in favour of…
EM.watch, which I've found no examples of how to use for reading files.
How do you read files within a EventMachine reactor loop?
EM.attach/watch cannot be used on files, as select/epoll on a disk-based file descriptor will always return readable.
Ultimately, it depends on what you're trying to do. If it's a small file, just File.read it. If it is larger, you can read small chunks over time. For example, EM::FileStreamer does this to send large file over the network.
Another common use-case is to tail a file and read in new contents when it changes. This can be achieved using EM.watch_file: http://github.com/jordansissel/eventmachine-tail

Resources