open-uri and sax parsing for a giant xml document - ruby

I need to connect to an external XML file to download and process (300MB+).
Then run through the XML document and save elements in the database.
I am already doing this no problem on a production server with Saxerator to be gentle on memory. It works great. Here is my issue now --
I need to use open-uri (though there could be alternative solutions?) to grab the file to parse through. This problem is that open-uri has to load the whole file before anything starts parsing, which defeats the entire purpose of using a SAX Parser to save on memory... any work arounds? Can I just read from the external XML document? I cannot load the entire file or it crashes my server, and since the document is updated every 30 minutes, I can't just save a copy of it on my server (though this is what I am doing currently to make sure everything id working).
I am doing this Ruby, p.s.

You may want to try Net::HTTP's streaming interface instead of open-URI. This will give Saxerator (via the underlying Nokogiri::SAX::Parser) an IO object rather than the entire file.

I took a few minutes to write this up and then realized you tagged this question with ruby. My solution is in Java so I apologize for that. I'm still including it here since it could be useful to you or someone down the road.
This is always how I've processed large external xml files
XMLReader xmlReader = SAXParserFactory.newInstance().newSAXParser().getXMLReader();
xmlReader.setFeature("http://xml.org/sax/features/namespaces", true);
XMLFilter filter = new XMLFilterImpl();
filter.setParent(xmlReader);
filter.parse(new InputSource(new BufferedReader(new InputStreamReader(new URL("<url to external document here>").openConnection().getInputStream(),"UTF8"))));

Related

Parsing a JSON file without JSON.parse()

This is my first time using Ruby. I'm writing an application that parses data and performs some calculations based on it, the source of which is a JSON file. I'm aware I can use JSON.parse() here but I'm trying to write my program so that it will work with other sources of data. Is there a clear cut way of doing this? Thank you.
When your source file is JSON then use JSON.parse. Do not implement a JSON parser on your own. If the source file is a CSV, then use the CSV class.
When your application should be able to read multiple different formats then just add one Reader class for each data type, like JSONReader, CSVReader, etc. And then decide depending on the file extension which reader to use to read the file.

How to parse large xml file in ruby

I need to parse a large (4gb) xml file in ruby, preferably with nokogiri. I've seen a lot of code exampled using
File.open(path)
but this takes too much time in my case. Is there an option to read the xml node by node in order to prevent loading the file at ones. Or what would be the fastest way to parse such a large file.
Best,
Phil
You can try using Nokogiri::XML::SAX
The basic way a SAX style parser works is by creating a parser,
telling the parser about the events we’re interested in, then giving
the parser some XML to process. The parser will notify you when it
encounters events your said you would like to know about.
I do this kind of work with LibXML http://xml4r.github.io/libxml-ruby/ (require 'xml') and its LibXML::XML::Reader API. It's simpler than SAX and allows you to make almost everything. REXML includes a similar API also, but it's quite buggy. Stream APIs like the one I mention or SAX shouldn't have any problem with huge files. I have not tested Nokogiri.
you may like to try this out - https://github.com/amolpujari/reading-huge-xml
HugeXML.read xml, elements_lookup do |element|
# => element{ :name, :value, :attributes}
end
I also tried using ox

How do I get Zlib to compress to a stream in Ruby?

I’m trying to upload files to Amazon S3 using AWS::S3, but I’d like to compress them with Zlib first. AWS::S3 expects its data to be a stream object, i.e. you would usually upload a file with something like
AWS::S3::S3Object.store('remote-filename.txt', open('local-file.txt'), 'bucket')
(Sorry if my terminology is off; I don’t actually know much about Ruby.) I know that I can zlib-compress a file with something like
data = Zlib::Deflate.deflate(File.read('local-file.txt'))
but passing data as the second argument to S3Object.store doesn’t seem to do what I think it does. (The upload goes fine but when I try to access the file from a web browser it doesn’t come back correctly.) How do I get Zlib to deflate to a stream, or whatever kind of object S3Object.store wants?
I think my problem before was not that I was passing the wrong kind of thing to S3Object.store, but that I was generating a zlib-compressed data stream without the header you’d usually find in a .gz file. In any event, the following worked:
str = StringIO.new()
gz = Zlib::GzipWriter.new(str)
gz.write File.read('local-file.txt')
gz.close
AWS::S3::S3Object.store('remote-filename.txt', str.string, 'bucket')

Better approach to saving Webpage content to database for caching

i want to know which approach is better to saving Webpage content to database for caching?
Using ntext data-type and save content as flat string
Using ntext, but compress content and then save
Using varbinary(MAX) to save content (how i can convert flat string to binary? ;-))
An other approach which you are suggest to me
UPDATE
in more depth i have many table (URLs, Caches, ParsedContents, Words, Hits and etc) which for each url in URLs table i'm sending request and save response into caches table. this is Downloader (URLResolver of Google) section of my engine. then indexer section act was to perform parsing and etc tasks which associated with this. and Compress/Decompress performs only when new content goes to be caching or parsing
The better approach would be to use the built-in caching features in ASP.NET. Searching StackOverflow for [asp.net] [caching] is a good start, and after (or before) that, similar searches on both www.asp.net and Google will get you quite far.
In response to your comment, I would probably save the data as a flat string. It might not be the best option performance-wise when it comes to storage, but if you're going to perform searches on the text content, you don't want to have to compress/decompress or convert to/from binary every time, since there is probably no (easy) way to do this inside SQL Server. Just make sure you have all intexes and full-text features you need set up correctly.

Reading a zip file from a GET request without saving first in Ruby?

I am trying read zip file from HTTP GET request. One way to do it is by
saving the response body to a physical file first and then reading the
zip file to read the files inside the zip.
Is there a way to read the files inside directly without having to save
the zip file into a physical file first?
My current code:
Net::HTTP.start("clinicaltrials.gov") do |http|
resp = http.get("/ct2/results/download?id=15002A")
open("C:\\search_result.zip", "wb") do |file|
file.write(resp.body)
end
end
Zip::ZipFile.open("C:\\search_result.zip") do |zipfile|
xml = zipfile.file.read("search_result.xml")
end
Looks like you're using rubyzip, which can't unzip from an in-memory buffer.
You might want to look at using Chilkat's Ruby Zip Library instead as it supports in-memory processing of zip data. It claims to be able to "Create or open in-memory Zips", though I have no experience with the library myself. Chilkat's library isn't free, however, so that may be a consideration. Not sure if there is a free library that has this feature.
One way might be to implement in-memory file, so that RubyZip can still play with your file without changing anything.
You sould take a look at this Ruby Hack

Resources