Nokogiri builder performance on huge XML? - ruby

I need to build a huge XML file, about 1-50 MB. I thought that using builder would be effective enough and, well it is, somewhat. The problem is, after the program reaches its last line it doesn't end immediately, but Ruby is still doing something for several seconds, maybe garbage collection? After that the program finally ends.
To give a real example, I am measured the time of building an XML file. It outputs 55 seconds (there is a database behind so it takes long) when the XML was built, but Ruby still processes for about 15 more seconds and the processor is going crazy.
The pseudo/real code is as follows:
...
builder = Nokogiri::XML::Builder.with(doc) do |xml|
build_node(xml)
end
...
def build_node(xml)
...
xml["#{namespace}"] if namespace
xml.send("#{elem_name}", attrs_hash) do |elem_xml|
...
if has_children
if type
case type
when XML::TextContent::PLAIN
elem_xml.text text_content
when XML::TextContent::COMMENT
elem_xml.comment text_content
when XML::TextContent::CDATA
elem_xml.cdata text_content
end
else
build_node(elem_xml)
end
end
end
end
Note that I was using a different approach using my own structure of classes, and the speed of the build was the same, but at the last line the program normally ended, but now I am forced to use Nokogiri so I have to find a solution.
What I can do to avoid that X seconds long overhead after the XML is built? Is it even possible?
UPDATE:
Thanks to a suggestion from Adiel Mittmann, during the creation of my minimal working example I was able to locate the problem. I now have a small (well not that small) example demonstrating the problem.
The following code is causing the problem:
xml.send("#{elem_name}_") do |elem_xml|
...
elem_xml.text text_content #This line is the problem
...
end
So the line executes the following code based on Nokogiri's documentation:
def create_text_node string, &block
Nokogiri::XML::Text.new string.to_s, self, &block
end
Text node creation code gets executed then. So, what exactly is happening here?
UPDATE 2:
After some other tries, the problem can be easily reproduced by:
builder = Nokogiri::XML::Builder.new do |xml|
0.upto(81900) do
xml.text "test"
end
end
puts "End"
So is it really Nokogiri itself? Is there any option for me?

Your example also takes a long time to execute here. And you were right: it's the garbage collector that's taking so long to execute. Try this:
require 'nokogiri'
class A
def a
builder = Nokogiri::XML::Builder.new do |xml|
0.upto(81900) do
xml.text "test"
end
end
end
end
A.new.a
puts "End1"
GC.start
puts "End2"
Here, the delay happens between "End1" and "End2". After "End2" is printed, the program closes immediately.
Notice that I created an object to demonstrate it. Otherwise, the data generated by the builder can only be garbage collected when the program finishes.
As for the best way to do what you're trying to accomplish, I suggest you ask another question giving details of what exactly you're trying to do with the XML files.

Try using the Ruby built-in (sic) Builder. I use it to generate large XML files as well, and it has such an small footprint.

Related

How to read a file one line at a time, on function call

I am reading a massive file of JSON objects like this:
File.open('massive.json') do |file|
file.lazy.each_slice(BATCH_SIZE) do |lines|
lines.each do |line|
employee_data_hash = JSON.parse(line, symbolize_names: true)
add_employee(employee_data_hash)
end
end
end
The thing is this, I am finding it tricky to write the correct test for it. Moreover it would be better for the overall design of my project if I could have a function get_json_line which would return one line, there is one JSON object per line. Thus I would be able to process the data one line at a time causing no memory problems.
Should I use yield? What effects would it have on performance if I use yield?

Parsing huge (~100mb) kml (xml) file taking *hours* without any sign of actual parsing

I'm currently trying to parse a very large kml (xml) file with ruby (Nokogiri) and am having a little bit of trouble.
The parsing code is good, in fact I'll share it just for the heck of it, even though this code doesn't have much to do with my problem:
geofactory = RGeo::Geographic.projected_factory(:projection_proj4 => "+proj=lcc +lat_1=34.83333333333334 +lat_2=32.5 +lat_0=31.83333333333333 +lon_0=-81 +x_0=609600 +y_0=0 +ellps=GRS80 +to_meter=0.3048 +no_defs", :projection_srid => 3361)
f = File.open("horry_parcels.kml")
kmldoc = Nokogiri::XML(f)
kmldoc.css("//Placemark").each_with_index do |placemark, i|
puts i
tds = Nokogiri::HTML(placemark.search("//description").children[0].to_html).search("tr > td")
h = HorryParcel.new
h.owner_name = tds.shift.text
tds.shift
tds.each_slice(2) do |k, v|
col = k.text.downcase
eval("h.#{col} = v.text")
end
coords = kmldoc.search("//MultiGeometry")[i].text.gsub("\n", "").gsub("\t", "").split(",0 ").map {|x| x.split(",")}
points = coords.map { |lon, lat| geofactory.parse_wkt("POINT (#{lon} #{lat})") }
geo_shape = geofactory.polygon(geofactory.linear_ring(points))
proj_shape = geo_shape.projection
h.geo_shape = geo_shape
h.proj_shape = proj_shape
h.save
end
Anyway, I've tested this code with a much, much smaller sample of kml and it works.
However, when I load the real thing, ruby simply waits, as if it is processing something. This "processing", however, has now spanned several hours while I've been doing other things. As you might have noticed, I have a counter (each_with_index) on the array of Placemarks and during this multi-hour period, not a single i value has been put to the command line. Oddly enough it hasn't timed out yet, but even if this works there has got to be a better way to do this thing.
I know I could open up the KML file in Google Earth (Google Earth Pro here) and save the data in smaller, more manageable kml files, but the way things appear to be set up, this would be a very manual, unprofessional process.
Here's a sample of the kml (w/ just one placemark) if that helps.
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document>
<name>justone.kml</name>
<Style id="PolyStyle00">
<LabelStyle>
<color>00000000</color>
<scale>0</scale>
</LabelStyle>
<LineStyle>
<color>ff0000ff</color>
</LineStyle>
<PolyStyle>
<color>00f0f0f0</color>
</PolyStyle>
</Style>
<Folder>
<name>justone</name>
<open>1</open>
<Placemark id="ID_010161">
<name>STUART CHARLES A JR</name>
<Snippet maxLines="0"></Snippet>
<description>""</description>
<styleUrl>#PolyStyle00</styleUrl>
<MultiGeometry>
<Polygon>
<outerBoundaryIs>
<LinearRing>
<coordinates>
-78.941896,33.867893,0 -78.942514,33.868632,0 -78.94342899999999,33.869705,0 -78.943708,33.870083,0 -78.94466799999999,33.871142,0 -78.94511900000001,33.871639,0 -78.94541099999999,33.871776,0 -78.94635,33.872216,0 -78.94637899999999,33.872229,0 -78.94691400000001,33.87248,0 -78.94708300000001,33.87256,0 -78.94783700000001,33.872918,0 -78.947889,33.872942,0 -78.948655,33.873309,0 -78.949589,33.873756,0 -78.950164,33.87403,0 -78.9507,33.873432,0 -78.95077000000001,33.873384,0 -78.950867,33.873354,0 -78.95093199999999,33.873334,0 -78.952518,33.871631,0 -78.95400600000001,33.869583,0 -78.955254,33.867865,0 -78.954606,33.867499,0 -78.953833,33.867172,0 -78.952994,33.866809,0 -78.95272799999999,33.867129,0 -78.952139,33.866803,0 -78.95152299999999,33.86645,0 -78.95134299999999,33.866649,0 -78.95116400000001,33.866847,0 -78.949281,33.867363,0 -78.948936,33.866599,0 -78.94721699999999,33.866927,0 -78.941896,33.867893,0
</coordinates>
</LinearRing>
</outerBoundaryIs>
</Polygon>
</MultiGeometry>
</Placemark>
</Folder>
</Document>
</kml>
EDIT:
99.9% of the data I work with is in *.shp format, so I've just ignored this problem for the past week. But I'm going to get this process running on my desktop computer (off of my laptop) and run it until it either times out or finishes.
class ClassName
attr_reader :before, :after
def go
#before = Time.now
run_actual_code
#after = Time.now
puts "process took #{(#after - #before) seconds} to complete"
end
def run_actual_code
...
end
end
The above code should tell me how long it took. From that (if it does actually finish) we should be able to compute a rough rule of thumb for how long you should expect your (otherwise PERFECT) code to run without SAX parsing or "atomization" of the document's text components.
For a huge XML file, you should not use default XML parser from Nokogiri, because it parses as DOM. A much better parsing strategy for large XML files is SAX. Luckly we are, Nokogiri supports SAX.
The downside is that using a SAX parser all logic should be done with callbacks. The idea is simple: The sax parser starts to read a file and let you know whenever it finds something interesting, for example a tag opening, a tag close, or a text. You will be able to bind callbacks to these events, and extract whatever you need.
Of course you don't want to use a SAX parser to load all file into the memory and work with it there - this is exactly what SAX want to avoid. You will need to do whatever you want with this file part-by-part.
So this is basically a rewrite your parsing with callbacks logic. To learn more about XML DOM vs SAX parsers, you might want to check this FAQ from cs.nmsu.edu
I actually ended up getting a copy of the data from a more accessible source, but I'm back here because I wanted to present a possible solution to the general problem. Less. Less was a built long time ago & is a part of unix by default in most cases.
http://en.wikipedia.org/wiki/Less_%28Unix%29
Not related to the stylesheet language ("LESS"), less is a text viewer (cannot edit files, only read them) that does not load the entire document it is reading until you have scanned through the entire thing yourself. I.e., it loads the first "page", so to speak, and waits for you to call for the next one.
If a ruby script could somehow pipe "pages" of text into...oh wait....the XML structure wouldn't allow it due to the fact that it wouldn't have the closing delimeters from the end of the undigested text file......So what you would have to do is some custom work on the front end, cut out those first couple parent brackets so that you can pluck out the XML children one by one and have the last closing parent brackets break the script because the parser will think it is finished and come across another closing bracket I guess.
I haven't tried this and don't have anything to try it on. But if I did, I'd probably try piping n-lot blocks of text into ruby (or python, etc) via less or something similar to it. Perhaps something more primitive than less I'm not sure

Parse huge file (10+gb) and write content in another one

I'm trying to use Sphinx Search Server to index a really huge file (around 14gb).
The file is whitespace separated, one entry per line.
To be able to use it with Sphinx, I need to provide a xml file to the Sphinx server.
How can I do it without killing my computer ?
What is the best strategy? Should I try to split the main file in several little files? What's the best way to do it?
Note: I'm doing it in Ruby, but I'm totally open to other hints.
Thanks for your time.
I think the main idea would be to parse the main file line by line, while generating a result XML. And every time it gets large enough, to feed it to Sphinx. Rinse and repeat.
What parsing do you need to do? If the transformations are restricted to just one line in the input at once and not too complicated, I would use awk instead of Ruby...
I hate guys who doesn't write solution after a question. So I'll try to don't be one of them, hopefully it will help somebody.
I added a simple reader method to the File class then used it to loop on the file based on a chunk size of my choice. Quite simple actually, working like a charm with Sphinx.
class File
# New static method
def self.seq_read(file_path,chunk_size=nil)
open(file_path,"rb") do |f|
f.each_chunk(chunk_size) do |chunk|
yield chunk
end
end
end
# New instance method
def each_chunk(chunk_size=1.kilobyte)
yield read(chunk_size) until eof?
end
end
Then just use it like this:
source_path = "./my_very_big_file.txt"
CHUNK_SIZE = 10.megabytes
File.seq_read(source_path, CHUNK_SIZE) do |chunk|
chunk.each_line do |line|
...
end
end

How to tidy up malformed xml in ruby

I'm having issues tidying up malformed XML code I'm getting back from the SEC's edgar database.
For some reason they have horribly formed xml. Tags that contain any sort of string aren't closed and it can actually contain other xml or html documents inside other tags. Normally I'd had this off to Tidy but that isn't being maintained.
I've tried using Nokogiri::XML::SAX::Parser but that seems to choke because the tags aren't closed. It seems to work alright until it hits the first ending tag and then it doesn't fire off on any more of them. But it is spiting out the right characters.
class Filing < Nokogiri::XML::SAX::Document
def start_element name, attrs = []
puts "starting: #{name}"
end
def characters str
puts "chars: #{str}"
end
def end_element name
puts "ending: #{name}"
end
end
It seems like this would be the best option because I can simply have it ignore the other xml or html doc. Also it would make the most sense because some of these documents can get quite large so storing the whole dom in memory would probably not work.
Here are some example files: 1 2 3
I'm starting to think I'll just have to write my own custom parser
Nokogiri's normal DOM mode is able to automatically fix-up the XML so it is syntactically correct, or a reasonable facsimile of that. It sometimes gets confused and will shift closing tags around, but you can preprocess the file to give it a nudge in the right direction if need be.
I saved the XML #1 out to a document and loaded it:
require 'nokogiri'
doc = ''
File.open('./test.xml') do |fi|
doc = Nokogiri::XML(fi)
end
puts doc.to_xml
After parsing, you can check the Nokogiri::XML::Document instance's errors method to see what errors were generated, for perverse pleasure.
doc.errors
If using Nokogiri's DOM model isn't good enough, have you considered using XMLLint to preprocess and clean the data, emitting clean XML so the SAX will work? Its --recover option might be of use.
xmllint --recover test.xml
It will output errors on stderr, and the code on stdout, so you can pipe it easily to another file.
As for writing your own parser... why? You have other options available to you, and reinventing a nicely implemented wheel is not a good use of time.

Rails File I/O: What works in Ruby doesn't work in Rails?

So, I wrote a simple Ruby class, and put it in my rails /lib directory. This class has the following method:
def Image.make_specific_image(paths, newfilename)
puts "making specific image"
#new_image = File.open(newfilename, "w")
puts #new_image.inspect
##blank.each(">") do |line|
puts line + "~~~~~"
#new_image.puts line
if line =~ /<g/
paths.each do |p|
puts "adding a path"
puts p
#new_image.puts p
end
end
end
end
Which creates a new file, and copies a hardcoded string (##blank) to this file, adding custom content at a certain location (after a g tag is found).
If I run this code from ruby, everything is just peachy.
HOWEVER, if I run this code from rails, the file gets CREATED, but is then empty. I've inspected each line of the code: nothing I'm trying to write to the file is nil, but the file is empty nonetheless.
I'm really stumped here. Is it a permissions thing? If so, why on EARTH would Rails have the permissions necessary to MAKE a file, but then not WRITE to the file it made?
Does File I/O somehow work differently in rails?
Specifically, I have a model method that calls:
Image.make_specific_image(paths, creature.id.to_s + ".svg")
which succesfully makes a file of the type "47.svg" that is empty.
Have you tried calling close on the file after you're done writing it? (You could also use the block-based File.open syntax, which will automatically close once the block is complete). I'm guessing the problem is that the writes aren't getting flushed to disk.
So.
Apparently File I/0 DOES work in Rails...just very, very slowly. In Ruby, as soon as I go to look at the file, it's there, it works, everything is spiffy.
Before, after seeing blank files from Rails, I would get frustrated, then delete the file, and change some code and try again (so as not to be full of spam, since each file is genearted on creature creation, so I would soon end up with a lot of files like "47.svg" and "48.svg", etc.
....So. I took my lunch break, came back to see if I could tell if the permissions of the rails generated file were different from the ruby generated file...and noticed that the RAILS file is no longer blank.
Seems to take about five minutes for rails to finally write to the file, even AFTER it claims it's done processing that whole call. Ruby takes a few seconds. Not really sure WHY they are so different, but at least now I know it's not a permissions thing.
Edit: Actually, on some files take so long, others are instant...

Resources