Ruby LibXML skip large nodes - ruby

I have an xml file that has a very large text node (>10 MB). While reading the file, is it possible to skip (ignore) this node?
I tried the following:
reader = XML::Reader.io(path)
while reader.read do
next if reader.name.eql?('huge-node')
end
But this still results in the error parser error : xmlSAX2Characters: huge text node
The only other solution I can think of is to first read the file as a string and remove the huge node through a gsub, and then parse the file. However, this method seems very inefficient.

That's probably because by the time you are trying to skip it, it's already read the node. According to the documentation for the #read method:
reader.read -> nil|true|false
Causes the reader to move to the next node in the stream, exposing its properties.
Returns true if a node was successfully read or false if there are no more nodes to read. On errors, an exception is raised.
You would need to skip the node prior to calling the #read method on it. I'm sure there are many ways you could do that but it doesn't look like this library supports XPath expressions, or I would suggest something like that.
EDIT: The question was clarified so that the SAX parser is a required part of the solution. I have removed links that would not be helpful given this constraint.

You don't have to skip the node. The cause is that since version 2.7.3 libxml limits the maximum size of a single text node to 10MB.
This limit can be removed with a new option, XML_PARSE_HUGE.
Bellow an example:
# Reads entire file into a string
$result = file_get_contents("https://www.ncbi.nlm.nih.gov/gene/68943?report=xml&format=text");
# Returns the xml string into an object
$xml = simplexml_load_string($result, 'SimpleXMLElement', LIBXML_COMPACT | LIBXML_PARSEHUGE);

Related

Search/Parse XML and exclude certain nodes without removing them?

The command below allows me to parse the text in all nodes except for nodes 'wp14:sizeRelH' & 'wp14:sizeRelV'
XML.search('//wp14:sizeRelH', '//wp14:sizeRelV').remove.search('//text()')
I would like to do the same thing but I do not want to remove nodes 'wp14:sizeRelH' and 'wp14:sizeRelV' from the XML.
This way I can parse through the XML tree and make changes to the text in each node without affecting nodes 'wp14:sizeRelH' and 'wp14:sizeRelV'
EDIT: It appears if nodes '//wp14:sizeRelH' or '//wp14:sizeRelV' are not in the XML, then my command also returns nothing which is not good :(
Looks like I found the answer. I used //text()[not...] but had to find the ancestors names of the text I didn't want to include:
XML.search('//text()[not(ancestor::wp14:pctHeight or ancestor::wp14:pctWidth or ancestor::wp:posOffset)]')

How to read a large file into a string

I'm trying to save and load the states of Matrices (using Matrix) during the execution of my program with the functions dump and load from Marshal. I can serialize the matrix and get a ~275 KB file, but when I try to load it back as a string to deserialize it into an object, Ruby gives me only the beginning of it.
# when I want to save
mat_dump = Marshal.dump(#mat) # serialize object - OK
File.open('mat_save', 'w') {|f| f.write(mat_dump)} # write String to file - OK
# somewhere else in the code
mat_dump = File.read('mat_save') # read String from file - only reads like 5%
#mat = Marshal.load(mat_dump) # deserialize object - "ArgumentError: marshal data too short"
I tried to change the arguments for load but didn't find anything yet that doesn't cause an error.
How can I load the entire file into memory? If I could read the file chunk by chunk, then loop to store it in the String and then deserialize, it would work too. The file has basically one big line so I can't even say I'll read it line by line, the problem stays the same.
I saw some questions about the topic:
"Ruby serialize array and deserialize back"
"What's a reasonable way to read an entire text file as a single string?"
"How to read whole file in Ruby?"
but none of them seem to have the answers I'm looking for.
Marshal is a binary format, so you need to read and write in binary mode. The easiest way is to use IO.binread/write.
...
IO.binwrite('mat_save', mat_dump)
...
mat_dump = IO.binread('mat_save')
#mat = Marshal.load(mat_dump)
Remember that Marshaling is Ruby version dependent. It's only compatible under specific circumstances with other Ruby versions. So keep that in mind:
In normal use, marshaling can only load data written with the same major version number and an equal or lower minor version number.

Ruby - Files - gets method

I am following Wicked cool ruby scripts book.
here,
there are two files, file_output = file_list.txt and oldfile_output = file_list.old. These two files contain list of all files the program went through and going to go through.
Now, the file is renamed as old file if a 'file_list.txt' file exists .
then, I am not able to understand the code.
Apparently every line of the file is read and the line is stored in oldfile hash.
Can some one explain from 4 the line?
And also, why is gets used here? why cant a .each method be used to read through every line?
if File.exists?(file_output)
File.rename(file_output, oldfile_output)
File.open(oldfile_output, 'rb') do |infile|
while (temp = infile.gets)
line = /(.+)\s{5,5}(\w{32,32})/.match(temp)
puts "#{line[1]} ---> #{line[2]}"
oldfile_hash[line[1]] = line[2]
end
end
end
Judging from the redundant use of quantifiers ({5,5} and {32,32}) in the regex (which would be better written as {5}, {32}), it looks like the person who wrote that code is not a professional Ruby programmer. So you can assume that the choice taken in the code is not necessarily the best.
As you pointed out, the code could have used each instead of while with gets. The latter approach is sort of an old-school Ruby way of doing it. There is nothing wrong in using it. Until the end of file is reached, gets will return a string, and when it does reach the end of file, gets will return nil, so the while loop works as the same when you use each; in each iteration, it reads the next line.
It looks like each line is supposed to represent a key-value pair. The regex assumes that the key is not an empty string, and that the key and the value are separated by exactly five spaces, and the the value consists of exactly thirty-two letters. Each key-value pair is printed (perhaps for monitoring the progress), and is stored in oldfile_hash, which is most likely a hash.
So the point of using .gets is to tell when the file is finished being read. Essentially, it's tied to the
while (condition)
....
end
block. So gets serves as a little method that will keep giving ruby the next line of the file until there is no more lines to give.

Using Nokogiri with multiple search elements

In this XML snippet I need to replace the data in the UID for some of the blocks. The actual file contains more than 100 similar blocks.
Although I have been able to extract subsets based on name="Track (Timeline)", I am struggling to reduce this subset to the specific block I need by also using the data in the <TrackID>, if name="Track (TimeLine)" and the text of <TrackID> is 0x1200 then set UID to xxxx.
I am new to Nokogiri and, although I write test scripts, I do not consider myself a programmer.
<StructuralMetadata key="06.0E.2B.34.02.53.01.01.0D.01.01.01.01.01.3B.00" length="116" name="Track (TimeLine)">
<EditRate>25/1</EditRate>
<Origin>0</Origin>
<Sequence>32-04-25-67-E7-A7-86-4A-9B-28-53-6F-66-74-65-6C</Sequence>
<TrackID>0x1200</TrackID>
<TrackName>Softel VBI Data</TrackName>
<TrackNumber>0x17010101</TrackNumber>
<UID>34-C1-B9-B9-5F-07-A4-4E-8F-F4-53-6F-66-74-65-6C</UID>
</StructuralMetadata>
<StructuralMetadata key="06.0E.2B.34.02.53.01.01.0D.01.01.01.01.01.3B.00" length="116" name="Track (TimeLine)">
<EditRate>25/1</EditRate>
<Origin>0</Origin>
<Sequence>35-12-2D-86-E6-74-0B-4C-B4-24-53-6F-66-74-65-6C</Sequence>
<TrackID>0x1300</TrackID>
<TrackName>Softel VBI Data</TrackName>
<TrackNumber>0x0</TrackNumber>
<UID>37-0C-80-34-4C-8D-CE-41-85-F3-53-6F-66-74-65-6C</UID>
</StructuralMetadata>
Using xpath:
//StructuralMetadata
will select all StructuralMetadata elements in your XML. The double slash at the start means to select nodes wherever they appear in the document.
You don't want all the nodes though, you can filter the ones you want with a predicate:
//StructuralMetadata[#name="Track (TimeLine)" and TrackID="0x1200"]
This will select all StructuralMetadata elements that have a name attribute with the value Track (TimeLine), and a TrackID child element with contents 0x1200.
As you're interested in the UID element, you can further refine the expression:
//StructuralMetadata[#name="Track (TimeLine)" and TrackID="0x1200"]/UID
This expression will match all the UID elements that are children of StructuralMetadata elements that match the predicate described above.
Putting this to use:
require 'nokogiri'
# Parse the document, assuming xml_file is a File object containing the XML
doc = Nokogiri::XML(xml_file)
# I'm assuming there is only one element in the document that matches
# the criteria, so I'm using at_xpath
node = doc.at_xpath('//StructuralMetadata[#name="Track (TimeLine)" and TrackID="0x1200"]/UID')
# At this point, doc contains a representation of the xml, and node points to
# the UID node within that representation. We can update the contents of
# this node
node.content = 'XXX'
# Now write out the updated XML. This just writes it to standard output,
# you could write it to a file or elsewhere if needed
puts doc.to_xml
A great way to approach this problem is with the ‘map reduce’ style of programming, which works to take a large list of things and narrow it down and combine it into the result you're after. Specifically, Array#find and Array#select are really useful for this sort of problem. Check out this example:
require 'nokogiri'
xml = Nokogiri::XML.parse(File.read "sample.xml")
element = xml.css('StructuralMetadata').find { |item|
item['name'] == "Track (TimeLine)" and item.css('TrackID').text == "0x1200"
}
puts element.to_xml
This little program first uses the CSS selector to get all of the <StructuralMetadata> elements in the document. It returns an array, which we can filter to just what we want using the Array#find method. Array#select is its cousin which returns an array of all the matching objects instead of the first one it happens to find.
Inside the block we have a test to check if the <StructuralMetadata> tag is the one we’re after. Then it puts the element.to_xml string to the console so you can see which thing it found if you run this as a command-line script. Now you can find the element, you can modify it in the usual way and save out a new XML file or whatever.

Processing large XML file with libxml-ruby chunk by chunk

I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>) using libxml in Ruby. I have tried the Reader class in combination with the expand method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:
File.open('dblp.xml') do |io|
dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
pubFactory = PubFactory.new
i = 0
while dblp.read do
case dblp.name
when 'article', 'inproceedings', 'book':
pub = pubFactory.create(dblp.expand)
i += 1
puts pub
pub = nil
$stderr.puts i if i % 10000 == 0
dblp.next
when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
# ignore for now
dblp.next
else
# nothing
end
end
end
The key here is that dblp.expand reads an entire subtree (like an <article> record) and passes it as an argument to a factory for further processing. Is this the right approach?
Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?
def first(root, node)
x = root.find(node).first
x ? x.content : nil
end
pub.pages = first(node,'pages') # node contains expanded node from dblp.expand
When processing big XML files, you should use a stream parser to avoid loading everything in memory. There are two common approaches:
Push parsers like SAX, where you react to encoutered tags as you get them (see tadman answer).
Pull parsers, where you control a "cursor" in the XML file that you can move with simple primitives like go up/go down etc.
I think that push parsers are nice to use if you want to retrieve just some fields, but they are generally messy to use for complex data extraction and are often implemented whith use case... when... constructs
Pull parser are in my opinion a good alternative between a tree-based model and a push parser. You can find a nice article in Dr. Dobb's journal about pull parsers with REXML .
When processing XML, two common options are tree-based, and event-based. The tree-based approach typically reads in the entire XML document and can consume a large amount of memory. The event-based approach uses no additional memory but doesn't do anything unless you write your own handler logic.
The event-based model is employed by the SAX-style parser, and derivative implementations.
Example with REXML: http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch08s01.html
REXML: http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html
I had the same problem, but I think I solved it by calling Node#remove! on the expanded node. In your case, I think you should do something like
my_node = dblp.expand
[do what you have to do with my_node]
dblp.next
my_node.remove!
Not really sure why this works, but if you look at the source for LibXML::XML::Reader#expand, there's a comment about freeing the node. I am guessing that Reader#expand associates the node to the Reader, and you have to call Node#remove! to free it.
Memory usage wasn't great, even with this hack, but at least it didn't keep on growing.

Resources