How do I traverse an inner node using SAX in Nokogiri? - ruby

I'm quite new to Nokogiri and Ruby and seeking a little help.
I am parsing a very large XML file using class MyDoc < Nokogiri::XML::SAX::Document. Now I want to traverse the inner part of a block.
Here's the format of my XML file:
<Content id="83087">
<Title></Title>
<PublisherEntity id="1067">eBooksLib</PublisherEntity>
<Publisher>eBooksLib</Publisher>
......
</Content>
I can already tell if the "Content" tag is found, now I want to know how to traverse inside of it. Here's my shortened code:
class MyDoc < Nokogiri::XML::SAX::Document
#check the start element. set flag for each element
def start_element name, attrs = []
if(name == 'Content')
#get the <Title>
#get the <PublisherEntity>
#get the Publisher
end
end
def cdata_block(string)
characters(string)
end
def characters(str)
puts str
end
end

Purists may disagree with me, but the way I've been doing it is to use Nokogiri to traverse the huge file, and then use XmlSimple to work with a smaller object in the file. Here's a snippet of my code:
require 'nokogiri'
require 'xmlsimple'
def isend(node)
return (node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT)
end
reader = Nokogiri::XML::Reader(File.open('database.xml', 'r'))
# traverse the file looking for tag "content"
reader.each do |node|
next if node.name != 'content' || isend(node)
# if we get here, then we found start of node 'content',
# so read it into an array and work with the array:
content = XmlSimple.xml_in(node.outer_xml())
title = content['title'][0]
# ...etc.
end
This works very well for me. Some may object to mixing SAX and non-SAX (nokogiri and XmlSimple) in the same code, but for my purposes, it gets the job done with minimal hassle.

It's trickier to do with SAX. I think the solution will need to look something like this:
class MyDoc < Nokogiri::XML::SAX::Document
def start_element name, attrs = []
#inside_content = true if name == 'Content'
#current_element = name
end
def end_element name
#inside_content = false if name == 'Content'
#current_element = nil
end
def characters str
puts "#{#current_element} - #{str}" if #inside_content && %w{Title PublisherEntity Publisher}.include?(#current_element)
end
end

Related

Parsing Large xml file [duplicate]

So I'm attempting to parse a 400k+ line XML file using Nokogiri.
The XML file has this basic format:
<?xml version="1.0" encoding="windows-1252"?>
<JDBOR date="2013-09-01 04:12:31" version="1.0.20 [2012-12-14]" copyright="Orphanet (c) 2013">
<DisorderList count="6760">
*** Repeated Many Times ***
<Disorder id="17601">
<OrphaNumber>166024</OrphaNumber>
<Name lang="en">Multiple epiphyseal dysplasia, Al-Gazali type</Name>
<DisorderSignList count="18">
<DisorderSign>
<ClinicalSign id="2040">
<Name lang="en">Macrocephaly/macrocrania/megalocephaly/megacephaly</Name>
</ClinicalSign>
<SignFreq id="640">
<Name lang="en">Very frequent</Name>
</SignFreq>
</DisorderSign>
</Disorder>
*** Repeated Many Times ***
</DisorderList>
</JDBOR>
Here is the code I've created to parse and return each DisorderSign id and name into a database:
require 'nokogiri'
sympFile = File.open("Temp.xml")
#doc = Nokogiri::XML(sympFile)
sympFile.close()
symptomsList = []
#doc.xpath("////DisorderSign").each do |x|
signId = x.at('ClinicalSign').attribute('id').text()
name = x.at('ClinicalSign').element_children().text()
symptomsList.push([signId, name])
end
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
This works perfect on the test files I've used, although they were much smaller, around 10000 lines.
When I attempt to run this on the large XML file, it simply does not finish. I left it on overnight and it seemed to just lockup. Is there any fundamental reason the code I've written would make this very memory intensive or inefficient? I realize I store every possible pair in a list, but that shouldn't be large enough to fill up memory.
Thank you for any help.
I see a few possible problems. First of all, this:
#doc = Nokogiri::XML(sympFile)
will slurp the whole XML file into memory as some sort of libxml2 data structure and that will probably be larger than the raw XML file.
Then you do things like this:
#doc.xpath(...).each
That may not be smart enough to produce an enumerator that just maintains a pointer to the internal form of the XML, it might be producing a copy of everything when it builds the NodeSet that xpath returns. That would give you another copy of most of the expanded-in-memory version of the XML. I'm not sure how much copying and array construction happens here but there is room for a fair bit of memory and CPU overhead even if it doesn't copy duplicate everything.
Then you make your copy of what you're interested in:
symptomsList.push([signId, name])
and finally iterate over that array:
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
I find that SAX parsers work better with large data sets but they are more cumbersome to work with. You could try creating your own SAX parser something like this:
class D < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [ ])
if(name == 'DisorderSign')
#data = { }
elsif(name == 'ClinicalSign')
#key = :sign
#data[#key] = ''
elsif(name == 'SignFreq')
#key = :freq
#data[#key] = ''
elsif(name == 'Name')
#in_name = true
end
end
def characters(str)
#data[#key] += str if(#key && #in_name)
end
def end_element(name, attrs = [ ])
if(name == 'DisorderSign')
# Dump #data into the database here.
#data = nil
elsif(name == 'ClinicalSign')
#key = nil
elsif(name == 'SignFreq')
#key = nil
elsif(name == 'Name')
#in_name = false
end
end
end
The structure should be pretty clear: you watch for the opening of the elements that you're interested in and do a bit of bookkeeping set up when the do, then cache the strings if you're inside an element you care about, and finally clean up and process the data as the elements close. You're database work would replace the
# Dump #data into the database here.
comment.
This structure makes it pretty easy to watch for the <Disorder id="17601"> elements so that you can keep track of how far you've gone. That way you can stop and restart the import with some small modifications to your script.
A SAX Parser is definitly what you want to be using. If you're anything like me and can't jive with the Nokogiri documentation, there is an awesome gem called Saxerator that makes this process really easy.
An example for what you are trying to do --
require 'saxerator'
parser = Saxerator.parser(Temp.xml)
parser.for_tag(:DisorderSign).each do |sign|
signId = sign[:ClinicalSign][:id]
name = sign[:ClinicalSign][:name]
Symtom(:name => name, :id => signId).create!
end
You're likely running out of memory because symptomsList is getting too large in memory size. Why not perform the SQL within the xpath loop?
require 'nokogiri'
sympFile = File.open("Temp.xml")
#doc = Nokogiri::XML(sympFile)
sympFile.close()
#doc.xpath("////DisorderSign").each do |x|
signId = x.at('ClinicalSign').attribute('id').text()
name = x.at('ClinicalSign').element_children().text()
Symptom.where(:name => name, :signid => signId.to_i).first_or_create
end
It's possible too that the file is just too large for the buffer to handle. In that case you could chop it up into smaller temp files and process them individually.
You can also use Nokogiri::XML::Reader. It's more memory intensive that Nokogiri::XML::SAX parser but you can keep XML structure, e.x.
class NodeHandler < Struct.new(:node)
def process
# Node processing logic
#e.x.
signId = node.at('ClinicalSign').attribute('id').text()
name = node.at('ClinicalSign').element_children().text()
end
end
Nokogiri::XML::Reader(File.open('./test/fixtures/example.xml')).each do |node|
if node.name == 'DisorderSign' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
NodeHandler.new(
Nokogiri::XML(node.outer_xml).at('./DisorderSign')
).process
end
end
Based on this blog

Parsing Large XML with Nokogiri

So I'm attempting to parse a 400k+ line XML file using Nokogiri.
The XML file has this basic format:
<?xml version="1.0" encoding="windows-1252"?>
<JDBOR date="2013-09-01 04:12:31" version="1.0.20 [2012-12-14]" copyright="Orphanet (c) 2013">
<DisorderList count="6760">
*** Repeated Many Times ***
<Disorder id="17601">
<OrphaNumber>166024</OrphaNumber>
<Name lang="en">Multiple epiphyseal dysplasia, Al-Gazali type</Name>
<DisorderSignList count="18">
<DisorderSign>
<ClinicalSign id="2040">
<Name lang="en">Macrocephaly/macrocrania/megalocephaly/megacephaly</Name>
</ClinicalSign>
<SignFreq id="640">
<Name lang="en">Very frequent</Name>
</SignFreq>
</DisorderSign>
</Disorder>
*** Repeated Many Times ***
</DisorderList>
</JDBOR>
Here is the code I've created to parse and return each DisorderSign id and name into a database:
require 'nokogiri'
sympFile = File.open("Temp.xml")
#doc = Nokogiri::XML(sympFile)
sympFile.close()
symptomsList = []
#doc.xpath("////DisorderSign").each do |x|
signId = x.at('ClinicalSign').attribute('id').text()
name = x.at('ClinicalSign').element_children().text()
symptomsList.push([signId, name])
end
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
This works perfect on the test files I've used, although they were much smaller, around 10000 lines.
When I attempt to run this on the large XML file, it simply does not finish. I left it on overnight and it seemed to just lockup. Is there any fundamental reason the code I've written would make this very memory intensive or inefficient? I realize I store every possible pair in a list, but that shouldn't be large enough to fill up memory.
Thank you for any help.
I see a few possible problems. First of all, this:
#doc = Nokogiri::XML(sympFile)
will slurp the whole XML file into memory as some sort of libxml2 data structure and that will probably be larger than the raw XML file.
Then you do things like this:
#doc.xpath(...).each
That may not be smart enough to produce an enumerator that just maintains a pointer to the internal form of the XML, it might be producing a copy of everything when it builds the NodeSet that xpath returns. That would give you another copy of most of the expanded-in-memory version of the XML. I'm not sure how much copying and array construction happens here but there is room for a fair bit of memory and CPU overhead even if it doesn't copy duplicate everything.
Then you make your copy of what you're interested in:
symptomsList.push([signId, name])
and finally iterate over that array:
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
I find that SAX parsers work better with large data sets but they are more cumbersome to work with. You could try creating your own SAX parser something like this:
class D < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [ ])
if(name == 'DisorderSign')
#data = { }
elsif(name == 'ClinicalSign')
#key = :sign
#data[#key] = ''
elsif(name == 'SignFreq')
#key = :freq
#data[#key] = ''
elsif(name == 'Name')
#in_name = true
end
end
def characters(str)
#data[#key] += str if(#key && #in_name)
end
def end_element(name, attrs = [ ])
if(name == 'DisorderSign')
# Dump #data into the database here.
#data = nil
elsif(name == 'ClinicalSign')
#key = nil
elsif(name == 'SignFreq')
#key = nil
elsif(name == 'Name')
#in_name = false
end
end
end
The structure should be pretty clear: you watch for the opening of the elements that you're interested in and do a bit of bookkeeping set up when the do, then cache the strings if you're inside an element you care about, and finally clean up and process the data as the elements close. You're database work would replace the
# Dump #data into the database here.
comment.
This structure makes it pretty easy to watch for the <Disorder id="17601"> elements so that you can keep track of how far you've gone. That way you can stop and restart the import with some small modifications to your script.
A SAX Parser is definitly what you want to be using. If you're anything like me and can't jive with the Nokogiri documentation, there is an awesome gem called Saxerator that makes this process really easy.
An example for what you are trying to do --
require 'saxerator'
parser = Saxerator.parser(Temp.xml)
parser.for_tag(:DisorderSign).each do |sign|
signId = sign[:ClinicalSign][:id]
name = sign[:ClinicalSign][:name]
Symtom(:name => name, :id => signId).create!
end
You're likely running out of memory because symptomsList is getting too large in memory size. Why not perform the SQL within the xpath loop?
require 'nokogiri'
sympFile = File.open("Temp.xml")
#doc = Nokogiri::XML(sympFile)
sympFile.close()
#doc.xpath("////DisorderSign").each do |x|
signId = x.at('ClinicalSign').attribute('id').text()
name = x.at('ClinicalSign').element_children().text()
Symptom.where(:name => name, :signid => signId.to_i).first_or_create
end
It's possible too that the file is just too large for the buffer to handle. In that case you could chop it up into smaller temp files and process them individually.
You can also use Nokogiri::XML::Reader. It's more memory intensive that Nokogiri::XML::SAX parser but you can keep XML structure, e.x.
class NodeHandler < Struct.new(:node)
def process
# Node processing logic
#e.x.
signId = node.at('ClinicalSign').attribute('id').text()
name = node.at('ClinicalSign').element_children().text()
end
end
Nokogiri::XML::Reader(File.open('./test/fixtures/example.xml')).each do |node|
if node.name == 'DisorderSign' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
NodeHandler.new(
Nokogiri::XML(node.outer_xml).at('./DisorderSign')
).process
end
end
Based on this blog

REXML: Equivalent of javascript-DOM's .innerHTML=

Is there a way to pass a string to an REXML::Element in such a way that the string will be parsed as XML, and the elements so found inserted into the target?
You can extend the REXML::Element class to include innerHTML as shown below.
require "rexml/element"
class REXML::Element
def innerHTML=(xml)
require "rexml/document"
self.to_a.each do |e|
self.delete e
end
d = REXML::Document.new "<root>#{xml}</root>"
d.root.to_a.each do |e|
case e
when REXML::Text
self.add_text e
when REXML::Element
self.add_element e
else
puts "ERROR"
end
end
xml
end
def innerHTML
ret = ''
self.to_a.each do |e|
ret += e.to_s
end
ret
end
end
You can then use innerHTML as you would in javascript (more or less).
require "rexml/document"
doc = REXML::Document.new "<xml><alice><b>bob</b><chuck>ch<u>u</u>ck</chuck></alice><alice/></xml>"
c = doc.root.get_elements('//chuck').first
t = c.innerHTML
c.innerHTML = "#{t}<david>#{t}</david>"
c = doc.root.get_elements('//alice').last
c.innerHTML = "<david>#{t}</david>"
doc.write( $stdout, 2 )
It would help if you could provide an example to further illustrate exactly what you had in mind.
With JS innerHTML you can insert text or HTML in one shot and changes are immediately displayed in the HTML document. The only way I know how to do this in REXML is with separate steps for inserting content/elements and saving/reloading the document.
To modify the text of a specific REXML Elemement you can use the text=() method.
#e represents a REXML Element
e.text = "blah"
If you want to insert another element you have to use the add_element() method.
#e represents a REXML Element
e.add_element('blah') #adds <blah></blah> to the existing element
b = e.get_elements('blah') #empty Element named "blah"
b.text('some text') #add some text to Element blah
Then of course save the XML document with the changes. ruby-doc.org/REXML/Element
text() will return the inner content as a string

Add a class to an element with Nokogiri

Apparently Nokogiri's add_class method only works on NodeLists, making this code invalid:
doc.search('a').each do |anchor|
anchor.inner_text = "hello!"
anchor.add_class("whatever") # WHOOPS!
end
What can I do to make this code work? I figured it'd be something like
doc.search('a').each do |anchor|
anchor.inner_text = "hello!"
Nokogiri::XML::NodeSet.new(anchor).add_class("whatever")
end
but this doesn't work either. Please tell me I don't have to implement my own add_class for single nodes!
A CSS class is just another attribute on an element:
doc.search('a').each do |anchor|
anchor.inner_text = "hello!"
anchor['class']="whatever"
end
Since CSS classes are space-delimited in the attribute, if you're not sure if one or more classes might already exist you'll need something like
anchor['class'] ||= ""
anchor['class'] = anchor['class'] << " whatever"
You need to explicitly set the attribute using = instead of just mutating the string returned for the attribute. This, for example, will not change the DOM:
anchor['class'] ||= ""
anchor['class'] << " whatever"
Even though it results in more work being done, I'd probably do this like so:
class Nokogiri::XML::Node
def add_css_class( *classes )
existing = (self['class'] || "").split(/\s+/)
self['class'] = existing.concat(classes).uniq.join(" ")
end
end
If you don't want to monkey-patch the class, you could alternatively:
module ClassMutator
def add_css_class( *classes )
existing = (self['class'] || "").split(/\s+/)
self['class'] = existing.concat(classes).uniq.join(" ")
end
end
anchor.extend ClassMutator
anchor.add_css_class "whatever"
Edit: You can see that this is basically what Nokogiri does internally for the add_class method you found by clicking on the class to view the source:
# File lib/nokogiri/xml/node_set.rb, line 136
def add_class name
each do |el|
next unless el.respond_to? :get_attribute
classes = el.get_attribute('class').to_s.split(" ")
el.set_attribute('class', classes.push(name).uniq.join(" "))
end
self
end
Nokogiri's add_class, works on a NodeSet, like you found. Trying to add the class inside the each block wouldn't work though, because at that point you are working on an individual node.
Instead:
require 'nokogiri'
html = '<p>one</p><p>two</p>'
doc = Nokogiri::HTML(html)
doc.search('p').tap{ |ns| ns.add_class('boo') }.each do |n|
puts n.text
end
puts doc.to_html
Which outputs:
# >> one
# >> two
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p class="boo">one</p>
# >> <p class="boo">two</p>
# >> </body></html>
The tap method, implemented in Ruby 1.9+, gives access to the nodelist itself, allowing the add_class method to add the "boo" class to the <p> tags.
Old thread, but it's the top Google hit. You can now do this with the append_class method without having to mess with space-delimiters:
doc.search('a').each do |anchor|
anchor.inner_text = "hello!"
anchor.append_class('whatever')
end

How to search an XML when parsing it using SAX in nokogiri

I have a simple but huge xml file like below. I want to parse it using SAX and only print out text between the title tag.
<root>
<site>some site</site>
<title>good title</title>
</root>
I have the following code:
require 'rubygems'
require 'nokogiri'
include Nokogiri
class PostCallbacks < XML::SAX::Document
def start_element(element, attributes)
if element == 'title'
puts "found title"
end
end
def characters(text)
puts text
end
end
parser = XML::SAX::Parser.new(PostCallbacks.new)
parser.parse_file("myfile.xml")
problem is that it prints text between all the tags. How can I just print text between the title tag?
You just need to keep track of when you're inside a <title> so that characters knows when it should pay attention. Something like this (untested code) perhaps:
class PostCallbacks < XML::SAX::Document
def initialize
#in_title = false
end
def start_element(element, attributes)
if element == 'title'
puts "found title"
#in_title = true
end
end
def end_element(element)
# Doesn't really matter what element we're closing unless there is nesting,
# then you'd want "#in_title = false if element == 'title'"
#in_title = false
end
def characters(text)
puts text if #in_title
end
end
The accepted answer above is correct, however it has a drawback that it will go through the whole XML file even if it finds <title> right at the beginning.
I did have similar needs and I ended up writing a saxy ruby gem that is aimed to be efficient in such situations. Under the hood it implements Nokogiri's SAX Api.
Here's how you'd use it:
require 'saxy'
title = Saxy.parse(path_to_your_file, 'title').first
It will stop right when it finds first occurrence of <title> tag.

Resources