Parsing XML with Ruby - ruby

I'm way new to working with XML but just had a need dropped in my lap. I have been given an usual (to me) XML format. There are colons within the tags.
<THING1:things type="Container">
<PART1:Id type="Property">1234</PART1:Id>
<PART1:Name type="Property">The Name</PART1:Name>
</THING1:things>
It is a large file and there is much more to it than this but I hope this format will be familiar to someone. Does anyone know a way to approach an XML document of this sort?
I'd rather not just write a brute-force way of parsing the text but I can't seem to make any headway with REXML or Hpricot and I suspect it is due to these unusual tags.
my ruby code:
require 'hpricot'
xml = File.open( "myfile.xml" )
doc = Hpricot::XML( xml )
(doc/:things).each do |thg|
[ 'Id', 'Name' ].each do |el|
puts "#{el}: #{thg.at(el).innerHTML}"
end
end
...which is just lifted from: http://railstips.org/blog/archives/2006/12/09/parsing-xml-with-hpricot/
And I figured I would be able to figure some stuff out from here but this code returns nothing. It doens't error. It just returns.

As #pguardiario mentioned, Nokogiri is the de facto XML and HTML parsing library. If you wanted to print out the Id and Name values in your example, here is how you would do it:
require 'nokogiri'
xml_str = <<EOF
<THING1:things type="Container">
<PART1:Id type="Property">1234</PART1:Id>
<PART1:Name type="Property">The Name</PART1:Name>
</THING1:things>
EOF
doc = Nokogiri::XML(xml_str)
thing = doc.at_xpath('//things')
puts "ID = " + thing.at_xpath('//Id').content
puts "Name = " + thing.at_xpath('//Name').content
A few notes:
at_xpath is for matching one thing. If you know you have multiple items, you want to use xpath instead.
Depending on your document, namespaces can be problematic, so calling doc.remove_namespaces! can help (see this answer for a brief discussion).
You can use the css methods instead of xpath if you're more comfortable with those.
Definitely play around with this in irb or pry to investigate methods.
Resources
Parsing an HTML/XML document
Getting started with Nokogiri
Update
To handle multiple items, you need a root element, and you need to remove the // in the xpath query.
require 'nokogiri'
xml_str = <<EOF
<root>
<THING1:things type="Container">
<PART1:Id type="Property">1234</PART1:Id>
<PART1:Name type="Property">The Name1</PART1:Name>
</THING1:things>
<THING2:things type="Container">
<PART2:Id type="Property">2234</PART2:Id>
<PART2:Name type="Property">The Name2</PART2:Name>
</THING2:things>
</root>
EOF
doc = Nokogiri::XML(xml_str)
doc.xpath('//things').each do |thing|
puts "ID = " + thing.at_xpath('Id').content
puts "Name = " + thing.at_xpath('Name').content
end
This will give you:
Id = 1234
Name = The Name1
ID = 2234
Name = The Name2
If you are more familiar with CSS selectors, you can use this nearly identical bit of code:
doc.css('things').each do |thing|
puts "ID = " + thing.at_css('Id').content
puts "Name = " + thing.at_css('Name').content
end

If in a Rails environment, the Hash object is extended and one can take advantage of the the method from_xml:
xml = File.open("myfile.xml")
data = Hash.from_xml(xml)

Related

Nokogiri - Checking if the value of an xpath exists and is blank or not in Ruby

I have an XML file, and before I process it I need to make sure that a certain element exists and is not blank.
Here is the code I have:
CSV.open("#{csv_dir}/products.csv","w",{:force_quotes => true}) do |out|
out << headers
Dir.glob("#{xml_dir}/*.xml").each do |xml_file|
gdsn_doc = GDSNDoc.new(xml_file)
logger.info("Processing xml file #{xml_file}")
:x
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
row = []
headers.each do |col|
row << product[col]
end
out << row
end
end
end
The following code is not working to find the "description" element and to check whether it is blank or not:
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
Here is a sample of the XML file:
<productData>
<description>Chocolate biscuits </description>
<productData>
This is how I have defined the class and Nokogiri:
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
The code had to be moved up to an earlier stage, where Nokogiri was initialised. It doesn't get runtime errors, but it does let XML files with blank descriptions get through and it shouldn't.
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
desc_exists = #doc.xpath("//productData/descriptions")
if !desc_exists.empty?
You are creating your instance like this:
gdsn_doc = GDSNDoc.new(xml_file)
then use it like this:
#desc_exists = #gdsn_doc.xpath("//productData/description")
#gdsn_doc and gdsn_doc are two different things in Ruby - try just using the version without the #:
#desc_exists = gdsn_doc.xpath("//productData/description")
The basic test is to use:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<productData>
<description>Chocolate biscuits </description>
<productData>
EOT
# using XPath selectors...
doc.xpath('//productData/description').to_html # => "<description>Chocolate biscuits </description>"
doc.xpath('//description').to_html # => "<description>Chocolate biscuits </description>"
xpath works fine when the document is parsed correctly.
I get an error "undefined method 'xpath' for nil:NilClass (NoMethodError)
Usually this means you didn't parse the document correctly. In your case it's because you're not using the right variable:
gdsn_doc = GDSNDoc.new(xml_file)
...
#desc_exists = #gdsn_doc.xpath("//productData/description")
Note that gdsn_doc is not the same as #gdsn_doc. The later doesn't appear to have been initialized.
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
While that should work, it's idiomatic to write it as:
#doc = Nokogiri::XML(File.read(xml_file))
File.open(...) do ... end is preferred if you're processing inside the block and want Ruby to automatically close the file. That isn't necessary when you're simply reading then passing the content to something else for processing, hence the use of File.read(...) which slurps the file. (Slurping isn't necessary a good practice because it can have scalability problems, but for reasonable sized XML/HTML it's OK because it's easier to use DOM-based parsing than SAX.)
If Nokogiri doesn't raise an exception it was able to parse the content, however that still doesn't mean the content was valid. It's a good idea to check
#doc.errors
to see whether Nokogiri/libXML had to do some fix-ups on the content just to be able to parse it. Fixing the markup can change the DOM from what you expect, making it impossible to find a tag based on your assumptions for the selector. You could use xmllint or one of the XML validators to check, but Nokogiri will still have to be happy.
Nokogiri includes a command-line version nokogiri that accepts a URL to the document you want to parse:
nokogiri http://example.com
It'll open IRB with the content loaded and ready for you to poke at it. It's very convenient when debugging and testing. It's also a decent way to make sure the content actually exists if you're dealing with HTML containing DHTML that loads parts of the page dynamically.

xpath search using libxml + ruby

I am trying to search for a specific node in an XML file using XPath. This search worked just fine under REXML but REXML was too slow for large XML docs. So moved over to LibXML.
My simple example is processing a Yum repomd.xml file, an example can be found here: http://mirror.san.fastserv.com/pub/linux/centos/6/os/x86_64/repodata/repomd.xml
My test script is as follows:
require 'rubygems'
require 'libxml'
p = LibXML::XML::Parser.file( "/tmp/dr.xml")
repomd = p.parse
filelist = repomd.find_first("/repomd/data[#type='filelists']/location#href")
puts "Length: " + filelist.length.to_s
filelist.each do |f|
puts f.attributes['href']
end
I get this error:
Error: Invalid expression.
/usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:123:in `find': Error: Invalid expression. (LibXML::XML::Error)
from /usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:123:in `find'
from /usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:130:in `find_first'
from /tmp/scripty.rb:6
I have also tried simpler examples like below, but still no dice.
p = LibXML::XML::Parser.file( "/tmp/dr.xml")
repomd = p.parse
filelist = repomd.root.find(".//location")
puts "Length: " + filelist.length.to_s
In the above case I get the output:
Length: 0
Your inspired guidance would be greatly appreciated, and I have searched for what I am doing wrong, and I just can't figure it out...
Here is some code that will fetch the file and process it, still doesn't work...
require 'rubygems'
require 'open-uri'
require 'libxml'
raw_xml = open('http://mirror.san.fastserv.com/pub/linux/centos/6/os/x86_64/repodata/repomd.xml').read
p = LibXML::XML::Parser.string(raw_xml)
repomd = p.parse
filelist = repomd.find_first("//data[#type='filelists']/location[#href]")
puts "First: " + filelist
In the end I reverted back to REXML and used stream processing. Much faster and much easier XPath syntax implementation.
Looking at your code,it seems you want to collect only those location elements which has href attribute. If that's the case below should work:
"//data[#type='filelists']/location[#href]"

How to parse XML to CSV where data is in attributes only

The XML file I am trying to parse has all the data contained in attributes. I found how to build the string to insert into the text file.
I have this XML file:
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
And I want to parse it into a text file like this with the class ref duplicated for each property:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
This is the code I have so far:
require 'nokogiri'
doc = Nokogiri::XML(File.open("file.xml"), 'UTF-8') do |config|
config.strict
end
content = doc.xpath("//ig:prescribed_item/#class_ref").map {|i|
i.search("//ig:prescribed_item/ig:prescribed_property/#property_ref").map { |d| d.text }
}
puts content.inspect
content.each do |c|
puts c.join('|')
end
I'd simplify it a bit using CSS accessors:
xml = <<EOT
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
EOT
require 'nokogiri'
doc = Nokogiri::XML(xml)
data = [ %w[ class_ref property_ref is_required UOM_ref] ]
doc.css('|prescribed_item').each do |pi|
pi.css('|prescribed_property').each do |pp|
data << [
pi['class_ref'],
pp['property_ref'],
pp['is_required'],
pp.at_css('|prescribed_unit_of_measure')['UOM_ref']
]
end
end
puts data.map{ |row| row.join('|') }
Which outputs:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
Could you explain this line in greater detail "pp.at_css('|prescribed_unit_of_measure')['UOM_ref']"
In Nokogiri, there are two types of "find a node" methods: The "search" methods return all nodes that match a particular accessor as a NodeSet, and the "at" methods return the first Node of the NodeSet which will be the first encountered Node that matched the accessor.
The "search" methods are things like search, css, xpath and /. The "at" methods are things like at, at_css, at_xpath and %. Both search and at accept either XPath or CSS accessors.
Back to pp.at_css('|prescribed_unit_of_measure')['UOM_ref']: At that point in the code pp is a local variable containing a "prescribed_property" Node. So, I'm telling the code to find the first node under pp that matches the CSS |prescribed_unit_of_measure accessor, in other words the first <dt:prescribed_unit_of_measure> tag contained by the pp node. When Nokogiri finds that node, it returns the value of the UOM_ref attribute of the node.
As a FYI, the / and % operators are aliased to search and at respectively in Nokogiri. They're part of its "Hpricot" compatability; We used to use them a lot when Hpricot was the XML/HTML parser of choice, but they're not idiomatic for most Nokogiri developers. I suspect it's to avoid confusion with the regular use of the operators, at least it is in my case.
Also, Nokogiri's CSS accessors have some extra-special juiciness; They support namespaces, like the XPath accessors do, only they use |. Nokogiri will let us ignore the namespaces, which is what I did. You'll want to nose around in the Nokogiri docs for CSS and namespaces for more information.
There are definitely ways of parsing based on attributes.
The Engine yard article "Getting started with Nokogiri" has a full description.
But quickly, the examples they give are:
To match “h3″ tags that have a class
attribute, we write:
h3[#class]
To match “h3″ tags whose class
attribute is equal to the string “r”,
we write:
h3[#class = "r"]
Using the attribute matching
construct, we can modify our previous
query to:
//h3[#class = "r"]/a[#class = "l"]

Optimizing Ruby RSS

I'm writing a very simple Ruby script to parse tweets out of a twitter RSS feed. Here's the code I have:
require 'rss'
#rss = RSS::Parser.parse('statuses.xml', false)
outputfile = open("output.txt", "w")
#rss.items.each do |i|
pubdate = i.published.to_s
if pubdate.include? '2011-05'
tweet = i.title.to_s
tweet = tweet.gsub(/<title>SlyFlourish: /, "")
tweet = tweet.gsub(/<\/title>/, "\n\n")
outputfile << tweet
end
end
I think I'm missing something about dealing with the objects coming out of the RSS parser. Can someone tell me how I can better pull out the title and date entries from the object returned by the parser?
Is there a reason you chose RSS? Parsing XML is expensive.
I'd consider using JSON instead.
There's also a twitter Ruby gem that makes this really easy:
require "twitter"
Twitter.user_timeline("gavin_morrice").each do |tweet|
puts tweet.text
puts tweet.created_at
end

Using Nokogiri with XML files in Ruby

I have this XML:
<Experiment>
<mzData version="1.05" accessionNumber="1635">
<description>
<admin>
<sampleName>Fas-induced and control Jurkat T-lymphocytes</sampleName>
<sampleDescription>
<cvParam cvLabel="MeSH" accession="D017209" name="apoptosis" />
<cvParam cvLabel="UNITY" accession="D2135" name="Jurkat cells" />
<cvParam cvLabel="MeSH" accession="D019014" name="Antigens, CD95" />
</sampleDescription>
</admin>
</description>
</mzData>
</Experiment>
</ExperimentCollection>
I also have the following code:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML(File.open("my.xml"))
sampleName = doc.xpath( "/ExperimentCollection/Experiment/mzData/description/admin/sampleName" ).text
sampleDescription = doc.xpath( "/ExperimentCollection/Experiment/mzData/description/admin/sampleDescription/MeSH/#accession" ).text
puts sampleName + " " + sampleDescription
foo = sampleName + " " + sampleDescription
f = File.new("my.txt","w")
f.write(foo)
f.close()
The code grabs the sampleName just fine, but not the accession letters/numbers. I only want to grab all the letters/numbers after MeSH -> accession (D017209 and D019014). What do I have to change in the doc.xpath command to make this work?
doc.xpath( "/ExperimentCollection/Experiment/mzData/description/admin/sampleDescription/MeSH/#accession" )
Returns nothing because there is no tag MeSH. You need to replace MeSH with cvParam[#cvLabel=\"MeSH\"] (read: a cvParam tag which has an attribute cvLabel with the value MeSH).
Once you fixed that xpath will return a collection of Nokogiri::XML::Attr objects. By calling text on that collection you will get back the string value of the first element. Since you want all of the elements you should instead use map(&:text) (or map {|n| n.text} in ruby 1.8.6) which will return an array containing the string value of each accession attribute (i.e. ["D017209", "D019014"] for the example XML-file).
Since you seem to be confused, here's a clarification:
#Bobby: When I said "xpath will return a collection of Nokogiri::XML::Attr objects", I meant just that. You call xpath and then xpath creates and returns a collection of Attr objects. In no way did I mean that you should manually create any Attr objects yourself.
And when I said you should use map, I just meant you should call map on the collection returned by xpath (though instead of using map you can just call puts with the collection as an argument).
So what you need to do is 1. fix your xpath like I described.
use xpath with the fixed xpath to get a collection
use puts to print it
In other words:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML(File.open("my.xml"))
common_prefix = "/ExperimentCollection/Experiment/mzData/description/admin"
sample_name = doc.xpath( common_prefix+"/sampleName" ).text
accessions = doc.xpath( common_prefix+
"/sampleDescription/cvParam[#cvLabel=\"MeSH\"]/#accession" )
puts sample_name
puts accessions
Here is a simple way to do it, although this is probably too clever, because you'll probably want to do other things as well:
File.open("my.txt","w") do |f|
doc.xpath('//cvParam[#cvLabel="MeSH"]').each {|n| f << "#{n['name']} #{n['accession']}\n"}
end
You may need a more selective xpath statement.

Resources