Nokogiri check XML root/file validity - ruby

Is there a simple method/way to check if a Nokogiri XML file has a proper root, like xml.valid? A way to check if the XML file contains specific content is very welcome as well.
I'm thinking of something like xml.valid? or xml.has_valid_root?. Thanks!

How are you going to determine what is a proper root?
<foo></foo>
has a proper root:
require 'nokogiri'
xml = '<foo></foo>'
doc = Nokogiri::XML(xml)
doc.root # => #<Nokogiri::XML::Element:0x3fd3a9471b7c name="foo">
Nokogiri has no way of determining that something else should have been the root. You might be able to test if you have foreknowledge of what the root node's name should be:
doc_root_ok = (doc.root.name == 'foo')
doc_root_ok # => true
You can see if the document parsed was well-formed (not needing any fixup), by looking at errors:
doc.errors # => []
If Nokogiri had to modify the document just to parse it, errors will return a list of changes that were made prior to parsing:
xml = '<foo><bar><bar></foo>'
doc = Nokogiri::XML(xml)
doc.errors # => [#<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: bar line 1 and foo>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag bar line 1>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag foo line 1>]

A common and useful pattern is
doc = Nokogiri::XML(xml) do |config|
config.strict
end
This will throw a wobbly if the document is not well formed. I like to do this in order to prevent Nokogiri from being too kind to my XML.

Related

How do I select an attribute from a Nokogiri::XML.parse result set element [duplicate]

I have a simple task of accessing the values of some attributes. This is a simple script that uses Nokogiri::XML::Builder to create a simple XML doc.
require 'nokogiri'
builder = Nokogiri::XML::Builder.new(:encoding => 'UTF-8') do |xml|
xml.Placement(:messageId => "392847-039820-938777", :system => "MOD", :version => "2.0") {
xml.objects {
xml.object(:myattribute => "99", :anotherattrib => "333")
xml.nextobject_ '9387toot'
xml.Entertainment "Last Man Standing"
}
}
end
puts builder.to_xml
puts builder.root.attributes["messageId"]
The results are:
<?xml version="1.0" encoding="UTF-8"?>
<Placement messageId="392847-039820-938777" version="2.0" system="MOD">
<objects>
<object anotherattrib="333" myattribute="99"/>
<nextobject>9387toot</nextobject>
<Entertainment>Last Man Standing</Entertainment>
</objects>
</Placement>
C:/Ruby/lib/ruby/gems/1.8/gems/nokogiri-1.4.2-x86-mingw32/lib/nokogiri/xml/document.rb:178:in `add_child': Document already has a root node (RuntimeError)
from C:/Ruby/lib/ruby/gems/1.8/gems/nokogiri-1.4.2-x86-mingw32/lib/nokogiri/xml/node.rb:455:in `parent='
from C:/Ruby/lib/ruby/gems/1.8/gems/nokogiri-1.4.2-x86-mingw32/lib/nokogiri/xml/builder.rb:358:in `insert'
from C:/Ruby/lib/ruby/gems/1.8/gems/nokogiri-1.4.2-x86-mingw32/lib/nokogiri/xml/builder.rb:350:in `method_missing'
from C:/Documents and Settings/etrojan/workspace/Lads/tryXPATH2.rb:15
The XML that is generated looks fine. However, my attempts to access attributes cause an error to be generated:
Document already has a root node
I don't understand why puts would cause this error.
Using Nokogiri::XML::Reader works for your example, but probably isn't the full answer you are looking for (Note that there is no attributes method for Builder).
reader = Nokogiri::XML::Reader(builder.to_xml)
reader.read #Moves to next node in document
reader.attribute("messageId")
Note that if you issued reader.read again and then tried reader.attribute("messageId") the result will be nil since the current node will not have this attribute.
What you probably want to do is use Nokogiri::XML::Document if you want to search an XML document by attribute.
doc = Nokogiri::XML(builder.to_xml)
elems = doc.xpath("//*[#messageId]") #get all elements with an attribute of 'messageId'
elems[0].attr('messageId') #gets value of attribute of first elem
Here is a slightly more succinct way to access attributes using Nokogiri (assuming you already have your xml stored in a variable called xml, as covered by #atomicules' answer):
xml.xpath("//Placement").attr("messageId")

Nokogiri - Checking if the value of an xpath exists and is blank or not in Ruby

I have an XML file, and before I process it I need to make sure that a certain element exists and is not blank.
Here is the code I have:
CSV.open("#{csv_dir}/products.csv","w",{:force_quotes => true}) do |out|
out << headers
Dir.glob("#{xml_dir}/*.xml").each do |xml_file|
gdsn_doc = GDSNDoc.new(xml_file)
logger.info("Processing xml file #{xml_file}")
:x
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
row = []
headers.each do |col|
row << product[col]
end
out << row
end
end
end
The following code is not working to find the "description" element and to check whether it is blank or not:
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
Here is a sample of the XML file:
<productData>
<description>Chocolate biscuits </description>
<productData>
This is how I have defined the class and Nokogiri:
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
The code had to be moved up to an earlier stage, where Nokogiri was initialised. It doesn't get runtime errors, but it does let XML files with blank descriptions get through and it shouldn't.
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
desc_exists = #doc.xpath("//productData/descriptions")
if !desc_exists.empty?
You are creating your instance like this:
gdsn_doc = GDSNDoc.new(xml_file)
then use it like this:
#desc_exists = #gdsn_doc.xpath("//productData/description")
#gdsn_doc and gdsn_doc are two different things in Ruby - try just using the version without the #:
#desc_exists = gdsn_doc.xpath("//productData/description")
The basic test is to use:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<productData>
<description>Chocolate biscuits </description>
<productData>
EOT
# using XPath selectors...
doc.xpath('//productData/description').to_html # => "<description>Chocolate biscuits </description>"
doc.xpath('//description').to_html # => "<description>Chocolate biscuits </description>"
xpath works fine when the document is parsed correctly.
I get an error "undefined method 'xpath' for nil:NilClass (NoMethodError)
Usually this means you didn't parse the document correctly. In your case it's because you're not using the right variable:
gdsn_doc = GDSNDoc.new(xml_file)
...
#desc_exists = #gdsn_doc.xpath("//productData/description")
Note that gdsn_doc is not the same as #gdsn_doc. The later doesn't appear to have been initialized.
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
While that should work, it's idiomatic to write it as:
#doc = Nokogiri::XML(File.read(xml_file))
File.open(...) do ... end is preferred if you're processing inside the block and want Ruby to automatically close the file. That isn't necessary when you're simply reading then passing the content to something else for processing, hence the use of File.read(...) which slurps the file. (Slurping isn't necessary a good practice because it can have scalability problems, but for reasonable sized XML/HTML it's OK because it's easier to use DOM-based parsing than SAX.)
If Nokogiri doesn't raise an exception it was able to parse the content, however that still doesn't mean the content was valid. It's a good idea to check
#doc.errors
to see whether Nokogiri/libXML had to do some fix-ups on the content just to be able to parse it. Fixing the markup can change the DOM from what you expect, making it impossible to find a tag based on your assumptions for the selector. You could use xmllint or one of the XML validators to check, but Nokogiri will still have to be happy.
Nokogiri includes a command-line version nokogiri that accepts a URL to the document you want to parse:
nokogiri http://example.com
It'll open IRB with the content loaded and ready for you to poke at it. It's very convenient when debugging and testing. It's also a decent way to make sure the content actually exists if you're dealing with HTML containing DHTML that loads parts of the page dynamically.

How to parse XML to CSV where data is in attributes only

The XML file I am trying to parse has all the data contained in attributes. I found how to build the string to insert into the text file.
I have this XML file:
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
And I want to parse it into a text file like this with the class ref duplicated for each property:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
This is the code I have so far:
require 'nokogiri'
doc = Nokogiri::XML(File.open("file.xml"), 'UTF-8') do |config|
config.strict
end
content = doc.xpath("//ig:prescribed_item/#class_ref").map {|i|
i.search("//ig:prescribed_item/ig:prescribed_property/#property_ref").map { |d| d.text }
}
puts content.inspect
content.each do |c|
puts c.join('|')
end
I'd simplify it a bit using CSS accessors:
xml = <<EOT
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
EOT
require 'nokogiri'
doc = Nokogiri::XML(xml)
data = [ %w[ class_ref property_ref is_required UOM_ref] ]
doc.css('|prescribed_item').each do |pi|
pi.css('|prescribed_property').each do |pp|
data << [
pi['class_ref'],
pp['property_ref'],
pp['is_required'],
pp.at_css('|prescribed_unit_of_measure')['UOM_ref']
]
end
end
puts data.map{ |row| row.join('|') }
Which outputs:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
Could you explain this line in greater detail "pp.at_css('|prescribed_unit_of_measure')['UOM_ref']"
In Nokogiri, there are two types of "find a node" methods: The "search" methods return all nodes that match a particular accessor as a NodeSet, and the "at" methods return the first Node of the NodeSet which will be the first encountered Node that matched the accessor.
The "search" methods are things like search, css, xpath and /. The "at" methods are things like at, at_css, at_xpath and %. Both search and at accept either XPath or CSS accessors.
Back to pp.at_css('|prescribed_unit_of_measure')['UOM_ref']: At that point in the code pp is a local variable containing a "prescribed_property" Node. So, I'm telling the code to find the first node under pp that matches the CSS |prescribed_unit_of_measure accessor, in other words the first <dt:prescribed_unit_of_measure> tag contained by the pp node. When Nokogiri finds that node, it returns the value of the UOM_ref attribute of the node.
As a FYI, the / and % operators are aliased to search and at respectively in Nokogiri. They're part of its "Hpricot" compatability; We used to use them a lot when Hpricot was the XML/HTML parser of choice, but they're not idiomatic for most Nokogiri developers. I suspect it's to avoid confusion with the regular use of the operators, at least it is in my case.
Also, Nokogiri's CSS accessors have some extra-special juiciness; They support namespaces, like the XPath accessors do, only they use |. Nokogiri will let us ignore the namespaces, which is what I did. You'll want to nose around in the Nokogiri docs for CSS and namespaces for more information.
There are definitely ways of parsing based on attributes.
The Engine yard article "Getting started with Nokogiri" has a full description.
But quickly, the examples they give are:
To match “h3″ tags that have a class
attribute, we write:
h3[#class]
To match “h3″ tags whose class
attribute is equal to the string “r”,
we write:
h3[#class = "r"]
Using the attribute matching
construct, we can modify our previous
query to:
//h3[#class = "r"]/a[#class = "l"]

How do I tell the line number for a node using the Nokogiri reader interface?

I'm trying to write a Nokogiri script that will grep XML for text nodes containing ASCII double-quotes («"»). Since I want a grep-like output I need the line number, and the contents of each line. However, I am unable to see how to tell the line number where the element starts at. Here is my code:
require 'rubygems'
require 'nokogiri'
ARGV.each do |filename|
xml_stream = File.open(filename)
reader = Nokogiri::XML::Reader(xml_stream)
titles = []
text = ''
grab_text = false
reader.each do |elem|
if elem.node_type == Nokogiri::XML::Node::TEXT_NODE
data = elem.value
lines = data.split(/\n/, -1);
lines.each_with_index do |line, idx|
if (line =~ /"/) then
STDOUT.printf "%s:%d:%s\n", filename, elem.line()+idx, line
end
end
end
end
end
elem.line() does not work.
XML and parsers don't really have a concept of line numbers. You're talking about the physical layout of the file.
You can play a game with the parser using accessors looking for text nodes containing linefeeds and/or carriage returns but that can be thrown off because XML allows nested nodes.
require 'nokogiri'
xml =<<EOT_XML
<atag>
<btag>
<ctag
id="another_node">
other text
</ctag>
</btag>
<btag>
<ctag id="another_node2">yet
another
text</ctag>
</btag>
<btag>
<ctag id="this_node">this text</ctag>
</btag>
</atag>
EOT_XML
doc = Nokogiri::XML(xml)
# find a particular node via CSS accessor
doc.at('ctag#this_node').text # => "this text"
# count how many "lines" there are in the document
doc.search('*/text()').select{ |t| t.text[/[\r\n]/] }.size # => 12
# walk the nodes looking for a particular string, counting lines as you go
content_at = []
doc.search('*/text()').each do |n|
content_at << [n.line, n.text] if (n.text['this text'])
end
content_at # => [[14, "this text"]]
This works because of the parser's ability to figure out what is a text node and cleanly return it, without relying on regex or text matches.
EDIT: I went through some old code, snooped around in Nokogiri's docs some, and came up with the above edited changes. It's working correctly, including working with some pathological cases. Nokogiri FTW!
As of 1.2.0 (released 2009-02-22), Nokogiri supports Node#line, which returns the line number in the source where that node is defined.
It appears to use the libxml2 function xmlGetLineNo().
require 'nokogiri'
doc = Nokogiri::XML(open 'tmpfile.xml')
doc.xpath('//xmlns:package[#arch="x86_64"]').each do |node|
puts '%4d %s' % [node.line, node['name']]
end
NOTE if you are working with large xml files (> 65535 lines), be sure to use Nokogiri 1.13.0 or newer (released 2022-01-06), or your Node#line results will not be accurate for large line numbers. See PR 2309 for an explanation.

How to access attributes using Nokogiri

I have a simple task of accessing the values of some attributes. This is a simple script that uses Nokogiri::XML::Builder to create a simple XML doc.
require 'nokogiri'
builder = Nokogiri::XML::Builder.new(:encoding => 'UTF-8') do |xml|
xml.Placement(:messageId => "392847-039820-938777", :system => "MOD", :version => "2.0") {
xml.objects {
xml.object(:myattribute => "99", :anotherattrib => "333")
xml.nextobject_ '9387toot'
xml.Entertainment "Last Man Standing"
}
}
end
puts builder.to_xml
puts builder.root.attributes["messageId"]
The results are:
<?xml version="1.0" encoding="UTF-8"?>
<Placement messageId="392847-039820-938777" version="2.0" system="MOD">
<objects>
<object anotherattrib="333" myattribute="99"/>
<nextobject>9387toot</nextobject>
<Entertainment>Last Man Standing</Entertainment>
</objects>
</Placement>
C:/Ruby/lib/ruby/gems/1.8/gems/nokogiri-1.4.2-x86-mingw32/lib/nokogiri/xml/document.rb:178:in `add_child': Document already has a root node (RuntimeError)
from C:/Ruby/lib/ruby/gems/1.8/gems/nokogiri-1.4.2-x86-mingw32/lib/nokogiri/xml/node.rb:455:in `parent='
from C:/Ruby/lib/ruby/gems/1.8/gems/nokogiri-1.4.2-x86-mingw32/lib/nokogiri/xml/builder.rb:358:in `insert'
from C:/Ruby/lib/ruby/gems/1.8/gems/nokogiri-1.4.2-x86-mingw32/lib/nokogiri/xml/builder.rb:350:in `method_missing'
from C:/Documents and Settings/etrojan/workspace/Lads/tryXPATH2.rb:15
The XML that is generated looks fine. However, my attempts to access attributes cause an error to be generated:
Document already has a root node
I don't understand why puts would cause this error.
Using Nokogiri::XML::Reader works for your example, but probably isn't the full answer you are looking for (Note that there is no attributes method for Builder).
reader = Nokogiri::XML::Reader(builder.to_xml)
reader.read #Moves to next node in document
reader.attribute("messageId")
Note that if you issued reader.read again and then tried reader.attribute("messageId") the result will be nil since the current node will not have this attribute.
What you probably want to do is use Nokogiri::XML::Document if you want to search an XML document by attribute.
doc = Nokogiri::XML(builder.to_xml)
elems = doc.xpath("//*[#messageId]") #get all elements with an attribute of 'messageId'
elems[0].attr('messageId') #gets value of attribute of first elem
Here is a slightly more succinct way to access attributes using Nokogiri (assuming you already have your xml stored in a variable called xml, as covered by #atomicules' answer):
xml.xpath("//Placement").attr("messageId")

Resources