Why does Nokogiri's to_xhtml create new `id` attributes from `name`? - ruby

Consider the following code:
require 'nokogiri' # v1.5.2
doc = Nokogiri.XML('<body><a name="foo">ick</a></body>')
puts doc.to_html
#=> <body><a name="foo">ick</a></body>
puts doc.to_xml
#=> <?xml version="1.0"?>
#=> <body>
#=> <a name="foo">ick</a>
#=> </body>
puts doc.to_xhtml
#=> <body>
#=> <a name="foo" id="foo">ick</a>
#=> </body>
Notice the new id attribute that has been created.
Who is responsible for this, Nokogiri or libxml2?
Why does this occur? (Is this enforcing a standard?)
The closest I can find is this spec describing how you may put both an id and name attribute with the same value.
Is there any way to avoid this, given the desire to use the to_xhtml method on input that may have <a name="foo">?
This problem arises because I have some input I am parsing with an id attribute on one element and a separate element with a name attribute that happens to conflict.

Apparently it's a feature of libxml2. In http://www.w3.org/TR/xhtml1/#h-4.10 we find:
In XML, fragment identifiers are of type ID, and there can only be a single attribute of type ID per element. Therefore, in XHTML 1.0 the id attribute is defined to be of type ID. In order to ensure that XHTML 1.0 documents are well-structured XML documents, XHTML 1.0 documents MUST use the id attribute when defining fragment identifiers on the elements listed above.
[...]
Note that in XHTML 1.0, the name attribute of these elements is formally deprecated, and will be removed in a subsequent version of XHTML.
The best 'workaround' I've come up with is:
# Destroy all <a name="..."> elements, replacing with children
# if another element with a conflicting id already exists in the document
doc.xpath('//a[#name][not(#id)][not(#href)]').each do |a|
a.replace(a.children) if doc.at_css("##{a['name']}")
end

Perhaps you could add some other id value to these elements to prevent libxml adding its own.
doc.xpath('//a[#name and not(#id)]').each do |n|
n['id'] = n['name'] + 'some_suffix'
end
(Obviously you’ll need to determine how to create a unique id value for your document).

Related

Is there a way to access Nokogiri::XML::Attr by using a symbol key, not a string key

I have a code like this
require 'nokogiri'
url = ENV['URL']
doc = Nokogiri::HTML(open(url))
link = doc.css('a#foo').attr('href').value
I want to access to Nokogiri::XML::Attr by using symbol like this.
doc = Nokogiri::HTML(open(url), hash_key_symbol: true)
link = doc.css('a#foo').attr(:href).value
I couldn't find information for it, but maybe I've overlooked it.
Is there a option like this?
You are calling attr on the NodeSet returned from css, not on a single Node object. attr on a Node will accept a symbol to specify the attribute, and has done for a while, but it looks like the corresponding change hasn’t been made to NodeSet#attr. Note that the NodeSet version of attr is for setting the attribute on all nodes in the set, and will only return the value of the attribute on the first node it contains if you don’t specify a value.
You can use at_css to explicitly only select the first matching node of your query, then you can use a symbol:
doc.at_css('a#foo').attr(:href).value
Alternatively you could select the node from the node set by its index:
doc.css('a#foo')[0].attr(:href).value
The simple way to access a parameter in a tag is to use the "hash" [] syntax:
require 'nokogiri'
html = <<EOT
<html>
<body>
bar
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
doc.at('a#foo')['href'] # => "blah"
But we can use a symbol:
doc.at('a#foo')[:href] # => "blah"
Note, at is equivalent to search('a#foo').first, and both return a Node. search, and its CSS and XPath variants return NodeSets, which don't have the ability to return the attribute of a specific node or all nodes. To process multiple nodes use map:
html = <<EOT
<html>
<body>
bar1
bar2
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
doc.search('a.foo').map{ |n| n['href'] } # => ["blah", "more_blah"]

Get all attributes for elements in XML file

I'm trying to parse a file and get all of the attributes for each <row> tag in the file. The file looks generally like this:
<?xml version="1.0" standalone="yes"?>
<report>
<table>
<columns>
<column name="month"/>
<column name="campaign"/>
<!-- many columns -->
</columns>
<rows>
<row month="December 2009" campaign="Campaign #1"
adgroup="Python" preview="Not available"
headline="We Write Apps in Python"
and="many more attributes here" />
<row month="December 2009" campaign="Campaign #1"
adgroup="Ruby" preview="Not available"
headline="We Write Apps in Ruby"
and="many more attributes here" />
<!-- many such rows -->
</rows></table></report>
Here is the full file: http://pastie.org/7268456#2.
I've looked at every tutorial and answer I can find on various help boards but they all assume the same thing- I'm searching for one or two specific tags and just need one or two values for those tags. I actually have 18 attributes for each <row> tag and I have a mysql table with a column for each of the 18 attributes. I need to put the information into an object/hash/array that I can use to insert into the table with ActiveRecord/Ruby.
I started out using Hpricot; you can see the code (which is not relevant) in the edit history of this question.
require 'nokogiri'
doc = Nokogiri.XML(my_xml_string)
doc.css('row').each do |row|
# row is a Nokogiri::XML::Element
row.attributes.each do |name,attr|
# name is a string
# attr is a Nokogiri::XML::Attr
p name => attr.value
end
end
#=> {"month"=>"December 2009"}
#=> {"campaign"=>"Campaign #1"}
#=> {"adgroup"=>"Python"}
#=> {"preview"=>"Not available"}
#=> {"headline"=>"We Write Apps in Python"}
#=> etc.
Alternatively, if you just want an array of hashes mapping attribute names to string values:
rows = doc.css('row').map{ |row| Hash[ row.attributes.map{|n,a| [n,a.value]} ] }
#=> [
#=> {"month"=>"December 2009", "campaign"=>"Campaign #1", adgroup="Python", … },
#=> {"month"=>"December 2009", "campaign"=>"Campaign #1", adgroup="Ruby", … },
#=> …
#=> ]
The Nokogiri.XML method is the simplest way to parse an XML string and get a Nokogiri::Document back.
The css method is the simplest way to find all the elements with a given name (ignoring their containment hierarchy and any XML namespaces). It returns a Nokogiri::XML::NodeSet, which is very similar to an array.
Each Nokogiri::XML::Element has an attributes method that returns a Hash mapping the name of the attribute to a Nokogiri::XML::Attr object containing all the information about the attribute (name, value, namespace, parent element, etc.)

Get text directly inside a tag in Nokogiri

I have some HTML that looks like:
<dt>
Hello
(2009)
</dt>
I already have all my HTML loaded into a variable called record. I need to parse out the year i.e. 2009 if it exists.
How can I get the text inside the dt tag but not the text inside the a tag? I've used record.search("dt").inner_text and this gives me everything.
It's a trivial question but I haven't managed to figure this out.
To get all the direct children with text, but not any further sub-children, you can use XPath like so:
doc.xpath('//dt/text()')
Or if you wish to use search:
doc.search('dt').xpath('text()')
Using XPath to select exactly what you want (as suggested by #Casper) is the right answer.
def own_text(node)
# Find the content of all child text nodes and join them together
node.xpath('text()').text
end
Here's an alternative, fun answer :)
def own_text(node)
node.clone(1).tap{ |copy| copy.element_children.remove }.text
end
Seen in action:
require 'nokogiri'
root = Nokogiri.XML('<r>hi <a>BOO</a> there</r>').root
puts root.text #=> hi BOO there
puts own_text(root) #=> hi there
The dt element has two children, so you can access it by:
doc.search("dt").children.last.text

Calling super's method (add namespace to Nokogiri XML document)

I have an XML document which is missing some namespace declaration. I know I can define it when I use doc.xpath() method, like the following:
doc.xpath('//dc:title', 'dc' => 'http://purl.org/dc/elements/1.1/')
However I would like to add it once since I have a lot of xpath calls.
I found out that my Nokogiri::XML::Document is inherited from Nokogiri::XML::Node. And the Node class contains an add_namespace() method. However I can't call it, because it says it is undefined.
Is this because Ruby does not allow calling parent class's functions? Is there a way to go around this?
EDIT
I add the following console example:
> c = Nokogiri.XML(doc_text)
> c.class
=> Nokogiri::XML::Document
> c.add_namespace('a','b')
NoMethodError: undefined method `add_namespace' for #<Nokogiri::XML::Document:0x007fea4ee22c60>
And here is the API document about Nokogiri::XML class
EDIT again:
The original doc was valid xml like this:
<root xmlns:ra="...">
<item>
<title/>
<ra:price/>
</item>
<item>...
</root>
However there are too many items, and I have to create one object for each of these, serialize and save in the database. So for each object I took the item node and turn it into string and saved in the object.
Now after I revive the object from DB and I want to parse the item node again I came to this namespace issue.
While Nokogiri::XML::Document does inherit from Nokogiri::XML::Node, some methods are explicitly removed at the document level, including add_namespace
https://github.com/tenderlove/nokogiri/blob/master/lib/nokogiri/xml/document.rb#L203
As #pguardiario notes, you want to add namespaces to the root element, not the document.
However, doing this after parsing the document is too late. Nokogiri has already created the nodes, discarding the namespaces:
require 'nokogiri'
xml = "<r><a:b/></r>"
doc = Nokogiri.XML(xml)
p doc.at('b').namespace
#=> nil
doc.root.add_namespace 'a', 'foo'
puts doc
#=> <?xml version="1.0"?>
#=> <r xmlns:a="foo">
#=> <b/>
#=> </r>
You'll need to fix your source XML as a string before parsing with Nokogiri. (Unless there's some way with the SAX parser to add the namespace when you hit the first node, before moving on.)

How to parse XML to CSV where data is in attributes only

The XML file I am trying to parse has all the data contained in attributes. I found how to build the string to insert into the text file.
I have this XML file:
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
And I want to parse it into a text file like this with the class ref duplicated for each property:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
This is the code I have so far:
require 'nokogiri'
doc = Nokogiri::XML(File.open("file.xml"), 'UTF-8') do |config|
config.strict
end
content = doc.xpath("//ig:prescribed_item/#class_ref").map {|i|
i.search("//ig:prescribed_item/ig:prescribed_property/#property_ref").map { |d| d.text }
}
puts content.inspect
content.each do |c|
puts c.join('|')
end
I'd simplify it a bit using CSS accessors:
xml = <<EOT
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
EOT
require 'nokogiri'
doc = Nokogiri::XML(xml)
data = [ %w[ class_ref property_ref is_required UOM_ref] ]
doc.css('|prescribed_item').each do |pi|
pi.css('|prescribed_property').each do |pp|
data << [
pi['class_ref'],
pp['property_ref'],
pp['is_required'],
pp.at_css('|prescribed_unit_of_measure')['UOM_ref']
]
end
end
puts data.map{ |row| row.join('|') }
Which outputs:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
Could you explain this line in greater detail "pp.at_css('|prescribed_unit_of_measure')['UOM_ref']"
In Nokogiri, there are two types of "find a node" methods: The "search" methods return all nodes that match a particular accessor as a NodeSet, and the "at" methods return the first Node of the NodeSet which will be the first encountered Node that matched the accessor.
The "search" methods are things like search, css, xpath and /. The "at" methods are things like at, at_css, at_xpath and %. Both search and at accept either XPath or CSS accessors.
Back to pp.at_css('|prescribed_unit_of_measure')['UOM_ref']: At that point in the code pp is a local variable containing a "prescribed_property" Node. So, I'm telling the code to find the first node under pp that matches the CSS |prescribed_unit_of_measure accessor, in other words the first <dt:prescribed_unit_of_measure> tag contained by the pp node. When Nokogiri finds that node, it returns the value of the UOM_ref attribute of the node.
As a FYI, the / and % operators are aliased to search and at respectively in Nokogiri. They're part of its "Hpricot" compatability; We used to use them a lot when Hpricot was the XML/HTML parser of choice, but they're not idiomatic for most Nokogiri developers. I suspect it's to avoid confusion with the regular use of the operators, at least it is in my case.
Also, Nokogiri's CSS accessors have some extra-special juiciness; They support namespaces, like the XPath accessors do, only they use |. Nokogiri will let us ignore the namespaces, which is what I did. You'll want to nose around in the Nokogiri docs for CSS and namespaces for more information.
There are definitely ways of parsing based on attributes.
The Engine yard article "Getting started with Nokogiri" has a full description.
But quickly, the examples they give are:
To match “h3″ tags that have a class
attribute, we write:
h3[#class]
To match “h3″ tags whose class
attribute is equal to the string “r”,
we write:
h3[#class = "r"]
Using the attribute matching
construct, we can modify our previous
query to:
//h3[#class = "r"]/a[#class = "l"]

Resources