Nokogiri: How to get node name with namespace prefix

Nokogiri: How to get node name with namespace prefix - ruby

i trying (for testing purpose) to parse Google merchant XML feed, defined as:
<?xml version="1.0" encoding="UTF-8"?>
<feed xml:lang="cs" xmlns="http://www.w3.org/2005/Atom" xmlns:g="http://base.google.com/ns/1.0">
<link rel="alternate" type="text/html" href="http://www.example.com"/>
<link rel="self" type="application/atom+xml" href="http://www.example.com/cs/feed/google.xml"/>
<title>EasyOptic</title>
<updated>2014-08-01T16:31:11Z</updated>
<entry>
<title>Sluneční Brýle Producer 1 133a code_color_1 Color 1 133a RayBan</title>
<link href="http://www.example.com/cs/katalog/price-category-1-style-1-optical-glasses-producer-1-rayban-133a-code_color_1-color-1"/>
<summary>Moc krásný a velmi levný produkt</summary>
<updated>2014-08-01T16:31:11Z</updated>
<g:id>EO111</g:id>
<g:condition>new</g:condition>
<g:price>100 Kč</g:price>
<g:availability>in stock</g:availability>
<g:image_link>http://www.example.com/images/fallback/default.png</g:image_link>
<g:additional_image_link>http://www.example.com/images/fallback/default.png</g:additional_image_link>
<g:brand>Producer 1</g:brand>
<g:mpn>EO111</g:mpn>
<g:gender>female</g:gender>
<g:google_product_category>Apparel & Accessories > Clothing Accessories > Sunglasses</g:google_product_category>
<g:product_type>Sluneční Brýle </g:product_type>
</entry>
<entry>
<title>Sluneční Brýle Producer 1 133a code_color_1 Color 1 133a RayBan</title>
<link href="http://www.example.com/cs/katalog/price-category-1-style-1-optical-glasses-producer-1-rayban-133a-code_color_1-color-1"/>
<summary>Moc krásný a velmi levný produkt</summary>
<updated>2014-08-01T16:31:10Z</updated>
<g:id>EO111</g:id>
<g:condition>new</g:condition>
<g:price>100 Kč</g:price>
<g:availability>in stock</g:availability>
<g:image_link>http://www.example.com/images/fallback/default.png</g:image_link>
<g:additional_image_link>http://www.example.com/images/fallback/default.png</g:additional_image_link>
<g:brand>Producer 1</g:brand>
<g:mpn>EO111</g:mpn>
<g:gender>female</g:gender>
<g:google_product_category>Apparel & Accessories > Clothing Accessories > Sunglasses</g:google_product_category>
<g:product_type>Sluneční Brýle </g:product_type>
</entry>
</feed>
with this ruby script:
require 'nokogiri'
def have_node_with_children(body, path_type, path, children_names)
doc = Nokogiri::XML(body)
case path_type
when :xpath
nodes = doc.xpath(path)
when :css
nodes = doc.css(path)
else
nodes = doc.xpath(path)
end
nodes.each do |node|
nchildren_names=[]
for child in node.children
nchildren_names << child.name unless child.to_s.strip =="" #nokogiri takes formating spaces as blank node with name "text"
end
puts("demanded_nodes: #{children_names.sort.join(", ")} , nodes found: #{nchildren_names.sort.join(", ")} ")
missing = children_names - nchildren_names
over = nchildren_names - children_names
puts("Missing: #{missing.sort.join(", ")} , Over: #{over.sort.join(", ")} ")
end
end
EXPECTED_ENTRY_NODES=[
'title',
'link',
'summary',
'updated',
'g:id',
'g:condition',
'g:price',
'g:availability',
'g:image_link',
'g:additional_image_link',
'g:brand',
'g:mpn',
'g:gender',
'g:google_product_category',
'g:product_type'
]
file=File.open('google.xml')
have_node_with_children(file.read,:xpath,'//xmlns:entry',EXPECTED_ENTRY_NODES)
It find node 'entry' (thanks for this tip ).
But when collecting it's children method child.name returns name without namespace prefix (e.g.: <'g:brand'>.name => 'brand'.
So comparsion with demanded fields fail.
Do anybody have tip hot to get node name with/and it's namespace prefix?
If I delete namespace definitions all work fine, but I cannot change the original XML.
I use this test in rspec request test, so another namespaces with maybe indentical base node names can appear.

xml_doc = Nokogiri::XML(xml)
xml_doc.xpath("//xmlns:entry").each do |entry|
entry.xpath("./*").each do |element| #Step through all Element nodes that are direct children of <entry>
prefix = element.namespace.prefix
puts prefix ? "#{element.namespace.prefix}:#{element.name}"
: element.name
end
break #only show output for the first <entry>
end
--output:--
title
link
summary
updated
g:id
g:condition
g:price
g:availability
g:image_link
g:additional_image_link
g:brand
g:mpn
g:gender
g:google_product_category
g:product_type
Now about this:
for child in node.children
A well grounded rubyist does not ever use a for-loop...because a for_loop just calls each(), so rubyists call each() directly:
node.children.each do |child|

Related

How to filter XML elements by date range in Ruby

I typically use Nokogiri as my XML parser.
I have the following XML:
<albums>
<aldo_nova album="aldo nova">
<release_date value="19820401"/>
</aldo_nova>
<classix_nouveaux album="Night People"/>
<release_date value="19820501"/>
</classix_nouveaux>
<engligh_beat album="I Just Can't Stop It"/>
<release_date value="19800501"/>
</engligh_beat>
</albums>
I want to get all albums that were released between 1/1/1980 and 4/15/1982:
<aldo_nova album="aldo nova">
<release_date value="19820401"/>
</aldo_nova>
<engligh_beat album="I Just Can't Stop It"/>
<release_date value="19800501"/>
</engligh_beat>
How do I filter/query the XML by a release_date range?

Your XML is malformed. After parsing, here's what Nokogiri has to say about it:
doc.errors
# => [#<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: albums line 1 and classix_nouveaux>,
# #<Nokogiri::XML::SyntaxError: Extra content at the end of the document>]
That's because:
<classix_nouveaux album="Night People"/>
and
<engligh_beat album="I Just Can't Stop It"/>
are terminated. Instead they should be:
<classix_nouveaux album="Night People">
and
<engligh_beat album="I Just Can't Stop It">
You can use CSS or XPath selectors to find exact matches, or even sub-string matches, but neither CSS or XPath understand "ranges" of dates, nor do they have an idea of what a Date is, so you'd have to extract all nodes, convert the date value into a Date object or integer in this case, then compare to the range:
date_range = 19800501..19820401
selected_albums = doc.search('//release_date').select { |rd| date_range.include?(rd['value'].to_i) }.map { |rd| rd.parent }
selected_albums.map(&:to_xml)
# => ["<aldo_nova album=\"aldo nova\">\n" +
# " <release_date value=\"19820401\"/>\n" +
# "</aldo_nova>",
# "<engligh_beat album=\"I Just Can't Stop It\">\n" +
# " <release_date value=\"19800501\"/>\n" +
# "</engligh_beat>"]
I think your XML is poorly designed because you have varying tag names for what should be an album. <album> should be a child of <albums>. I'd recommend something like this:
<collection>
<albums>
<album band="aldo nova" title="aldo nova" release_date="19820401"/>
<album band="classix nouveaux" title="Night People" release_date="19820501"/>
<album band="english beat" title="I Just Can't Stop It" release_date="19800501"/>
</albums>
</collection>
Once the XML is in a standard form, then it becomes easier to navigate and search:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<collection>
<albums>
<album band="aldo nova" title="aldo nova" release_date="19820401"/>
<album band="classix nouveaux" title="Night People" release_date="19820501"/>
<album band="english beat" title="I Just Can't Stop It" release_date="19800501"/>
</albums>
</collection>
EOT
doc.search('album').last['title'] # => "I Just Can't Stop It"
band = 'aldo nova'
doc.search("//album[#band='#{band}']").map { |a| a['title'] } # => ["aldo nova"]
and searching for dates becomes more straightforward because it's not necessary to find the parent of the node:
date_range = 19800501..19820401
selected_albums = doc.search('album').select { |a| date_range.include?(a['release_date'].to_i) }
selected_albums.map(&:to_xml)
# => ["<album band=\"aldo nova\" title=\"aldo nova\" release_date=\"19820401\"/>",
# "<album band=\"english beat\" title=\"I Just Can't Stop It\" release_date=\"19800501\"/>"]
I'd recommend reading some tutorials on XML itself as it's easy to paint ourselves into corners if the data isn't represented logically and correctly.

Can't address XML attribute thought XPath in Ruby (using Nokogiri)

I'm trying to filter xml file to get nodes with certain attribute. I can successfully filter by node (ex. \top_manager), but when I try \\top_manager[#salary='great'] I get nothing.
<?xml version= "1.0"?>
<employee xmlns="http://www.w3schools.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="employee.xsd">
<top_manager>
<ceo salary="great" respect="enormous" type="extra">
<fname>
Vasya
</fname>
<lname>
Pypkin
</lname>
<hire_date>
19
</hire_date>
<descr>
Big boss
</descr>
</ceo>
<cio salary="big" respect="great" type="intro">
<fname>
Petr
</fname>
<lname>
Pypkin
</lname>
<hire_date>
25
</hire_date>
<descr>
Resposible for information security
</descr>
</cio>
</top_manager>
......
How I need to correct this code to get what I need?
require 'nokogiri'
f = File.open("employee.xml")
doc = Nokogiri::XML(f)
doc.xpath("//top_manager[#salary='great']").each do |node|
puts node.text
end
thank you.

That's because salary is not attribute of <top_manager> element, it is the attribute of <top_manager>'s children elements :
//xmlns:top_manager[*[#salary='great']]
Above XPath select <top_manager> element having any of it's child element has attribute salary equals "great". Or if you meant to select the children (the <ceo> element in this case) :
//xmlns:top_manager/*[#salary='great']

Ruby Hash parsed_response error

BACKGROUND
I am using HTTParty to parse an XML hash response. Unfortunately, when the hash response only has one entry(?), the resulting hash is not indexable. I have confirmed the resulting XML syntax is the same for single and multiple entry(?). I have also confirmed my code works when there are always multiple entries(?) in the hash.
QUESTION
How do I accommodate the single hash entry case and/or is there an easier way to accomplish what I am trying to do?
CODE
require 'httparty'
class Rest
include HTTParty
format :xml
end
def test_redeye
# rooms and devices
roomID = Hash.new
deviceID = Hash.new { |h,k| h[k] = Hash.new }
rooms = Rest.get(#reIp["theater"] + "/redeye/rooms/").parsed_response["rooms"]
puts "rooms #{rooms}"
rooms["room"].each do |room|
puts "room #{room}"
roomID[room["name"].downcase.strip] = "/redeye/rooms/" + room["roomId"]
puts "roomid #{roomID}"
devices = Rest.get(#reIp["theater"] + roomID[room["name"].downcase.strip] + "/devices/").parsed_response["devices"]
puts "devices #{devices}"
devices["device"].each do |device|
puts "device #{device}"
deviceID[room["name"].downcase.strip][device["displayName"].downcase.strip] = "/devices/" + device["deviceId"]
puts "deviceid #{deviceID}"
end
end
say "Done"
end
XML - SINGLE ENTRY
<?xml version="1.0" encoding="UTF-8" ?>
<devices>
<device manufacturerName="Philips" description="" portType="infrared" deviceType="0" modelName="" displayName="TV" deviceId="82" />
</devices>
XML - MULTIPLE ENTRY
<?xml version="1.0" encoding="UTF-8" ?>
<devices>
<device manufacturerName="Denon" description="" portType="infrared" deviceType="6" modelName="Avr-3311ci" displayName="AVR" deviceId="77" />
<device manufacturerName="Philips" description="" portType="infrared" deviceType="0" modelName="" displayName="TV" deviceId="82" />
</devices>
RESULTING ERROR
[Info - Plugin Manager] Matches, executing block
rooms {"room"=>[{"name"=>"Home Theater", "currentActivityId"=>"78", "roomId"=>"-1", "description"=>""}, {"name"=>"Living", "currentActivityId"=>"-1", "roomId"=>"81", "description"=>"2nd Floor"}, {"name"=>"Theater", "currentActivityId"=>"-1", "roomId"=>"80", "description"=>"1st Floor"}]}
room {"name"=>"Home Theater", "currentActivityId"=>"78", "roomId"=>"-1", "description"=>""}
roomid {"home theater"=>"/redeye/rooms/-1"}
devices {"device"=>[{"manufacturerName"=>"Denon", "description"=>"", "portType"=>"infrared", "deviceType"=>"6", "modelName"=>"Avr-3311ci", "displayName"=>"AVR", "deviceId"=>"77"}, {"manufacturerName"=>"Philips", "description"=>"", "portType"=>"infrared", "deviceType"=>"0", "modelName"=>"", "displayName"=>"TV", "deviceId"=>"82"}]}
device {"manufacturerName"=>"Denon", "description"=>"", "portType"=>"infrared", "deviceType"=>"6", "modelName"=>"Avr-3311ci", "displayName"=>"AVR", "deviceId"=>"77"}
deviceid {"home theater"=>{"avr"=>"/devices/77"}}
device {"manufacturerName"=>"Philips", "description"=>"", "portType"=>"infrared", "deviceType"=>"0", "modelName"=>"", "displayName"=>"TV", "deviceId"=>"82"}
deviceid {"home theater"=>{"avr"=>"/devices/77", "tv"=>"/devices/82"}}
room {"name"=>"Living", "currentActivityId"=>"-1", "roomId"=>"81", "description"=>"2nd Floor"}
roomid {"home theater"=>"/redeye/rooms/-1", "living"=>"/redeye/rooms/81"}
devices {"device"=>{"manufacturerName"=>"Philips", "description"=>"", "portType"=>"infrared", "deviceType"=>"0", "modelName"=>"", "displayName"=>"TV", "deviceId"=>"82"}}
device ["manufacturerName", "Philips"]
/usr/local/rvm/gems/ruby-1.9.3-p374#SiriProxy/gems/siriproxy-0.3.2/plugins/siriproxy-redeye/lib/siriproxy-redeye.rb:145:in `[]': can't convert String into Integer (TypeError)

There are a couple of options I see. If you control the endpoint, you could modify the XML being sent to accomodate HTTParty's underlying XML parser, Crack by putting a type="array" attribute on the devices XML element.
Otherwise, you could check to see what class the device is before indexing into it:
case devices["device"]
when Array
# act on the collection
else
# act on the single element
end
It's much less than ideal whenever you have to do type-checking in a dynamic language, so if you find yourself doing this more than once it may be worth introducing polymorphism or at the very least extracting a method to do this.

Should Nokogiri::XML.parse be creating separate Text nodes for linefeeds?

I have an XML document created by an outside tool:
<?xml version="1.0" encoding="UTF-8"?>
<suite>
<id>S1</id>
<name>First Suite</name>
<description></description>
<sections>
<section>
<name>section 1</name>
<cases>
<case>
<id>C1</id>
<title>Test 1.1</title>
<type>Other</type>
<priority>4 - Must Test</priority>
<estimate></estimate>
<milestone></milestone>
<references></references>
</case>
<case>
<id>C2</id>
<title>Test 1.2</title>
<type>Other</type>
<priority>4 - Must Test</priority>
<estimate></estimate>
<milestone></milestone>
<references></references>
</case>
</cases>
</section>
</sections>
</suite>
From irb, I do the following: (Output suppressed until final command)
> require('nokogiri')
> doc = Nokogiri::XML.parse(open('./test.xml'))
> test_case = doc.search('case').first
=> #<Nokogiri::XML::Element:0x3ff75851bc44 name="case" children=[#<Nokogiri::XML::Text:0x3ff75851b8fc "\n ">, #<Nokogiri::XML::Element:0x3ff75851b7bc name="id" children=[#<Nokogiri::XML::Text:0x3ff75851b474 "C1">]>, #<Nokogiri::XML::Text:0x3ff75851b1cc "\n ">, #<Nokogiri::XML::Element:0x3ff75851b078 name="title" children=[#<Nokogiri::XML::Text:0x3ff75851ad58 "Test 1.1">]>, #<Nokogiri::XML::Text:0x3ff75851aa9c "\n ">, #<Nokogiri::XML::Element:0x3ff75851a970 name="type" children=[#<Nokogiri::XML::Text:0x3ff75851a6c8 "Other">]>, #<Nokogiri::XML::Text:0x3ff7585191d8 "\n ">, #<Nokogiri::XML::Element:0x3ff7585190d4 name="priority" children=[#<Nokogiri::XML::Text:0x3ff758518d64 "4 - Must Test">]>, #<Nokogiri::XML::Text:0x3ff758518ad0 "\n ">, #<Nokogiri::XML::Element:0x3ff7585189a4 name="estimate">, #<Nokogiri::XML::Text:0x3ff758518670 "\n ">, #<Nokogiri::XML::Element:0x3ff758518558 name="milestone">, #<Nokogiri::XML::Text:0x3ff7585182b0 "\n ">, #<Nokogiri::XML::Element:0x3ff758518184 name="references">, #<Nokogiri::XML::Text:0x3ff758517ef0 "\n ">]>
This results in a number of children that look like the following:
#<Nokogiri::XML::Text:0x3ff758517ef0 "\n ">
I want to iterate through these XML nodes without having to do something like:
> real_nodes = test_case.children.reject{|n| n.node_name == 'text' && n.content.strip!.empty?}
I couldn't find a parse parameter in the Nokogiri docs to suppress the treating of newlines as separate nodes. Is there a way to do this during the parse instead of after?

Check the documentation. You can just do this:
doc = Nokogiri::XML.parse(open('./test.xml')) do |config|
config.noblanks
end
That will load the file without any empty nodes.

The text nodes are the result of pretty-printing the XML. The spec doesn't require whitespace between tags, and, for efficiency, a huge XML file could be stripped of inter-tag whitespace to save space and reduce transfer time, without sacrificing the data content.
This might show what's happening:
require 'nokogiri'
xml = '<foo></foo>'
Nokogiri::XML(xml).at('foo').child
=> nil
With no whitespace between the tags there's no text node either.
xml = '<foo>
</foo>'
Nokogiri::XML(xml).at('foo').child
=> #<Nokogiri::XML::Text:0x3fcee9436ff0 "\n">
doc.at('foo').child.class
=> Nokogiri::XML::Text
With whitespace for pretty-printing, the XML has a text node following the foo tag.

XPath Query to select hyperlink

The following is a subset of xml from a twitter atom feed:
<entry>
<id>tag:search.twitter.com,2005:18232030105964545</id>
<published>2010-12-24T09:10:29Z</published>
<link type="text/html" rel="alternate" href="http://twitter.com/KTNKenya/statuses/18232030105964545"/>
<title>Synovate Poll: PM Raila Odinga remains the preffered presidential candidate at 42% while Uhuru Kenyatta is at 14%... http://fb.me/yjmMbmBx</title>
<content type="html">Synovate Poll: PM <b>Raila</b> Odinga remains the preffered presidential candidate at 42% while Uhuru Kenyatta is at 14%... <a href="http://fb.me/yjmMbmBx">http://fb.me/yjmMbmBx</a></content>
<updated>2010-12-24T09:10:29Z</updated>
<link type="image/png" rel="image" href="http://a3.twimg.com/profile_images/701825859/NEW_KTN_normal.png"/>
<google:location>nairobi, kenya</google:location>
<twitter:geo>
</twitter:geo>
<twitter:metadata>
<twitter:result_type>recent</twitter:result_type>
</twitter:metadata>
<twitter:source><a href="http://www.facebook.com/twitter" rel="nofollow">Facebook</a></twitter:source>
<twitter:lang>en</twitter:lang>
<author>
<name>KTNKenya (KTN Kenya)</name>
<uri>http://twitter.com/KTNKenya</uri>
</author>
</entry>
From the <title>...</title> element, i need to select the hyperlink http://fb.me/yjmMbmBx via an XPath query. How do I do it? Is it possible?
*I'm an XPath newbie.
Thanks.

You have two options:
Use <title> (xpath: "/entry/title/text()") and get the URL yourself (e.g. using regex or finding the last instance of "http://" in the string.
Get the data first:
/entry/content[#type="html"]/text()
Then you need to parse this as HTML and extract any tags, and use the href attribute of those tags. How you do this last part depends on the language/environment you are doing this in.
Update: Added basic example code for option 1 above, as requested:
xmlpp::Element *node = parser.get_document()->get_root_node();
xmlpp::NodeSet results = node->find("/entry/title/text()");
xmlpp::ContentNode* content = dynamic_cast<xmlpp::ContentNode*>(results.front());
std::string text = content->get_content();
std::string link = "";
int res = text.rfind("http://");
if(res == text.npos)
res = text.rfind("https://");
if(res != text.npos)
link = text.substr(res);

With atom prefix bound to http://www.w3.org/2005/Atom namespace URI, use:
/atom:feed/atom:entry/atom:title[contains(.,'http://')]
This selects every atom:title element child of atom:entry, having the string "http://" contained in its string value.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Nokogiri: How to get node name with namespace prefix - ruby

Related

How to filter XML elements by date range in Ruby

Can't address XML attribute thought XPath in Ruby (using Nokogiri)

Ruby Hash parsed_response error

Should Nokogiri::XML.parse be creating separate Text nodes for linefeeds?

XPath Query to select hyperlink

Categories

Resources