Hpricot + Ruby XML parsing and logical selection.
Objective: Find all title written by author Bob.
My XML file:
<rss>
<channel>
<item>
<title>Book1</title>
<pubDate>march 1 2010</pubDate>
<author>Bob</author>
</item>
<item>
<title>book2</title>
<pubDate>october 4 2009</pubDate>
<author>Bill</author>
</item>
<item>
<title>book3</title>
<pubDate>June 5 2010</pubDate>
<author>Steve</author>
</item>
</channel>
</rss>
#my Hpricot, running this code returns no output, however the search pattern works on its own.
(doc % :rss % :channel / :item).each do |item|
a=item.search("author[text()*='Bob']")
#puts "FOUND" if a.include?"Bob"
puts item.at("title") if a.include?"Bob"
end
If you're not set on Hpricot, here's one way to do this with XPath in Nokogiri:
require 'nokogiri'
doc = Nokogiri::XML( my_rss_string )
bobs_titles = doc.xpath("//title[parent::item/author[text()='Bob']]")
p bobs_titles.map{ |node| node.text }
#=> ["Book1"]
Edit: #theTinMan's XPath also works well, is more readable, and may very well be faster:
bobs_titles = doc.xpath("//author[text()='Bob']/../title")
One of the ideas behind XPath is it allows us to navigate a DOM similarly to a disk directory:
require 'hpricot'
xml = <<EOT
<rss>
<channel>
<item>
<title>Book1</title>
<pubDate>march 1 2010</pubDate>
<author>Bob</author>
</item>
<item>
<title>book2</title>
<pubDate>october 4 2009</pubDate>
<author>Bill</author>
</item>
<item>
<title>book3</title>
<pubDate>June 5 2010</pubDate>
<author>Steve</author>
</item>
<item>
<title>Book4</title>
<pubDate>march 1 2010</pubDate>
<author>Bob</author>
</item>
</channel>
</rss>
EOT
doc = Hpricot(xml)
titles = (doc / '//author[text()="Bob"]/../title' )
titles # => #<Hpricot::Elements[{elem <title> "Book1" </title>}, {elem <title> "Book4" </title>}]>
That means: "find all the books by Bob, then look up one level and find the title tag".
I added an extra book by "Bob" to test getting all occurrences.
To get the item containing a book by Bob, just move back up a level:
items = (doc / '//author[text()="Bob"]/..' )
puts items # => nil
# >> <item>
# >> <title>Book1</title>
# >> <pubdate>march 1 2010</pubdate>
# >> <author>Bob</author>
# >> </item>
# >> <item>
# >> <title>Book4</title>
# >> <pubdate>march 1 2010</pubdate>
# >> <author>Bob</author>
# >> </item>
I also figured out what (doc % :rss % :channel / :item) is doing. It's equivalent to nesting the searches, minus the wrapping parenthesis, and these should all be the same in Hpricot-ese:
(doc % :rss % :channel / :item).size # => 4
(((doc % :rss) % :channel) / :item).size # => 4
(doc / '//rss/channel/item').size # => 4
(doc / 'rss channel item').size # => 4
Because '//rss/channel/item' is how you'd normally see an XPath accessor, and 'rss channel item' is a CSS accessor, I'd recommend using those formats for maintenance and clarity.
Related
I'm generating an RSS feed using Ruby's built-in RSS library, which seems to escape HTML when generating feeds. For certain elements I'd prefer that it preserve the original HTML by wrapping it in a CDATA block.
A minimal working example:
require 'rss/2.0'
feed = RSS::Rss.new("2.0")
feed.channel = RSS::Rss::Channel.new
feed.channel.title = "Title & Show"
feed.channel.link = "http://foo.net"
feed.channel.description = "<strong>Description</strong>"
item = RSS::Rss::Channel::Item.new
item.title = "Foo & Bar"
item.description = "<strong>About</strong>"
feed.channel.items << item
puts feed
...which generates the following RSS:
<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>Title & Show</title>
<link>http://foo.net</link>
<description><strong>Description</strong></description>
<item>
<title>Foo & Bar</title>
<description><strong>About</strong></description>
</item>
</channel>
</rss>
Instead of HTML-encoding the channel and item descriptions, I'd like to keep the original HTML and wrap them in CDATA blocks, e.g.:
<description><![CDATA[<strong>Description</strong>]]></description>
monkey-patching the element-generating method works for me:
require 'rss/2.0'
class RSS::Rss::Channel
def description_element need_convert, indent
markup = "#{indent}<description>"
markup << "<![CDATA[#{#description}]]>"
markup << "</description>"
markup
end
end
# ...
this prevents the call to Utils.html_escape which escapes a few special entities.
My code is supposed to "guess" the path(s) that lies before the relevant text nodes in my XML file. Relevant in this case means: text nodes nested within the recurring product/person/something tag, but not text nodes that are used outside of it.
This code:
#doc, items = Nokogiri.XML(#file), []
path = []
#doc.traverse do |node|
if node.class.to_s == "Nokogiri::XML::Element"
is_path_element = false
node.children.each do |child|
is_path_element = true if child.class.to_s == "Nokogiri::XML::Element"
end
path.push(node.name) if is_path_element == true && !path.include?(node.name)
end
end
final_path = "/"+path.reverse.join("/")
works for simple XML files, for example:
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Some XML file title</title>
<description>Some XML file description</description>
<item>
<title>Some product title</title>
<brand>Some product brand</brand>
</item>
<item>
<title>Some product title</title>
<brand>Some product brand</brand>
</item>
</channel>
</rss>
puts final_path # => "/rss/channel/item"
But when it gets more complicated, how should I then approach the challenge? For example with this one:
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Some XML file title</title>
<description>Some XML file description</description>
<item>
<titles>
<title>Some product title</title>
</titles>
<brands>
<brand>Some product brand</brand>
</brands>
</item>
<item>
<titles>
<title>Some product title</title>
</titles>
<brands>
<brand>Some product brand</brand>
</brands>
</item>
</channel>
</rss>
If you are looking for a list of deepest "parent" paths in the XML, there is more than one way to view that.
Although I think your own code could be adjusted to achieve the same output, I was convinced the same thing could be achieved by using xpath. And my motivation is to get my XML skills unrusty (not used Nokogiri yet, but I will need to do so professionally soon). So here is how to get all parent paths that have just one child level beneath them, using xpath:
xml.xpath('//*[child::* and not(child::*/*)]').each { |node| puts node.path }
The output of this for your second example file is:
/rss/channel/item[1]/titles
/rss/channel/item[1]/brands
/rss/channel/item[2]/titles
/rss/channel/item[2]/brands
. . . if you took this list and gsub out the indexes, then make the array unique, then this looks a lot like the output of your loop . . .
paths = xml.xpath('//*[child::* and not(child::*/*)]').map { |node| node.path }
paths.map! { |path| path.gsub(/\[[0-9]+\]/,'') }.uniq!
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]
Or in one line:
paths = xml.xpath('//*[* and not(*/*)]').map { |node| node.path.gsub(/\[[0-9]+\]/,'') }.uniq
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]
I'm created a library to build xpath.
xpath = Jini.new
.add_path('parent')
.add_path('child')
.add_all('toys')
.add_attr('name', 'plane')
.to_s
puts xpath // -> /parent/child//toys[#name="plane"]
This is a simple stuff but driving me really crazy now. Spent hours on figuring this out which I have many many times before.
I am trying to read a parse xmlsimple doc. But I don't know why can't access elements by index number. I can't understand the problem, when I try this in the console it works, but not in actual code. It gives me this error on the view page:
undefined method `[]' for nil:NilClass
Code:
#i = 0
list =""
while #i <= 2
puts xml
a = parsed_items["Item"][#i]["ItemId"]
list << a.to_s << ","
#i += 1
end
puts list.to_s
If I do it by giving a int value manually in my code then it works:
a = parsed_items["Item"][0]["ItemId"] # it works with other exact code
Change to #i and not working:
a = parsed_items["Item"][#i]["ItemId"] # it does not work with other exact code
XML:
1.9.2p290 :013 > items = "<ItemList> <Item> <ItemId>123</ItemId> <ItemName>abc</ItemName> <ItemType>xyz</ItemType> <Status>bad</Status> </Item> <Item> <ItemId>456</ItemId> <ItemName>fgh</ItemName> <ItemType>nbv</ItemType> <Status>bad</Status> </Item> </ItemList>"
=> "<ItemList> <Item> <ItemId>123</ItemId> <ItemName>abc</ItemName> <ItemType>xyz</ItemType> <Status>bad</Status> </Item> <Item> <ItemId>456</ItemId> <ItemName>fgh</ItemName> <ItemType>nbv</ItemType> <Status>bad</Status> </Item> </ItemList>"
1.9.2p290 :014 > parsed_items = XmlSimple.xml_in(items, { 'KeyAttr' => 'name' })
=> {"Item"=>[{"ItemId"=>["123"], "ItemName"=>["abc"], "ItemType"=>["xyz"], "Status"=>["bad"]}, {"ItemId"=>["456"], "ItemName"=>["fgh"], "ItemType"=>["nbv"], "Status"=>["bad"]}]}
XML:
<ItemList>
<Item>
<ItemId>123</ItemId>
<ItemName>abc</ItemName>
<ItemType>xyz</ItemType>
<Status>bad</Status>
</Item>
<Item>
<ItemId>456</ItemId>
<ItemName>fgh</ItemName>
<ItemType>nbv</ItemType>
<Status>bad</Status>
</Item>
</ItemList>
Paraphrased, that error means "Hey, you put [] after something that was nil, but nil doesn't have that method!"
You only have 2 items in your array, so when #i gets to 2—which is the third item in a 0-based list—the code parse_items["Item"][#i] is returning nil; when you try to then execute ["ItemId"] on that value you get the error you stated.
Simplest change to fix this:
while #i<2 # instead of <=2
Better change (let Ruby iterate for you):
list = ""
parsed_items["Item"].each do |item|
list << item["ItemId"].to_s << ","
end
puts list
Even better change (let Ruby do your work for you):
puts parsed_items["Item"].map{ |item| item["ItemId"] }.join(',')
For some reason you're defining an instance variable instead of a local one. Also conversing list into a string is completely unnecessary since it's a string from a very beginning. Working code should look somewhat like this:
i = 0
list =""
while i <= 2
puts xml
a = parsed_items["Item"][i]["ItemId"]
list << a.to_s << ","
i += 1
end
puts list
I strongly suggest you to read about different variable types.
I am using XmlSimple, the problem I am having is in parsing a list of entries, determine number of entries with similar xml tag.
<ItemList>
<Item>
<ItemId>123</ItemId>
<ItemName>abc</ItemName>
<ItemType>xyz</ItemType>
<Status>ok</Status>
</Item>
</ItemList>
Above gets parsed as this -
"ItemList"=> {
"Item"=>{ "ItemId"=>"123",
"ItemName"=>"abc",
"ItemType"=>"xyz",
"Status"=>"ok"
}
},
And I access it as - ['ItemList']['Item']['ItemId'], Without any Index number anywhere.
But if ItemList has more then 1 entries then it messes up my application.
<ItemList>
<Item>
<ItemId>123</ItemId>
<ItemName>abc</ItemName>
<ItemType>xyz</ItemType>
<Status>bad</Status>
</Item>
<Item>
<ItemId>456</ItemId>
<ItemName>fgh</ItemName>
<ItemType>nbv</ItemType>
<Status>bad</Status>
</Item>
</ItemList>
Above gets parsed as this -
"ItemList"=> {
"Item"=>{ "ItemId"=>"123",
"ItemName"=>"abc",
"ItemType"=>"xyz",
"Status"=>"bad"
},
"Item"=>{ "ItemId"=>"456",
"ItemName"=>"fgh",
"ItemType"=>"nbv",
"Status"=>"bad"
}
},
I can access it as - ['ItemList']['Item'][0]['ItemId'] and ['ItemList']['Item'][1]['ItemId']. With providing an Index number manually.
But since I don't know how many items are there in the list I cannot provide index number in the actual app, the xml might have No entry or might have hundreds of them.
Thought of using Nokogiri, but it has the same parsing behavior.
How do I handle this?
Sample processing of your data using xml-simple gem
1.9.2p290 :013 > items = "<ItemList> <Item> <ItemId>123</ItemId> <ItemName>abc</ItemName> <ItemType>xyz</ItemType> <Status>bad</Status> </Item> <Item> <ItemId>456</ItemId> <ItemName>fgh</ItemName> <ItemType>nbv</ItemType> <Status>bad</Status> </Item> </ItemList>"
=> "<ItemList> <Item> <ItemId>123</ItemId> <ItemName>abc</ItemName> <ItemType>xyz</ItemType> <Status>bad</Status> </Item> <Item> <ItemId>456</ItemId> <ItemName>fgh</ItemName> <ItemType>nbv</ItemType> <Status>bad</Status> </Item> </ItemList>"
1.9.2p290 :014 > parsed_items = XmlSimple.xml_in(items, { 'KeyAttr' => 'name' })
=> {"Item"=>[{"ItemId"=>["123"], "ItemName"=>["abc"], "ItemType"=>["xyz"], "Status"=>["bad"]}, {"ItemId"=>["456"], "ItemName"=>["fgh"], "ItemType"=>["nbv"], "Status"=>["bad"]}]}
1.9.2p290 :015 > parsed_items.class
=> Hash
1.9.2p290 :016 > parsed_items["Item"].class
=> Array
1.9.2p290 :017 > parsed_items["Item"].length
=> 2
So your Item will be an array and you can apply length method on it. With my example above you can always do parsed_items["Item"].length
If you are using Ruby 1.8+, I use REXML which makes this easy. See the Accessing Elements section: http://www.germane-software.com/software/rexml/docs/tutorial.html
If 'result' is what you get from parsing your XML doc, then you could test
result['ItemList']['Item']
to check whether it is an array (or enumerable). If it is, then there's more than 1 item, and you'll have to enumerate over the items.
Alternatively, you could do this (assuming ruby 1.9):
[*result['ItemList']['Item']].each do |item|
...
end
The splat operator is cool and when used like this lets you transparently handle a value that could be nil, a scalar, or a collection.
I have a simple XML file, items.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<items>
<item>
<name>mouse</name>
<manufacturer>Logicteh</manufacturer>
</item>
<item>
<name>keyboard</name>
<manufacturer>Logitech - Inc.</manufacturer>
</item>
<item>
<name>webcam</name>
<manufacturer>Logistech</manufacturer>
</item>
</items>
I am trying to insert a new node with the following code:
require 'rubygems'
require 'nokogiri'
f = File.open('items.xml')
#items = Nokogiri::XML(f)
f.close
price = Nokogiri::XML::Node.new "price", #items
price.content = "10"
#items.xpath('//items/item/manufacturer').each do |node|
node.add_next_sibling(price)
end
file = File.open("items_fixed.xml",'w')
file.puts #items.to_xml
file.close
However this code adds a new node only after the last <manufacturer> node, items_fixed.xml:
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<name>mouse</name>
<manufacturer>Logitech</manufacturer>
</item>
<item>
<name>keyboard</name>
<manufacturer>Logitech</manufacturer>
</item>
<item>
<name>webcam</name>
<manufacturer>Logitech</manufacturer><price>10</price>
</item>
</items>
Why?
It would be helpful to distinguish between a Node (a particular piece of structured XML data at a particular place in a tree), and a "node template" which is the structure of the data.
Nokogiri (and most other XML libraries) only allow you to specify Nodes, not node templates. So when you created price = Nokogiri::XML::Node.new "price", #items, you had a particular piece of data that belongs in a particular place, but hadn't defined the place yet.
When you added it to the first <item>, you defined its place. When you added it to the second <item>, you uprooted it from its place and put it in a new place. At that point this Node appeared only in the second <item>. This continues when you add the same Node to each item, until you reach the last <item>, which is where the node stays.
Nokogiri doesn't have any way to specify a node template. What you need to do is:
#items.xpath('//items/item/manufacturer').each do |node|
price = Nokogiri::XML::Node.new "price", #items
price.content = "10"
node.add_next_sibling(price)
end
I'd start with this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<name>mouse</name>
<manufacturer>Logitech</manufacturer>
</item>
<item>
<name>keyboard</name>
<manufacturer>Logitech - Inc.</manufacturer>
</item>
</items>
EOT
doc.search('manufacturer').each { |n| n.after('<price>10</price>') }
Which results in:
puts doc.to_xml
# >> <?xml version="1.0" encoding="UTF-8"?>
# >> <items>
# >> <item>
# >> <name>mouse</name>
# >> <manufacturer>Logitech</manufacturer><price>10</price>
# >> </item>
# >> <item>
# >> <name>keyboard</name>
# >> <manufacturer>Logitech - Inc.</manufacturer><price>10</price>
# >> </item>
# >> </items>
It's easy to build upon this to insert different values for the price.