Using ruby/nokogiri to transform xml to another xml - ruby

I've never encountered task of transforming XML from one form to another. I hear that XSLT is just for that, but I don't want to go there. So, using only ruby and nokogiri, how can I:
remove all item elements but time from initial XML and also rename element time to HammerTime?
Initial XML:
...
<item>
<time>05.04.2011 9:53:23</time>
<iddqd>42</iddqd>
<idkfa>woot</idkfa>
</item>
<item>
...
Desired result:
...
<item>
<HammerTime>05.04.2011 9:53:23</HammerTime>
</item>
<item>
...
I figured out how to put data from XML to array using nokogiri's .xpath, but is there a way to make the desired transformation into another XML without manually having to write something like puts "<HammerTime>#{array['time']}</HammerTime>"?

Here you go:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML <<-EOHTML
<html>
<body>
<item>
<time>05.04.2011 9:53:23</time>
<iddqd>42</iddqd>
<idkfa>woot</idkfa>
</item>
</body>
</html>
EOHTML
hammer = doc.at_css "time"
hammer.name = 'hammertime'
doc.css("iddqd").remove
doc.css("idkfa").remove
outfile = File.new("output.html", "w")
outfile.puts doc.to_html
outfile.close

What do you mean with
into another XML without manually having to write something like puts "<HammerTime>#{array['time']}</HammerTime>"?
If you want to transform an XML element into another in a language-independent way, you can use XSLT transformations (or stylesheet). Once you have your XSLT file you can apply it with Nokogiri's Nokogiri::XSLT::Stylesheet#apply_to.

Related

Getting a node value depending on an another value at the same level

For each "item" node in the following XML structure, I want to select the corresponding "title" (the text nodes are located at the same level as the item nodes, I can't modify it).
The link between those two nodes will be the "ref" node which is a kind of primary key between the "item" and "title" trees.
Is it possible in XPath ?
I think it should be something like this: //root/item/../title[ref/text()=??????]/label
An example :
<root>
<item>
<ref>ITEM001</ref>
</item>
<item>
<ref>ITEM002</ref>
</item>
<item>
<ref>ITEM003</ref>
</item>
<item>
<ref>ITEM004</ref>
</item>
<title>
<ref>ITEM002</ref>
<label>Hello world!</label>
</title>
<title>
<ref>ITEM003</ref>
<label>Goodbye world!</label>
</title>
<title>
<ref>ITEM007</ref>
<label>This is a test!</label>
</title>
<title>
<ref>ITEM0010</ref>
<label>No this a question!</label>
</title>
</root>
The result would be:
ITEM001: empty
ITEM002: Hello world!
ITEM003: Goodbye world!
ITEM004: empty
Thanks in advance for your help.
I assume if you follow below steps you would get you desired output.
Step 1: Iterate through all the Items tag and capture all in an array.
Step 2: Using a loop on array use the below XPath to find the respective label value.
//title[contains(.,'')]/label.
Step 3: If you find an matching element then get the text of the label to display on console else display empty.

Get low level xpath from XML with Nokogiri

I'm trying to store in an array all the unique Xpaths of the low level elements in the XML below, but like I'm doing in array a is being stored all the XML, not only the Xpath themselves. The XML has different levels of Xpath. I mean, some child elements only have 2 ancestors and others more than one.
This is the code I have.
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<name>Cake</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
<batter>Chocolate</batter>
<batter>Blueberry</batter>
<batter>Devil's Food</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Powdered Sugar</topping>
<topping>Chocolate with Sprinkles</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
<item>
<name>Raised</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
</items>
EOT
a = []
a = doc.xpath("//*")
puts a
I'd like to store in array "a" only the unique xpaths as below:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping
Maybe somebody could help me in how to do this.
Thanks for the help.
What you want to select is the "leaf" nodes. You can do it like so:
doc.xpath("//*[not(*)]")
This means "select all elements that don't contain elements".
If you want the XPaths, you'll need to call .path on each node. But the paths provided by Nokogiri have explicit positions (e.g. /items/item[2]/topping[4]), so you'll have to apply a regex to remove them, then remove duplicates with uniq:
doc.xpath("//*[not(*)]").map {|leaf| leaf.path.gsub(/\[.*?\]/, '') }.uniq
Output:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping

Get attribute value from XML

I have this chunk of XML:
<show name="Are We There Yet?">
<sid>24588</sid>
<network>TBS</network>
<title>The Kwandanegaba Children's Fund Episode</title>
<ep>03x31</ep>
<link>
http://www.tvrage.com/shows/id-24588/episodes/1065228407
</link>
</show>
I am trying to get "Are we there yet?" via Nokogiri. It is effectively the 'name' attribute of 'show'. I'm struggling to figure out how to parse this.
xml.at_css('show').value was my best guess but doesn't work.
You can use the following:
xml.at('//show/#name').text
which is XPath expression that returns the name attribute from the show element.
Use:
require 'nokogiri'
xml =<<EOT
<show name="Are We There Yet?">
<sid>24588</sid>
<network>TBS</network>
<title>The Kwandanegaba Children's Fund Episode</title>
<ep>03x31</ep>
<link>
http://www.tvrage.com/shows/id-24588/episodes/1065228407
</link>
</show>
EOT
xml = Nokogiri::XML(xml)
puts xml.at('show')['name']
=> Are We There Yet?
at accepts either CSS or XPath expressions, so feel free to use it for both. Use at_css or at_xpath if you know you need to declare the expression as CSS or XPath, respectively. at returns a Node, so you can simply reference the parameters of the node like you would a hash.

XQuery ancestor axis doesn't work, but explicit XPath does

Consider the following XML snippet:
<doc>
<chapter id="1">
<item>
<para>some text here</para>
</item>
</chapter>
</doc>
In XQuery, I have a function that needs to do some things based on the ancestor chapter of a given "para" element that is passed in as a parameter, as shown in the stripped down example below:
declare function doSomething($para){
let $chapter := $para/ancestor::chapter
return "some stuff"
};
In that example, $chapter keeps coming up empty. However, if I write the function similar to the follwing (i.e., without using the ancestor axis), I get the desired "chapter" element:
declare function doSomething($para){
let $chapter := $para/../..
return "some stuff"
};
The problem is that I cannot use explicit paths as in the latter example because the XMl I will be searching is not guaranteed to have the "chapter" element as a grandparent every time. It may be a great-grandparent or great-great-grandparent, and so on, as shown below:
<doc>
<chapter id="1">
<item>
<subItem>
<para>some text here</para>
</subItem>
</item>
</chapter>
</doc>
Does anyone have an explanation as to why the axis doesn't work, while the explicit XPath does? Also, does anyone have any suggestions on how to solve this problem?
Thank you.
SOLUTION:
The mystery is now solved.
The node in question was re-created in another function, which had the result of stripping it of all of its ancestor information. Unfortunately, the previous developer did not document this wonderful, little function and has cost us all a good deal of time.
So, the ancestor axis worked exactly as it should - it was just being applied to a deceptive node.
I thank all of you for your efforts in answering my questions.
The ancestor axis does work fine. I suspect your problem is namespaces. The example you showed and that I ran (below) has XML without any namespaces. If your XML have a namespace then you would need to provide that in the ancestor XPath, like this: $para/ancestor:foo:chapter where in this case the prefix _foo_ is bound to the correct namespace for the chapter element.
let $doc := <doc>
<chapter id="1">
<item>
<para>some text here</para>
</item>
</chapter>
</doc>
let $para := $doc//para
return $para/ancestor::chapter
RESULT:
<?xml version="1.0" encoding="UTF-8"?>
<chapter id="1">
<item>
<para>some text here</para>
</item>
</chapter>
These things almost always boil down to namespaces! As a daignostic to confirm 100% that namespace are not the issue, can you try:
declare function local:doSomething($para) {
let $chapter := $para/ancestor::*[local-name() = 'chapter']
return $chapter
};
This seems surprising to me; which XQuery implementation are you using? With BaseX, the following query...
declare function local:doSomething($para) {
let $chapter := $para/ancestor::chapter
return $chapter
};
let $xml :=
<doc>
<chapter id="1">
<item>
<para>some text here</para>
</item>
</chapter>
</doc>
return local:doSomething($xml//para)
...returns...
<chapter id="1">
<item>
<para>some text here</para>
</item>
</chapter>
I suspect namespaces too. If $para/../.. works but $para/parent::item/parent::chapter turns up empty, then you know it's a question of namespaces.
Look for an xmlns declaration at the top of your content, e.g.:
<doc xmlns="http://example.com">
...
</doc>
In your XQuery, you then need to bind that namespace to a prefix and use that prefix in your XQuery/XPath expressions, like this:
declare namespace my="http://example.com";
declare function doSomething($para){
let $chapter := $para/ancestor::my:chapter
return "some stuff"
};
What prefix you use doesn't matter. The important thing is that the namespace URI (http://example.com in the above example) matches up.
It makes sense that ../.. selects the element you want, because .. is short for parent::node() which selects the parent node regardless of its name (or namespace). Whereas ancestor::chapter will only select <chapter> elements that are not in a namespace (unless you have declared a default element namespace, which is usually not a good idea in XQuery because it affects both your input and your output).

Parse data from multiple XML files and output to csv file

I've got a dozen XML files which contain the results of some wcat web performance tests. Within each XML file there is a data node that contains the names of each page requested and the average time it took to load it. I want to extract that information from each XML file and output it to a csv file so I can create a nice pretty graph in excel.
I could do the task in my main working language of C# but in an attempt to improve my scripting skills I'd like to try and do it using unix/cygwin commands or a scripting language such as Ruby.
The format of the XML file is:
<report name="wcat" version="6.3.1" level="1" top="100">
<section name="header" key="90000">
... lots of other XML junk...
<item>
<data name="reportt" >Request Name I</data>
...
<data name="avgttlb" >628</data>
</item>
<item>
<data name="reportt" >Request Name II</data>
...
<data name="avgttlb" >793</data>
</item>
... lots of other XML junk...
</section
</report>
And the csv output I need is:
Request,File 1,File 2,...,File 12
Request Name I,628,123,...,789
Request Name II,793,456,...,987
Are there any good cygwin command line utilities that could parse the XML? Or failing that is there a nice way to do it in Ruby?
What you're describing could be done in XSLT, which supports text output method, multiple input files (using the document() function), and of course templates.
I know some people find XSLT gross, but I use it all the time for this kind of thing and rather like it. Plus it's pretty much platform-independent.
Ruby has a nice parser called Nokogiri, that I really like. It supports both XML and HTML, DOM and SAX, and can build XML if that's your fancy. It's built on libxml2.
#!/usr/bin/env ruby -w
xml = <<END_XML
<report name="wcat" version="6.3.1" level="1" top="100">
<section name="header" key="90000">
<item>
<data name="reportt" >Request Name I</data>
<data name="avgttlb" >628</data>
</item>
<item>
<data name="reportt" >Request Name II</data>
<data name="avgttlb" >793</data>
</item>
</section
</report>
END_XML
require 'nokogiri'
doc = Nokogiri::XML(xml)
content = doc.search('item').map { |i|
i.search('data').map { |d| d.text }
}
content.each do |c|
puts c.join(',')
end
# >> Request Name I,628
# >> Request Name II,793
Notice that Nokogiri allows use of CSS accessors, which I'm using here, in addition to the standard XPath accessors. The actual parsing took the middle four lines.
Ruby's got a built-in CSV generator/parser, but for this quick 'n dirty example I didn't use it.
in python...
import elementTree.ElementTree
import csv
result = []
tree = elementTree.ElemenTree.parse('test.xml')
section = tree.getroot().find('section')
items = section.findall('item')
for item in items:
records = item.findall('data')
row = [rec.text for rec in records]
result.append(row)
csv.writer(file('output.csv', 'w'))
csv.writerows(result)

Resources