Get low level xpath from XML with Nokogiri - ruby

I'm trying to store in an array all the unique Xpaths of the low level elements in the XML below, but like I'm doing in array a is being stored all the XML, not only the Xpath themselves. The XML has different levels of Xpath. I mean, some child elements only have 2 ancestors and others more than one.
This is the code I have.
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<name>Cake</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
<batter>Chocolate</batter>
<batter>Blueberry</batter>
<batter>Devil's Food</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Powdered Sugar</topping>
<topping>Chocolate with Sprinkles</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
<item>
<name>Raised</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
</items>
EOT
a = []
a = doc.xpath("//*")
puts a
I'd like to store in array "a" only the unique xpaths as below:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping
Maybe somebody could help me in how to do this.
Thanks for the help.

What you want to select is the "leaf" nodes. You can do it like so:
doc.xpath("//*[not(*)]")
This means "select all elements that don't contain elements".
If you want the XPaths, you'll need to call .path on each node. But the paths provided by Nokogiri have explicit positions (e.g. /items/item[2]/topping[4]), so you'll have to apply a regex to remove them, then remove duplicates with uniq:
doc.xpath("//*[not(*)]").map {|leaf| leaf.path.gsub(/\[.*?\]/, '') }.uniq
Output:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping

Related

How can I copy nodes from one xml file to another, using Nokogiri?

I am trying to do the following:
I have the following xml_1 file, which I generated.
<document>
<TITLE>Computer Parts</TITLE>
<header>
<ITEM>Motherboard</ITEM>
<MANUFACTURER>ASUS</MANUFACTURER>
<MODEL>P3B-F</MODEL>
<COST> 123.00</COST>
</header>
<part1>
<ITEM>Video Card</ITEM>
<MANUFACTURER>ATI</MANUFACTURER>
<MODEL>All-in-Wonder Pro</MODEL>
<COST> 160.00</COST>
</part1>
.....
<part5>
</part5>
{HERE I WANT TO ADD NODES FROM OTHER XML FILES}
</document>
Because I am trying to generate a big xml file, I prefer to generate them in pieces and combine them in the end.
In that way I have cleaner and more readable code.
In the end I want to copy the xml files (xml_2,xml_3,etc) in sequence in the xml_1 file.
So, lets say that I have another xml_2 file like the following:
<?xml version="1.0"?>
<part6>
</part6>
...
<part10>
</part10>
And so on.. I can have xml_3 .. xml_n.
My question is:
Is it possible using Nokogiri in a ruby file to copy the nodes of one xml file to another?
Thanks in advance!
See Nokogiri::XML::Node#<< to append children:
require 'nokogiri'
doc1 = Nokogiri::XML('<doc><foo>Foo</foo></doc>')
doc2 = Nokogiri::XML('<doc><bar>Bar</bar></doc>')
doc3 = Nokogiri::XML('<doc><gah>Gah</gah></doc>')
doc1.root << doc2.root.children # Append doc2's root's children to doc1's root.
doc1.root << doc3.root.children # Append doc3's root's children to doc1's root.
doc1.to_xml # =>
# <doc>
# <foo>Foo</foo>
# <bar>Bar</bar>
# <gah>Gah</gah>
# </doc>
Per the docs, you can append any node, document fragment, or node set, so you can select the target nodes in just about any way you want (CSS selectors, XPath, DOM, etc).

Ruby + Nokogiri + Xpath navigate Node_Set

<Item id="item0">
<Links>
<FirstLink id="link1" target="one"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content</String>
</Data>
</Item>
<Item id="item1">
<Links>
<FirstLink id="link1" target="two"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content</String>
</Data>
</Item>
I have created a Nokogiri-NodeSet with this structure, i.e. a list of items with links and data children.
How can I filter any items that don't match a certain value in the 'target'-attribute of <FirstLink>?
Actually, what I want in the end is to extract the <Data><String>-Content of every <Item> that matches a certain value in it's <FirstLink> "Target"-Attribute.
I've tried several approaches already but I'm at a loss as to how to identify an element by an attribute of it's grandchild, then extracting the content of this grandchild's parent's sibling, X(.
We can build up an XPath expression to do this. Assuming we are starting from the whole XML document, rather than the node-set you already have, something like
//Item
will select all <Item> elements (I’m guessing you already have something like that to get this node-set).
Next, to select only those <Item> elements which have <Links><FirstLink> where FirstLink has a target attribute value of one:
//Item[Links/FirstLink[#target='one']]
and finally to select the Data/String children of those nodes:
//Item[Links/FirstLink[#target='one']]/Data/String
So with Nokogiri you could use something like this (where doc is your parsed document):
doc.xpath("//Item[Links/FirstLink[#target='one']]/Data/String")
or if you want to use the node-set you already have you can use a relative expression:
nodeset.xpath("self::Item[Links/FirstLink[#target='one']]/Data/String")
I completely didn't understand what your goal is. But using a guess, I am trying to show you, how to proceed in this case :
require 'nokogiri'
doc = Nokogiri::XML <<-xml
<Item id="item0">
<Links>
<FirstLink id="link1" target="one"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content1</String>
</Data>
</Item>
<Item id="item1">
<Links>
<FirstLink id="link1" target="two"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content2</String>
</Data>
</Item>
xml
#xpath method with the expression "//Item", will select all the Item nodes. Then those Item nodes will be passed to the #reject method to select only those nodes, that has a node called Links having the target attribute value is "one". If any of the links, either FirstLink or SecondLink has the target attribute value "one", for that nodes grandparent node Item will be selected.
node.at("//Links/FirstLink")['target'] will give you the string say "one" which is a value of target attribute of the node, FirstLink of first Item nodes , then "two" from the second Item node. The part ['any vaue'] in node.at("//Links/FirstLink")['target']['any vaue'] is a call to the String#[] method.
Remember below approach will give you the flexibility of the use regular expression too.
nodeset = doc.xpath("//Item").reject do |node|
node.at("//Links/FirstLink")['target']['any vaue']
end
Now nodeset contains only the required Item nodes. Now I use #map, passing each item node inside it to collect the content of the String node. Then #at method with an expression //Data/String, will select the String node. Then #text, will give you the content of each String node.
nodeset.map { |n| n.at('//Data/String').text } # => ["content1"]

How do I parse XML with Nokogiri css selectors, using loops?

I am trying to parse this sample XML file:
<Collection version="2.0" id="74j5hc4je3b9">
<Name>A Funfair in Bangkok</Name>
<PermaLink>Funfair in Bangkok</PermaLink>
<PermaLinkIsName>True</PermaLinkIsName>
<Description>A small funfair near On Nut in Bangkok.</Description>
<Date>2009-08-03T00:00:00</Date>
<IsHidden>False</IsHidden>
<Items>
<Item filename="AGC_1998.jpg">
<Title>Funfair in Bangkok</Title>
<Caption>A small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-07T19:22:08</CreatedDate>
<Keywords>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="133" height="200" />
<PreviewSize width="532" height="800" />
<OriginalSize width="2279" height="3425" />
</Item>
<Item filename="AGC_1164.jpg" iscover="True">
<Title>Bumper Cars at a Funfair in Bangkok</Title>
<Caption>Bumper cars at a small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-03T22:08:24</CreatedDate>
<Keywords>
<Keyword>Bumper Cars</Keyword>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="200" height="133" />
<PreviewSize width="800" height="532" />
<OriginalSize width="3725" height="2479" />
</Item>
</Items>
</Collection>
Here is my current code:
require 'nokogiri'
doc = Nokogiri::XML(File.open("sample.xml"))
somevar = doc.css("collection")
#create loop
somevar.each do |item|
puts "Item "
puts item['Title']
puts "\n"
end#items
Starting at the root of the XML document, I'm trying to go from the root "Collections" down to each new level.
I start in the node sets, and get information from the nodes, and the nodes contain elements. How do I assign the node to a variable, and extract every single layer underneath that and the text?
I can do something like the code below, but I want to know how to systematically move through each nested element of XML using loops, and output the data for each line. When finished showing text, how do I move back up to the previous element/node, whatever it may be (traversing a node in the tree)?
puts somevar.css("Keyworks Keyword").text
Nokogiri's NodeSet and Node support very similar APIs, with the key semantic difference that NodeSet's methods tend to operate on all the contained nodes in turn. For example, while a single node's children gets that node's children, a NodeSet's children gets all contained nodes' children (ordered as they occur in the document). So, to print all the titles and authors of all your items, you could do this:
require 'nokogiri'
doc = Nokogiri::XML(File.open("sample.xml"))
coll = doc.css("Collection")
coll.css("Items").children.each do |item|
title = item.css("Title")[0]
authors = item.css("Authors")[0]
puts title.content if title
puts authors.content if authors
end
You can get at any level of the tree in this way. Another example -- depth-first search printing every node in the tree (NB. the printed representation of a node includes the printed representations of its children, so the output will be quite long):
def rec(node)
puts node
node.children.each do |child|
rec child
end
end
Since you ask about this specifically, if you want to get at the parent of a given node, you can use the parent method. You may never need to though, if you can put your processing in blocks passed to each and the like on NodeSets containing subtrees of interest.

How can I concatenate two XML tags using Ruby/Nokogiri?

I am using Ruby to retrieve an XML document with the following format:
<project>
<users>
<person>
<name>LUIS</name>
</person>
<person>
<name>JOHN</name>
</person>
</users>
</project>
I want to know how to produce the following result, with the tags concatenated:
<project>
<users>
<person>
<name>LUIS JOHN</name>
</person>
</users>
</project>
Here is the code I am using:
file = File.new( "proyectos.xml" )
doc3 = Nokogiri::XML(file)
a=0
#participa = doc3.search("person")
#participa.each do |i|
#par = #participa.search("name").map { |node| node.children.text }
#par.each do |i|
puts #par[a]
puts '--'
a = a + 1
end
end
Rather than supply code, here's how to fish:
To parse your XML into Nokogiri, which I recommend highly:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<project>
<users>
<person>
<name>LUIS</name>
</person>
<person>
<name>JOHN</name>
</person>
</users>
</project>
EOT
That gives you a doc variable which is the DOM as a Nokogiri::XML::Document. From that you can search, either for matching nodes or a particular node. search allows you to pass an XPath or CSS accessor to locate what you are looking for. I recommend CSS for most things because it is more readable, but XPath has some great tools to dig into the structure of your XML, so often I end up with both in my code.
So, doc.at('users') is the CSS accessor to find the first users node. doc.search('person') will return all nodes matching the person tag as a NodeSet, which is basically an array which you can enumerate or loop over.
Nokogiri has a text method for a node that lets you get the text content of that node, including all the carriage-returns between nodes that would normally be considered formatting in the XML as it flows down the document. When you have the text of the node, you can apply the normal Ruby string processing commands, such as strip, squish, chomp, etc., to massage the text into a more usable format.
Nokogiri also has a children= method which lets you redefine the child nodes of a node. You can pass in a node you've created, a NodeSet, or even the text you want rendered into the XML at that point.
In a quick experiment, I have code that does what you want in basically four lines. But, I want to see your work before I share what I wrote.
Finally, puts doc.to_xml will let you easily see if your changes to the document were successful.
Here's how I'd do it:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<project>
<users>
<person>
<name>LUIS</name>
</person>
<person>
<name>JOHN</name>
</person>
</users>
</project>
EOT
The XML is parsed into a DOM now. Search for the users tags, then locate the embedded name tags and extract the text from them. Join the results into a single space-delimited string. Then replace the children of the users tag with the desired results:
doc.search('users').each do |users|
user_names = users.search('name').map(&:text).join(' ')
users.children = "<person><name>#{ user_names }</name></person>"
end
If you output the resulting XML you'll get:
puts doc.to_xml
<?xml version="1.0"?>
<project>
<users><person><name>LUIS JOHN</name></person></users>
</project>

Using ruby/nokogiri to transform xml to another xml

I've never encountered task of transforming XML from one form to another. I hear that XSLT is just for that, but I don't want to go there. So, using only ruby and nokogiri, how can I:
remove all item elements but time from initial XML and also rename element time to HammerTime?
Initial XML:
...
<item>
<time>05.04.2011 9:53:23</time>
<iddqd>42</iddqd>
<idkfa>woot</idkfa>
</item>
<item>
...
Desired result:
...
<item>
<HammerTime>05.04.2011 9:53:23</HammerTime>
</item>
<item>
...
I figured out how to put data from XML to array using nokogiri's .xpath, but is there a way to make the desired transformation into another XML without manually having to write something like puts "<HammerTime>#{array['time']}</HammerTime>"?
Here you go:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML <<-EOHTML
<html>
<body>
<item>
<time>05.04.2011 9:53:23</time>
<iddqd>42</iddqd>
<idkfa>woot</idkfa>
</item>
</body>
</html>
EOHTML
hammer = doc.at_css "time"
hammer.name = 'hammertime'
doc.css("iddqd").remove
doc.css("idkfa").remove
outfile = File.new("output.html", "w")
outfile.puts doc.to_html
outfile.close
What do you mean with
into another XML without manually having to write something like puts "<HammerTime>#{array['time']}</HammerTime>"?
If you want to transform an XML element into another in a language-independent way, you can use XSLT transformations (or stylesheet). Once you have your XSLT file you can apply it with Nokogiri's Nokogiri::XSLT::Stylesheet#apply_to.

Resources