How do I parse XML with Nokogiri css selectors, using loops? - ruby

I am trying to parse this sample XML file:
<Collection version="2.0" id="74j5hc4je3b9">
<Name>A Funfair in Bangkok</Name>
<PermaLink>Funfair in Bangkok</PermaLink>
<PermaLinkIsName>True</PermaLinkIsName>
<Description>A small funfair near On Nut in Bangkok.</Description>
<Date>2009-08-03T00:00:00</Date>
<IsHidden>False</IsHidden>
<Items>
<Item filename="AGC_1998.jpg">
<Title>Funfair in Bangkok</Title>
<Caption>A small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-07T19:22:08</CreatedDate>
<Keywords>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="133" height="200" />
<PreviewSize width="532" height="800" />
<OriginalSize width="2279" height="3425" />
</Item>
<Item filename="AGC_1164.jpg" iscover="True">
<Title>Bumper Cars at a Funfair in Bangkok</Title>
<Caption>Bumper cars at a small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-03T22:08:24</CreatedDate>
<Keywords>
<Keyword>Bumper Cars</Keyword>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="200" height="133" />
<PreviewSize width="800" height="532" />
<OriginalSize width="3725" height="2479" />
</Item>
</Items>
</Collection>
Here is my current code:
require 'nokogiri'
doc = Nokogiri::XML(File.open("sample.xml"))
somevar = doc.css("collection")
#create loop
somevar.each do |item|
puts "Item "
puts item['Title']
puts "\n"
end#items
Starting at the root of the XML document, I'm trying to go from the root "Collections" down to each new level.
I start in the node sets, and get information from the nodes, and the nodes contain elements. How do I assign the node to a variable, and extract every single layer underneath that and the text?
I can do something like the code below, but I want to know how to systematically move through each nested element of XML using loops, and output the data for each line. When finished showing text, how do I move back up to the previous element/node, whatever it may be (traversing a node in the tree)?
puts somevar.css("Keyworks Keyword").text

Nokogiri's NodeSet and Node support very similar APIs, with the key semantic difference that NodeSet's methods tend to operate on all the contained nodes in turn. For example, while a single node's children gets that node's children, a NodeSet's children gets all contained nodes' children (ordered as they occur in the document). So, to print all the titles and authors of all your items, you could do this:
require 'nokogiri'
doc = Nokogiri::XML(File.open("sample.xml"))
coll = doc.css("Collection")
coll.css("Items").children.each do |item|
title = item.css("Title")[0]
authors = item.css("Authors")[0]
puts title.content if title
puts authors.content if authors
end
You can get at any level of the tree in this way. Another example -- depth-first search printing every node in the tree (NB. the printed representation of a node includes the printed representations of its children, so the output will be quite long):
def rec(node)
puts node
node.children.each do |child|
rec child
end
end
Since you ask about this specifically, if you want to get at the parent of a given node, you can use the parent method. You may never need to though, if you can put your processing in blocks passed to each and the like on NodeSets containing subtrees of interest.

Related

How can I copy nodes from one xml file to another, using Nokogiri?

I am trying to do the following:
I have the following xml_1 file, which I generated.
<document>
<TITLE>Computer Parts</TITLE>
<header>
<ITEM>Motherboard</ITEM>
<MANUFACTURER>ASUS</MANUFACTURER>
<MODEL>P3B-F</MODEL>
<COST> 123.00</COST>
</header>
<part1>
<ITEM>Video Card</ITEM>
<MANUFACTURER>ATI</MANUFACTURER>
<MODEL>All-in-Wonder Pro</MODEL>
<COST> 160.00</COST>
</part1>
.....
<part5>
</part5>
{HERE I WANT TO ADD NODES FROM OTHER XML FILES}
</document>
Because I am trying to generate a big xml file, I prefer to generate them in pieces and combine them in the end.
In that way I have cleaner and more readable code.
In the end I want to copy the xml files (xml_2,xml_3,etc) in sequence in the xml_1 file.
So, lets say that I have another xml_2 file like the following:
<?xml version="1.0"?>
<part6>
</part6>
...
<part10>
</part10>
And so on.. I can have xml_3 .. xml_n.
My question is:
Is it possible using Nokogiri in a ruby file to copy the nodes of one xml file to another?
Thanks in advance!
See Nokogiri::XML::Node#<< to append children:
require 'nokogiri'
doc1 = Nokogiri::XML('<doc><foo>Foo</foo></doc>')
doc2 = Nokogiri::XML('<doc><bar>Bar</bar></doc>')
doc3 = Nokogiri::XML('<doc><gah>Gah</gah></doc>')
doc1.root << doc2.root.children # Append doc2's root's children to doc1's root.
doc1.root << doc3.root.children # Append doc3's root's children to doc1's root.
doc1.to_xml # =>
# <doc>
# <foo>Foo</foo>
# <bar>Bar</bar>
# <gah>Gah</gah>
# </doc>
Per the docs, you can append any node, document fragment, or node set, so you can select the target nodes in just about any way you want (CSS selectors, XPath, DOM, etc).

Get low level xpath from XML with Nokogiri

I'm trying to store in an array all the unique Xpaths of the low level elements in the XML below, but like I'm doing in array a is being stored all the XML, not only the Xpath themselves. The XML has different levels of Xpath. I mean, some child elements only have 2 ancestors and others more than one.
This is the code I have.
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<name>Cake</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
<batter>Chocolate</batter>
<batter>Blueberry</batter>
<batter>Devil's Food</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Powdered Sugar</topping>
<topping>Chocolate with Sprinkles</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
<item>
<name>Raised</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
</items>
EOT
a = []
a = doc.xpath("//*")
puts a
I'd like to store in array "a" only the unique xpaths as below:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping
Maybe somebody could help me in how to do this.
Thanks for the help.
What you want to select is the "leaf" nodes. You can do it like so:
doc.xpath("//*[not(*)]")
This means "select all elements that don't contain elements".
If you want the XPaths, you'll need to call .path on each node. But the paths provided by Nokogiri have explicit positions (e.g. /items/item[2]/topping[4]), so you'll have to apply a regex to remove them, then remove duplicates with uniq:
doc.xpath("//*[not(*)]").map {|leaf| leaf.path.gsub(/\[.*?\]/, '') }.uniq
Output:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping

Ruby + Nokogiri + Xpath navigate Node_Set

<Item id="item0">
<Links>
<FirstLink id="link1" target="one"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content</String>
</Data>
</Item>
<Item id="item1">
<Links>
<FirstLink id="link1" target="two"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content</String>
</Data>
</Item>
I have created a Nokogiri-NodeSet with this structure, i.e. a list of items with links and data children.
How can I filter any items that don't match a certain value in the 'target'-attribute of <FirstLink>?
Actually, what I want in the end is to extract the <Data><String>-Content of every <Item> that matches a certain value in it's <FirstLink> "Target"-Attribute.
I've tried several approaches already but I'm at a loss as to how to identify an element by an attribute of it's grandchild, then extracting the content of this grandchild's parent's sibling, X(.
We can build up an XPath expression to do this. Assuming we are starting from the whole XML document, rather than the node-set you already have, something like
//Item
will select all <Item> elements (I’m guessing you already have something like that to get this node-set).
Next, to select only those <Item> elements which have <Links><FirstLink> where FirstLink has a target attribute value of one:
//Item[Links/FirstLink[#target='one']]
and finally to select the Data/String children of those nodes:
//Item[Links/FirstLink[#target='one']]/Data/String
So with Nokogiri you could use something like this (where doc is your parsed document):
doc.xpath("//Item[Links/FirstLink[#target='one']]/Data/String")
or if you want to use the node-set you already have you can use a relative expression:
nodeset.xpath("self::Item[Links/FirstLink[#target='one']]/Data/String")
I completely didn't understand what your goal is. But using a guess, I am trying to show you, how to proceed in this case :
require 'nokogiri'
doc = Nokogiri::XML <<-xml
<Item id="item0">
<Links>
<FirstLink id="link1" target="one"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content1</String>
</Data>
</Item>
<Item id="item1">
<Links>
<FirstLink id="link1" target="two"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content2</String>
</Data>
</Item>
xml
#xpath method with the expression "//Item", will select all the Item nodes. Then those Item nodes will be passed to the #reject method to select only those nodes, that has a node called Links having the target attribute value is "one". If any of the links, either FirstLink or SecondLink has the target attribute value "one", for that nodes grandparent node Item will be selected.
node.at("//Links/FirstLink")['target'] will give you the string say "one" which is a value of target attribute of the node, FirstLink of first Item nodes , then "two" from the second Item node. The part ['any vaue'] in node.at("//Links/FirstLink")['target']['any vaue'] is a call to the String#[] method.
Remember below approach will give you the flexibility of the use regular expression too.
nodeset = doc.xpath("//Item").reject do |node|
node.at("//Links/FirstLink")['target']['any vaue']
end
Now nodeset contains only the required Item nodes. Now I use #map, passing each item node inside it to collect the content of the String node. Then #at method with an expression //Data/String, will select the String node. Then #text, will give you the content of each String node.
nodeset.map { |n| n.at('//Data/String').text } # => ["content1"]

counting elements in xml with Nokogiri

I'd like to understand why count gives me 5?
If I'm at the root element and I want to know my children, it is supposed to give me 2.
doc = Nokogiri::XML(open('link..to....element.xml'))
root = doc.root.children.count
puts root
<element>
<name>Married with Children</name>
<name>Married with Children</name>
</element>
You get 5 as the result because there are five child nodes under the root <element> node. There are two <name> nodes and three text nodes that each consist of whitespace; one between the opening <element> and the first <name>, one between the two <names>, and one between the second <name> and the closing </element>:
doc.root.children.each do |c|
p c
end
output:
#<Nokogiri::XML::Text:0x80544a04 "\n ">
#<Nokogiri::XML::Element:0x80544900 name="name" children=[#<Nokogiri::XML::Text:0x8054470c "Married with Children">]>
#<Nokogiri::XML::Text:0x80544554 "\n ">
#<Nokogiri::XML::Element:0x80544478 name="name" children=[#<Nokogiri::XML::Text:0x80544284 "Married with Children">]>
#<Nokogiri::XML::Text:0x805440cc "\n">
If you use the noblanks option when parsing Nokogiri won’t include these whitespace nodes:
doc = Nokogiri::XML(open('link..to....element.xml')) { |c| c.noblanks }
Now doc.root.children.count will equal 2, only the two <name> element nodes will be included.

How do I force parsing an XML node as hash array?

This is my simplified myXML:
<?xml version="1.0" encoding="utf-8"?>
<ShipmentRequest>
<Message>
<MemberId>A00000001</MemberId>
<MemberName>Bruce</MemberName>
<Line>
<LineNumber>3.1</LineNumber>
<Item>fruit-004</Item>
<Description>Peach</Description>
</Line>
<Line>
<LineNumber>4.1</LineNumber>
<Item>fruit-001</Item>
<Description>Peach</Description>
</Line>
</Message>
</ShipmentRequest>
When I parse it with the Crack gem myHash is:
{
"MemberId"=>"A00000001",
"MemberName"=>"Bruce",
"Line"=>[
{"LineNumber"=>"3.1", "Item"=>"A0001", "Description"=>"Apple"},
{"LineNumber"=>"4.1", "Item"=>"A0002", "Description"=>"Peach"}
]
}
The Crack gem creates the hash Line as an array, because there two <Line> nodes in myXML. But if myXML would contain only one <Line> node, the Crack gem would not parse it as an array:
{
"MemberId"=>"ABC0001",
"MemberName"=>"Alan",
"Line"=> {"LineNumber"=>"4.1", "Item"=>"fruit-004", "Description"=>"Apple"}
}
I want to see it still as an array no matter if there's only one node:
{
"MemberId"=>"ABC0001",
"MemberName"=>"Alan",
"Line"=> [{"LineNumber"=>"4.1", "Item"=>"fruit-004", "Description"=>"Apple"}]
}
After you convert the XML document to a hash you could do this:
myHash["Line"] = [myHash["Line"]] if myHash["Line"].kind_of?(Hash)
It will ensure that the Line node will be wrapped in Array.
The problem is, you're relying on code to do what you really should do. Crack has no idea that you want a single node to be an array of a single element, and that behavior makes it a lot more difficult for you when trying to dive into that portion of the data.
Parsing XML isn't hard, and, by parsing it yourself, you'll know what to expect, and will avoid the hassle of dealing with the "sometimes it's an array and sometimes it's not" returned by Crack.
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="utf-8"?>
<ShipmentRequest>
<Message>
<MemberId>A00000001</MemberId>
<MemberName>Bruce</MemberName>
<Line>
<LineNumber>3.1</LineNumber>
<Item>fruit-004</Item>
<Description>Peach</Description>
</Line>
<Line>
<LineNumber>4.1</LineNumber>
<Item>fruit-001</Item>
<Description>Peach</Description>
</Line>
</Message>
</ShipmentRequest>
EOT
That sets up the DOM, so it can be navigated:
hash = {}
message = doc.at('Message')
hash[:member_id] = message.at('MemberId').text
hash[:member_name] = message.at('MemberName').text
lines = message.search('Line').map do |line|
line_number = line.at('LineNumber').text
item = line.at('Item').text
description = line.at('Description').text
{
:line_number => line_number,
:item => item,
:description => description
}
end
hash[:lines] = lines
message = doc.at('Message') finds the first <Message> node.
message.at('MemberId').text finds the first <MemberID> node inside <Message>.
message.at('MemberName').text is similar to the above step.
message.search('Line') looks for all <Line> nodes inside <Message>.
From those descriptions you can figure out the rest.
After running, hash looks like:
{:member_id=>"A00000001",
:member_name=>"Bruce",
:lines=>
[{:line_number=>"3.1", :item=>"fruit-004", :description=>"Peach"},
{:line_number=>"4.1", :item=>"fruit-001", :description=>"Peach"}]}
If I remove one of the <Line> blocks from the XML, and re-run, I get:
{:member_id=>"A00000001",
:member_name=>"Bruce",
:lines=>[{:line_number=>"3.1", :item=>"fruit-004", :description=>"Peach"}]}
Using search to locate the <Line> nodes is the trick. search returns a NodeSet, which is akin to an Array, so by iterating over it using map it'll return an array of hashes of the contents of <Line> tags.
Nokogiri is a great tool for parsing HTML and XML, then allowing us to search, add, change or remove nodes. It supports CSS and XPath accessors, so if you are used to jQuery or how CSS works, or XPath expressions, you'll be off and running quickly. The tutorials for Nokogiri are a good starting place to learn how it works.

Resources