Nokogiri (Ruby): Extract tag contents for a specific attribute inside each node - ruby

I have a XML with the following structure
<Root>
<Batch name="value">
<Document id="ID1">
<Tags>
<Tag id="ID11" name="name11">Contents</Tag>
<Tag id="ID12" name="name12">Contents</Tag>
</Tags>
</Document>
<Document id="ID2">
<Tags>
<Tag id="ID21" name="name21">Contents</Tag>
<Tag id="ID22" name="name22">Contents</Tag>
</Tags>
</Document>
</Batch>
</Root>
I want to extract the contents of specific tags for each Document node, using something like this:
xml.xpath('//Document/Tags').each do |node|
puts xml.xpath('//Root/Batch/Document/Tags/Tag[#id="ID11"]').text
end
Which is expected to extract the contents of the tag with id = "ID11" for each 2 nodes, but retrieves nothing. Any ideas?

You have a minor error in the xpath, you are using /Documents/Document while the XML you pasted is a bit different.
This should work:
//Root/Batch/Document/Tags/Tag[#id="ID11"]
My favorite way to do this is by using the #css method like this:
xml.css('Tag[#id="ID11"]').each do |node|
puts node.text
end

It seemed that xpath used was wrong.
'//Root/Batch/Documents/Document/Tags/Tag[#id="ID11"]'
shoud be
'//Root/Batch/Document/Tags/Tag[#id="ID11"]'

I managed to get it working with the following code:
xml.xpath('//Document/Tags').each do |node|
node.xpath("Tag[#id='ID11']").text
end

Related

Get low level xpath from XML with Nokogiri

I'm trying to store in an array all the unique Xpaths of the low level elements in the XML below, but like I'm doing in array a is being stored all the XML, not only the Xpath themselves. The XML has different levels of Xpath. I mean, some child elements only have 2 ancestors and others more than one.
This is the code I have.
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<name>Cake</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
<batter>Chocolate</batter>
<batter>Blueberry</batter>
<batter>Devil's Food</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Powdered Sugar</topping>
<topping>Chocolate with Sprinkles</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
<item>
<name>Raised</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
</items>
EOT
a = []
a = doc.xpath("//*")
puts a
I'd like to store in array "a" only the unique xpaths as below:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping
Maybe somebody could help me in how to do this.
Thanks for the help.
What you want to select is the "leaf" nodes. You can do it like so:
doc.xpath("//*[not(*)]")
This means "select all elements that don't contain elements".
If you want the XPaths, you'll need to call .path on each node. But the paths provided by Nokogiri have explicit positions (e.g. /items/item[2]/topping[4]), so you'll have to apply a regex to remove them, then remove duplicates with uniq:
doc.xpath("//*[not(*)]").map {|leaf| leaf.path.gsub(/\[.*?\]/, '') }.uniq
Output:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping

Ruby + Nokogiri + Xpath navigate Node_Set

<Item id="item0">
<Links>
<FirstLink id="link1" target="one"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content</String>
</Data>
</Item>
<Item id="item1">
<Links>
<FirstLink id="link1" target="two"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content</String>
</Data>
</Item>
I have created a Nokogiri-NodeSet with this structure, i.e. a list of items with links and data children.
How can I filter any items that don't match a certain value in the 'target'-attribute of <FirstLink>?
Actually, what I want in the end is to extract the <Data><String>-Content of every <Item> that matches a certain value in it's <FirstLink> "Target"-Attribute.
I've tried several approaches already but I'm at a loss as to how to identify an element by an attribute of it's grandchild, then extracting the content of this grandchild's parent's sibling, X(.
We can build up an XPath expression to do this. Assuming we are starting from the whole XML document, rather than the node-set you already have, something like
//Item
will select all <Item> elements (I’m guessing you already have something like that to get this node-set).
Next, to select only those <Item> elements which have <Links><FirstLink> where FirstLink has a target attribute value of one:
//Item[Links/FirstLink[#target='one']]
and finally to select the Data/String children of those nodes:
//Item[Links/FirstLink[#target='one']]/Data/String
So with Nokogiri you could use something like this (where doc is your parsed document):
doc.xpath("//Item[Links/FirstLink[#target='one']]/Data/String")
or if you want to use the node-set you already have you can use a relative expression:
nodeset.xpath("self::Item[Links/FirstLink[#target='one']]/Data/String")
I completely didn't understand what your goal is. But using a guess, I am trying to show you, how to proceed in this case :
require 'nokogiri'
doc = Nokogiri::XML <<-xml
<Item id="item0">
<Links>
<FirstLink id="link1" target="one"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content1</String>
</Data>
</Item>
<Item id="item1">
<Links>
<FirstLink id="link1" target="two"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content2</String>
</Data>
</Item>
xml
#xpath method with the expression "//Item", will select all the Item nodes. Then those Item nodes will be passed to the #reject method to select only those nodes, that has a node called Links having the target attribute value is "one". If any of the links, either FirstLink or SecondLink has the target attribute value "one", for that nodes grandparent node Item will be selected.
node.at("//Links/FirstLink")['target'] will give you the string say "one" which is a value of target attribute of the node, FirstLink of first Item nodes , then "two" from the second Item node. The part ['any vaue'] in node.at("//Links/FirstLink")['target']['any vaue'] is a call to the String#[] method.
Remember below approach will give you the flexibility of the use regular expression too.
nodeset = doc.xpath("//Item").reject do |node|
node.at("//Links/FirstLink")['target']['any vaue']
end
Now nodeset contains only the required Item nodes. Now I use #map, passing each item node inside it to collect the content of the String node. Then #at method with an expression //Data/String, will select the String node. Then #text, will give you the content of each String node.
nodeset.map { |n| n.at('//Data/String').text } # => ["content1"]

How can I concatenate two XML tags using Ruby/Nokogiri?

I am using Ruby to retrieve an XML document with the following format:
<project>
<users>
<person>
<name>LUIS</name>
</person>
<person>
<name>JOHN</name>
</person>
</users>
</project>
I want to know how to produce the following result, with the tags concatenated:
<project>
<users>
<person>
<name>LUIS JOHN</name>
</person>
</users>
</project>
Here is the code I am using:
file = File.new( "proyectos.xml" )
doc3 = Nokogiri::XML(file)
a=0
#participa = doc3.search("person")
#participa.each do |i|
#par = #participa.search("name").map { |node| node.children.text }
#par.each do |i|
puts #par[a]
puts '--'
a = a + 1
end
end
Rather than supply code, here's how to fish:
To parse your XML into Nokogiri, which I recommend highly:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<project>
<users>
<person>
<name>LUIS</name>
</person>
<person>
<name>JOHN</name>
</person>
</users>
</project>
EOT
That gives you a doc variable which is the DOM as a Nokogiri::XML::Document. From that you can search, either for matching nodes or a particular node. search allows you to pass an XPath or CSS accessor to locate what you are looking for. I recommend CSS for most things because it is more readable, but XPath has some great tools to dig into the structure of your XML, so often I end up with both in my code.
So, doc.at('users') is the CSS accessor to find the first users node. doc.search('person') will return all nodes matching the person tag as a NodeSet, which is basically an array which you can enumerate or loop over.
Nokogiri has a text method for a node that lets you get the text content of that node, including all the carriage-returns between nodes that would normally be considered formatting in the XML as it flows down the document. When you have the text of the node, you can apply the normal Ruby string processing commands, such as strip, squish, chomp, etc., to massage the text into a more usable format.
Nokogiri also has a children= method which lets you redefine the child nodes of a node. You can pass in a node you've created, a NodeSet, or even the text you want rendered into the XML at that point.
In a quick experiment, I have code that does what you want in basically four lines. But, I want to see your work before I share what I wrote.
Finally, puts doc.to_xml will let you easily see if your changes to the document were successful.
Here's how I'd do it:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<project>
<users>
<person>
<name>LUIS</name>
</person>
<person>
<name>JOHN</name>
</person>
</users>
</project>
EOT
The XML is parsed into a DOM now. Search for the users tags, then locate the embedded name tags and extract the text from them. Join the results into a single space-delimited string. Then replace the children of the users tag with the desired results:
doc.search('users').each do |users|
user_names = users.search('name').map(&:text).join(' ')
users.children = "<person><name>#{ user_names }</name></person>"
end
If you output the resulting XML you'll get:
puts doc.to_xml
<?xml version="1.0"?>
<project>
<users><person><name>LUIS JOHN</name></person></users>
</project>

XmlNodeList SelectNodes trouble

I'm trying to parse an xml file
My code looks like:
string path2 = "xmlFile.xml";
XmlDocument xDoc = new XmlDocument();
xDoc.Load(path2);
XmlNodeList xnList = xDoc.DocumentElement["feed"].SelectNodes("entry");
But can't seem to get the listing of nodes. I get the error message- "Use the 'new' keyword to create an object instance." and it seems to be on 'SelectNodes("entry")'
This code worked when I loaded the xml from an rss feed, but not a local file. Can you tell me what I'm doing wrong?
My xml looks like:
<?xml version="1.0"?>
<feed xmlns:media="http://search.yahoo.com/mrss/" xmlns:gr="http://www.google.com/schemas/reader/atom/" xmlns:idx="urn:atom-extension:indexing" xmlns="http://www.w3.org/2005/Atom" idx:index="no" gr:dir="ltr">
<entry gr:crawl-timestamp-msec="1318667375230">
<title type="html">Title 1 text</title>
<summary>summary 1 text text text</summary>
</entry>
<entry gr:crawl-timestamp-msec="1318667375230">
<title type="html">title 2 text</title>
<summary>summary 2 text text text</summary>
</entry>
</feed>
Take the namespace into acount:
XmlNamespaceManager mgr = new XmlNamespaceManager(XDoc.NameTable);
mgr.AddNamespace("atom", "http://www.w3.org/2005/Atom");
XmlNodeList xnList = xDoc.SelectNodes("//atom:entry", mgr);
This is the infamous most FAQ about XPath -- referring to the names of elements that are in a default namespace.
Short answer: search for "XPath default namespace" and understand the problem.
Then use an XmlNamespaceManager instance to add an association between a prefix (say "x") and the default namespace (in your case "http://www.w3.org/2005/Atom").
Finally, replace any Name with x:Name in your XPath expression.

Using ruby/nokogiri to transform xml to another xml

I've never encountered task of transforming XML from one form to another. I hear that XSLT is just for that, but I don't want to go there. So, using only ruby and nokogiri, how can I:
remove all item elements but time from initial XML and also rename element time to HammerTime?
Initial XML:
...
<item>
<time>05.04.2011 9:53:23</time>
<iddqd>42</iddqd>
<idkfa>woot</idkfa>
</item>
<item>
...
Desired result:
...
<item>
<HammerTime>05.04.2011 9:53:23</HammerTime>
</item>
<item>
...
I figured out how to put data from XML to array using nokogiri's .xpath, but is there a way to make the desired transformation into another XML without manually having to write something like puts "<HammerTime>#{array['time']}</HammerTime>"?
Here you go:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML <<-EOHTML
<html>
<body>
<item>
<time>05.04.2011 9:53:23</time>
<iddqd>42</iddqd>
<idkfa>woot</idkfa>
</item>
</body>
</html>
EOHTML
hammer = doc.at_css "time"
hammer.name = 'hammertime'
doc.css("iddqd").remove
doc.css("idkfa").remove
outfile = File.new("output.html", "w")
outfile.puts doc.to_html
outfile.close
What do you mean with
into another XML without manually having to write something like puts "<HammerTime>#{array['time']}</HammerTime>"?
If you want to transform an XML element into another in a language-independent way, you can use XSLT transformations (or stylesheet). Once you have your XSLT file you can apply it with Nokogiri's Nokogiri::XSLT::Stylesheet#apply_to.

Resources