Ruby + Nokogiri + Xpath navigate Node_Set - ruby

<Item id="item0">
<Links>
<FirstLink id="link1" target="one"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content</String>
</Data>
</Item>
<Item id="item1">
<Links>
<FirstLink id="link1" target="two"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content</String>
</Data>
</Item>
I have created a Nokogiri-NodeSet with this structure, i.e. a list of items with links and data children.
How can I filter any items that don't match a certain value in the 'target'-attribute of <FirstLink>?
Actually, what I want in the end is to extract the <Data><String>-Content of every <Item> that matches a certain value in it's <FirstLink> "Target"-Attribute.
I've tried several approaches already but I'm at a loss as to how to identify an element by an attribute of it's grandchild, then extracting the content of this grandchild's parent's sibling, X(.

We can build up an XPath expression to do this. Assuming we are starting from the whole XML document, rather than the node-set you already have, something like
//Item
will select all <Item> elements (I’m guessing you already have something like that to get this node-set).
Next, to select only those <Item> elements which have <Links><FirstLink> where FirstLink has a target attribute value of one:
//Item[Links/FirstLink[#target='one']]
and finally to select the Data/String children of those nodes:
//Item[Links/FirstLink[#target='one']]/Data/String
So with Nokogiri you could use something like this (where doc is your parsed document):
doc.xpath("//Item[Links/FirstLink[#target='one']]/Data/String")
or if you want to use the node-set you already have you can use a relative expression:
nodeset.xpath("self::Item[Links/FirstLink[#target='one']]/Data/String")

I completely didn't understand what your goal is. But using a guess, I am trying to show you, how to proceed in this case :
require 'nokogiri'
doc = Nokogiri::XML <<-xml
<Item id="item0">
<Links>
<FirstLink id="link1" target="one"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content1</String>
</Data>
</Item>
<Item id="item1">
<Links>
<FirstLink id="link1" target="two"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content2</String>
</Data>
</Item>
xml
#xpath method with the expression "//Item", will select all the Item nodes. Then those Item nodes will be passed to the #reject method to select only those nodes, that has a node called Links having the target attribute value is "one". If any of the links, either FirstLink or SecondLink has the target attribute value "one", for that nodes grandparent node Item will be selected.
node.at("//Links/FirstLink")['target'] will give you the string say "one" which is a value of target attribute of the node, FirstLink of first Item nodes , then "two" from the second Item node. The part ['any vaue'] in node.at("//Links/FirstLink")['target']['any vaue'] is a call to the String#[] method.
Remember below approach will give you the flexibility of the use regular expression too.
nodeset = doc.xpath("//Item").reject do |node|
node.at("//Links/FirstLink")['target']['any vaue']
end
Now nodeset contains only the required Item nodes. Now I use #map, passing each item node inside it to collect the content of the String node. Then #at method with an expression //Data/String, will select the String node. Then #text, will give you the content of each String node.
nodeset.map { |n| n.at('//Data/String').text } # => ["content1"]

Related

Getting a node value depending on an another value at the same level

For each "item" node in the following XML structure, I want to select the corresponding "title" (the text nodes are located at the same level as the item nodes, I can't modify it).
The link between those two nodes will be the "ref" node which is a kind of primary key between the "item" and "title" trees.
Is it possible in XPath ?
I think it should be something like this: //root/item/../title[ref/text()=??????]/label
An example :
<root>
<item>
<ref>ITEM001</ref>
</item>
<item>
<ref>ITEM002</ref>
</item>
<item>
<ref>ITEM003</ref>
</item>
<item>
<ref>ITEM004</ref>
</item>
<title>
<ref>ITEM002</ref>
<label>Hello world!</label>
</title>
<title>
<ref>ITEM003</ref>
<label>Goodbye world!</label>
</title>
<title>
<ref>ITEM007</ref>
<label>This is a test!</label>
</title>
<title>
<ref>ITEM0010</ref>
<label>No this a question!</label>
</title>
</root>
The result would be:
ITEM001: empty
ITEM002: Hello world!
ITEM003: Goodbye world!
ITEM004: empty
Thanks in advance for your help.
I assume if you follow below steps you would get you desired output.
Step 1: Iterate through all the Items tag and capture all in an array.
Step 2: Using a loop on array use the below XPath to find the respective label value.
//title[contains(.,'')]/label.
Step 3: If you find an matching element then get the text of the label to display on console else display empty.

How to write XML path expression for the following code?

Write an expression that selects all the items ISBN and TITLE that their return
is “3/12/2017”
Code -
<itemlist>
<item>
<title>
The Bonfire of the Vanities
</title>
<type>Book</type>
<authors>
<author>Wolfe, Tom</author>
</authors>
<subjects>
<subject>New York</subject>
<subject>Race Relations</subject>
</subjects>
<isbn>0374115370</isbn>
<location>Adult</location>
<collection>Fiction</collection>
<status return="3/12/2017">Checked Out</status>
</item>
</itemlist>
//itemlist/item[status/#return='3/12/2017']/(isbn|title)
Find item elements whose status element child has return attribute that is "3/12/2017", then take those items' children that are isbn or title elements.

xpath multiple scope : select data from multiple trees

Problem : select data based on node which is in another part of the tree
How to select data in rows of column with label = "status"?
Data should be "data2" from /result/rows/items/item/c/items/item/v
and selection should be based on label='status' i.e. /result/cols/items/item/label=status
In the XML below "status" is column number 2, but it may change to column number 1, so the according XPath should return data of column no.1
<result>
<cols>
<items>
<item>
<id>c1</id>
<label>result</label>
<type>string</type>
</item>
<item>
<id>c2</id>
<label>status</label>
<type>string</type>
</item>
<item>
<id>c3</id>
<label>message</label>
<type>string</type>
</item>
</items>
</cols>
<rows>
<items>
<item>
<c>
<items>
<item>
<v>data1</v>
</item>
<item>
<v>data2</v>
</item>
<item>
<v />
</item>
</items>
</c>
</item>
</items>
</rows>
</result>
Your description is not very clear to understand.
I got it like this:
There is a node which indicates the column. The label of the column is "status". You get this label with
/result/cols/items/item/label[text()='status']
But that's not what you want. First, you want to find out at which position that column is. You get that position with
count(/result/cols/items/item[label/text()='status']/preceding-sibling::*)+1
But that's still not what you want. Based on that information, you want to select the actual data within rows. You get a row with
/result/rows/items/item/c/items/item[2]/v/text()
But you don't always want the second column of the row, you want the row based on the column index determined earlier. So you need to combine both:
/result/rows/items/item/c/items/item[count(/result/cols/items/item[label/text()='status']/preceding-sibling::*)+1]/v/text()
The last expression does not contain any hard coded indexes and uses only the column header text "status" to determine where the data is. In your example, it returns data2. If you change the column header text to "result", it gives you data1.
I'm not sure what you are asking for. But if you are looking for an Expression, which will get the "type" text for all labels with the text "status"
//label[text()='status']/following-sibling::type

Get low level xpath from XML with Nokogiri

I'm trying to store in an array all the unique Xpaths of the low level elements in the XML below, but like I'm doing in array a is being stored all the XML, not only the Xpath themselves. The XML has different levels of Xpath. I mean, some child elements only have 2 ancestors and others more than one.
This is the code I have.
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<name>Cake</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
<batter>Chocolate</batter>
<batter>Blueberry</batter>
<batter>Devil's Food</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Powdered Sugar</topping>
<topping>Chocolate with Sprinkles</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
<item>
<name>Raised</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
</items>
EOT
a = []
a = doc.xpath("//*")
puts a
I'd like to store in array "a" only the unique xpaths as below:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping
Maybe somebody could help me in how to do this.
Thanks for the help.
What you want to select is the "leaf" nodes. You can do it like so:
doc.xpath("//*[not(*)]")
This means "select all elements that don't contain elements".
If you want the XPaths, you'll need to call .path on each node. But the paths provided by Nokogiri have explicit positions (e.g. /items/item[2]/topping[4]), so you'll have to apply a regex to remove them, then remove duplicates with uniq:
doc.xpath("//*[not(*)]").map {|leaf| leaf.path.gsub(/\[.*?\]/, '') }.uniq
Output:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping

How do I parse XML with Nokogiri css selectors, using loops?

I am trying to parse this sample XML file:
<Collection version="2.0" id="74j5hc4je3b9">
<Name>A Funfair in Bangkok</Name>
<PermaLink>Funfair in Bangkok</PermaLink>
<PermaLinkIsName>True</PermaLinkIsName>
<Description>A small funfair near On Nut in Bangkok.</Description>
<Date>2009-08-03T00:00:00</Date>
<IsHidden>False</IsHidden>
<Items>
<Item filename="AGC_1998.jpg">
<Title>Funfair in Bangkok</Title>
<Caption>A small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-07T19:22:08</CreatedDate>
<Keywords>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="133" height="200" />
<PreviewSize width="532" height="800" />
<OriginalSize width="2279" height="3425" />
</Item>
<Item filename="AGC_1164.jpg" iscover="True">
<Title>Bumper Cars at a Funfair in Bangkok</Title>
<Caption>Bumper cars at a small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-03T22:08:24</CreatedDate>
<Keywords>
<Keyword>Bumper Cars</Keyword>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="200" height="133" />
<PreviewSize width="800" height="532" />
<OriginalSize width="3725" height="2479" />
</Item>
</Items>
</Collection>
Here is my current code:
require 'nokogiri'
doc = Nokogiri::XML(File.open("sample.xml"))
somevar = doc.css("collection")
#create loop
somevar.each do |item|
puts "Item "
puts item['Title']
puts "\n"
end#items
Starting at the root of the XML document, I'm trying to go from the root "Collections" down to each new level.
I start in the node sets, and get information from the nodes, and the nodes contain elements. How do I assign the node to a variable, and extract every single layer underneath that and the text?
I can do something like the code below, but I want to know how to systematically move through each nested element of XML using loops, and output the data for each line. When finished showing text, how do I move back up to the previous element/node, whatever it may be (traversing a node in the tree)?
puts somevar.css("Keyworks Keyword").text
Nokogiri's NodeSet and Node support very similar APIs, with the key semantic difference that NodeSet's methods tend to operate on all the contained nodes in turn. For example, while a single node's children gets that node's children, a NodeSet's children gets all contained nodes' children (ordered as they occur in the document). So, to print all the titles and authors of all your items, you could do this:
require 'nokogiri'
doc = Nokogiri::XML(File.open("sample.xml"))
coll = doc.css("Collection")
coll.css("Items").children.each do |item|
title = item.css("Title")[0]
authors = item.css("Authors")[0]
puts title.content if title
puts authors.content if authors
end
You can get at any level of the tree in this way. Another example -- depth-first search printing every node in the tree (NB. the printed representation of a node includes the printed representations of its children, so the output will be quite long):
def rec(node)
puts node
node.children.each do |child|
rec child
end
end
Since you ask about this specifically, if you want to get at the parent of a given node, you can use the parent method. You may never need to though, if you can put your processing in blocks passed to each and the like on NodeSets containing subtrees of interest.

Resources