Find string in NodeSet with XPath (Nokgiri) - ruby

I have this XML:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml>
<page number="1">
<text top="91">Rapport</text>
<text top="102">foo</text>
</page>
<page number="2">
<text top="91">Rapport</text>
<text top="102">bar</text>
</page>
<page number="3">
<text top="91">Rapport</text>
<text top="102">asdf</text>
</page>
</pdf2xml>
which I'm doing this with:
require 'nokogiri'
doc = Nokogiri::XML(File.read("file.xml"))
pages = doc.xpath("//page")
nodeset = pages[0].xpath("./text") + pages[1].xpath("./text")
I want to find a node by string in nodeset, like this
irb(main):011:0> nodeset.at_xpath("//text[text()[contains(., 'bar')]]")
=> #<Nokogiri::XML::Element:0x3fea6a4821d4 name="text" attributes=[#<Nokogiri::XML::Attr:0x3fea6a482170 name="top" value="102">] children=[#<Nokogiri::XML::Text:0x3fea6a481cac "bar">]>
but I don't want to use //
I have managed to do this
irb(main):018:0> nodeset.at_xpath("text()[contains(., 'bar')]")
=> #<Nokogiri::XML::Text:0x3fea6a481cac "bar">
but I want the whole <text> node.
What should my xpath query on nodeset look like?

For selecting parent of the current node you can use .. For example,
/pdf2xml/page[1]
points to the first <page> node. If you want to select its parent again you can write
/pdf2xml/page[1]/..
This will select <pdf2xml> node which is the parent of <page>.
On the similar lines you can use .. for selecting parent node in your example.
For more information you can refer this
Hope this helps.

Simpler than selecting the text() node and then selecting the parent node is to just select the node you want in the first place:
pages = doc.xpath("//page")
puts pages.xpath("text[contains(.,'bar')]")
#=> <text top="102">bar</text>
If it makes you feel better, you could alternatively explicitly test the text() child node of the text element instead of using the text equivalent for the element:
pages.xpath("text[contains(text(),'bar')]")

I just discovered that
nodeset.at_xpath("../text[text()[contains(., 'bar')]]")
works too.
Edit: But I think this is slower than /...

Related

Getting a single child element with a given name with Nokogiri

Let's say I have XML which looks like this:
<paper>
<header>
</header>
<body>
<paragraph>
</paragraph>
</body>
<conclusion>
</conclusion>
</paper>
Is there a way I can just get conclusion, without making an ugly loop like:
for child in paper.children do
if child.name == "conclusion"
conclusion = child
end
end
puts conclusion
Ideally something like python's Element.find('conclusion').
Try with xpath method.
node = doc.xpath("//conclusion")[0]
or, if you know is just one
node = doc.at_xpath("//conclusion")

Ruby + Nokogiri + Xpath navigate Node_Set

<Item id="item0">
<Links>
<FirstLink id="link1" target="one"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content</String>
</Data>
</Item>
<Item id="item1">
<Links>
<FirstLink id="link1" target="two"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content</String>
</Data>
</Item>
I have created a Nokogiri-NodeSet with this structure, i.e. a list of items with links and data children.
How can I filter any items that don't match a certain value in the 'target'-attribute of <FirstLink>?
Actually, what I want in the end is to extract the <Data><String>-Content of every <Item> that matches a certain value in it's <FirstLink> "Target"-Attribute.
I've tried several approaches already but I'm at a loss as to how to identify an element by an attribute of it's grandchild, then extracting the content of this grandchild's parent's sibling, X(.
We can build up an XPath expression to do this. Assuming we are starting from the whole XML document, rather than the node-set you already have, something like
//Item
will select all <Item> elements (I’m guessing you already have something like that to get this node-set).
Next, to select only those <Item> elements which have <Links><FirstLink> where FirstLink has a target attribute value of one:
//Item[Links/FirstLink[#target='one']]
and finally to select the Data/String children of those nodes:
//Item[Links/FirstLink[#target='one']]/Data/String
So with Nokogiri you could use something like this (where doc is your parsed document):
doc.xpath("//Item[Links/FirstLink[#target='one']]/Data/String")
or if you want to use the node-set you already have you can use a relative expression:
nodeset.xpath("self::Item[Links/FirstLink[#target='one']]/Data/String")
I completely didn't understand what your goal is. But using a guess, I am trying to show you, how to proceed in this case :
require 'nokogiri'
doc = Nokogiri::XML <<-xml
<Item id="item0">
<Links>
<FirstLink id="link1" target="one"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content1</String>
</Data>
</Item>
<Item id="item1">
<Links>
<FirstLink id="link1" target="two"/>
<SecondLink id="link2" target="two"/>
</Links>
<Data>
<String>content2</String>
</Data>
</Item>
xml
#xpath method with the expression "//Item", will select all the Item nodes. Then those Item nodes will be passed to the #reject method to select only those nodes, that has a node called Links having the target attribute value is "one". If any of the links, either FirstLink or SecondLink has the target attribute value "one", for that nodes grandparent node Item will be selected.
node.at("//Links/FirstLink")['target'] will give you the string say "one" which is a value of target attribute of the node, FirstLink of first Item nodes , then "two" from the second Item node. The part ['any vaue'] in node.at("//Links/FirstLink")['target']['any vaue'] is a call to the String#[] method.
Remember below approach will give you the flexibility of the use regular expression too.
nodeset = doc.xpath("//Item").reject do |node|
node.at("//Links/FirstLink")['target']['any vaue']
end
Now nodeset contains only the required Item nodes. Now I use #map, passing each item node inside it to collect the content of the String node. Then #at method with an expression //Data/String, will select the String node. Then #text, will give you the content of each String node.
nodeset.map { |n| n.at('//Data/String').text } # => ["content1"]

How do I parse XML with Nokogiri css selectors, using loops?

I am trying to parse this sample XML file:
<Collection version="2.0" id="74j5hc4je3b9">
<Name>A Funfair in Bangkok</Name>
<PermaLink>Funfair in Bangkok</PermaLink>
<PermaLinkIsName>True</PermaLinkIsName>
<Description>A small funfair near On Nut in Bangkok.</Description>
<Date>2009-08-03T00:00:00</Date>
<IsHidden>False</IsHidden>
<Items>
<Item filename="AGC_1998.jpg">
<Title>Funfair in Bangkok</Title>
<Caption>A small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-07T19:22:08</CreatedDate>
<Keywords>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="133" height="200" />
<PreviewSize width="532" height="800" />
<OriginalSize width="2279" height="3425" />
</Item>
<Item filename="AGC_1164.jpg" iscover="True">
<Title>Bumper Cars at a Funfair in Bangkok</Title>
<Caption>Bumper cars at a small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-03T22:08:24</CreatedDate>
<Keywords>
<Keyword>Bumper Cars</Keyword>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="200" height="133" />
<PreviewSize width="800" height="532" />
<OriginalSize width="3725" height="2479" />
</Item>
</Items>
</Collection>
Here is my current code:
require 'nokogiri'
doc = Nokogiri::XML(File.open("sample.xml"))
somevar = doc.css("collection")
#create loop
somevar.each do |item|
puts "Item "
puts item['Title']
puts "\n"
end#items
Starting at the root of the XML document, I'm trying to go from the root "Collections" down to each new level.
I start in the node sets, and get information from the nodes, and the nodes contain elements. How do I assign the node to a variable, and extract every single layer underneath that and the text?
I can do something like the code below, but I want to know how to systematically move through each nested element of XML using loops, and output the data for each line. When finished showing text, how do I move back up to the previous element/node, whatever it may be (traversing a node in the tree)?
puts somevar.css("Keyworks Keyword").text
Nokogiri's NodeSet and Node support very similar APIs, with the key semantic difference that NodeSet's methods tend to operate on all the contained nodes in turn. For example, while a single node's children gets that node's children, a NodeSet's children gets all contained nodes' children (ordered as they occur in the document). So, to print all the titles and authors of all your items, you could do this:
require 'nokogiri'
doc = Nokogiri::XML(File.open("sample.xml"))
coll = doc.css("Collection")
coll.css("Items").children.each do |item|
title = item.css("Title")[0]
authors = item.css("Authors")[0]
puts title.content if title
puts authors.content if authors
end
You can get at any level of the tree in this way. Another example -- depth-first search printing every node in the tree (NB. the printed representation of a node includes the printed representations of its children, so the output will be quite long):
def rec(node)
puts node
node.children.each do |child|
rec child
end
end
Since you ask about this specifically, if you want to get at the parent of a given node, you can use the parent method. You may never need to though, if you can put your processing in blocks passed to each and the like on NodeSets containing subtrees of interest.

How do I force parsing an XML node as hash array?

This is my simplified myXML:
<?xml version="1.0" encoding="utf-8"?>
<ShipmentRequest>
<Message>
<MemberId>A00000001</MemberId>
<MemberName>Bruce</MemberName>
<Line>
<LineNumber>3.1</LineNumber>
<Item>fruit-004</Item>
<Description>Peach</Description>
</Line>
<Line>
<LineNumber>4.1</LineNumber>
<Item>fruit-001</Item>
<Description>Peach</Description>
</Line>
</Message>
</ShipmentRequest>
When I parse it with the Crack gem myHash is:
{
"MemberId"=>"A00000001",
"MemberName"=>"Bruce",
"Line"=>[
{"LineNumber"=>"3.1", "Item"=>"A0001", "Description"=>"Apple"},
{"LineNumber"=>"4.1", "Item"=>"A0002", "Description"=>"Peach"}
]
}
The Crack gem creates the hash Line as an array, because there two <Line> nodes in myXML. But if myXML would contain only one <Line> node, the Crack gem would not parse it as an array:
{
"MemberId"=>"ABC0001",
"MemberName"=>"Alan",
"Line"=> {"LineNumber"=>"4.1", "Item"=>"fruit-004", "Description"=>"Apple"}
}
I want to see it still as an array no matter if there's only one node:
{
"MemberId"=>"ABC0001",
"MemberName"=>"Alan",
"Line"=> [{"LineNumber"=>"4.1", "Item"=>"fruit-004", "Description"=>"Apple"}]
}
After you convert the XML document to a hash you could do this:
myHash["Line"] = [myHash["Line"]] if myHash["Line"].kind_of?(Hash)
It will ensure that the Line node will be wrapped in Array.
The problem is, you're relying on code to do what you really should do. Crack has no idea that you want a single node to be an array of a single element, and that behavior makes it a lot more difficult for you when trying to dive into that portion of the data.
Parsing XML isn't hard, and, by parsing it yourself, you'll know what to expect, and will avoid the hassle of dealing with the "sometimes it's an array and sometimes it's not" returned by Crack.
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="utf-8"?>
<ShipmentRequest>
<Message>
<MemberId>A00000001</MemberId>
<MemberName>Bruce</MemberName>
<Line>
<LineNumber>3.1</LineNumber>
<Item>fruit-004</Item>
<Description>Peach</Description>
</Line>
<Line>
<LineNumber>4.1</LineNumber>
<Item>fruit-001</Item>
<Description>Peach</Description>
</Line>
</Message>
</ShipmentRequest>
EOT
That sets up the DOM, so it can be navigated:
hash = {}
message = doc.at('Message')
hash[:member_id] = message.at('MemberId').text
hash[:member_name] = message.at('MemberName').text
lines = message.search('Line').map do |line|
line_number = line.at('LineNumber').text
item = line.at('Item').text
description = line.at('Description').text
{
:line_number => line_number,
:item => item,
:description => description
}
end
hash[:lines] = lines
message = doc.at('Message') finds the first <Message> node.
message.at('MemberId').text finds the first <MemberID> node inside <Message>.
message.at('MemberName').text is similar to the above step.
message.search('Line') looks for all <Line> nodes inside <Message>.
From those descriptions you can figure out the rest.
After running, hash looks like:
{:member_id=>"A00000001",
:member_name=>"Bruce",
:lines=>
[{:line_number=>"3.1", :item=>"fruit-004", :description=>"Peach"},
{:line_number=>"4.1", :item=>"fruit-001", :description=>"Peach"}]}
If I remove one of the <Line> blocks from the XML, and re-run, I get:
{:member_id=>"A00000001",
:member_name=>"Bruce",
:lines=>[{:line_number=>"3.1", :item=>"fruit-004", :description=>"Peach"}]}
Using search to locate the <Line> nodes is the trick. search returns a NodeSet, which is akin to an Array, so by iterating over it using map it'll return an array of hashes of the contents of <Line> tags.
Nokogiri is a great tool for parsing HTML and XML, then allowing us to search, add, change or remove nodes. It supports CSS and XPath accessors, so if you are used to jQuery or how CSS works, or XPath expressions, you'll be off and running quickly. The tutorials for Nokogiri are a good starting place to learn how it works.

getting XmlSearch to return siblings only, not children

I'm getting a SOAP response that looks like this:
<Activity>
<Id>A</Id>
<Subject>foo</Subject>
<Activity>Task</Activity>
</Activity>
<Activity>
<Id>B</Id>
<Subject>bar</Subject>
<Activity>Appointment</Activity>
</Activity>
<Activity>
<Id>C</Id>
<Subject>snafu</Subject>
<Activity>Task</Activity>
</Activity>
In Coldfusion, I was trying to parse out the Activity nodes with this:
<cfset arrMainNodes = XmlSearch(soapResponse, "//*[name()='Activity']") />
The problem is, instead if getting an array with three elements, I get an array with six: 3 of the parent, and 3 of the children.
I can't for the life of me figure out the XPath statement the will find siblings only, and not children.
Please Help.
Use:
//*[name()='Activity' and not(ancestor::*[name()='Activity' ])]
This selects all elements in the document, whose name is "Activity" and that do not have an ancestor with name "Activity".

Resources