counting elements in xml with Nokogiri - ruby

I'd like to understand why count gives me 5?
If I'm at the root element and I want to know my children, it is supposed to give me 2.
doc = Nokogiri::XML(open('link..to....element.xml'))
root = doc.root.children.count
puts root
<element>
<name>Married with Children</name>
<name>Married with Children</name>
</element>

You get 5 as the result because there are five child nodes under the root <element> node. There are two <name> nodes and three text nodes that each consist of whitespace; one between the opening <element> and the first <name>, one between the two <names>, and one between the second <name> and the closing </element>:
doc.root.children.each do |c|
p c
end
output:
#<Nokogiri::XML::Text:0x80544a04 "\n ">
#<Nokogiri::XML::Element:0x80544900 name="name" children=[#<Nokogiri::XML::Text:0x8054470c "Married with Children">]>
#<Nokogiri::XML::Text:0x80544554 "\n ">
#<Nokogiri::XML::Element:0x80544478 name="name" children=[#<Nokogiri::XML::Text:0x80544284 "Married with Children">]>
#<Nokogiri::XML::Text:0x805440cc "\n">
If you use the noblanks option when parsing Nokogiri won’t include these whitespace nodes:
doc = Nokogiri::XML(open('link..to....element.xml')) { |c| c.noblanks }
Now doc.root.children.count will equal 2, only the two <name> element nodes will be included.

Related

XPath to get parents with multiple children but only one type of child

I need an XPath (1.0) to get all parent nodes with multiple children but only one type of child (e.g., either <div> or <li> but not <div> and <li>). Any help? Thank you!
<doc>
<tom>
<janet />
</tom>
<dick>
<janet />
<jane />
</dick>
<harry>
<jane />
</harry>
</doc>
So for the above we should get tom and harry but not dick
Using the example as a reference, the following XPath 1.0 expression:
/doc/*[count(./*) = count(./*[name(.) = name(../*[1])])]
Will return all children of doc where the total number of children of that element equals the number of children with the same name as the first child of that element. Or, more simply put, all children have the same name aka 'type'.
However, the above will return nodes that have 0 or 1 children, so to restrict it to only those where there are multiple child nodes, we can use:
/doc/*[count(./*) = count(./*[name(.) = name(../*[1])]) and count(./*) > 1]
If you want to further restrict it so that all children have to be a certain element, for example jane, you could use: /doc/*[count(./*) = count(./*[name(.) = name(../*[1])]) and count(./*) > 1 and ./*[1] = ./jane[1]]

CSS/Xpath sibling selector in Nokogiri

I have the following XML tree and need to get out the first name and surname only for the contrib tags with child xref nodes of ref-type "corresp".
<pmc-articleset>
<article>
<front>
<article-meta>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Wereszczynski</surname>
<given-names>Jeff</given-names>
</name>
<xref rid="aff1" ref-type="aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Andricioaei</surname>
<given-names>Ioan</given-names>
</name>
<xref rid="aff1" ref-type="aff"/>
<xref ref-type="corresp" rid="cor1">*</xref>
</contrib>
</contrib-group>
</article-meta>
</front>
</article>
</pmc-articleset>
I saw "Getting the siblings of a node with Nokogiri" which points out the CSS sibling selectors that can be used in Nokogiri, but, following the example given, my code gives siblings indiscriminately.
require "Net/http"
require "nokogiri"
url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=PMC1637560&db=pmc"
xml_data = Net::HTTP.get_response(URI.parse(url)).body
parsedoc = Nokogiri::XML.parse(xml_data)
corrdetails = parsedoc.at('contrib:has(xref[text()="*"])')
puts surname = corrdetails.xpath( "//surname" ).text
puts givennames = corrdetails.xpath("//given-names").text
=> WereszczynskiAndricioaei
=> JeffIoan
I only want the sibling node under the condition that <xref ref-type="corresp">*</> , that is an output of:
=> Andricioaei
=> Ioan
I've currently implemented this without referring to ref-type but rather selecting the asterisk within the xref tag (either is appropriate).
The problem is actually with your XPath for getting the the surname and given name, i.e., the XPath is incorrect for the lines:
puts surname = corrdetails.xpath( "//surname" ).text
puts givennames = corrdetails.xpath("//given-names").text
Starting the XPath with // means to look for the node anywhere in the document. You only want to look within the corrdetails node, which means the XPath needs to start with a dot, e.g., .//.
Change the two lines to:
puts surname = corrdetails.xpath( ".//surname" ).text
puts givennames = corrdetails.xpath(".//given-names").text

How do I parse XML with Nokogiri css selectors, using loops?

I am trying to parse this sample XML file:
<Collection version="2.0" id="74j5hc4je3b9">
<Name>A Funfair in Bangkok</Name>
<PermaLink>Funfair in Bangkok</PermaLink>
<PermaLinkIsName>True</PermaLinkIsName>
<Description>A small funfair near On Nut in Bangkok.</Description>
<Date>2009-08-03T00:00:00</Date>
<IsHidden>False</IsHidden>
<Items>
<Item filename="AGC_1998.jpg">
<Title>Funfair in Bangkok</Title>
<Caption>A small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-07T19:22:08</CreatedDate>
<Keywords>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="133" height="200" />
<PreviewSize width="532" height="800" />
<OriginalSize width="2279" height="3425" />
</Item>
<Item filename="AGC_1164.jpg" iscover="True">
<Title>Bumper Cars at a Funfair in Bangkok</Title>
<Caption>Bumper cars at a small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-03T22:08:24</CreatedDate>
<Keywords>
<Keyword>Bumper Cars</Keyword>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="200" height="133" />
<PreviewSize width="800" height="532" />
<OriginalSize width="3725" height="2479" />
</Item>
</Items>
</Collection>
Here is my current code:
require 'nokogiri'
doc = Nokogiri::XML(File.open("sample.xml"))
somevar = doc.css("collection")
#create loop
somevar.each do |item|
puts "Item "
puts item['Title']
puts "\n"
end#items
Starting at the root of the XML document, I'm trying to go from the root "Collections" down to each new level.
I start in the node sets, and get information from the nodes, and the nodes contain elements. How do I assign the node to a variable, and extract every single layer underneath that and the text?
I can do something like the code below, but I want to know how to systematically move through each nested element of XML using loops, and output the data for each line. When finished showing text, how do I move back up to the previous element/node, whatever it may be (traversing a node in the tree)?
puts somevar.css("Keyworks Keyword").text
Nokogiri's NodeSet and Node support very similar APIs, with the key semantic difference that NodeSet's methods tend to operate on all the contained nodes in turn. For example, while a single node's children gets that node's children, a NodeSet's children gets all contained nodes' children (ordered as they occur in the document). So, to print all the titles and authors of all your items, you could do this:
require 'nokogiri'
doc = Nokogiri::XML(File.open("sample.xml"))
coll = doc.css("Collection")
coll.css("Items").children.each do |item|
title = item.css("Title")[0]
authors = item.css("Authors")[0]
puts title.content if title
puts authors.content if authors
end
You can get at any level of the tree in this way. Another example -- depth-first search printing every node in the tree (NB. the printed representation of a node includes the printed representations of its children, so the output will be quite long):
def rec(node)
puts node
node.children.each do |child|
rec child
end
end
Since you ask about this specifically, if you want to get at the parent of a given node, you can use the parent method. You may never need to though, if you can put your processing in blocks passed to each and the like on NodeSets containing subtrees of interest.

Get the non-empty element using XPATH

I have the following XML
<?xml version = "1.0" encoding = "UTF-8"?>
<root>
<group>
<p1></p1>
</group>
<group>
<p1>value1</p1>
</group>
<group>
<p1></p1>
</group>
</root>
is it possible to get the last the node with value? in this case get the value of the second group/p1.
This xpath should work as well:
//group/p1[string-length(text()) > 0]
How about something like /root/group/p1[text() and not(../following-sibling::group/p1/text())]
In other words: get the p1 elements that have text and whose group parents are not followed by group nodes that have non-empty p1 elements.
You may also use [not(node())] Selector.
Example: //group/p1[not(node())]
It actually can be simplified as below:
//group/p1[string-length() > 0] => element text is non-empty
//group/p1[string-length() = 6] => element text has length 6

How to get content from next node

I have an XML below -
<document>
<node name="Node 0 Text here" ID="01" >aa
</node>
<node name="Node 1 Text here" ID="11">bb
</node>
<node name="Node 2 Text here" ID="12">cc
</node>
<node name="Node 3 Text here" ID="22">dd
</node>
<node name="Node 4 Text here" ID="23">ee
</node>
</document>
I need to search content in a particular node within this XML.
If search keyword does not exist in that node, then I have to begin searching from the next node of current node, you could say sibling.
If that keyword does not exist in all the nodes after the current node then it should begin search from start..
I have to achieve this in my code behind- dotnet class. I have used -
XmlNodeList xmlNodes = xd.SelectNodes("//12/following-sibling::*");
Here, 12 refers to nodeid of the current node,which will be passed as an argument. But I am getting error.
Any help is appreciated.
I need to search content in a particular node within this XML
to get a node matching by its content, the XPath is:
node[contains(text(),'aa')]
This will return the first node for example and any other node whose content text contains aa.
If search keyword does not exist in that node, then I have to begin searching from the next node of current node, you could say sibling. If that keyword does not exist in all the nodes after the current node then it should begin search from start.
This sentence does not make much sense to XPath. The expression above will return all nodes matching the keyword. If you want the first matched node you can get it from the XmlNodeList after or directly from the XPath expression changing it to:
node[contains(text(),'aa')][1]
12 refers to nodeid of the current node,which will be passed as an argument
That's not correct. To select the node by id you should use, for instance:
node[#id=12]/text()
This will get the content of the node with id=12.
Use:
(/*/node[ID='12']/following-sibling::*[contains(.,$pattern)][1]
|
/*/node[ID='12']/preceding-sibling::*[contains(.,$pattern)][1]
)
[last()]
This expression selects the last from the two wanted selections -- the first of the following siblings that contains the value of $pattern and the first of the preceding siblings that contains the value of $pattern.
You need to substitute $pattern with the exact value you want to serch for.

Resources