I'm new to XML/Nokogiri. I'm trying to fetch all the nodes with a certain name from an XML document someone else generated. The document looks like:
<taxonomy>
<taxonomy_name>World</taxonomy_name>
<node atlas_node_id = "val">
<node_name></node_name>
<node atlas_node_id = "val>
<node_name></node_name>
<node atlas_node_id = "val">
<node_name></node_name>
</node>
<node atlas_node_id = "val">
<node_name></node_name>
</node>
</node>
<node atlas_node_id = "val">
<node_name></node_name>
</node>
<node atlas_node_id = "val">
<node_name></node_name>
</node>
</node>
</taxonomy>
I want to pull ALL the nodes with the attribute atlas_node_id. In my build_files method I have the following line:
destinations = tax_file.xpath("//node")
where tax_file is previously set to point to the XML file.
The above returns what seems like ALL the nodes in the file and if I try to set destinations to tax_file.xpath("//node_name/node") then I get an empty NodeSet. Is there some way I can pull all the nodes with the attribute atlas_node_id?
I glanced through "Searching a XML/HTML Document" but didn't really see anything that could help. Am I missing something really obvious?
Update
After trying the solutions suggested by haradwaith and Alexey Shein - both solutions seem to fetch all the nodes as one large node? Testing in irb:
destinations = tax_file.xpath("//node[#atlas_node_id]") (OR)
destinations = tax_file.css('[atlas_node_id]')
d = destinations[0]
d.content
>> \n Africa\n \n South Africa\n \n Cape Town\n \n Table Mountain National Park\n \n \n \n Free State\n \n Bloemfontein\n \n \n \n Gauteng\n \n Johannesburg\n \n \n Pretoria\n \n \n \n KwaZulu-Natal\n \n Durban\n \n \n Pietermaritzburg\n \n \n \n Mpumalanga\n \n Kruger National Park\n \n \n \n The Drakensberg\n \n Royal Natal National Park\n \n \n \n The Garden Route\n \n Oudtshoorn\n \n \n Tsitsikamma Coastal National Park\n \n \n \n\nSudan\n\nEastern Sudan\n\nPort Sudan\n\n\n\nKhartoum\n\n\n\nSwaziland\n\n
Where I would have expected to see just 'Africa'. Any ideas as to why this is happening?
Just use the [] CSS selector:
xml = <<EOD
<taxonomy>
<taxonomy_name>World</taxonomy_name>
<node atlas_node_id = "val">
<node_name>Africa</node_name>
<node atlas_node_id = "val>
<node_name>Capetown</node_name>
</node>
</node>
</taxonomy>
EOD
tax_file = Nokogiri::XML(xml)
nodes = tax_file.css('[atlas_node_id] > node_name')
p nodes.first.text # => "Africa"
You can read short introduction to CSS selectors on MDN page.
Oh, it seems you didn't need the nodes with attribute atlas_node_id themselves, but their <node_name> children.
What code above is actually says is find all tags that have an attribute with name "atlas_node_id" and get all his immediate (i.e. 1 level deep) children with tag "node_name".
You can find an explanation of the XPath 1.0 syntax in the documentation.
To get all the nodes with an attribute atlas_node_id, you can do:
tax_file.xpath("//node[#atlas_node_id]")
Related
Base on below XML exemple file employees.xml and using Ruby Nokogiri gem I wan to open this file, change the building number to 320 and the room number to 99 for Sandra Defoe and save the changes. What is the recommended way to do it.
<?xml version="1.0" encoding="utf-16"?>
<employees>
<employee id="be129">
<firstname>Jane</firstname>
<lastname>Doe</lastname>
<building>327</building>
<room>19</room>
</employee>
<employee id="be130">
<firstname>William</firstname>
<lastname>Defoe</lastname>
<building>326</building>
<room>14a</room>
</employee>
<employee id="be132">
<firstname>Sandra</firstname>
<lastname>Defoe</lastname>
<building>327</building>
<room>22</room>
</employee>
<employee id="be133">
<firstname>Steve</firstname>
<lastname>Casey</lastname>
<building>327</building>
<room>24</room>
</employee>
</employees>
I'd use this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="utf-16"?>
<employees>
<employee id="be130">
<firstname>William</firstname>
<lastname>Defoe</lastname>
<building>326</building>
<room>14a</room>
</employee>
<employee id="be132">
<firstname>Sandra</firstname>
<lastname>Defoe</lastname>
<building>327</building>
<room>22</room>
</employee>
</employees>
EOT
first_name = 'Sandra'
last_name = 'Defoe'
node = doc.at("//employee[firstname/text()='%s' and lastname/text()='%s']" % [first_name, last_name])
node.at('building').content = '320'
node.at('room').content = '99'
Which results in:
doc.to_xml
# => "\uFEFF<?xml version=\"1.0\" encoding=\"utf-16\"?>\n" +
# "<employees>\n" +
# " <employee id=\"be130\">\n" +
# " <firstname>William</firstname>\n" +
# " <lastname>Defoe</lastname>\n" +
# " <building>326</building>\n" +
# " <room>14a</room>\n" +
# " </employee>\n" +
# " <employee id=\"be132\">\n" +
# " <firstname>Sandra</firstname>\n" +
# " <lastname>Defoe</lastname>\n" +
# " <building>320</building>\n" +
# " <room>99</room>\n" +
# " </employee>\n" +
# "</employees>\n"
Normally I recommend using CSS selectors because they tend to result in less visual noise, however CSS doesn't let us peek into the text of nodes, and working around that, while possible, results in even more noise. XPath, on the other hand, can be very noisy, but for this sort of task, it's more usable.
XPath is very well documented and figuring out what this is doing should be pretty easy.
The Ruby side of it is using a "format string":
"//employee[firstname/text()='%s' and lastname/text()='%s']" % [first_name, last_name])
similar to
"%s %s" % [first_name, last_name] # => "Sandra Defoe"
"//employee[firstname/text()='%s' and lastname/text()='%s']" % [first_name, last_name]
# => "//employee[firstname/text()='Sandra' and lastname/text()='Defoe']"
Just for thoroughness, here's what I'd do if I wanted to use CSS exclusively:
node = doc.search('employee').find { |node|
node.at('firstname').text == first_name && node.at('lastname').text == last_name
}
This gets ugly though, because search tells Nokogiri to retrieve all employee nodes from libXML, then Ruby has to walk through them all telling Nokogiri to tell libXML to look in the child firstname and lastname nodes and return their text. That's slow, especially if there are many employee nodes and the one you want is at the bottom of the file.
The XPath selector tells Nokogiri to pass the search to libXML which parses it, finds the employee node with the child nodes containing the first and last names and returns only that node. It's much faster.
Note that at('employee') is equivalent to search('employee').first.
# File 'lib/nokogiri/xml/searchable.rb', line 70
def at(*args)
search(*args).first
end
Finally, mediate on the difference between a NodeSet#text and Node#text as the first will lead to insanity.
Assume your content is a string:
xml=%q(
<?xml version="1.0" encoding="utf-16"?>
<employees>
<employee id="be129">
<firstname>Jane</firstname>
<lastname>Doe</lastname>
<building>327</building>
<room>19</room>
</employee>
<employee id="be130">
<firstname>William</firstname>
<lastname>Defoe</lastname>
<building>326</building>
<room>14a</room>
</employee>
<employee id="be132">
<firstname>Sandra</firstname>
<lastname>Defoe</lastname>
<building>327</building>
<room>22</room>
</employee>
<employee id="be133">
<firstname>Steve</firstname>
<lastname>Casey</lastname>
<building>327</building>
<room>24</room>
</employee>
</employees>)
doc = Nokogiri.parse(xml)
This will work but assumes the first and last names are unique, otherwise it will modify the first match of first and last name.
target = doc.css('employee').find do |node|
node.search('firstname').text == 'Sandra' &&
node.search('lastname').text == 'Defoe'
end
target.at_css('building').content = '320'
target.at_css('room').content = '99'
doc # outputs the updated xml
=> <?xml version="1.0"?>
<?xml version="1.0" encoding="utf-16"?>
<employees>
<employee id="be129">
<firstname>Jane</firstname>
<lastname>Doe</lastname>
<building>327</building>
<room>19</room>
</employee>
<employee id="be130">
<firstname>William</firstname>
<lastname>Defoe</lastname>
<building>326</building>
<room>14a</room>
</employee>
<employee id="be132">
<firstname>Sandra</firstname>
<lastname>Defoe</lastname>
<building>320</building>
<room>99</room>
</employee>
<employee id="be133">
<firstname>Steve</firstname>
<lastname>Casey</lastname>
<building>327</building>
<room>24</room>
</employee>
</employees>
I have a very strange xml file that i need to update using augeas.
<root>
<node name="Client">
<node name="Attributes">
<info>
<test>
<entry><key>colour</key><value type="string">blue</value></entry>
</test>
</info>
</node>
</node>
<node name="Network">
<node name="Server">
<info>
<test>
<entry><key>transport</key><value type="string">internet</value></entry>
<entry><key>ipAddr</key><value type="string">125.125.125.142</value></entry>
<entry><key>portNo</key><value type="string">1234</value></entry>
<entry><key>protocolType</key><value type="string">tcp</value></entry>
</test>
</info>
</node>
</node>
</root>
I need to update the element "value" which is just after the element "key" which contains the text ipAddr.
Based on your description of the node you want to update, here's a suggestion:
set /files/path/to/your/file.xml//entry[key/#text="ipAddr"]/value/#text "255.255.255.0"
This selects the entry node at any level in the file, which has a key/#text subnode with value ipAddr and then it updates its value/#text subnode to have value 255.255.255.0.
I am trying to select the node Prp[#name='node name'] which has a parent name item20 using the XPath expression //Prp[#name='node name'and ../../../*[#name='item20']] but this works only if my file contains only this part of XML:
<Node name="item20">
<Node name="config">
<Node name="runmodeparams">
<Node name="simple">
<Prp name="filename" type="S" value="p"/>
<Prp name="filepath" type="S" value="r"/>
</Node>
<Prp name="activerunmode" type="S" value="Simple"/>
</Node>
<Prp name="node name" type="S" value="lastversion"/>
</Node>
If it also contains another part of the XML file like the following one, then XPath returns an empty result.
<Node name="item20">
<Node name="config">
<Node name="runmodeparams">
<Node name="simple">
<Prp name="filename" type="S" value="p"/>
<Prp name="filepath" type="S" value="r"/>
</Node>
<Prp name="activerunmode" type="S" value="Simple"/>
</Node>
<Prp name="node name" type="S" value="lastversion"/>
</Node>
</Node>
<Node name="item21">
<Node name="config">
<Node name="runmodeparams">
<Node name="simple">
<Prp name="filename" type="S" value="p"/>
<Prp name="filepath" type="S" value="r"/>
</Node>
<Prp name="activerunmode" type="S" value="Simple"/>
</Node>
<Prp name="node name" type="S" value="lastversion"/>
</Node>
</Node>
How can I properly select the node?
The second XML snippet you gave is no valid XML as it contains two root nodes. If this really is your full XML input, you should
fix it if possible, or somewhat wrap it in a single root node and
try to fetch some error message from your XPath engine.
I wrapped it in another element and your second XPath somewhat worked - but probably didn't return the expected result; both node name elements of item20 and item21 are returned as you're stepping out too far.
Anyway, you'd better check for "item20" in a predicate when stepping down the XML tree:
//Node[#name='item20']//Prp[#name='node name']
This not only limits to the node you're looking for, but also should be faster for most cases.
If performance really matters and the <Prp/> element you're looking for is always at the same position, try to avoid the descendant-or-self-steps // and provide a full distinct path, here it would be
//Node[#name='item20']/Prp[#name='node name']
<product>
<book>
<id>111</id>
<name>xxx</name>
</book>
<pen>
<id>222</id>
<name>yyy</name>
</pen>
<pencil>
<id>333</id>
<name>zzz</name>
</pencil>
I want to remove the "pencil" node and print the remaining xml using REXML (Ruby). Can anybody tell me how to do that ?
By using one of the delete methods http://rubydoc.info/stdlib/rexml/
require "rexml/document"
string = <<EOF
<product>
<book>
<id>111</id>
<name>xxx</name>
</book>
<pen>
<id>222</id>
<name>yyy</name>
</pen>
<pencil>
<id>333</id>
<name>zzz</name>
</pencil>
</product>
EOF
doc = REXML::Document.new(string)
doc.delete_element('//pencil')
puts doc
There is also nice tutorial to get you started: http://www.germane-software.com/software/rexml/docs/tutorial.html
Given a search term, how to search the attributes of nodes in an XML and return the XML which contains only those nodes that match the term along with their parents all the way tracing to the root node.
Here is an example of the input XML:
<root>
<node name = "Amaths">
<node name = "Bangles"/>
</node>
<node name = "C">
<node name = "Dangles">
<node name = "E">
<node name = "Fangles"/>
</node>
</node>
<node name = "Gdecimals" />
</node>
<node name = "Hnumbers"/>
<node name = "Iangles"/>
</root>
The output I'm looking for the search term "angles":
<root>
<node name = "Amaths">
<node name = "Bangles"/>
</node>
<node name = "C">
<node name = "Dangles">
<node name = "E">
<node name = "Fangles"/>
</node>
</node>
</node>
<node name = "Iangles"/>
</root>
The XPath that I use to search the xml is "//*[contains(#name,'angles')]"
I'm using Nokogiri in Ruby to search the XML which provides me a NodeSet of all nodes that match the term. I cannot figure out how to construct back the XML from that set of nodes.
Thanks!
EDIT: Fixed the example should have been . Thanks Dimitre.
EDIT 2: Fixed the xml again for well-formedness.
First, do note that the presented wanted output is incorrect and the following element has no end tag later in the document:
<node name = "C">
The results of evaluating an XPath expressions can be a set of nodes from the XML document, but these notes can't be altered by XPath.
This XPath expression selects the
nodes that match the term along with
their parents all the way tracing to
the root node
//*[contains(#name,'angles') and not(node())]/ancestor::*
However, the nodes are not changed and they contain all their children, meaning that the complete subtree rooted in Root still is a the subtree of Root in the returned result.
In case you want to obtain a new document (set of nodes) with different structure than the original XML document, you have to use another language that is hosting XPath. There are many such languages, such as XSLT, XQuery and any language with an XML DOM implementation.
Here is an XSLT transformation, producing the wanted result:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[not(descendant-or-self::*[contains(#name, 'angles')])]"/>
</xsl:stylesheet>
when this transformation is applied on the provided XML document(corrected to be well-formed):
<root>
<node name = "Amaths">
<node name = "Bangles"/>
</node>
<node name = "C">
<node name = "Dangles">
<node name = "E">
<node name = "Fangles"/>
</node>
<node name = "Gdecimals" />
</node>
</node>
<node name = "Hnumbers"/>
<node name = "Iangles"/>
</root>
the wanted (correct) result is produced:
<root>
<node name="Amaths">
<node name="Bangles"/>
</node>
<node name="C">
<node name="Dangles">
<node name="E">
<node name="Fangles"/>
</node>
</node>
</node>
<node name="Iangles"/>
</root>