Traverse xml structure to determine if a certain text node exists - ruby

Alright I have an xml document that looks something like this:
<xml>
<list>
<partner>
<name>Some Name</name>
<status>active</status>
<id>0</id>
</partner>
<partner>
<name>Another Name</name>
<status>active</status>
<id>1</id>
</partner>
</list>
</xml>
I am using ruby's lib-xml to parse it.
I want to find if there is a partner with the name 'Some Name' in a quick and ruby idiomatic way.
How can I do this in one line or ruby code, assuming i have a the document parsed in a variable named document.. Such that i can call document.find(xpath) to retrieve nodes. I have had to do this multiple times in slightly different scenarios and now its starting to bug me.
I know i can do the following (but its ugly)
found = false
document.find('//partner/name').each do |name|
if (name.content == 'Some Name')
found = true
break
end
end
assert(found, "Some Name should have been found")
but i find this really ugly. I thought about using the enumeration include? mixin method but that still won't work because I need to get the .content field of each node as opposed to the actual node...
While writing this, I though of this (but it seems somewhat inefficient albeit elegant)
found = document.find('//partner/name').collect{|name| name.content}.member?("Some Name")
Are there any other ways of doing this?

What about this?
found = document.find("//partner[name='Some Name']").empty?

I tried this solution:
found = document.find("//partner[name='Some Name']") != nil
but I got an error saying the xpath expression was invalid.
However, i was reading some xpath documentation it it looks like you can call a text() function in the expression to get the text node. I tried the following and it appears to work:
found = document.find("//partner/name/text()='Some Name'")
found actually is not a xml node but a true/false object so this works.

I would use a language that natively operates on XML (XQuery for example). With XQuery it is possible to formulate this sort of queries over xml data in a concise and elegant way.

Related

Nokogiri not parsing XML in ruby - xmlns issue?

Given the following ruby code :
require 'nokogiri'
xml = "<?xml version='1.0' encoding='UTF-8'?>
<ProgramList xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns:xsd='http://www.w3.org/2001/XMLSchema' xmlns='http://publisher.webservices.affili.net/'>
<TotalRecords>145</TotalRecords>
<Programs>
<ProgramSummary>
<ProgramID>6540</ProgramID>
<Title>Matalan</Title>
<Limitations>A bit of text
</Limitations>
<URL>http://www.matalan.co.uk</URL>
<ScreenshotURL>http://www.matalan.co.uk/</ScreenshotURL>
<LaunchDate>2009-11-02T00:00:00</LaunchDate>
<Status>1</Status>
</ProgramSummary>
<ProgramSummary>
<ProgramID>11787</ProgramID>
<Title>Club 18-30</Title>
<Limitations/>
<URL>http://www.club18-30.com/</URL>
<ScreenshotURL>http://www.club18-30.com</ScreenshotURL>
<LaunchDate>2013-05-16T00:00:00</LaunchDate>
<Status>1</Status>
</ProgramSummary>
</Programs>
</ProgramList>"
doc = Nokogiri::XML(xml)
p doc.xpath("//Programs")
gives :
=> []
Not what is expected.
On further investigation if I remove xmlns='http://publisher.webservices.affili.net/' from the initial <ProgramList> tag I get the expected output.
Indeed if I change xmlns='http://publisher.webservices.affili.net/' to xmlns:anything='http://publisher.webservices.affili.net/' I get the expected output.
So my question is what is going on here? Is this malformed XML? And what is the best strategy for dealing with it?
While it's hardcoded in this example the XML is (will be) coming from a web service.
Update
I realise I can use the remove_namespaces! method but the Nokogiri docs do say that it's "...probably is not a good thing in general" to do this. Also I'm interested in why it's happening and what the 'correct' XML should be.
The xmlns='http://publisher.webservices.affili.net/' indicates the default namespace for all elements under the one where it appears (including the element itself). That means that all elements that don’t otherwise have an explicit namespace fall under this namespace.
XPath queries don’t have default namespaces (at least in XPath 1.0), so any name that appears in one without a prefix refers to that element in no namespace.
In your code, you want to find Program elements in the http://publisher.webservices.affili.net/ namespace (since that is the default namespace), but are looking (in your XPath query) for Program elements in no namespace.
To explicitly specify the namespace in the query, you can do something like this:
doc.xpath("//pub:Programs", "pub" => "http://publisher.webservices.affili.net/")
Nokogiri makes this a little easier for namespaces declared on the root element (as in this case), declaring them for you with the same prefix. It will also declare the default namespace using the xmlns prefix, so you can also do:
doc.xpath("//xmlns:Programs")
which will give you the same result.

How do I get the input value from a Nokogiri::XML::NodeSet?

I am looking for my input element using Nokogiri's xpath method.
It's returning an object of class Nokogiri::XML::NodeSet:
[#<Nokogiri::XML::Element:0x3fcc0e07de14 name="input" attributes=[#<Nokogiri::XML::Attr:0x3fcc0e07dba8 name="type" value="text">, #<Nokogiri::XML::Attr:0x3fcc0e07db94 name="name" value="creditInstallmentAmount">, #<Nokogiri::XML::Attr:0x3fcc0e07db44 name="style" value="width:240px">, #<Nokogiri::XML::Attr:0x3fcc0e07dae0 name="value" value="94.8">, #<Nokogiri::XML::Attr:0x3fcc0e07da18 name="readonly" value="true">]>
Is there a faster and cleaner way to get the value of input than casting this using to_s:
"<input type=\"text\" name=\"creditInstallmentAmount\" style=\"width:240px\" value=\"94.8\" readonly>"
and match with regular expressions?
A couple things will help:
Nokogiri has the at method, which is the equivalent of search(...).first, and, instead of returning a NodeSet, it returns the Node itself, making it easy to grab values from it:
require 'nokogiri'
doc = Nokogiri::HTML('<input type="text" name="creditInstallmentAmount" style="width:240px" value="94.8" readonly>')
doc.at('input')['value'] # => "94.8"
doc.at('input')['value'].to_f # => 94.8
Also, notice I'm using CSS notation, instead of XPath. Nokogiri supports both, and a lot of times the CSS is more obvious and easily readable. The at_css method is an alias to at for convenience.
Note that Nokogiri uses a little test in search and at to try to determine whether the selector is CSS or XPath, and then branches accordingly to the specific method. The test can be fooled, at which point you should use the specific CSS or XPath variant, or always use them if you're paranoid. In years of using Nokogiri I've only once encountered the situation where the code was confused.
If you want to be more explicit about which input you want, you can look into the parameters for the tag:
doc.at('input[#name="creditInstallmentAmount"]')['value'] # => "94.8"
Get familiar with the difference between search and at and their varients, and Nokogiri will really become useful to you. Learn how to access the parameters and text() nodes and you'll know 99% of what you need to know for parsing HTML and XML.
Ok, I found the answer:
.map{|node| node["value"]}.first
Ok, this works for me
require 'nokogiri'
require 'open-uri'
html = open ARGV[0]
doc = Nokogiri::HTML(html)
inputs = doc.search 'input'
inputs.map{|node| node['name']}
or all in one
inputs = Nokogiri::HTML(html).search('input').map{|node| node['name']}

What's the xpath syntax to get tag names?

I'm using Nokogiri to parse a large XML file. Say I've got the following structure:
<menagerie>
<penguin>Pablo</penguin>
<penguin>Mortimer</penguin>
<bull>Ferdinand</bull>
<aardvark>James Cornelius Madison Humphrey Zophar Handlebrush III</aardvark>
</menagerie>
I can count the non-penguins like this:
xml.xpath('//menagerie//*[not(penguin)]').length // 2
But how do I get a list of the tags, like this? (The exact format isn't important; I just want to visually scan the non-penguins.)
bull
aardvark
Update
This gave me the list I wanted - thanks Oded and TMN and delnan!
xml.xpath('//menageries/*[not(penguin)]').each do |node|
puts node.name()
end
You can use the name() or local-name() XPath function.
See the examples on zvon.
I know it's a bit outdated but you should do: xml.xpath('//meagerie/*[not(penguin)]/name()') as the expression. Note the slash, not the dot. This is how you call methods on the current node in XPath.

Any way to strip namespace garbage from XML file?

I need to select some nodes from an XML file (AppNamespace.xaml from a Silverlight XAP file, not that it matters), but the file has namespace stuff so XPath doesn't work. I could waste most of a day trial-and-erroring the bondage-and-discipline nightmare of XmlNamespaceManager and end up with hopelessly fragile code that can't tolerate the slightest variation in the input file (not a great idea in production code), or I could use the ludicrous local-name() syntax[1].
But it would be more convenient to use XPath as a human-readable query language that can be used to return specified nodes or attribute values from arbitrary XML files.
So is there any way to strip the line-noise out of the file? Or am I stuck? Is the labyrinthine imbecility of Linq-to-XML truly the lesser evil?
[1]
//*[local-name() = 'Deployment']/*[local-name() = 'Deployment.Parts']/*[local-name() = 'AssemblyPart']/#*[local-name()='Name']
Update
Five years down the road, I stand behind the term "labyrinthine imbecility" with every fiber of my being, except for a few fibers that want to use something much stronger.
Ed, here's an example of using namespaces with the System.Xml.XPath Extensions class. I've modified it to match the input you're looking at:
string markup = #"
<Deployment xmlns="http://schemas.microsoft.com/client/2007/deployment"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" ...>
<Deployment.Parts>
<AssemblyPart x:Name="xamlName" Source="assembly" />
</Deployment.Parts>
</Deployment>
";
XmlReader reader = XmlReader.Create(new StringReader(markup));
XElement root = XElement.Load(reader);
XmlNameTable nameTable = reader.NameTable;
XmlNamespaceManager namespaceManager = new XmlNamespaceManager(nameTable);
nsm.AddNamespace("x", "http://schemas.microsoft.com/winfx/2006/xaml");
nsm.AddNamespace("dep", "http://schemas.microsoft.com/client/2007/deployment");
IEnumerable<XElement> elements =
root.XPathSelectElements("//dep:Deployment/dep:Deployment.Parts/dep:AssemblyPart/#x:Name", nsm);
foreach (XElement el in elements)
Console.WriteLine(el);
Not very complicated. Obviously you already know about XmlNamespaceManager, but I think you got a worse impression of it than it deserves.
When you say "hopelessly fragile code that can't tolerate the slightest variation in the input file", are you blaming namespaces in general, or XmlNamespaceManager? I don't see how either one makes it fragile... any more so than XML processing code without namespaces will not tolerate certain changes in the input document, but will tolerate others.
Have a little respect for other intelligent people in the industry, take a little time to understand the advantages behind a design before you dismiss it, and you will usually find that there are good reasons for what was done.
Not that XML namespaces couldn't be improved upon. However nobody has managed to produce a better standard and get it accepted by the community.
In XPath 2.0 you can use namespace wildcards (if you know what you are doing):
//*:Deployment/*:Deployment.Parts/*:AssemblyPart/#Name
btw. If an attribute doesn't have a prefix it is in no namespace at all. As this is most often the case, I guess, you don't need local-name() for the attribute.
I came here as a result of this search:
and I am adding an "Answer" to cheer on your "5 years on" update.
I was motivated to do this because I have an XML document that uses a tonne of namespaces -
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:html="http://www.w3.org/TR/REC-html40" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:x2="urn:schemas-microsoft-com:office:excel2" version="1.0" exclude-result-prefixes="msxsl">
and APPARENTLY I have to know what all those namespaces are in advance in order to hard code the XmlNamespaceManager, or write some code that parses the namespace declarations and adds the relevant name spaces myself. Why in the name of all that is holy does the XmlDocument not manage to do that all by itself?
XmlDocument databaseXml = new XmlDocument();
databaseXml.LoadXml(xslt.XslTransform);
var dbnsmgr = new XmlNamespaceManager(databaseXml.NameTable);
dbnsmgr.AddNamespace("xsl", "http://www.w3.org/1999/XSL/Transform");
dbnsmgr.AddNamespace("ss", "urn:schemas-microsoft-com:office:spreadsheet");
XmlElement databaseStylesElement = (XmlElement)database
Xml.DocumentElement.SelectSingleNode("/xsl:stylesheet/xsl:template");

XPath concat multiple nodes

I'm not very familiar with xpath. But I was working with xpath expressions and setting them in a database. Actually it's just the BAM tool for biztalk.
Anyway, I have an xml which could look like:
<File>
<Element1>element1<Element1>
<Element2>element2<Element2>
<Element3>
<SubElement>sub1</SubElement>
<SubElement>sub2</SubElement>
<SubElement>sub3</SubElement>
<Element3>
</File>
I was wondering if there is a way to use an xpath expression of getting all the SubElements concatted? At the moment, I am using:
/*[local-name()='File']/*[local-name()='Element3']/*[local-name()='SubElement']
This works if it only has one index. But apparently my xml sometimes has more nodes, so it gives NULL. I could just use
/*[local-name()='File']/*[local-name()='Element3']/*[local-name()='SubElement'][0]
but I need all the nodes. Is there a way to do this?
Thanks a lot!
Edit: I changed the XML, I was wrong, it's different, it should look like this:
<item>
<element1>el1</element1>
<element2>el2</element2>
<element3>el3</element3>
<element4>
<subEl1>subel1a</subEl1>
<subEl2>subel2a</subEl2>
</element4>
<element4>
<subEl1>subel1b</subEl1>
<subEl2>subel2b</subEl2>
</element4>
</item>
And I need to have a one line code to get a result like: "subel2a subel2b";
I need the one line because I set this xpath expression as an xml attribute (not my choice, it's specified). I tried string-join but it's not really working.
string-join(/file/Element3/SubElement, ',')
/File/Element3/SubElement will match all of the SubElement elements in your sample XML. What are you using to evaluate it?
If your evaluation method is subject to the "first node rule", then it will only match the first one. If you are using a method that returns a nodeset, then it will return all of them.
You can get all SubElements by using:
//SubElement
But this won't keep them grouped together how you want. You will want to do a query for all elements that contain a SubElement (basically do a search for the parent of any SubElements).
//parent::SubElement
Once you have that, you could (depending on your programming language) loop through the parents and concatenate the SubElements.

Resources