Nokogiri not parsing XML in ruby - xmlns issue? - ruby

Given the following ruby code :
require 'nokogiri'
xml = "<?xml version='1.0' encoding='UTF-8'?>
<ProgramList xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns:xsd='http://www.w3.org/2001/XMLSchema' xmlns='http://publisher.webservices.affili.net/'>
<TotalRecords>145</TotalRecords>
<Programs>
<ProgramSummary>
<ProgramID>6540</ProgramID>
<Title>Matalan</Title>
<Limitations>A bit of text
</Limitations>
<URL>http://www.matalan.co.uk</URL>
<ScreenshotURL>http://www.matalan.co.uk/</ScreenshotURL>
<LaunchDate>2009-11-02T00:00:00</LaunchDate>
<Status>1</Status>
</ProgramSummary>
<ProgramSummary>
<ProgramID>11787</ProgramID>
<Title>Club 18-30</Title>
<Limitations/>
<URL>http://www.club18-30.com/</URL>
<ScreenshotURL>http://www.club18-30.com</ScreenshotURL>
<LaunchDate>2013-05-16T00:00:00</LaunchDate>
<Status>1</Status>
</ProgramSummary>
</Programs>
</ProgramList>"
doc = Nokogiri::XML(xml)
p doc.xpath("//Programs")
gives :
=> []
Not what is expected.
On further investigation if I remove xmlns='http://publisher.webservices.affili.net/' from the initial <ProgramList> tag I get the expected output.
Indeed if I change xmlns='http://publisher.webservices.affili.net/' to xmlns:anything='http://publisher.webservices.affili.net/' I get the expected output.
So my question is what is going on here? Is this malformed XML? And what is the best strategy for dealing with it?
While it's hardcoded in this example the XML is (will be) coming from a web service.
Update
I realise I can use the remove_namespaces! method but the Nokogiri docs do say that it's "...probably is not a good thing in general" to do this. Also I'm interested in why it's happening and what the 'correct' XML should be.

The xmlns='http://publisher.webservices.affili.net/' indicates the default namespace for all elements under the one where it appears (including the element itself). That means that all elements that don’t otherwise have an explicit namespace fall under this namespace.
XPath queries don’t have default namespaces (at least in XPath 1.0), so any name that appears in one without a prefix refers to that element in no namespace.
In your code, you want to find Program elements in the http://publisher.webservices.affili.net/ namespace (since that is the default namespace), but are looking (in your XPath query) for Program elements in no namespace.
To explicitly specify the namespace in the query, you can do something like this:
doc.xpath("//pub:Programs", "pub" => "http://publisher.webservices.affili.net/")
Nokogiri makes this a little easier for namespaces declared on the root element (as in this case), declaring them for you with the same prefix. It will also declare the default namespace using the xmlns prefix, so you can also do:
doc.xpath("//xmlns:Programs")
which will give you the same result.

Related

Xpath Expression evaluation on attributes with any namespace prefix

Could you please help me on this xpath expression evaluation
I am working on fetching the proxy references. In the xml file the references will get stored as:
One way of XML file will have the reference as below:
con1:service ref="MyProject/ProxyServices/service1"
xsi:type="con2:PipelineRef" xmlns:ref="http://www.bea.com/wli/sb/reference"/
here in the xml file the name spaces are:
xmlns:con1="http://www.bea.com/wli/sb/stages/config"
xmlns:con2="http://www.bea.com/wli/sb/pipeline/config"
Another way of XML will have the reference as below.
con1:service ref="MyProject/ProxyServices/service2"
xsi:type="ref:ProxyRef" xmlns:ref="http://www.bea.com/wli/sb/reference"/
here in the xml file the name spaces are:
xmlns:con1="http://www.bea.com/wli/sb/stages/config"
xmlns:ref="http://www.bea.com/wli/sb/reference"
I have used this xpath expression, this is not fetching the reference service values, could you please help what is wrong in it.
"//service[#type= #*[local-name() ='ProxyRef' or #type=#*[local-name() ='PipelineRef']]/#ref"
when I used like this it is working but, name space prefix is keep on changes when there are multiple references in the xml file.
"//service[#type='ref:ProxyRef'or #type='con:PipelineRef' or #type='con1:PipelineRef' or #type='con2:PipelineRef' or #type='con3:PipelineRef' ...#type='con20:PipelineRef' ]/#ref";
Now here basically the type attribute PipelineRef is keep on changing the name space prefix from con to con(n). Now I am looking for something which supports some thing like #type='*:PipelineRef' or #type='con*:PipelineRef' or the best way to fetch the service element reference attribute value.
Thanks in advance.
Try using contains() like so :
//service[contains(#type,':ProxyRef') or contains(#type,':PipelineRef')]
Another alternative would be using ends-with() function which is more precise for this purpose compared to contains() function. However, ends-with() isn't available in xpath 1.0, so there is a chance that you need to implement it yourself (feasible, but the xpath result is less intuitive for me).

Xpath - How to navigate to a value (Ruby Nokogiri)

If I want to grab a currencies rate, say "USD", given a certain time, say "2015-02-09", how would I go about doing this?
I tried the following:
/gesmes:Envelope/def:Cube/def:Cube[#time="2014-11-19"]/def:Cube[#currency="USD"]/#rate
Though I suppose due a lack of understanding this is wrong, well at least, I know it is wrong because Nokogiri does not run it.
http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist-90d.xml
EDIT:
I'm going to go ahead and guess that I am not correctly using Nokogiri and XPath.
#doc = Nokogiri::XML(File.open("exchange_data.xml"))
#values = #doc.xpath('XPATH HERE')
#values.each {|i| puts i}
I have read the tutorial, and managed to get it working for other xml files, but this one seems harder to crack.
require 'nokogiri'
doc = Nokogiri::XML(File.open("xml4.xml"))
target_date = "2015-02-09"
target_currency = 'USD'
xpaths = [
"//gesmes:Envelope",
"/xmlns:Cube",
"/xmlns:Cube[#time='#{target_date}']",
"/xmlns:Cube[#currency='#{target_currency}']",
]
xpath = xpaths.join
target_cube = doc.at_xpath(xpath)
puts target_cube.attribute('rate')
--output:--
1.1297
Response to comment:
Your root tag:
<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01"
xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref">
...declares two namespaces with xmlns, which stands for xml namespace. The namespace:
xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01"
declares that any child tag whose name is prefixed by gesmes, e.g.:
<gesmes:subject>
...
</gesmes:subject>
will actually have a tag name that incorporates the specified url into the tag name, something like this:
<http://www.gesmes.org/xml/2002-08-01:subject>
...
</http://www.gesmes.org/xml/2002-08-01:subject>
The reason you would want to use a namespace is to create a unique name for the Cube tag, so that it doesn't clash with another xml document's Cube tag.
The second namespace declaration:
xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref"
is a default namespace declaration. It declares that any child tag that does not specify a prefix will have the specified url incorporated into its tag name. So a tag like this:
<Cube>
...
</Cube>
becomes something like this:
<http://www.ecb.int/vocabulary/2002-08-01/eurofxref:Cube>
...
</http://www.ecb.int/vocabulary/2002-08-01/eurofxref:Cube>
However, it would be unwieldy to have to write a tag name like that in your xpaths, so in place of the url you instead use the shortcut xmlns:
/xmlns:Cube
This might be due to the namespaces in this document:
<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01" xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref">
To test this hypothesis, apply the following XPath expression:
/*[local-name() = 'Envelope']/*[local-name() = 'Cube']/*[local-name() = 'Cube'][#time="2014-11-19"]/*[local-name() = 'Cube'][#currency="USD"]/#rate
and let me know what you get. If you are otherwise correctly using XPath, you should end up with:
rate="1.2535"
If not, you are not using the XPath facilities of Nokogiri correctly, and then you'd really need to show all of your Ruby code to get help.
EDIT
Responding to a comment:
I look forward to seeing some examples added to your answer, so that I can learn something new about xml namespaces. – 7stud
7stud already gave the correct answer, I'll only add info I think is missing from this answer.
Explicit namespaces
First of all, if a namespace URI is explicitly present on an element, the correct syntax uses curly brackets, both for a prefixed and default namespace:
<{http://www.gesmes.org/xml/2002-08-01}subject>
Internally, this is how namespaces could be represented on elements (although some applications have other ways to associate elements with namespaces). Prefixes and default namespaces are there to simplify this process.
Namespaces in Nokogiri
Prefixes (gesmes:) do not have any inherent meaning. They can be associated with an arbitrary namespace URI and every document can use gesmes: to mean something different. Namespace declarations are not available to an XPath engine per se - usually, if you'd like to use a prefix in an XPath expression, you need to declare this namespace again for the XPath processor.
Yet, Nokogiri tries to simplify namespace handling for you by redeclaring namespace declarations found on the root element of the input document. This is important because it allows you to reuse the prefixes declared on the root element of the input without actually declaring the namespace. For default namespaces declared on the root element that do not have a prefix, Nokogiri has defined a special syntax:
xmlns:Cube
Namespaces that are present in the document, but declared on an element other than the root element:
<root>
<child xmlns:gesmes="http://other.com"/>
</root>
must be explicitly declared in Nokogiri:
#doc.xpath('//other:Cube', 'other' => 'http://other.com/')
What's wrong with your original code?
Your code:
/gesmes:Envelope/def:Cube/def:Cube[#time="2014-11-19"]/def:Cube[#currency="USD"]/#rate
does not work because you are using an unknown prefix def:. This prefix is not declared on the root element of the input, and neither did you declare it with Nokogiri. The Cube elements are in the default namespace, and, as we have seen, the correct way to address them is
/gesmes:Envelope/xmlns:Cube
and so on, 7stud gave you the correct answer.

How to build an Object from XML, modify, then write to File in Ruby

I'm having an awful time trying to use a library to parse an XML File into a hash like object, modify it, then print it back out to another XML file in Ruby. For a class I'm taking, we're supposed to use a Java JAXB like library where we convert XML into an object. We've already done SAX and DOM methods so we can't use those methods of XML de-serialization. Nokogiri helped me with both of these in Ruby.
The only problem is that besides the SIMPLE modifications I'm making to the objects, when I write to file it has drastic differences. Is there a Ruby library meant for doing just this? I've tried: ROXML, XML::Mapping, and ActiveSupport::CoreExt. The only one I can get to even run is ActiveSupport, and even then it starts putting element attributes as child elements in the output XML.
I'm willing to try out XmlSimple, but I'm curious has anyone actually had to do this before/run into the same problems? Again, I can't read in lines one at a time like SAX or build a Tree like structure like DOM, it needs to be a hash like object.
Any help is much appreciated!
You should have a look into nokogiri: http://nokogiri.org/
Then you can parse the XML like this :
xml_file = "some_path"
#xml = Nokogiri::XML(File.open xml_file)
#xml.xpath('//listing').each do |node|
style = node.search("style").text
end
With Xpath, you can perform queries in the XML :
#xml.xpath("//listing[name='John']").first(10)
OK, I got it working. After looking at ActiveSupport::CoreExt 's source code I found it just uses a gem called xml-simple. What's obnoxious is the gem, library name in the require statement, and class name are a mixture of hyphenated and non hyphenated spellings. For future reference here's what I did:
# gem install xml-simple
# ^ all lowercase, hyphenated
require 'xmlsimple'
# ^ all lowercase, not hyphenated
doc = XmlSimple.xml_in 'hw3.xml', 'KeepRoot' => true
# ^ Camel cased (it's a class), not hyphenated
# doc.class => Hash
# manipulate doc as a hash
file = File.new('HW3a.xml', 'w')
file.write("<?xml version='1.0' encoding='utf-8'?>\n")
file.write(XmlSimple.xml_out doc, 'KeepRoot' => true)
I hope this helps someone. Also make sure you pay attention to case and hyphens with this gem!!!

Select default namespace in XPath with HtmlUnit

I want to parse a Feedburner feed with HtmlUnit.
The feed is this one: http://feeds.feedburner.com/alcoanewsreleases
From this feed I want to read all item nodes, so normally a //item XPath should do the trick. Unfortunately that does not work in this case.
groovy code snippet:
def page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases")
def elements = page.getByXPath("//item")
Sample of the XML feed:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss1full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://purl.org/rss/1.0/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
[...SNIP...]
<item rdf:about="http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2011&pageID=20110518006002en">
<title>Chris L. Ayers Named President, Alcoa Global Primary Products</title>
<dc:date>2011-05-18</dc:date
<link>http://feedproxy.google.com/~r/alcoanewsreleases/~3/PawvdhpJrkc/news_detail.asp</link>
<description>NEW YORK--(BUSINESS WIRE)--Alcoa (NYSE:AA) announced today that Chris L. Ayers has been named President of Alcoa’s Global Primary Products (GPP) business, effective May 18, 2011. Ayers, previously Chief Operating Officer of GPP, succeeds John Thuestad, who will be handling special projects for the Company. Ayers joined Alcoa in February 2010 as Chief Operating Officer of Alcoa Cast, Forged and Extruded Products, a new position. He was elected a Vice President of Alcoa in April 2010 and Executive</description>
<feedburner:origLink xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2010&pageID=20100104006194en</feedburner:origLink>
</item>
[...SNIP...]
</rdf:RDF>
I suspect this to be an issue with the namespaces because this document has 4 namespaces. The namespaces are
(this is the default) xmlns="http://purl.org/rss/1.0/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"
I have tried to use Nokogiri with this (another XML Parser that I use for ruby scripts).
With Nokogiri I could just us the XPath //xmlns:item which works and returns all nodes from the feed.
I have tried the same XPath with HtmlUnit but it does not work.
So I think I can phrase my question as:
How can I select a node from the default namespace with HtmlUnit?
Any ideas?
From this feed I want to read all item
nodes, so normally a //item XPath
should do the trick. Unfortunately
that does not work in this case.
In XPath, that means "select all elements whose local name is item that are in no namespace". In RSS, the item elements must be in a namespace. So the above should never work with a conforming XML parser and XPath engine.
What's confusing is that in XML, <item> means "an element named item that is in the default namespace, i.e. whatever default namespace is in scope at this place in the document;" whereas in XPath, "item" means an element in no namespace. (Or, you could say, it means an element in the default namespace, but unless you have a way to tell XPath what the default namespace is, the default namespace is no namespace. Usually (always?) in XPath 1.0 there is no way to declare the default namespace for XPath expressions.)
The other confusing thing to beginners is that the namespace prefix mappings in the source XML document are not considered significant by the XPath processor. When the XML document is parsed, a data structure is built that remembers the name and namespace of every element (and other nodes). The namespace prefixes used, including the empty prefix of the default namespace, are considered mere syntactic convenience. More on this below...
With Nokogiri I could just us the
XPath //xmlns:item which works and
returns all nodes from the feed.
Whatever that is, it's not XPath. Maybe it's a Nokogiri extension to it (a very convenient one, but its syntax is really counter-intuitive).
So I think I can phrase my question
as: How can I select a node from the
default namespace with HtmlUnit?
Let's phrase it as: How can I select the RSS item elements with HtmlUnit? I phrase it that way because the RSS spec (actually in general any conforming XML vocabulary spec) does not require that its elements will be in the default namespace. That happens to be true in the sample you received, but the service provider could change that tomorrow and still be perfectly conformant to RSS. Tomorrow, the service provider could use the "rss" namespace prefix for that namespace; or any other arbitrary prefix. What RSS does specify is what namespace its elements will be in: the namespace whose URI is http://purl.org/rss/1.0/.
It's kind of like asking, "How do I write a function (in Javascript, C, Java, etc.) that can tell me the value of the variable a?" Usually a function has no idea what variable name was used for what in the caller. All it knows are the values of its arguments. If you call sqrt(4), you'll get the same answer as with a = 4; sqrt(a) or rumpelstiltzkin = 4; sqrt(rumpelstiltzkin). Clearly, the name of the variable argument has no direct effect on the result of the function call. It just needs to be the name of a variable that holds the right value. If a compiler complained because you wrote b = 4; return sqrt(b) instead of using a, you'd think that compiler was nuts. It's not supposed to care about variable names as long as you use valid identifiers.
In the same way, when processing RSS, we're not supposed to care about what namespace prefix is used, as long as it's a prefix that identifies the right namespace. It could be no prefix (which identifies the default namespace).
In XPath 2.0, you can wildcard the namespace. This is very handy if you know you're not going to need namespaces for disambiguation. In that case you can select //*:item. However, I don't think HTMLUnit supports XPath 2.0. Also in XPath 2.0 environments like XSLT 2.0, you can specify a default namespace for XPath expressions, but that won't help you in HTMLUnit.
So you have a couple of choices:
Use an XPath expression that ignores namespaces, such as //*[local-name() = 'item'].
or
The robust way: Register a namespace prefix for http://purl.org/rss/1.0/ and use it in your XPath expression: //rss:item. The question then becomes, how do you register a namespace prefix in HTMLUnit and pass it to the XPath processor? I took a quick look in the docs and didn't find any facility for doing that.
Caveat: I should add that the above is in regard to conforming XPath processors. I have no idea what XPath processor HTMLUnit uses. There are some XPath processors out there that ignore the specs and make the world more confusing for everybody.
I saw here that someone used the following syntax for elements in the default namespace in HTMLUnit:
//:item
But I wouldn't recommend that, for three reasons:
It's not valid XPath, so you can't expect it to work with other programs.
It will only work on RSS feeds that declare the RSS namespace to be the default namespace. RSS feeds that use a namespace prefix will cause the above to fail.
It will hold you back from learning how XML namespaces really work, and it will help preserve the status quo of tools that don't adequately support namespaces.
HTMLUnit is primarily designed for HTML, so incomplete handling of XML is understandable. But claiming to support XPath and then not providing ways to declare namespace prefixes is a bug. HTMLUnit uses an XPath package that seems to be part of Xalan-J. That package has ways to provide namespace mappings to XPath, but I don't know if HTMLUnit exposes that functionality.
This sounds familiar enough that I'm quite sure I've used namespaces and XPath successfully with HtmlUnit in the past, but of course I can't find the code. I suspect it must have been with HTML pages only: the page reference in your example is an XmlPage which has a number of methods specific to namespaces, all of which throw a "not implemented yet" exception when used. :-(
The current version (2.8) of HtmlUnit is nearly a year old, so it may be that some work has been done in the meantime to support XML namespaces. The "HtmlUnit Users" mailing list would be the place to find out.
In the meantime, as always there is a workaround:
final XmlPage page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases");
// no good
List elements = page.getByXPath("//item");
System.out.println( elements.size() ) ;
// ugly, but it works
DomElement de = (DomElement)page.getFirstByXPath( "//rdf:RDF" );
List<DomNode> items = new ArrayList<DomNode>() ;
for( DomNode dn : de.getChildNodes() )
{
String name = dn.getLocalName() ;
if( ( name != null ) && ( name.equals( "item" ) ) )
items.add( dn ) ;
}
System.out.println( "found " + items.size() ) ;
Oh boy Java is painful after working in Scala... ;-)

Traverse xml structure to determine if a certain text node exists

Alright I have an xml document that looks something like this:
<xml>
<list>
<partner>
<name>Some Name</name>
<status>active</status>
<id>0</id>
</partner>
<partner>
<name>Another Name</name>
<status>active</status>
<id>1</id>
</partner>
</list>
</xml>
I am using ruby's lib-xml to parse it.
I want to find if there is a partner with the name 'Some Name' in a quick and ruby idiomatic way.
How can I do this in one line or ruby code, assuming i have a the document parsed in a variable named document.. Such that i can call document.find(xpath) to retrieve nodes. I have had to do this multiple times in slightly different scenarios and now its starting to bug me.
I know i can do the following (but its ugly)
found = false
document.find('//partner/name').each do |name|
if (name.content == 'Some Name')
found = true
break
end
end
assert(found, "Some Name should have been found")
but i find this really ugly. I thought about using the enumeration include? mixin method but that still won't work because I need to get the .content field of each node as opposed to the actual node...
While writing this, I though of this (but it seems somewhat inefficient albeit elegant)
found = document.find('//partner/name').collect{|name| name.content}.member?("Some Name")
Are there any other ways of doing this?
What about this?
found = document.find("//partner[name='Some Name']").empty?
I tried this solution:
found = document.find("//partner[name='Some Name']") != nil
but I got an error saying the xpath expression was invalid.
However, i was reading some xpath documentation it it looks like you can call a text() function in the expression to get the text node. I tried the following and it appears to work:
found = document.find("//partner/name/text()='Some Name'")
found actually is not a xml node but a true/false object so this works.
I would use a language that natively operates on XML (XQuery for example). With XQuery it is possible to formulate this sort of queries over xml data in a concise and elegant way.

Resources