Xpath - How to navigate to a value (Ruby Nokogiri) - ruby

If I want to grab a currencies rate, say "USD", given a certain time, say "2015-02-09", how would I go about doing this?
I tried the following:
/gesmes:Envelope/def:Cube/def:Cube[#time="2014-11-19"]/def:Cube[#currency="USD"]/#rate
Though I suppose due a lack of understanding this is wrong, well at least, I know it is wrong because Nokogiri does not run it.
http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist-90d.xml
EDIT:
I'm going to go ahead and guess that I am not correctly using Nokogiri and XPath.
#doc = Nokogiri::XML(File.open("exchange_data.xml"))
#values = #doc.xpath('XPATH HERE')
#values.each {|i| puts i}
I have read the tutorial, and managed to get it working for other xml files, but this one seems harder to crack.

require 'nokogiri'
doc = Nokogiri::XML(File.open("xml4.xml"))
target_date = "2015-02-09"
target_currency = 'USD'
xpaths = [
"//gesmes:Envelope",
"/xmlns:Cube",
"/xmlns:Cube[#time='#{target_date}']",
"/xmlns:Cube[#currency='#{target_currency}']",
]
xpath = xpaths.join
target_cube = doc.at_xpath(xpath)
puts target_cube.attribute('rate')
--output:--
1.1297
Response to comment:
Your root tag:
<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01"
xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref">
...declares two namespaces with xmlns, which stands for xml namespace. The namespace:
xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01"
declares that any child tag whose name is prefixed by gesmes, e.g.:
<gesmes:subject>
...
</gesmes:subject>
will actually have a tag name that incorporates the specified url into the tag name, something like this:
<http://www.gesmes.org/xml/2002-08-01:subject>
...
</http://www.gesmes.org/xml/2002-08-01:subject>
The reason you would want to use a namespace is to create a unique name for the Cube tag, so that it doesn't clash with another xml document's Cube tag.
The second namespace declaration:
xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref"
is a default namespace declaration. It declares that any child tag that does not specify a prefix will have the specified url incorporated into its tag name. So a tag like this:
<Cube>
...
</Cube>
becomes something like this:
<http://www.ecb.int/vocabulary/2002-08-01/eurofxref:Cube>
...
</http://www.ecb.int/vocabulary/2002-08-01/eurofxref:Cube>
However, it would be unwieldy to have to write a tag name like that in your xpaths, so in place of the url you instead use the shortcut xmlns:
/xmlns:Cube

This might be due to the namespaces in this document:
<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01" xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref">
To test this hypothesis, apply the following XPath expression:
/*[local-name() = 'Envelope']/*[local-name() = 'Cube']/*[local-name() = 'Cube'][#time="2014-11-19"]/*[local-name() = 'Cube'][#currency="USD"]/#rate
and let me know what you get. If you are otherwise correctly using XPath, you should end up with:
rate="1.2535"
If not, you are not using the XPath facilities of Nokogiri correctly, and then you'd really need to show all of your Ruby code to get help.
EDIT
Responding to a comment:
I look forward to seeing some examples added to your answer, so that I can learn something new about xml namespaces. – 7stud
7stud already gave the correct answer, I'll only add info I think is missing from this answer.
Explicit namespaces
First of all, if a namespace URI is explicitly present on an element, the correct syntax uses curly brackets, both for a prefixed and default namespace:
<{http://www.gesmes.org/xml/2002-08-01}subject>
Internally, this is how namespaces could be represented on elements (although some applications have other ways to associate elements with namespaces). Prefixes and default namespaces are there to simplify this process.
Namespaces in Nokogiri
Prefixes (gesmes:) do not have any inherent meaning. They can be associated with an arbitrary namespace URI and every document can use gesmes: to mean something different. Namespace declarations are not available to an XPath engine per se - usually, if you'd like to use a prefix in an XPath expression, you need to declare this namespace again for the XPath processor.
Yet, Nokogiri tries to simplify namespace handling for you by redeclaring namespace declarations found on the root element of the input document. This is important because it allows you to reuse the prefixes declared on the root element of the input without actually declaring the namespace. For default namespaces declared on the root element that do not have a prefix, Nokogiri has defined a special syntax:
xmlns:Cube
Namespaces that are present in the document, but declared on an element other than the root element:
<root>
<child xmlns:gesmes="http://other.com"/>
</root>
must be explicitly declared in Nokogiri:
#doc.xpath('//other:Cube', 'other' => 'http://other.com/')
What's wrong with your original code?
Your code:
/gesmes:Envelope/def:Cube/def:Cube[#time="2014-11-19"]/def:Cube[#currency="USD"]/#rate
does not work because you are using an unknown prefix def:. This prefix is not declared on the root element of the input, and neither did you declare it with Nokogiri. The Cube elements are in the default namespace, and, as we have seen, the correct way to address them is
/gesmes:Envelope/xmlns:Cube
and so on, 7stud gave you the correct answer.

Related

Xpath Expression evaluation on attributes with any namespace prefix

Could you please help me on this xpath expression evaluation
I am working on fetching the proxy references. In the xml file the references will get stored as:
One way of XML file will have the reference as below:
con1:service ref="MyProject/ProxyServices/service1"
xsi:type="con2:PipelineRef" xmlns:ref="http://www.bea.com/wli/sb/reference"/
here in the xml file the name spaces are:
xmlns:con1="http://www.bea.com/wli/sb/stages/config"
xmlns:con2="http://www.bea.com/wli/sb/pipeline/config"
Another way of XML will have the reference as below.
con1:service ref="MyProject/ProxyServices/service2"
xsi:type="ref:ProxyRef" xmlns:ref="http://www.bea.com/wli/sb/reference"/
here in the xml file the name spaces are:
xmlns:con1="http://www.bea.com/wli/sb/stages/config"
xmlns:ref="http://www.bea.com/wli/sb/reference"
I have used this xpath expression, this is not fetching the reference service values, could you please help what is wrong in it.
"//service[#type= #*[local-name() ='ProxyRef' or #type=#*[local-name() ='PipelineRef']]/#ref"
when I used like this it is working but, name space prefix is keep on changes when there are multiple references in the xml file.
"//service[#type='ref:ProxyRef'or #type='con:PipelineRef' or #type='con1:PipelineRef' or #type='con2:PipelineRef' or #type='con3:PipelineRef' ...#type='con20:PipelineRef' ]/#ref";
Now here basically the type attribute PipelineRef is keep on changing the name space prefix from con to con(n). Now I am looking for something which supports some thing like #type='*:PipelineRef' or #type='con*:PipelineRef' or the best way to fetch the service element reference attribute value.
Thanks in advance.
Try using contains() like so :
//service[contains(#type,':ProxyRef') or contains(#type,':PipelineRef')]
Another alternative would be using ends-with() function which is more precise for this purpose compared to contains() function. However, ends-with() isn't available in xpath 1.0, so there is a chance that you need to implement it yourself (feasible, but the xpath result is less intuitive for me).

Nokogiri not parsing XML in ruby - xmlns issue?

Given the following ruby code :
require 'nokogiri'
xml = "<?xml version='1.0' encoding='UTF-8'?>
<ProgramList xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns:xsd='http://www.w3.org/2001/XMLSchema' xmlns='http://publisher.webservices.affili.net/'>
<TotalRecords>145</TotalRecords>
<Programs>
<ProgramSummary>
<ProgramID>6540</ProgramID>
<Title>Matalan</Title>
<Limitations>A bit of text
</Limitations>
<URL>http://www.matalan.co.uk</URL>
<ScreenshotURL>http://www.matalan.co.uk/</ScreenshotURL>
<LaunchDate>2009-11-02T00:00:00</LaunchDate>
<Status>1</Status>
</ProgramSummary>
<ProgramSummary>
<ProgramID>11787</ProgramID>
<Title>Club 18-30</Title>
<Limitations/>
<URL>http://www.club18-30.com/</URL>
<ScreenshotURL>http://www.club18-30.com</ScreenshotURL>
<LaunchDate>2013-05-16T00:00:00</LaunchDate>
<Status>1</Status>
</ProgramSummary>
</Programs>
</ProgramList>"
doc = Nokogiri::XML(xml)
p doc.xpath("//Programs")
gives :
=> []
Not what is expected.
On further investigation if I remove xmlns='http://publisher.webservices.affili.net/' from the initial <ProgramList> tag I get the expected output.
Indeed if I change xmlns='http://publisher.webservices.affili.net/' to xmlns:anything='http://publisher.webservices.affili.net/' I get the expected output.
So my question is what is going on here? Is this malformed XML? And what is the best strategy for dealing with it?
While it's hardcoded in this example the XML is (will be) coming from a web service.
Update
I realise I can use the remove_namespaces! method but the Nokogiri docs do say that it's "...probably is not a good thing in general" to do this. Also I'm interested in why it's happening and what the 'correct' XML should be.
The xmlns='http://publisher.webservices.affili.net/' indicates the default namespace for all elements under the one where it appears (including the element itself). That means that all elements that don’t otherwise have an explicit namespace fall under this namespace.
XPath queries don’t have default namespaces (at least in XPath 1.0), so any name that appears in one without a prefix refers to that element in no namespace.
In your code, you want to find Program elements in the http://publisher.webservices.affili.net/ namespace (since that is the default namespace), but are looking (in your XPath query) for Program elements in no namespace.
To explicitly specify the namespace in the query, you can do something like this:
doc.xpath("//pub:Programs", "pub" => "http://publisher.webservices.affili.net/")
Nokogiri makes this a little easier for namespaces declared on the root element (as in this case), declaring them for you with the same prefix. It will also declare the default namespace using the xmlns prefix, so you can also do:
doc.xpath("//xmlns:Programs")
which will give you the same result.

Extract a specific node from an XML file

I want to extract only the body node/tag from an XML file using doc.xpath in Ruby
The node to extract from the XML file:
<wcm:element name="Body"><p>A new study suggests that <a href="ssNODELINK/SmokingAndCancer">tobacco</a> companies may be using online video portals, such as YouTube, to get around advertising restrictions and market their products to young people.</p>
</wcm:element>
I have tried the following:
page_content = doc.xpath("/wcm:root/wcm:element").inner_text
But this extracts every node everything
Then I tried this:
page_content = doc.xpath("/wcm:root/wcm:element/Body")
But does not work.
Anyone has any suggestions how to extract exactly the body section of an XML file using doc.xpath in Ruby?
I'm not 100% certain I've understood what you mean but… let's not let that stop us. You want to get the content of a particular node from the input. Your first XPath statement:
/wcm:root/wcm:element
is extracting every element with name wcm:element that is a child of the wcm:root element which is the root element.
Your second:
/wcm:root/wcm:element/Body
is similar but looks for elements with name Body which are children of the wcm:element.
What you need to is to get the values of the wcm:element element where the attribute name is set to the value Body. You access attributes in XPath by prefixing them with an # sign and to express a where condition you use [...] - a predicate. You XPath statement needs to be:
/wcm:root/wcm:element[#name = 'Body']
I'm assuming that your XPath execution environment is fine the namespace prefixes (wcm) because you say that your first query returned content.

Select default namespace in XPath with HtmlUnit

I want to parse a Feedburner feed with HtmlUnit.
The feed is this one: http://feeds.feedburner.com/alcoanewsreleases
From this feed I want to read all item nodes, so normally a //item XPath should do the trick. Unfortunately that does not work in this case.
groovy code snippet:
def page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases")
def elements = page.getByXPath("//item")
Sample of the XML feed:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss1full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://purl.org/rss/1.0/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
[...SNIP...]
<item rdf:about="http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2011&pageID=20110518006002en">
<title>Chris L. Ayers Named President, Alcoa Global Primary Products</title>
<dc:date>2011-05-18</dc:date
<link>http://feedproxy.google.com/~r/alcoanewsreleases/~3/PawvdhpJrkc/news_detail.asp</link>
<description>NEW YORK--(BUSINESS WIRE)--Alcoa (NYSE:AA) announced today that Chris L. Ayers has been named President of Alcoa’s Global Primary Products (GPP) business, effective May 18, 2011. Ayers, previously Chief Operating Officer of GPP, succeeds John Thuestad, who will be handling special projects for the Company. Ayers joined Alcoa in February 2010 as Chief Operating Officer of Alcoa Cast, Forged and Extruded Products, a new position. He was elected a Vice President of Alcoa in April 2010 and Executive</description>
<feedburner:origLink xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2010&pageID=20100104006194en</feedburner:origLink>
</item>
[...SNIP...]
</rdf:RDF>
I suspect this to be an issue with the namespaces because this document has 4 namespaces. The namespaces are
(this is the default) xmlns="http://purl.org/rss/1.0/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"
I have tried to use Nokogiri with this (another XML Parser that I use for ruby scripts).
With Nokogiri I could just us the XPath //xmlns:item which works and returns all nodes from the feed.
I have tried the same XPath with HtmlUnit but it does not work.
So I think I can phrase my question as:
How can I select a node from the default namespace with HtmlUnit?
Any ideas?
From this feed I want to read all item
nodes, so normally a //item XPath
should do the trick. Unfortunately
that does not work in this case.
In XPath, that means "select all elements whose local name is item that are in no namespace". In RSS, the item elements must be in a namespace. So the above should never work with a conforming XML parser and XPath engine.
What's confusing is that in XML, <item> means "an element named item that is in the default namespace, i.e. whatever default namespace is in scope at this place in the document;" whereas in XPath, "item" means an element in no namespace. (Or, you could say, it means an element in the default namespace, but unless you have a way to tell XPath what the default namespace is, the default namespace is no namespace. Usually (always?) in XPath 1.0 there is no way to declare the default namespace for XPath expressions.)
The other confusing thing to beginners is that the namespace prefix mappings in the source XML document are not considered significant by the XPath processor. When the XML document is parsed, a data structure is built that remembers the name and namespace of every element (and other nodes). The namespace prefixes used, including the empty prefix of the default namespace, are considered mere syntactic convenience. More on this below...
With Nokogiri I could just us the
XPath //xmlns:item which works and
returns all nodes from the feed.
Whatever that is, it's not XPath. Maybe it's a Nokogiri extension to it (a very convenient one, but its syntax is really counter-intuitive).
So I think I can phrase my question
as: How can I select a node from the
default namespace with HtmlUnit?
Let's phrase it as: How can I select the RSS item elements with HtmlUnit? I phrase it that way because the RSS spec (actually in general any conforming XML vocabulary spec) does not require that its elements will be in the default namespace. That happens to be true in the sample you received, but the service provider could change that tomorrow and still be perfectly conformant to RSS. Tomorrow, the service provider could use the "rss" namespace prefix for that namespace; or any other arbitrary prefix. What RSS does specify is what namespace its elements will be in: the namespace whose URI is http://purl.org/rss/1.0/.
It's kind of like asking, "How do I write a function (in Javascript, C, Java, etc.) that can tell me the value of the variable a?" Usually a function has no idea what variable name was used for what in the caller. All it knows are the values of its arguments. If you call sqrt(4), you'll get the same answer as with a = 4; sqrt(a) or rumpelstiltzkin = 4; sqrt(rumpelstiltzkin). Clearly, the name of the variable argument has no direct effect on the result of the function call. It just needs to be the name of a variable that holds the right value. If a compiler complained because you wrote b = 4; return sqrt(b) instead of using a, you'd think that compiler was nuts. It's not supposed to care about variable names as long as you use valid identifiers.
In the same way, when processing RSS, we're not supposed to care about what namespace prefix is used, as long as it's a prefix that identifies the right namespace. It could be no prefix (which identifies the default namespace).
In XPath 2.0, you can wildcard the namespace. This is very handy if you know you're not going to need namespaces for disambiguation. In that case you can select //*:item. However, I don't think HTMLUnit supports XPath 2.0. Also in XPath 2.0 environments like XSLT 2.0, you can specify a default namespace for XPath expressions, but that won't help you in HTMLUnit.
So you have a couple of choices:
Use an XPath expression that ignores namespaces, such as //*[local-name() = 'item'].
or
The robust way: Register a namespace prefix for http://purl.org/rss/1.0/ and use it in your XPath expression: //rss:item. The question then becomes, how do you register a namespace prefix in HTMLUnit and pass it to the XPath processor? I took a quick look in the docs and didn't find any facility for doing that.
Caveat: I should add that the above is in regard to conforming XPath processors. I have no idea what XPath processor HTMLUnit uses. There are some XPath processors out there that ignore the specs and make the world more confusing for everybody.
I saw here that someone used the following syntax for elements in the default namespace in HTMLUnit:
//:item
But I wouldn't recommend that, for three reasons:
It's not valid XPath, so you can't expect it to work with other programs.
It will only work on RSS feeds that declare the RSS namespace to be the default namespace. RSS feeds that use a namespace prefix will cause the above to fail.
It will hold you back from learning how XML namespaces really work, and it will help preserve the status quo of tools that don't adequately support namespaces.
HTMLUnit is primarily designed for HTML, so incomplete handling of XML is understandable. But claiming to support XPath and then not providing ways to declare namespace prefixes is a bug. HTMLUnit uses an XPath package that seems to be part of Xalan-J. That package has ways to provide namespace mappings to XPath, but I don't know if HTMLUnit exposes that functionality.
This sounds familiar enough that I'm quite sure I've used namespaces and XPath successfully with HtmlUnit in the past, but of course I can't find the code. I suspect it must have been with HTML pages only: the page reference in your example is an XmlPage which has a number of methods specific to namespaces, all of which throw a "not implemented yet" exception when used. :-(
The current version (2.8) of HtmlUnit is nearly a year old, so it may be that some work has been done in the meantime to support XML namespaces. The "HtmlUnit Users" mailing list would be the place to find out.
In the meantime, as always there is a workaround:
final XmlPage page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases");
// no good
List elements = page.getByXPath("//item");
System.out.println( elements.size() ) ;
// ugly, but it works
DomElement de = (DomElement)page.getFirstByXPath( "//rdf:RDF" );
List<DomNode> items = new ArrayList<DomNode>() ;
for( DomNode dn : de.getChildNodes() )
{
String name = dn.getLocalName() ;
if( ( name != null ) && ( name.equals( "item" ) ) )
items.add( dn ) ;
}
System.out.println( "found " + items.size() ) ;
Oh boy Java is painful after working in Scala... ;-)

Problem running xpath query with namespaces

I'm trying to use an xpath expression to select a node-set in an xml document with different namespaces defined.
The xml looks something like this:
<?POSTEN SND="SE00317644000" REC="5566420989" MSGTYPE="EPIX"?>
<ns:Msg xmlns:ns="http://www.noventus.se/epix1/genericheader.xsd">
<GenericHeader>
<SubsysId>1</SubsysId>
<SubsysType>30003</SubsysType>
<SendDateTime>2009-08-13T14:28:15</SendDateTime>
</GenericHeader>
<m:OrderStatus xmlns:m="http://www.noventus.se/epix1/orderstatus.xsd">
<Header>
<OrderSystemId>Soda SE</OrderSystemId>
<OrderNo>20090811</OrderNo>
<Status>0</Status>
</Header>
<Lines>...
I want to select only "Msg"-nodes that has the "OrderStatus" child and therefore I want to use the following xpath expression: /Msg[count('OrderStatus') > 0] but this won't work since I get an error message saying: "Namespace Manager or XsltContext needed. This query has a prefix, variable, or user-defined function".
So I think I want to use an expression that looks something like this: /*[local-name()='Msg'][count('OrderStatus') > 0] but that doesn't seem to work.. any ideas?
Br,
Andreas
I want to use the following xpath
expression:
/Msg[count('OrderStatus')[ 0]
but this won't work since I get an error message saying: "Namespace
Manager or XsltContext needed.
This is a FAQ.
In XPath a unprefixed name is always considered to belong in "no namespace".
However, the elements you want to select are in fact in the "http://www.noventus.se/epix1/genericheader.xsd"
namespace.
You have two possible ways to write your XPath expression:
Use the facilities of the hosting language to associate prefixes to all different namespaces to which names from the expression belong. You haven't indicated what is the hosting language in this concrete case, so I can't help you with this. A C# example can be found here.
If you have associated the prefix "xxx" to the namespace "http://www.noventus.se/epix1/genericheader.xsd" and the prefix "yyy" to the namespace "http://www.noventus.se/epix1/orderstatus.xsd", then your Expression can be written as:
/xxx:Msg[yyy:OrderStatus]
:2: If you don't want to use any prefixes at all, an XPath expression can still be constructed, however it will not be too readable:
/*[local-name() = 'Msg' and *[local-name() = 'OrderStatus']]
Finally, do note:
In order to test if an element x has a child y it isn't necessary to test for a positive count(y). Just use: x[y]
Xpath positions are 1-based. This means that NodeSetExpression[0] never selects a node. You want: NodeSetExpression[1]

Resources