XPath-REXML-Ruby: Selecting multiple siblings/ancestors/descendants - ruby

This is my first post here. I have just started working with Ruby and am using REXML for some XML handling. I present a small sample of my xml file here:
<record>
<header>
<identifier>oai:lcoa1.loc.gov:loc.gmd/g3195.ct000379</identifier>
<datestamp>2004-08-13T15:32:50Z</datestamp>
<setSpec>gmd</setSpec>
</header>
<metadata>
<titleInfo>
<title>Meet-konstige vertoning van de grote en merk-waardige zons-verduistering</title>
</titleInfo>
</metadata>
</record>
My objective is to match the last numerical value in the tag with a list of values that I have from an array. I have achieved this with the following code snippet:
ids = XPath.match(xmldoc, "//identifier[text()='oai:lcoa1.loc.gov:loc.gmd/"+mapid+"']")
Having got a particular identifier that I wish to investigate, now I want to go back to and select and then select to get the value in the node for that particular identifier.
I have looked at the XPath tutorials and expressions and many of the related questions on this website as well and learnt about axes and the different concepts such as ancestor/following sibling etc. However, I am really confused and cannot figure this out easily.
I was wondering if I could get any help or if someone could point me towards an online resource "easy" to read.
Thank you.
UPDATE:
I have been trying various combinations of code such as:
idss = XPath.match(xmldoc, "//identifier[text()='oai:lcoa1.loc.gov:loc.gmd/"+mapid+"']/parent::header/following-sibling::metadata/child::mods/child::titleInfo/child::title")
The code compiles but does not output anything. I am wondering what I am doing so wrong.

Here's a way to accomplish it using XPath, then going up to the record, then XPath to get the title:
require 'rexml/document'
include REXML
xml=<<END
<record>
<header>
<identifier>oai:lcoa1.loc.gov:loc.gmd/g3195.ct000379</identifier>
<datestamp>2004-08-13T15:32:50Z</datestamp>
<setSpec>gmd</setSpec>
</header>
<metadata>
<titleInfo>
<title>Meet-konstige</title>
</titleInfo>
</metadata>
</record>
END
doc=Document.new(xml)
mapid = "ct000379"
text = "oai:lcoa1.loc.gov:loc.gmd/g3195.#{mapid}"
identifier_nodes = XPath.match(doc, "//identifier[text()='#{text}']")
record_node = identifier_nodes.first.parent.parent
record_node.elements['metadata/titleInfo/title'].text
=> "Meet-konstig"

Related

How to get XmlInputParser work with self-closing XML tags?

I am trying to load XML files from SE data dump* into HDFS using MapReduce. These XML files consist of a number of <row> elements (enclosed in a top-level "category"), like so:
<badges>
<row Id="1" UserId="1" Name="Organizer" Date="2009-07-15T06:51:46.370" />
<row Id="2" UserId="3" Name="Organizer" Date="2009-07-15T06:51:46.387" />
<row Id="4" UserId="1" Name="Autobiographer" Date="2009-07-15T06:51:46.447" />
...
</badges>
I want each "row" to be processed by a separate map() function, and have configured org.apache.mahout.classifier.bayes.XmlInputFormat's start and end tags as below:
Configuration config = new Configuration();
config.set(XmlInputFormat.START_TAG_KEY, "<row>");
config.set(XmlInputFormat.END_TAG_KEY, "</row>");
However, this fails to parse the XML file, because the <row> element is self-closing. How do I get this to work, without artificially "closing" the self-closing tags?
Linking to SE blog rather than directly to the data dump, to prevent dead link in case location changes in future.
This is a somewhat ugly hack. Change the START_TAG_KEY and END_TAG_KEY as below:
config.set(XmlInputFormat.START_TAG_KEY, "<row");
config.set(XmlInputFormat.END_TAG_KEY, "/>");
The "keys" are being used like delimiters, and accept any string, rather than just XML tags. Not a "clean" solution, and may stop working on future implementations, but it gets the work done now.
Note: I figured it out while I was midway through posting the question. This seems rather obvious in hindsight, but I decided to go ahead with the post anyway, so that someone may find it useful in future.

XPath expression using idref

I am reading about and testing XQuery and like test tools I use BaseX(www.basex.org) and saxon-HE 9.4.0.6N.
For the following simple XML file - no schema attached to the sample.xml:
<rootab>
<l1>
<items p="a">
<itema x1="10" id="abc">testa</itema>
<itemb x1="10" id="dfe">testb</itemb>
<itemc x1="10" id="jgh">testc</itemc>
</items>
</l1>
<l2>
<items p="b">
<itema x1="10" xidref="abc">testa</itema>
<itemc x1="10" xidref="jgh">testc</itemc>
<itemd x1="10" xidref="abc">testA101</itemd>
<iteme x1="10" xidref="jgh">testB202</iteme>
</items>
</l2>
</rootab>
In Basex_GUI if I enter the following XPath expression: //idref("abc")/..
the result is: <itema x1="10" xidref="abc">testa</itema>
In BaseX_GUI if I add the simple XQuery expression:
for $x in doc('sample.xml')//idref("abc")/..
return <aaa>{$x}</aaa>
the result is:
<aaa>
<itema x1="10" xidref="abc">testa</itema>
</aaa>
<aaa>
<itemd x1="10" xidref="abc">testA101</itemd>
</aaa>
q1) Why the XPath expression returned only one node? I expected two...
In Saxon, by using the below xql file:
<test>
{
doc('sample.xml')//idref("abc")/..
}
</test>
or the XQuery expression , I receive the same result by running the command query sample.xql:
<?xml version="1.0" encoding="UTF-8"?><test/>
q2)what is wrong in my Saxon test ?
thank you in advance for your help!
Basically, idref() is sensitive to DTD validation - it recognizes attributes declared as type IDREF in your DTD.
You haven't shown us your DTD, and more importantly, you haven't shown how the input to the queries is supplied. There are many ways of constructing input in which the "IDREF-ness" of an attribute is lost - for example, going via a DOM. Even when you use the doc() function in Saxon, the way the input tree is built depends on many factors including configuration options and your URIResolver.
I see you are using .NET. When Saxon uses the Microsoft XML parser on .NET, it doesn't know which attributes are IDs and IDREFs, so the id() and idref() functions don't work (the MS parser simply doesn't supply this information). You therefore need to use the JAXP parser (Xerces) that comes with the Saxon product. I think this is the default these days.
So not really an answer, but hopefully some background that explains some of the things that can go wrong.

Xpath - How to select subnode where sibling-node contains certain text

I want to use XPath to select the sub tree containing the <name>-tag with "ABC" and not the other one from the following xml. Is this possible? And as a minor question, which keywords would I use to find something like that over Google (e.g. for selecting the sub tree by an attribute I would have the terminology for)?
<root>
<operation>
<name>ABC</name>
<description>Description 1</description>
</operation>
<operation>
<name>DEF</name>
<description>Description 2</description>
</operation>
</root>
Use:
/*/operation[name='ABC']
For your second question: I strongly recommend not to rely on online sources (there are some that aren't so good) but to read a good book on XPath.
See some resources listed here:
https://stackoverflow.com/questions/339930/any-good-xslt-tutorial-book-blog-site-online/341589#341589
For your first question, I think a more accurate way to do it would be://operation[./name[text()='ABC']].And according to this , we can also make it://operation[./name[text()[.='ABC']]]

REXML - Having trouble asserting a value is present in XML response

I got help here this morning using REXML and the answer helped me understand more about it. However I've encountered another problem and can't seem to figure it out.
My response is like this:
<?xml version="1.0"?>
<file>
<link a:size="2056833" a:mimeType="video/x-flv" a:bitrate="1150000.0" a:height="240" a:width="320" rel="content.alternate"> https://link.com</link>
</file>
So, what I want to do is assert that a:mimeType is video/x-flv
Here's what I have tried:
xmlDoc = REXML::Document.new(xml)
assert_equal xmlDoc.elements().to_a("file/link['#a:mimeType']").first.text, 'video/x-flv'
and also:
assert xmlDoc.elements().to_a("file/link['#a:mimeType']").include? 'video/x-flv'
and various combinations. I actually get lots of these links back but I only really care if one of them has this mimeType. Also, some of the links don't have mimeType.
Any help greatly appreciated.
Thanks,
Adrian
text is retrieving for text content of elements (text between tags). You want to access an "attribute". Try
xmlDoc.elements().to_a("file/link").first.attributes['a:mimeType']
To see if either of the links has the correct mimeType, you can convert the array of elements
into an array of mimeType attributes and check if it contains the right value:
xmlDoc.elements().to_a("file/link").map { | elem | elem.attributes['a:mimeType'] }.include? 'video/x-flv'
UPDATE
Or much simpler, check if there is an element with the attribute mimeTypes having the right value:
xmlDoc.elements().to_a("file/link[#a:mimeType='video/x-flv']") != []
Thanks for teaching me something about XPath ;-)

Traverse xml structure to determine if a certain text node exists

Alright I have an xml document that looks something like this:
<xml>
<list>
<partner>
<name>Some Name</name>
<status>active</status>
<id>0</id>
</partner>
<partner>
<name>Another Name</name>
<status>active</status>
<id>1</id>
</partner>
</list>
</xml>
I am using ruby's lib-xml to parse it.
I want to find if there is a partner with the name 'Some Name' in a quick and ruby idiomatic way.
How can I do this in one line or ruby code, assuming i have a the document parsed in a variable named document.. Such that i can call document.find(xpath) to retrieve nodes. I have had to do this multiple times in slightly different scenarios and now its starting to bug me.
I know i can do the following (but its ugly)
found = false
document.find('//partner/name').each do |name|
if (name.content == 'Some Name')
found = true
break
end
end
assert(found, "Some Name should have been found")
but i find this really ugly. I thought about using the enumeration include? mixin method but that still won't work because I need to get the .content field of each node as opposed to the actual node...
While writing this, I though of this (but it seems somewhat inefficient albeit elegant)
found = document.find('//partner/name').collect{|name| name.content}.member?("Some Name")
Are there any other ways of doing this?
What about this?
found = document.find("//partner[name='Some Name']").empty?
I tried this solution:
found = document.find("//partner[name='Some Name']") != nil
but I got an error saying the xpath expression was invalid.
However, i was reading some xpath documentation it it looks like you can call a text() function in the expression to get the text node. I tried the following and it appears to work:
found = document.find("//partner/name/text()='Some Name'")
found actually is not a xml node but a true/false object so this works.
I would use a language that natively operates on XML (XQuery for example). With XQuery it is possible to formulate this sort of queries over xml data in a concise and elegant way.

Resources