How to get XmlInputParser work with self-closing XML tags? - hadoop

I am trying to load XML files from SE data dump* into HDFS using MapReduce. These XML files consist of a number of <row> elements (enclosed in a top-level "category"), like so:
<badges>
<row Id="1" UserId="1" Name="Organizer" Date="2009-07-15T06:51:46.370" />
<row Id="2" UserId="3" Name="Organizer" Date="2009-07-15T06:51:46.387" />
<row Id="4" UserId="1" Name="Autobiographer" Date="2009-07-15T06:51:46.447" />
...
</badges>
I want each "row" to be processed by a separate map() function, and have configured org.apache.mahout.classifier.bayes.XmlInputFormat's start and end tags as below:
Configuration config = new Configuration();
config.set(XmlInputFormat.START_TAG_KEY, "<row>");
config.set(XmlInputFormat.END_TAG_KEY, "</row>");
However, this fails to parse the XML file, because the <row> element is self-closing. How do I get this to work, without artificially "closing" the self-closing tags?
Linking to SE blog rather than directly to the data dump, to prevent dead link in case location changes in future.

This is a somewhat ugly hack. Change the START_TAG_KEY and END_TAG_KEY as below:
config.set(XmlInputFormat.START_TAG_KEY, "<row");
config.set(XmlInputFormat.END_TAG_KEY, "/>");
The "keys" are being used like delimiters, and accept any string, rather than just XML tags. Not a "clean" solution, and may stop working on future implementations, but it gets the work done now.
Note: I figured it out while I was midway through posting the question. This seems rather obvious in hindsight, but I decided to go ahead with the post anyway, so that someone may find it useful in future.

Related

Freemarker Dynamically reference tag outside my fo:table-body

I can't show you the whole code, but the following are basically the steps I'm taking to generate dynamic code inside my fo-table-body tag.
At a stage inside <fo:table-body>, I want to be able to reference the block named "ref" and change the value inside out if. Is this possible ?
<#assign value="Hello World"/>
<fo:block name"ref">
<fo:inline font-weight="bold">Value: </fo:inline>
<fo:inline>${Value}</fo:inline>
</fo:block>
<fo:table-body start-indent="0pt">
// All sorts of data inside the tags
<fo:table-row>
<fo:table-cell></fo:table-cell>
</fo:table-row>
</fo:table-body>
FreeMarker templates continuously write to the output as they are executed, so if you have already printed a piece of output, then it's not in the hands of FreeMarker anymore. (It might be still sitting in some buffer behind the Writer, but FreeMarker is not aware of that.) What you can do though is generating the dependency part (fo:table-body) first, but capturing it instead of printing it, like <#assign tableBody><fo:table-body...>...</fo:table-body></#assign>, then generate the dependent part (fo:block) as usual, then print the captured part (${tableBody}, or <#noescape>${tableBody}</#noescape>, depending on what kind of auto-escaping you use).

how to read text in child elements if the parent element's name has dots in freemarker

I have an xml document that I would like to parse using freemarker. The XML document itself was auto generated using SAX in my smooks script. This smooks script created the following XML with element names derived from the actual java package names that I have in my workspace.
<map>
<entry>
<string>RunReportMsg</string>
<com.web.ws.messages.v1__2.RunReportMsg>
<analyticsReport>
<columns>
<com.web.ws.objects.v1__2.ReportColumn>
<dataType>
<id>
<id>10</id>
</id>
</dataType>
</com.web.ws.objects.v1__2.ReportColumn>
</columns>
<analyticsReport>
</com.web.ws.messages.v1__2.RunReportMsg>
</entry>
</map>
A similar question has been posted on this site about this. But I cannot figure out how this would solve my problem.
Access XML elements with names containing a period/dot in FreeMarker templates
I know how to access "RunReportMsg" text in the element "string".
${map.entry.string}
How do I access data in the following child element using dotted notation in freemarker? As the element "com.web.ws.messages.v1__2.RunReportMsg" has multiple periods, I am not sure how to traverse down through further child elements. I need a way to find out the number in the following "id" element.
<id>10</id>
I read the documentation on expressions in freemarker site on ".vars". I am not sure if this applies to my case.
Any help is deeply appreciated.
You can use this syntax:
${map.entry["com.web.ws.messages.v1__2.RunReportMsg"].analyticsReport.columns["com.web.ws.objects.v1__2.ReportColumn"].dataType.id.id}

Xpath - How to select subnode where sibling-node contains certain text

I want to use XPath to select the sub tree containing the <name>-tag with "ABC" and not the other one from the following xml. Is this possible? And as a minor question, which keywords would I use to find something like that over Google (e.g. for selecting the sub tree by an attribute I would have the terminology for)?
<root>
<operation>
<name>ABC</name>
<description>Description 1</description>
</operation>
<operation>
<name>DEF</name>
<description>Description 2</description>
</operation>
</root>
Use:
/*/operation[name='ABC']
For your second question: I strongly recommend not to rely on online sources (there are some that aren't so good) but to read a good book on XPath.
See some resources listed here:
https://stackoverflow.com/questions/339930/any-good-xslt-tutorial-book-blog-site-online/341589#341589
For your first question, I think a more accurate way to do it would be://operation[./name[text()='ABC']].And according to this , we can also make it://operation[./name[text()[.='ABC']]]

XPath-REXML-Ruby: Selecting multiple siblings/ancestors/descendants

This is my first post here. I have just started working with Ruby and am using REXML for some XML handling. I present a small sample of my xml file here:
<record>
<header>
<identifier>oai:lcoa1.loc.gov:loc.gmd/g3195.ct000379</identifier>
<datestamp>2004-08-13T15:32:50Z</datestamp>
<setSpec>gmd</setSpec>
</header>
<metadata>
<titleInfo>
<title>Meet-konstige vertoning van de grote en merk-waardige zons-verduistering</title>
</titleInfo>
</metadata>
</record>
My objective is to match the last numerical value in the tag with a list of values that I have from an array. I have achieved this with the following code snippet:
ids = XPath.match(xmldoc, "//identifier[text()='oai:lcoa1.loc.gov:loc.gmd/"+mapid+"']")
Having got a particular identifier that I wish to investigate, now I want to go back to and select and then select to get the value in the node for that particular identifier.
I have looked at the XPath tutorials and expressions and many of the related questions on this website as well and learnt about axes and the different concepts such as ancestor/following sibling etc. However, I am really confused and cannot figure this out easily.
I was wondering if I could get any help or if someone could point me towards an online resource "easy" to read.
Thank you.
UPDATE:
I have been trying various combinations of code such as:
idss = XPath.match(xmldoc, "//identifier[text()='oai:lcoa1.loc.gov:loc.gmd/"+mapid+"']/parent::header/following-sibling::metadata/child::mods/child::titleInfo/child::title")
The code compiles but does not output anything. I am wondering what I am doing so wrong.
Here's a way to accomplish it using XPath, then going up to the record, then XPath to get the title:
require 'rexml/document'
include REXML
xml=<<END
<record>
<header>
<identifier>oai:lcoa1.loc.gov:loc.gmd/g3195.ct000379</identifier>
<datestamp>2004-08-13T15:32:50Z</datestamp>
<setSpec>gmd</setSpec>
</header>
<metadata>
<titleInfo>
<title>Meet-konstige</title>
</titleInfo>
</metadata>
</record>
END
doc=Document.new(xml)
mapid = "ct000379"
text = "oai:lcoa1.loc.gov:loc.gmd/g3195.#{mapid}"
identifier_nodes = XPath.match(doc, "//identifier[text()='#{text}']")
record_node = identifier_nodes.first.parent.parent
record_node.elements['metadata/titleInfo/title'].text
=> "Meet-konstig"

Any way to strip namespace garbage from XML file?

I need to select some nodes from an XML file (AppNamespace.xaml from a Silverlight XAP file, not that it matters), but the file has namespace stuff so XPath doesn't work. I could waste most of a day trial-and-erroring the bondage-and-discipline nightmare of XmlNamespaceManager and end up with hopelessly fragile code that can't tolerate the slightest variation in the input file (not a great idea in production code), or I could use the ludicrous local-name() syntax[1].
But it would be more convenient to use XPath as a human-readable query language that can be used to return specified nodes or attribute values from arbitrary XML files.
So is there any way to strip the line-noise out of the file? Or am I stuck? Is the labyrinthine imbecility of Linq-to-XML truly the lesser evil?
[1]
//*[local-name() = 'Deployment']/*[local-name() = 'Deployment.Parts']/*[local-name() = 'AssemblyPart']/#*[local-name()='Name']
Update
Five years down the road, I stand behind the term "labyrinthine imbecility" with every fiber of my being, except for a few fibers that want to use something much stronger.
Ed, here's an example of using namespaces with the System.Xml.XPath Extensions class. I've modified it to match the input you're looking at:
string markup = #"
<Deployment xmlns="http://schemas.microsoft.com/client/2007/deployment"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" ...>
<Deployment.Parts>
<AssemblyPart x:Name="xamlName" Source="assembly" />
</Deployment.Parts>
</Deployment>
";
XmlReader reader = XmlReader.Create(new StringReader(markup));
XElement root = XElement.Load(reader);
XmlNameTable nameTable = reader.NameTable;
XmlNamespaceManager namespaceManager = new XmlNamespaceManager(nameTable);
nsm.AddNamespace("x", "http://schemas.microsoft.com/winfx/2006/xaml");
nsm.AddNamespace("dep", "http://schemas.microsoft.com/client/2007/deployment");
IEnumerable<XElement> elements =
root.XPathSelectElements("//dep:Deployment/dep:Deployment.Parts/dep:AssemblyPart/#x:Name", nsm);
foreach (XElement el in elements)
Console.WriteLine(el);
Not very complicated. Obviously you already know about XmlNamespaceManager, but I think you got a worse impression of it than it deserves.
When you say "hopelessly fragile code that can't tolerate the slightest variation in the input file", are you blaming namespaces in general, or XmlNamespaceManager? I don't see how either one makes it fragile... any more so than XML processing code without namespaces will not tolerate certain changes in the input document, but will tolerate others.
Have a little respect for other intelligent people in the industry, take a little time to understand the advantages behind a design before you dismiss it, and you will usually find that there are good reasons for what was done.
Not that XML namespaces couldn't be improved upon. However nobody has managed to produce a better standard and get it accepted by the community.
In XPath 2.0 you can use namespace wildcards (if you know what you are doing):
//*:Deployment/*:Deployment.Parts/*:AssemblyPart/#Name
btw. If an attribute doesn't have a prefix it is in no namespace at all. As this is most often the case, I guess, you don't need local-name() for the attribute.
I came here as a result of this search:
and I am adding an "Answer" to cheer on your "5 years on" update.
I was motivated to do this because I have an XML document that uses a tonne of namespaces -
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:html="http://www.w3.org/TR/REC-html40" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:x2="urn:schemas-microsoft-com:office:excel2" version="1.0" exclude-result-prefixes="msxsl">
and APPARENTLY I have to know what all those namespaces are in advance in order to hard code the XmlNamespaceManager, or write some code that parses the namespace declarations and adds the relevant name spaces myself. Why in the name of all that is holy does the XmlDocument not manage to do that all by itself?
XmlDocument databaseXml = new XmlDocument();
databaseXml.LoadXml(xslt.XslTransform);
var dbnsmgr = new XmlNamespaceManager(databaseXml.NameTable);
dbnsmgr.AddNamespace("xsl", "http://www.w3.org/1999/XSL/Transform");
dbnsmgr.AddNamespace("ss", "urn:schemas-microsoft-com:office:spreadsheet");
XmlElement databaseStylesElement = (XmlElement)database
Xml.DocumentElement.SelectSingleNode("/xsl:stylesheet/xsl:template");

Resources