I am trying to parse a document using Dom4J. This document comes from various providers, and sometimes comes with namespaces and sometimes without.
For eg:
<book>
<author>john</author>
<publisher>
<name>John Q</name>
</publisher>
</book>
or
<book xmlns="http://schemas.xml.com/XMLSchemaInstance">
<author>john</author>
<publisher>
<name>John Q</name>
</publisher>
</book>
or
<book xmlns:i="http://schemas.xml.com/XMLSchemaInstance">
<i:author>john</i:author>
<i:publisher>
<i:name>John Q</i:name>
</i:publisher>
</book>
I have a list of XPaths. I parse the document into a Document class, and then search on it using the xpaths.
Document doc = parseDocument(documentFile);
List<String> XmlPaths = new List<String>();
XmlPaths.add("book/author");
XmlPaths.add("book/publisher/name");
for (int i = 0; i < XmlPaths.size(); i++)
{
String searchPath = XmlPaths.get(i);
Node currentNode = doc.selectSingleNode(searchPath);
assert(currentNode != null);
}
This code does not work on the last document, the one that is using namespace prefixes.
I tried these techniques, but none of them seem to work.
1) changing the last element in the xpath to be namespace neutral:
/book/:author
/book/[local-name()='author']
/[local-name()='book']/[local-name()='author']
All of these throw an exception saying that the XPATH format is not correct.
2) Adding namespace uris to the XPAth, after creating it using DocumentHelper.createXPath();
Any idea what I am doing wrong?
FYI I am using dom4j version 1.5
Your XPath does not contain a tag name. The general syntax in your case would be
/TAGNAMEPARENT[CONDITION_PARENT]/TAGNAMECHILD[CONDITION_CHILD]
The important aspect is that the tag names are mandatory while the conditions are optional. If you do not want to specify a tag name you have use * for "any tag". There may be performance implications for large XML files since you will always have to iterate over a node set instead of using an index lookup. Maybe #MichaelKay can comment on this.
Try this instead:
/*[local-name()='book']/*[local-name()='author']
Related
In Xquery 3.1 (under eXist-DB 4.7) I receive xml data like this, with no namespace:
<edit-request id="TC9999">
<title-collection>foocolltitle</title-collection>
<title-exempla>fooextitle</title-exempla>
<title-short>fooshorttitle</title-short>
</edit-request>
This is assigned to a variable $content and this statement:
let $collid := $content/edit-request/#id
...correctly returns: TC9999
Now, I need to actually transform all the data in $content into a TEI xml document.
I first need to get some info from an existing TEI file, so I assigned another variable:
let $oldcontent := doc(concat($globalvar:URIdata,$collid,"/",$collid,".xml"))
And then I create the new TEI document, referring to both $content and $oldcontent:
let $xml := <listBibl xmlns="http://www.tei-c.org/ns/1.0"
type="collection"
xml:id="{$collid}">
<bibl>
<idno type="old_sql_id">{$oldcontent//tei:idno[#type="old_sql_id"]/text()}</idno>
<title type="collection">{$content//title-exempla/text()}</title>
</bibl>
</listBibl>
The references to the TEI namespace in $oldcontent come through, but to my surprise the references to $content (no namespace) don't show up:
<listBibl xmlns="http://www.tei-c.org/ns/1.0"
type="collection"
xml:id="TC9999">
<bibl>
<idno type="old_sql_id">1</idno>
<title type="collection"/>
</bibl>
</listBibl>
The question is: how do I refer to the non-namespace elements in $content in the context of let $xml=...?
Nb: the Xquery document has a declaration at the top (as it is the principle namespace of virtually all the documents):
declare namespace tei = "http://www.tei-c.org/ns/1.0";
In essence you are asking how to write an XPath expression to select nodes in an empty namespace in a context where the default element namespace is non-empty. One of the most direct solutions is to use the "URI plus local-name syntax" for writing QNames. Here is an example:
xquery version "3.1";
let $x := <x><y>Jbrehr</y></x>
return
<p xmlns="foo">Hey there,
{ $x/Q{}y => string() }!</p>
If instead of $x/Q{}y the example had used the more common form of the path expression, $x/y, its result would have been an empty sequence, since the local name y used to select the <y> element specifies no namespace and thus inherits the foo element namespace from its context. By using the "URI plus local-name syntax", though, we are able to specify the empty namespace we are looking for.
For more information on this, see the XPath 3.1 specification's discussion of expanded QNames: https://www.w3.org/TR/xpath-31/#doc-xpath31-EQName.
I have one really large input file which is an XML data.
So now when I put that in HDFS, logically the HDFS blocks will be created and the XML records will also be divided amongst blocks. Now the typical TextInputFormat handles the scenario by skipping the first line if it is not start of line and logically the previous mapper reads (over RPC) from this block till end of record.
In XML case how we can handle the scenario? I don't want to use the WholeFileInputFormat as that will not help me using the parallelism.
<books>
<book>
<author>Test</author>
<title>Hadoop Recipes</title>
<ISBN>04567GHFR</ISBN>
</book>
<book>
<author>Test</author>
<title>Hadoop Data</title>
<ISBN>04567ABCD</ISBN>
</book>
<book>
<author>Test1</author>
<title>C++</title>
<ISBN>FTYU9876</ISBN>
</book>
<book>
<author>Test1</author>
<title>Baby Tips</title>
<ISBN>ANBMKO09</ISBN>
</book>
</books>
The initialize function of the XMLRecordReader looks like -
public void initialize(InputSplit arg0, TaskAttemptContext arg1)
throws IOException, InterruptedException {
Configuration conf = arg1.getConfiguration();
FileSplit split = (FileSplit) arg0;
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
FileSystem fs = file.getFileSystem(conf);
fsin = fs.open(file);
fsin.seek(start);
DocumentBuilder db = null;
try {
db = DocumentBuilderFactory.newInstance()
.newDocumentBuilder();
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Document doc = null;
try {
doc = db.parse(fsin);
} catch (SAXException e) {
e.printStackTrace();
}
NodeList nodes = doc.getElementsByTagName("book");
for (int i = 0; i < nodes.getLength(); i++) {
Element element = (Element) nodes.item(i);
BookWritable book = new BookWritable();
NodeList author = element.getElementsByTagName("author");
Element line = (Element) author.item(0);
book.setBookAuthor(new Text(getCharacterDataFromElement(line)));
NodeList title = element.getElementsByTagName("title");
line = (Element) title.item(0);
book.setBookTitle(new Text(getCharacterDataFromElement(line)));
NodeList isbn = element.getElementsByTagName("ISBN");
line = (Element) isbn.item(0);
book.setBookISBN(new Text(getCharacterDataFromElement(line)));
mapBooks.put(Long.valueOf(i), book);
}
this.startPos = 0;
endPos = mapBooks.size();
}
Using the DOM parser for handling the XML parsing part, not sure but may be if I do a pattern match then the DOM parser parsing issue will be resolved (in case of broken XML in one of the splits) but will that also solve the last mapper completing the record from next input split?
Please correct me in case there is some fundamental issue and if any solution is there it will be a great help.
Thanks,
AJ
You could very well try out mahout's XMLinputFormat class. More explanation in the book 'Hadoop in action'
I don't think an XML file could be splittable by itself. THen I don't think there is a generic public solution for you. The problem is there is not way to understand the tag hierarchy starting in the middle of the XML unless you know the structure of the XML a priori.
But your XML is very simple and you can create an Ad-Hoc splitter. As you have explained, the TextInputFormat skip the first characters until it reach the beginning of a new text line. Well, you can do the same thing looking for book tag instead a new line. Copy the code but instead to look for "\n" character look for the open tag for your items.
Be sure to use a SAX parser in your development, use DOM is not a good option to deal with big XML's. In a SAX parser you read one by one each tag and take an action in each event instead to load all the file in memory as in the case of DOM Tree generation.
Maybe split the XML file first. There are Open Source XML splitters. Also at least two commercial split tools that claim to handle the XML structure automatically to ensure each split file is well-formed XML. Google "xml split tool" or "xml splitter"
I could not find much examples of evaluate XPath using xerces-c 3.1.
Given the following sample XML input:
<abc>
<def>AAA BBB CCC</def>
</abc>
I need to retrieve the "AAA BBB CCC" string by the XPath "/abc/def/text()[0]".
The following code works:
XMLPlatformUtils::Initialize();
// create the DOM parser
XercesDOMParser *parser = new XercesDOMParser;
parser->setValidationScheme(XercesDOMParser::Val_Never);
parser->parse("test.xml");
// get the DOM representation
DOMDocument *doc = parser->getDocument();
// get the root element
DOMElement* root = doc->getDocumentElement();
// evaluate the xpath
DOMXPathResult* result=doc->evaluate(
XMLString::transcode("/abc/def"), // "/abc/def/text()[0]"
root,
NULL,
DOMXPathResult::ORDERED_NODE_SNAPSHOT_TYPE, //DOMXPathResult::ANY_UNORDERED_NODE_TYPE, //DOMXPathResult::STRING_TYPE,
NULL);
// look into the xpart evaluate result
result->snapshotItem(0);
std::cout<<StrX(result->getNodeValue()->getFirstChild()->getNodeValue())<<std::endl;;
XMLPlatformUtils::Terminate();
return 0;
But I really hate that:
result->getNodeValue()->getFirstChild()->getNodeValue()
Has it to be a node set instead of the exact node I want?
I tried other format of XPath such as "/abc/def/text()[0]", and "DOMXPathResult::STRING_TYPE". xerces always thrown exception.
What did I do wrong?
I don't code with Xerces C++ but it seems to implement the W3C DOM Level 3 so based on that I would suggest to select an element node with a path like /abc/def and then simply to access result->getNodeValue()->getTextContent() to get the contents of the element (e.g. AAA BBB CCC).
As far as I understand the DOM APIs, if you want a string value then you need to use a path like string(/abc/def) and then result->getStringValue() should do (if the evaluate method requests any type or STRING_TYPE as the result type).
Other approaches if you know you are only interested in the first node in document order you could evaluate /abc/def with FIRST_ORDERED_NODE_TYPE and then access result->getNodeValue()->getTextContent().
I have an XML document like:
<data>
<item type="apple">
<misc>something</misc>
<appleValue>23</appleValue>
<misc2>something else</misc2>
</item>
<item type="banana">
<bananaValue>47</bananaValue>
<random>something</random>
</item>
</data>
I can get the items with doc("data.xml")/data/item but I need to get the text from the elements that end with Value. So I'd like to get "23" and "47", but I don't necessarily know the element names, meaning all I really know is there are elements that end in Value, I don't know if it's appleValue, bananaValue, etc. except that I could look at the type attribute and buildup a string.
let $type := (doc("data.xml")/data/item)[1]/#type
doc("data.xml")/data/item/$typeValue
...That last line is what I'm trying to get at, clearly that's not correct but I need to find elements whose name is known based on a variable (stored in a variable such as $type) and "Value".
Any ideas? I realize this variable element naming is strange/odd/bad...but that's the way it is and I have to deal with it.
I got it thanks to this post: Can XPath match on parts of an element's name?
doc("data.xml")/data/item/*[ends-with(name(), "Value")]
I would avoid using the name() function in favor of either node-name() or local-name(). The reason for this is that name() can give you different answers depending on what (and whether) namespace prefixes are used in the source. For example, the following three elements have the same exact name (QName):
<appleValue xmlns="http://example.com"/>
<x:appleValue xmlns:x="http://example.com"/>
<y:appleValue xmlns:y="http://example.com"/>
However, the name() function will give you a different answer for each one (appleValue, x:appleValue, and y:appleValue, respectively). So you're better off either ignoring the namespace by using local-name() (which returns the string appleValue for all three of the above cases) or explicitly specifying the namespace (even if it's empty, as Oliver showed), using node-name() (which returns a proper QName value, rather than a string). In this case, since you're not using namespaces (and since even if you added one later, the code will still work), I'd be slightly in favor of using local-name() as follows:
doc("data.xml")/data/item/*['Value' eq substring-after(local-name(),../#type)]
For elaboration on reasons to avoid the name() function (and exceptions), see "Perils of the name function".
You can access the name of the node using name(). XPath 1.0 does not have an "ends-with" function, but by using substring() and string-length() - 1 you can get there.
//item/*[ substring( name(), string-length(name() ) - 4 ) = 'Value']
A more precise way to implement this would be
for $item in doc("data.xml")/data/item
let $value-name := fn:QName('', concat($item/#type, 'Value'))
return $item/*[node-name() = $value-name]
I have this XML document and I want to find an specific GitHubCommiter using REXML. Hoy do I do that?
<users>
<GitHubCommiter id="Nerian">
<username>name</username>
<password>12345</password>
</GitHubCommiter>
<GitHubCommiter id="xmawet">
<username>name</username>
<password>12345</password>
</GitHubCommiter>
<GitHubCommiter id="JulienChristophe">
<username>name</username>
<password>12345</password>
</GitHubCommiter>
</users>
I have tried:
log = REXML::Document.new(file)
root = log.root username = root.elements["GitHubCommiter['#{github_user_name}']"].elements['username'].text
password = root.elements["GitHubCommiter['#{github_user_name}']"].elements['password'].text
root.elements["GitHubCommiter['id'=>'#{github_user_name}']"].text
But I don't find a way to do it. Any idea?
The docs say for elements (emphasis mine):
[]( index, name=nil)
Fetches a child element. Filters only Element children, regardless of the XPath match.
index: the search parameter. This is either an Integer, which will be used to find the index‘th child Element, or an XPath, which will be used to search for the Element.
So it needs to be XPath:
root.elements["./GitHubCommiter[#id = '{github_user_name}']"]
etc.