I have a xml file like this:
<Doc> A0B100 </Doc>
<Doc> A0B101 </Doc>
<Doc> B1A100 </Doc>
<Doc> B1A101 </Doc>
I use xpath query to select value of node that contain "B1"
my code :
$txtSearch = "B1";
$titles = $xpath->query("Doc[contains(text(),\"$txtSearch\")]");
It returned all 4 value :
A0B100
A0B101
B1A100
B1A101
But I only want the contain text() to match first string that the result I expected is
B1A100
B1A101
How can I do that?
use this xpath
Doc[starts-with(normalize-space(text()),\"$txtSearch\")]
added normalize-space() to trim spaces on your sample xml
Related
sample_xml='<employees>\
<person id="p1">\
<name value="Alice">ALICE</name>\
</person>\
<person id="p2">\
<name value="Alice">BOB</name>\
</person>\
<person id="p3">\
<name value="Alice"/>\
</person>\
</employees>'
data = [
[f'{sample_xml}']
]
df = spark.createDataFrame(data, ['data'])
df=df.selectExpr(
'xpath(data,"/employees/person/name[#value=\'Alice\']/text()") test'
)
this gives expcted ["ALICE", "BOB"]
Problem:
I want my result to be ["ALICE", "BOB","NA"]
i.e for empty path like below
<name value="Alice"/>
I want to return a default NA .
is it possible to achieve this ?
Regards
With XPath itself this is not possible. It can only return you the actual values of the matching nodes or nothing if no match.
In order to get NA or any other data that is not actually contained in the XML, you should wrap the basic XPath request with some additional, external code to return the customized output in case of no match.
In XPath 2.0, use /employees/person/name[#value=\'Alice\'] /(string(text()), 'NA')[1]".
It can't be done in XPath 1.0. In XPath 1.0 there's no such thing as a sequence of strings; you can only return a sequence of nodes, and you can only return nodes that are actually present in the input document.
Given the XML structure
<Doc>
<Other />
<Q1 />
<Q2 />
</Doc>
How can I select only nodes that begin with a "Q", e.g. /Doc/Q1 and /Doc/Q2?
It seems like this can be done with starts-with, but I have only found examples that apply starts-with to the value of the node
/Doc/*[starts-with(name(), 'Q')]
Consider the following XML snippet:
<doc>
<chapter id="1">
<item>
<para>some text here</para>
</item>
</chapter>
</doc>
In XQuery, I have a function that needs to do some things based on the ancestor chapter of a given "para" element that is passed in as a parameter, as shown in the stripped down example below:
declare function doSomething($para){
let $chapter := $para/ancestor::chapter
return "some stuff"
};
In that example, $chapter keeps coming up empty. However, if I write the function similar to the follwing (i.e., without using the ancestor axis), I get the desired "chapter" element:
declare function doSomething($para){
let $chapter := $para/../..
return "some stuff"
};
The problem is that I cannot use explicit paths as in the latter example because the XMl I will be searching is not guaranteed to have the "chapter" element as a grandparent every time. It may be a great-grandparent or great-great-grandparent, and so on, as shown below:
<doc>
<chapter id="1">
<item>
<subItem>
<para>some text here</para>
</subItem>
</item>
</chapter>
</doc>
Does anyone have an explanation as to why the axis doesn't work, while the explicit XPath does? Also, does anyone have any suggestions on how to solve this problem?
Thank you.
SOLUTION:
The mystery is now solved.
The node in question was re-created in another function, which had the result of stripping it of all of its ancestor information. Unfortunately, the previous developer did not document this wonderful, little function and has cost us all a good deal of time.
So, the ancestor axis worked exactly as it should - it was just being applied to a deceptive node.
I thank all of you for your efforts in answering my questions.
The ancestor axis does work fine. I suspect your problem is namespaces. The example you showed and that I ran (below) has XML without any namespaces. If your XML have a namespace then you would need to provide that in the ancestor XPath, like this: $para/ancestor:foo:chapter where in this case the prefix _foo_ is bound to the correct namespace for the chapter element.
let $doc := <doc>
<chapter id="1">
<item>
<para>some text here</para>
</item>
</chapter>
</doc>
let $para := $doc//para
return $para/ancestor::chapter
RESULT:
<?xml version="1.0" encoding="UTF-8"?>
<chapter id="1">
<item>
<para>some text here</para>
</item>
</chapter>
These things almost always boil down to namespaces! As a daignostic to confirm 100% that namespace are not the issue, can you try:
declare function local:doSomething($para) {
let $chapter := $para/ancestor::*[local-name() = 'chapter']
return $chapter
};
This seems surprising to me; which XQuery implementation are you using? With BaseX, the following query...
declare function local:doSomething($para) {
let $chapter := $para/ancestor::chapter
return $chapter
};
let $xml :=
<doc>
<chapter id="1">
<item>
<para>some text here</para>
</item>
</chapter>
</doc>
return local:doSomething($xml//para)
...returns...
<chapter id="1">
<item>
<para>some text here</para>
</item>
</chapter>
I suspect namespaces too. If $para/../.. works but $para/parent::item/parent::chapter turns up empty, then you know it's a question of namespaces.
Look for an xmlns declaration at the top of your content, e.g.:
<doc xmlns="http://example.com">
...
</doc>
In your XQuery, you then need to bind that namespace to a prefix and use that prefix in your XQuery/XPath expressions, like this:
declare namespace my="http://example.com";
declare function doSomething($para){
let $chapter := $para/ancestor::my:chapter
return "some stuff"
};
What prefix you use doesn't matter. The important thing is that the namespace URI (http://example.com in the above example) matches up.
It makes sense that ../.. selects the element you want, because .. is short for parent::node() which selects the parent node regardless of its name (or namespace). Whereas ancestor::chapter will only select <chapter> elements that are not in a namespace (unless you have declared a default element namespace, which is usually not a good idea in XQuery because it affects both your input and your output).
I need to select the text in a node, but not any child nodes.
the xml looks like this
<a>
apples
<b><c/></b>
pears
</a>
If I select a/text(), all I get is "apples". How would I retreive "apples pears" while omitting <b><c/></b>
Well the path a/text() selects all text child nodes of the a element so the path is correct in my view. Only if you use that path with e.g. XSLT 1.0 and <xsl:value-of select="a/text()"/> it will output the string value of the first selected node. In XPath 2.0 and XQuery 1.0: string-join(a/text()/normalize-space(), ' ') yields the string apples pears so maybe that helps for your problem. If not then consider to explain in which context you use XPath or XQuery so that a/text() only returns the (string?) value of the first selected node.
To retrieve all the descendants I advise using the // notation. This will return all text descendants below an element. Below is an xquery snippet that gets all the descendant text nodes and formats it like Martin indicated.
xquery version "1.0";
let $a :=
<a>
apples
<b><c/></b>
pears
</a>
return normalize-space(string-join($a//text(), " "))
Or if you have your own formatting requirements you could start by looping through each text element in the following xquery.
xquery version "1.0";
let $a :=
<a>
apples
<b><c/></b>
pears
</a>
for $txt in $a//text()
return $txt
If I select a/text(), all i get is
"apples". How would i retreive "apples
pears"
Just use:
normalize-space(/)
Explanation:
The string value of the root node (/) of the document is the concatenation of all its text-node descendents. Because there are white-space-only text nodes, we need to eliminate these unwanted text nodes.
Here is a small demonstration how this solution works and what it produces:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
'<xsl:value-of select="normalize-space()"/>'
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<a>
apples
<b><c/></b>
pears
</a>
the wanted, correct result is produced:
'apples pears'
I'm using XPath 1.0 to parse an HTML file and I want to get a string sequence from a node-set. First I select a node-set (eg: //div) and then I want the string-value of each node of the set. I've tried with string(//div) but it only returns the string-value of the first node in the set.
Example:
<foo>
<div>
bbbb<p>aaa</p>
</div>
<div>
cccc<p>aaa</p>
</div>
</foo>
I expect a result like ('bbbbaaa', 'ccccaaa') but I only get 'bbbaaa'
In XPath 1.0 the "string-value of a node-set" is by definition the string value of the first node in the node-set.
In XPath 2.0 the following expression produces a sequence of the string values of all div elements in an XML document:
//div/string(.)