xpath multiple nodes query with custom strings - xpath

I have a working multiple node xpath query and I want to add some custom strings between the results.
<FooBar>
<Foo>
<Fooid>A</Fooid>
<Booid>222</Booid>
<Wooid>Z</Wooid>
</Foo>
<Foo>
<Fooid>B</Fooid>
<Booid>333</Booid>
<Wooid>Y</Wooid>
</Foo>
<Foo>
<Fooid>C</Fooid>
<Booid>444</Booid>
<Wooid>X</Wooid>
</Foo>
</FooBar>
I have messed with different combinations of string-joins and/or concats, but the result was always wrong or ended up in a syntax-error. My xpath version is Xpath 2.0
//Foo/Fooid | //Foo/Booid | Foo/Wooid
The above xpath results in:
A
222
Z
My preferred result would be:
(A)
{222}
[Z]
what is the correct usage of string-join in order to get the brackets around the three ids?

after doing some research and with your comments, I was able to achive the desired solution with this line:
//Foo/concat('(', Fooid, ')'), //Foo/concat('{', Booid, '}'),Foo/concat('[', Wooid, ']')
The '|' was replaced by a comma.

to concat these characters, use their html entity instead.
concat('&lpar;', //Fooid, '&rpar;')
for parentheses use
&lpar;
&rpar;
for brackets
&lbrack;
&rbrack;
for brackes
&lbrace;
&rbrace;
See full character entity sets here

Related

xpath return default value ,if value of attribute not found using text()

sample_xml='<employees>\
<person id="p1">\
<name value="Alice">ALICE</name>\
</person>\
<person id="p2">\
<name value="Alice">BOB</name>\
</person>\
<person id="p3">\
<name value="Alice"/>\
</person>\
</employees>'
data = [
[f'{sample_xml}']
]
df = spark.createDataFrame(data, ['data'])
df=df.selectExpr(
'xpath(data,"/employees/person/name[#value=\'Alice\']/text()") test'
)
this gives expcted ["ALICE", "BOB"]
Problem:
I want my result to be ["ALICE", "BOB","NA"]
i.e for empty path like below
<name value="Alice"/>
I want to return a default NA .
is it possible to achieve this ?
Regards
With XPath itself this is not possible. It can only return you the actual values of the matching nodes or nothing if no match.
In order to get NA or any other data that is not actually contained in the XML, you should wrap the basic XPath request with some additional, external code to return the customized output in case of no match.
In XPath 2.0, use /employees/person/name[#value=\'Alice\'] /(string(text()), 'NA')[1]".
It can't be done in XPath 1.0. In XPath 1.0 there's no such thing as a sequence of strings; you can only return a sequence of nodes, and you can only return nodes that are actually present in the input document.

XPath 1.0 lowest value regardless of ordering

I have this data, and I'm looking for the lowest bid.
<root>
<current_bid>$1.00</current_bid>
<current_bid>$2.00</current_bid>
<current_bid>$3.00</current_bid>
<current_bid>$4.00</current_bid>
<current_bid>$5.00</current_bid>
</root>
This is my XPath 1.0 attempt:
//current_bid[not(translate (., '$,.','') > translate(//current_bid, '$,.',''))]
And it works fine (returns only the $1.00 bid) with the data above, but if I change the ordering of the data to let's say this here:
<root>
<current_bid>$5.00</current_bid>
<current_bid>$1.00</current_bid>
<current_bid>$2.00</current_bid>
<current_bid>$3.00</current_bid>
<current_bid>$4.00</current_bid>
</root>
Then it gives a wrong output (returns all values).
Shouldn't the order be irrelevant when I use //current_bid, since it queries the whole document?
Also: how would I go if I wanted the second lowest bid?
XPath 1.0 processes nodes in document order so there's no way to sort them with pure XPath. It can be done with XSL processing
This approach works only if minimum is at first position.
Xpath:
'//current_bid[(position()<=last()) and not(translate (., "$,.","") > translate(//current_bid, "$,.",""))]'
Sample:
<root>
<current_bid>$1.00</current_bid>
<current_bid>$5.00</current_bid>
<current_bid>$2.00</current_bid>
<current_bid>$4.00</current_bid>
<current_bid>$3.00</current_bid>
</root>
Testing on command line with xmllint
xmllint --xpath '//current_bid[(position()<=last()) and not(translate (., "$,.","") > translate(//current_bid, "$,.",""))]' test.xml ; echo
Result:
<current_bid>$1.00</current_bid>
If the number of nodes is known in advance perhaps it could be done with nested conditions but would give a very complex XPath expression.

xpath without specificy the tag? [duplicate]

Given this XML, what XPath returns all elements whose prop attribute contains Foo (the first three nodes):
<bla>
<a prop="Foo1"/>
<a prop="Foo2"/>
<a prop="3Foo"/>
<a prop="Bar"/>
</bla>
//a[contains(#prop,'Foo')]
Works if I use this XML to get results back.
<bla>
<a prop="Foo1">a</a>
<a prop="Foo2">b</a>
<a prop="3Foo">c</a>
<a prop="Bar">a</a>
</bla>
Edit:
Another thing to note is that while the XPath above will return the correct answer for that particular xml, if you want to guarantee you only get the "a" elements in element "bla", you should as others have mentioned also use
/bla/a[contains(#prop,'Foo')]
This will search you all "a" elements in your entire xml document, regardless of being nested in a "blah" element
//a[contains(#prop,'Foo')]
I added this for the sake of thoroughness and in the spirit of stackoverflow. :)
This XPath will give you all nodes that have attributes containing 'Foo' regardless of node name or attribute name:
//attribute::*[contains(., 'Foo')]/..
Of course, if you're more interested in the contents of the attribute themselves, and not necessarily their parent node, just drop the /..
//attribute::*[contains(., 'Foo')]
descendant-or-self::*[contains(#prop,'Foo')]
Or:
/bla/a[contains(#prop,'Foo')]
Or:
/bla/a[position() <= 3]
Dissected:
descendant-or-self::
The Axis - search through every node underneath and the node itself. It is often better to say this than //. I have encountered some implementations where // means anywhere (decendant or self of the root node). The other use the default axis.
* or /bla/a
The Tag - a wildcard match, and /bla/a is an absolute path.
[contains(#prop,'Foo')] or [position() <= 3]
The condition within [ ]. #prop is shorthand for attribute::prop, as attribute is another search axis. Alternatively you can select the first 3 by using the position() function.
Have you tried something like:
//a[contains(#prop, "Foo")]
I've never used the contains function before but suspect that it should work as advertised...
John C is the closest, but XPath is case sensitive, so the correct XPath would be:
/bla/a[contains(#prop, 'Foo')]
If you also need to match the content of the link itself, use text():
//a[contains(#href,"/some_link")][text()="Click here"]
/bla/a[contains(#prop, "foo")]
try this:
//a[contains(#prop,'foo')]
that should work for any "a" tags in the document
For the code above...
//*[contains(#prop,'foo')]

Get nodes from xml string using regex

I have string xml like below:
<Query>
<Code>USD</Code>
<Description>United States Dollars</Description>
<UpdateTime>2013-03-04 02:27:33</UpdateTime>
<toUSD>1</toUSD>
<USDto>1</USDto>
<toEUR>2</toEUR>
<EURto>3</EURto>
</Query>
All text is in one line without white spaces. I can't write right regex pattern. I want get nodes which begin like <to. For example <toEUR>, <toUSD>.
How should I write this pattern?
With nokogiri and the xpath function starts-with:
require 'nokogiri'
doc = Nokogiri::XML <<EOF
<Query>
<Code>USD</Code>
<Description>United States Dollars</Description>
<UpdateTime>2013-03-04 02:27:33</UpdateTime>
<toUSD>1</toUSD>
<USDto>1</USDto>
<toEUR>2</toEUR>
<EURto>3</EURto>
</Query>
EOF
doc.search('//*[starts-with(name(),"to")]').map &:to_s
#=> ["<toUSD>1</toUSD>", "<toEUR>2</toEUR>"]
Although the general consensus is that parsing xml etc with regex is not the way to go, something like this should do the trick:
<\s*(to[^>\s]+)[^>]*>([^<]+)<\s*/\s*\1\s*>
In ruby format:
/<\s*(to[^>\s]+)[^>]*>([^<]+)<\s*\/\s*\1\s*>/
Matches <toWatever>value</toWhatever> back-reference group 1 returns the name (toWhatever) and back-reference group 2 returns the value.

XPath expression for selecting all text in a given node, and the text of its chldren

Basically I need to scrape some text that has nested tags.
Something like this:
<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>
And I want an expression that will produce this:
This is an example bolded text
I have been struggling with this for hour or more with no result.
Any help is appreciated
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
You want to call the XPath string() function on the div element.
string(//div[#id='theNode'])
You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document. This will remove leading and trailing whitespace and replace sequences of whitespace characters with a single space. When you pass a nodeset to normalize-space(), the nodeset will first be converted to it's string-value. If no arguments are passed to normalize-space it will use the context node.
normalize-space(//div[#id='theNode'])
// if theNode was the context node, you could use this instead
normalize-space()
You might want use a more efficient way of selecting the context node than the example XPath I have been using. eg, the following Javascript example can be run against this page in some browsers.
var el = document.getElementById('question');
var result = document.evaluate('normalize-space()', el, null ).stringValue;
The whitespace only text node between the span and b elements might be a problem.
Use:
string(//div[#id='theNode'])
When this expression is evaluated, the result is the string value of the first (and hopefully only) div element in the document.
As the string value of an element is defined in the XPath Specification as the concatenation in document order of all of its text-node descendants, this is exactly the wanted string.
Because this can include a number of all-white-space text nodes, you may want to eliminate contiguous leading and trailing white-space and replace any such intermediate white-space by a single space character:
Use:
normalize-space(string(//div[#id='theNode']))
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
"<xsl:copy-of select="string(//div[#id='theNode'])"/>"
===========
"<xsl:copy-of select="normalize-space(string(//div[#id='theNode']))"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<div id='theNode'> This is an
<span style="color:red">example</span>
<b>bolded</b> text
</div>
the two XPath expressions are evaluated and the results of these evaluations are copied to the output:
" This is an
example
bolded text
"
===========
"This is an example bolded text"
If you are using scrapy in python, you can use descendant-or-self::*/text(). Full example:
txt = """<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>"""
selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text
all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall()
final_txt = ''.join( _ for _ in all_txt).strip()
print(final_txt) # 'This is an example bolded text'
How about this :
/div/text()[1] | /div/span/text() | /div/b/text() | /div/text()[2]
Hmmss I am not sure about the last part though. You might have to play with that.
normal code
//div[#id='theNode']
to get all text but if they become split then
//div[#id='theNode']/text()
Not sure but if you provide me the link I will try

Resources