XPath - find first occurance of string - xpath

I'm trying to select an anchor element by first containing the text "To Be Coded", then extracting a number from a string using substring, then using the greater than comparison operator (>0). This is what I have thus far:
/a[number(substring(text(),???,string-length()-1))>0]
An example of the HTML is:
<a class="" href="javascript:submitRequest('getRec','30', '63', 'Z')">
To Be Coded (23)
</a>
My issue right now is I don't know how to find the first occurrence of the open parenthesis. I'm also not sure how to combine what I have with the contains(text(),"To Be Coded") function.
So my criteria for the selection is:
Must be an anchor element
Must include the text "To Be Coded"
Must contain a number greater than 0 in the parentheses
Edit: I suppose I could just "hard code" the starting position for the substring, but I'm not sure what that would be - will XPath count the white space before the text in the element? How would it handle/count the characters?

Here try this :
a[contains(., 'To Be Coded') and number(substring-before(substring-after(., '('), ')')) > 0]

Related

Extract last word using Xpath 1.0

I need to select only the last word using xpath 1.0. I have something like this:
<Example>
<Ctry> Portugal PT </Ctry>
</Example>
I want to select only the PT word but the order is not exact, i.e: <Ctry> Portugal - Lisbon - PT </Ctry>, but the word i want to extract is always the last one.
I've already tried:
//*[name()='Example'][substring(., string-length(.) - string-length('PT')+1) = 'PT']/text() but extracts always the whole string.
Can anyone help me please?
You're selecting a node using the substring as a predicate to filter out other nodes. If you want the substring to be your output, it shouldn't go inside brackets.
substring(//*[name()='Example'], string-length(//*[name()='Example']) - string-length('PT')+1)
note that /text() can be ommited when working with string functions

xpath query omit results with parent tag

I'm fairly new to xpath so seeking some help with a pattern to match the following. My current attempt isn't matching what I would expect.
//text()[1][contains(.,'wordToMatch') and not(self::a)]
As i'm sure you can see from the pattern above, i'm a noob.
Sample payload 1:
<p>Sample 1 wordToMatch some
random text
to not be matched followed by wordToMatch, this should work.</p>
Expected Result 1:
wordToMatch (Not the one inside of a' tags but the following one)
Sample payload 2:
<p>Sample 2 wordToMatch some
random text to not be matched followed by <b>wordToMatch</b> this
should work.</p>
Expected Result 2:
wordToMatch (The one inside of the b' tags)
Sample payload 3:
<p>Sample 3 wordToMatch some
random text to not be matched followed by wordToMatch followed by
further occurrences of wordToMatch which should not be matched.</p>
Expected Result 3:
wordToMatch (The second occurrence of the term)
Expected results for all 3 payloads is the first occurrence of the term wordToMatch which is NOT wrapped inside of an 'a' Tag.
The end language that will implement this pattern is Java.
Please help.
It's still not clear from the question what you're after exactly, adding exact expected output for each sample will clears things up, I think. Anyway, based on current information, consider the following XPath which will match any element where inner text is exactly equals 'wordToMatch', and the element itself is not an <a> element :
//*[.='wordToMatch'][not(self::a)]
This will return b element in the 2nd case and none for other cases. If you want to relax the matching return the text node (instead of parent element), this will do:
//*[not(self::a)]/text()[contains(.,'wordToMatch')]
UPDATE:
In XPath 2.0 or above you can use for construct :
for $t in //*[not(self::a)]/text()[contains(.,'wordToMatch')]
return 'wordToMatch'
xpatheval demo

Xpath - identify text that contains string 'AB' in 5th and 6th characters of a word

I am trying to write a matching XPath rule but I can't seem to pin point words with the exact letters in the 5th and 6th position.
example 'ab' in 'qwerabqwert'
/location1[Variable='variable1'][item1[contains(.,'AB')] or item1[contains(.,'ab')]
Please help.
You can use the substring() function in the below way:
/location1[Variable='variable1'][item1[substring(., 5, 2) = ('AB', 'ab')]]

Xpath with htmlagilitypack

I am try to select the "string b" text node using XPath with the HtmlAgilliyPack.
<div>
string a<br/>
string b<br/>
string c<br/>
</div>
I am not sure how to select the text?
This won't work //div/text(1)
Anybody has some suggestions?
There are two problems with your expression:
XPath starts counting at 1, so you want the second text node
text() is a node filter which does not accept arguments. If you want to limit to the second text node, use the predicate [position() = 2] or the short version [2].
Use this expression:
//div/text()[2]
Selecting text nodes can include some hassles, chopping leading and trailing whitespace and omitting whitespace-only text nodes is implementation-dependent.
Try:
//div/br[1]/following-sibling::text()[1]'
The direct following text after the first br.

Regex in xpath?

I want to find a table cell that contains the link (\d{0,3} )?pieces.
How would I need to write this xpath?
Can I simply insert the xpath directly into the Capybara search? Or do I need to do something special to indicate it is a regex? Or can I not do it at all?
Xpath 1.0
XPath 1.0 does not include regular expression support. You should be able to achieve the desired match with the following expression:
//td/a['pieces'=substring(#href, string-length(#href) -
string-length('pieces') + 1) and
'pieces'=translate(#href, '0123456789', '') and
string-length(#href) > 5 and
string-length(#href) < 10]
The first test in the predicate checks that the string ends with pieces. The second test ensures that the entire string equals pieces when all of the digits are removed (i.e. there are no other characters). The final two tests ensure that the entire length of the string is between 6 and 9, which is the length of pieces plus zero to three digits.
Test it on the following document:
<table>
<tr>
<td>test0</td>
<td>no match</td>
<td>no match</td>
<td>test1</td>
<td>test2</td>
<td>no match</td>
<td>test3</td>
</tr>
</table>
It should match only the test0, test1, test2, and test3 links.
(Note: The expression may be further complicated by the possibility of other characters preceding the portion you're attempting to match.)
XPath 2.0
Achieving this in XPath 2.0 is trivial with the matches function.
//td/a[
substring-after(concat(#href ,'x') ,'pieces')='x'
and
111>=concat(0 ,translate( substring-before(#href ,'pieces') ,'0123456789 -.' ,'1111111111xxx'))
]
This is another solution, not necessarily better, but, perhaps, interesting.
The first conjunct is true just when #href contains exactly one occurrence
of 'pieces', and it is at the end.
The second conjunct is true just when the part of #href before 'pieces' is empty
or is a numeral made entirely of digits (no .,-, or white-space), with at most 3 digits.
The number of 1's in the '111>=' is the maximum number of digits that will match.
Reference: http://www.w3.org/TR/xpath
The substring-after function returns the substring of the first argument string that follows the first occurrence of the second argument string in the first argument string, or the empty string if the first argument string does not contain the second argument string.
The substring-before function returns the substring of the first argument string that precedes the first occurrence of the second argument string in the first argument string, or the empty string if the first argument string does not contain the second argument string.
... a string that consists of optional whitespace followed by an optional minus sign followed by a Number followed by whitespace is converted to the IEEE 754 number ... any other string is converted to NaN
Number ::= Digits ('.' Digits?)? | '.' Digits
An attribute node has a string-value. The string-value is the normalized value as specified by the XML Recommendation [XML]
The normalize-space function returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.

Resources