Traversing and search with xpath - xpath

My code:
"/root/pharagraph/sentence[" + y + "]/sequence/word"
that is same like
"/root/pharagraph[1]/sentence[" + y + "]/sequence/word"
Problem is that I want something like:
"/root/pharagraph[*]/sentence[" + y + "]/sequence/word"
So my Xpath search the sentence y in first pharagraph but I want to search sentence y in all pharagraphs.

No. Your first XPath expression is the same as your hypothetical XPath (the 3rd XPath). If you get only the first matched element using the 1st XPath, then the problem is in the code that execute the XPath, not in the XPath it self. For example, since I came from .NET, this might happen when one is using the wrong SelectSingleNode() method instead of the correct SelectNodes() to execute the XPath.

Related

Extract last word using Xpath 1.0

I need to select only the last word using xpath 1.0. I have something like this:
<Example>
<Ctry> Portugal PT </Ctry>
</Example>
I want to select only the PT word but the order is not exact, i.e: <Ctry> Portugal - Lisbon - PT </Ctry>, but the word i want to extract is always the last one.
I've already tried:
//*[name()='Example'][substring(., string-length(.) - string-length('PT')+1) = 'PT']/text() but extracts always the whole string.
Can anyone help me please?
You're selecting a node using the substring as a predicate to filter out other nodes. If you want the substring to be your output, it shouldn't go inside brackets.
substring(//*[name()='Example'], string-length(//*[name()='Example']) - string-length('PT')+1)
note that /text() can be ommited when working with string functions

Selecting up to second space / first two words in a text()

I am having trouble getting my head around how to achieve the following. I have gotten this far:
//*[#id="main"]/div[2]/section/div[2]/h1/span[1][starts-with(.,"IDENTIFIER")]/following::span[1]/text()
This will return a response such as:
Foo1 Foo2 Foo3 Foo4
I am trying to make this return only Foo1 & Foo2, where Foo1 & Foo2 can be any length of characters and there may be any number of additional Foo's following them.
I have tried looking at
substring-before(//*[#id="main"]/div[2]/section/div[2]/h1/span[1][starts-with(.,"IDENTIFIER")]/following::span[1]/text(), ' ')
To extract up to the first space however I have hit a wall in what I am doing wrong.
I am using the xpath within a Scrapy spider. Any help is appreciated
Example with :
<table>
<td>Pierre Paul Jacques Marie Maurice Jeanne</td>
</table>
XPath expression :
substring(//td,1,string-length(substring-before(//td," "))+string-length(substring-before(substring-after(//td," ")," "))+1)
Output :
Pierre Paul
The XPath works in 3 steps. First we get the length of the second term with 3 functions (substring-after, substring-before, and string-length). Space is used as delimiter. Then we get the length of the first term with 2 functions (substring-before and string-length). Space is used as delimiter. Finally we use susbstring to extract what we need. Syntax : fn(content of the element,starting position for the extraction (1), ending position (length of text1 + length of text2) + 1(space delimiter)).
You can replace //td with your XPath selector (remove the /text() at the end and try to find a shorter expression).

Xpath 1.0 using an arithmetic operators

Let's say we have this:
something
Now is there a way to return the #href like: "www.something/page/2". Basically to return the #href value, but with the substring-after(.,"page/") incremented by 1. I've been trying something like
//a/#href[number(substring-after(.,"page/"))+1]
but it doesn't work, and I don't think I can use
//a/#href/number(substring-after(.,"page/"))+1
It's not precisely a paging think, so that I can use the pagination, I just picked that for an example. The point is just to find a way to increment a value in xpath 1.0. Any help?
What you can do is
concat(
translate(//a/#href, '0123456789', ''),
translate(//a/#href, translate(//a/#href, '0123456789', ''), '') + 1
)
So that concatenates the 'href' attribute with all digits being removed with the the sum of 1 and the 'href' with anything but digits being removed.
That might suffice is all digits in your URLs occur at the end of your URL. But generally XPath 1.0 is good at selecting nodes in your input but bad at constructing new values based on parts of node values.
There is a simpler way to achieve this, just take the substring after the page, add 1, and then munge it all back together:
This XPath is based on the current node being the #href attribute:
concat(substring-before(.,'page/'),
'page/',
substring-after(.,'page/')+1
)
Your order of operations is a little, well, out of order. Use something like this:
substring-after(//a/#href, 'page/') + 1
Note that it is not necessary to explicitly convert the string value to a number. From the spec:
The numeric operators convert their operands to numbers as if by
calling the number function.
Putting it all together:
concat(
substring-before(//a/#href, 'page/'),
'page/',
substring-after(//a/#href, 'page/') + 1)
Result:
www.something/page/2

XPATH : replace every ohter whitespace

I'd like to replace every other (odd?) space with x. The result should be:
axb axb axb axb axb
I tried something like:
replace ("a b a b a b a b" , " " , "x")[position() mod 2 = 0]
-- but with no result.
First of all: fn:replace requires an XPath 2.0 (or XQuery) compatible query processor.
You cannot use fn:replace with an predicate like this. There is no array-like access to characters in XPath (like you're used to from eg. C). You probably could also solve this using fn:tokenize and a for-loop, but that's getting things rather complicated.
Your query did not return any result, as there is exactly one result (single element string sequence), but the predicate only returns every second.
Use a regular expression instead. This expression matches on non-space (\S) and space (\s) and replaces those patterns by a version with x in between. The star quantifier in the end is important for odd number of match groups (like in your example).
replace("a b a b a b a b" , "(\S+)\s+(\S+\s*)", "$1x$2")

XPath 2.0:reference earlier context in another part of the XPath expression

in an XPath I would like to focus on certain elements and analyse them:
...
<field>aaa</field>
...
<field>bbb</field>
...
<field>aaa (1)</field>
...
<field>aaa (2)</field>
...
<field>ccc</field>
...
<field>ddd (7)</field>
I want to find the elements who's text content (apart from a possible enumeration, are unique. In the aboce example that would be bbb, ccc and ddd.
The following XPath gives me the unique values:
distinct-values(//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))
Now I would like to extent that and perform another XPath on all the distinct values, that would be to count how many field start with either of them and retreive the ones who's count is bigger than 1.
These could be a field content that is equal to that particular value, or it starts witrh that value and is followed by " (". The problem is that in the second part of that XPath I would have refer to the context of that part itself and to the former context at the same time.
In the following XPath I will - instead of using "." as the context- use c_outer and c_inner:
distinct-values(//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))[count(//field[(c_inner = c_outer) or starts-with(c_inner, concat(c_outer, ' ('))]) > 1]
I can't use "." for both for obvious reasons. But how could I reference a particular, or the current distinct value from the outer expression within the inner expression?
Would that even be possible?
XQuery can do it e.g.
for $s
in distinct-values(
//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))
where count(//field[(. = $s) or starts-with(., concat($s, ' ('))]) > 1
return $s

Resources