Finding the position index of a comment() - xpath

Faced with this:
<div>
some text
<!-- this is the hook comment-->
target part 1
target part 2
<!-- this is another comment-->
some other text
</div>
I'm trying to get to the desired output of:
target part 1
target part 2
The number of comments and text elements is unknown, but the target text always comes after the comment containing hook. So the idea is to find the position() of the relevant comment(), and get the next element.
There are some previous questions about finding the position of an element containing a certain text or by attribute, but comment() is an odd duck and I can't modify the answers there to this situation. For example, trying a variation on the answers:
//comment()[contains(string(),'hook')]/preceding::*
or using preceding-sibling::*, returns nothing.
So I decided to try something else. A count(//node()) of the xml returns 6. And //node()[2] returns the relevant comment(). But when I try to get the position of that comment by using index-of() (which should return 2)
index-of(//node(),//comment()[contains(string(),'hook')])
it returns 3!
Of course, I can disregard that and use the 3 index position as the position for the target text (instead of incrementing 2 by 1), but I was wondering, first, why is the outcome what it is and, second, does it have any unintended consequences.

There is no need to firstly find the position() of the elements if you want to get the nodes between two comments (FYI position() depends on the whole nodeset you selected).
You can get the elements directly - here they are text() nodes. So a sample file like
<?xml version="1.0" encoding="UTF-8"?>
<root>
<div>
some text
<!-- this is the hook comment-->
target part 1
target part 2
<!-- this is another comment-->
some other text
<!-- this is another comment-->
no one needs this
<!-- this is another comment-->
this is also useless
<!-- this is another hook comment-->
second target text
<!-- this is another comment-->
again some useless crap
<!-- this is another comment-->
and the last piece that noone needs
</div>
</root>
can be queried with the following expression
//comment()[contains(string(),'hook')]/following-sibling::text()[preceding-sibling::comment()[1][contains(string(),'hook')]]
to result in
target part 1
target part 2
second target text
If you only want the first block, restrict the expression to the first item:
(//comment()[contains(string(),'hook')]/following-sibling::text()[preceding-sibling::comment()[1][contains(string(),'hook')]])[1]
Its result is
target part 1
target part 2
as desired.
If you can use XPath-2.0, you can append a /position() to the expressions above to get the position of the comment()s. But, as mentioned above, they are relative to comment nodes. So the result would be 1 2.

Related

How do I count the XML properties using xpath in ruby?

I have this XML:
<SPEECH>
<SPEAKER>ADAM</SPEAKER>
<LINE>Yonder comes my master, your brother.</LINE>
</SPEECH>
<SPEECH>
<SPEAKER>ORLANDO</SPEAKER>
<LINE>Go apart, Adam, and thou shalt hear how he will</LINE>
<LINE>shake me up.</LINE>
</SPEECH>`enter code here`
<STAGEDIR>Enter OLIVER</STAGEDIR>
<SPEECH>
<SPEAKER>ADAM</SPEAKER>
<LINE>Now, sir! what make you here?</LINE>
</SPEECH>
How do I count how many lines are when a SPEAKER with text Adam has in total?
I tried something like this:
#source.xpath("//SPEAKER[//*[contains(text(), 'ADAM')]]//LINE")
I'm not familiar with Ruby, but the XPath to get all LINE elements from SPEAKER named "ADAM" would be:
//SPEECH[SPEAKER='ADAM']/LINE
or if you want use contains instead of an exact match for SPEAKER:
//SPEECH[contains(SPEAKER, 'ADAM')]/LINE
Brief explanation:
//SPEECH: find SPEECH elements anywhere in the document...
[contains(SPEAKER, 'ADAM')]: ...where its child element SPEAKER contains text 'ADAM'
/LINE: from such SPEECH elements, select child element LINE
xpathtester demo
A few problems in your attempted XPath:
//*[contains(text(), 'ADAM')] will match any element within the entire XML document that contains text 'ADAM', not just within SPEAKER element because it starts with / which point to the root document. You should, at least, add . at the beginning
LINE is not descendant of SPEAKER, so //SPEAKER[...]//LINE will not match any element in the XML above

Select XML Node by position

I have the following XML structure
<Root>
<BundleItem>
<Item>1</Item>
<Item>2</Item>
<Item>3</Item>
</BundleItem>
<Item>4</Item>
<Item>5</Item>
<Item>6</Item>
<BundleItem>
<Item>7</Item>
<Item>8</Item>
<Item>9</Item>
</BundleItem>
</Root>
And by providing the following xPath
//Item[1]
I am selecting
<Item>1</Item>
<Item>4</Item>
<Item>7</Item>
My goal is to select only <Item>1</Item> or <Item>7</Item> regardless of the parent element where they are found and only depending on the position, which i am providing in the xPath.
Is it possible to do that only by using the position and without providing additional criterias in the xPath ?
//Item[1] selects the all the first child elements that are <Item/> regardless of their parent.
To get the two items you are looking for you could use //Item[text() = 1 or text() = 7].
A good tutorial can be found at w3schools.com and you can play with XPath expressions over your XML input here. (I am not affiliated with either of these resources but find them useful.)

XPATH Select All Attributes attr Except One On Specific Element elem

I was selecting all attributes id and everything was going nicely then one day requirements changed and now I have to select all except one!
Given the following example:
<root>
<structs id="123">
<struct>
<comp>
<data id="asd"/>
</comp>
</struct>
</structs>
</root>
I want to select all attributes id except the one at /root/structs/struct/comp/data
Please note that the Xml could be different.
Meaning, what I really want is: given any Xml tree, I want to select all attributes id except the one on element /root/structs/struct/comp/data
I tried the following:
//#id[not(ancestor::struct)] It kinda worked but I want to provide a full xpath to the ancestor axis which I couldn't
//#id[not(contains(name(), 'data'))] It didn't work because name selector returns the name of the underlying node which is the attribute not its parent element
The following should achieve what you're describing:
//#id[not(parent::data/parent::comp/parent::struct/parent::structs/parent::root)]
As you can see, it simply checks from bottom to top whether the id attribute's parent matches the path root/structs/struct/comp/data.
I think this should be sufficient for your needs, but it does not 100% ensure that the parent is at the path /root/structs/struct/comp/data because it could be, for example, at the path /someOtherHigherRoot/root/structs/struct/comp/data. I'm guessing that's not a possible scenario in your XML structure, but if you had to check for that, you could do this:
//#id[not(parent::data/parent::comp/parent::struct/parent::structs/parent::root[not(parent::*)])]

xpath - matching value of child in current node with value of element in parent

Edit: I think I found the answer but I'll leave the open for a bit to see if someone has a correction/improvement.
I'm using xpath in Talend's etl tool. I have xml like this:
<root>
<employee>
<benefits>
<benefit>
<benefitname>CDE</benefitname>
<benefit_start>2/3/2004</benefit_start>
</benefit>
<benefit>
<benefitname>ABC</benefitname>
<benefit_start>1/1/2001</benefit_start>
</benefit>
</benefits>
<dependent>
<benefits>
<benefit>
<benefitname>ABC</benefitname>
</benefit>
</dependent>
When parsing benefits for dependents, I want to get elements present in the employee's
benefit element. So in the example above, I want to get 1/1/2001 for the dependent's
start date. I want 1/1/2001, not 2/3/2004, because the dependent's benefit has benefitname ABC, matching the employee's benefit with the same benefitname.
What xpath, relative to /root/employee/dependent/benefits/benefit, will yield the value of
benefit_start for the benefit under parent employee that has the same benefit name as the
dependent benefit name? (Note I don't know ahead of time what the literal value will be, I can't just look for 'ABC', I have to match whatever value is in the dependent's benefitname element.
I'm trying:
../../../benefits/benefit[benefitname=??what??]/benefit_start
I don't know how to refer to the current node's ancestor in the middle of
the xpath (since I think "." at the point I have ??what?? will refer to
the benefit node of the employee/benefits.
EDIT: I think what I want is "current()/benefitname" where the ??what?? is. Seems to work with saxon, I haven't tried it in the etl tool yet.
Your XML is malformed, and I don't think you've described your siduation very well (the XPath you're trying has a bunch of ../../s at the beginning, but you haven't said what the context node is, whether you're iterating through certain nodes, or what.
Supposing the current context node were an employee element, you could select benefit_starts that match dependent benefits with
benefits/benefit[benefitname = ../../dependent/benefits/benefit/benefitname]
/benefit_start
If the current context node is a benefit element in a dependents section, and you want to get the corresponding benefit_start for just the current benefit element, you can do:
../../../benefits/benefit[benefitname = current()/benefitname]/benefit_start
Which is what I think you've already discovered.

Modify XPath to return second of two values

I have an XPath that returns two items. I want to modify it so that it returns only the second, or the last if there are more than 2.
//a[#rel='next']
I tried
//a[#rel='next'][2]
but that doesn't return anything at all. How can I rewrite the xpath so I get only the 2nd link?
Found the answer in
XPATH : finding an attribute node (and only one)
In my case the right XPath would be
(//a[#rel='next'])[last()]
EDIT (by Tomalak) - Explanation:
This selects all a[#rel='next'] nodes, and takes the last of the entire set:
(//a[#rel='next'])[last()]
This selects all a[#rel='next'] nodes that are the respective last a[#rel='next'] of the parent context each of them is in:
//a[#rel='next'][last()] equivalent: //a[#rel='next' and position()=last()]
This selects all a[#rel='next'] nodes that are the second a[#rel='next'] of the parent context each of them is in (in your case, each parent context had only one a[#rel='next'], that's why you did not get anything back):
//a[#rel='next'][2] equivalent: //a[#rel='next' and position()=2]
For the sake of completeness: This selects all a nodes that are the last of the parent context each of them is in, and of them only those that have #rel='next' (XPath predicates are applied from left to right!):
//a[last()][#rel='next'] NOT equiv!: //a[position()=last() and #rel='next']

Resources