Select `text()` that 1) precede a given node but 2) are also descendants of another given node - xpath

This is a follow-up question of this, but unfortunately the answer from that question doesn't apply.
Say I have the following XML:
<body>
<div id="global-header">
header
</div>
<div id="a">
<h3>some title</h3>
<p>text 1
<b>bold</b>
</p>
<div>
<p>abc</p>
<p>text 2</p>
<p>def</p>
</div>
</div>
</body>
I want to
find the <p> node whose value is "text 2" (assume we only have exactly one such <p>), and then
find all the nodes that precede this particular <p> but are also descendants of the <div id='a'> node(you can use something like [#id='a'] to locate it), and finally
extract text() from step 2.
The desired output should look like:
some title
text 1
bold
abc
The caveat is that
the preceding nodes may contain arbitrary node type, not only <h3> and <p>.
the <p>text 2</p> node may be embeded arbitrarly deep in the tree, hence xpath like .//p[text()="text 2"]/preceding-sibling::* would only extract <p>abc</p> and leave out others.

You can try this XPath expression:
//p[.='text 2']/preceding::text()[ancestor::div[#id='a']]
The disadvantage of this approach is that the text() nodes may not be clearly separated, but rather merged for the sub-elements. To separate them, you'd need some kind of for-loop.

Related

Getting single element with similar xpaths but with different same level, "neighboring" node

I'm trying to get the xpath of an element with a similar xpath to others but has a "neighbor" element that's different . Please see example below.
<div>
<div id='a'> </div>
<span> Text here </span> #this is what i'm trying to get
</div>
<div>
<div id='b'> </div>
<span> Text here </span>
</div>
I tried using //div//span, but this gives me the 2 spans. So i tried using //div//child::div[#id='a']//ancestor::div//child::span, but it doesn't look pleasant and looks repetitive. Does this have a better implementation?
try
//div[div[#id='a']]/span
it says get the span child node of all div nodes with child node div (with an #id equal to 'a').

xpath:how to find a node that not contains text?

I have a html like:
...
<div class="grid">
"abc"
<span class="searchMatch">def</span>
</div>
<div class="grid">
<span class="searchMatch">def</span>
</div>
...
I want to get the div which not contains text,but xpath
//div[#class='grid' and text()='']
seems doesn't work,and if I don't know the text that other divs have,how can I find the node?
Let's suppose I have inferred the requirement correctly as:
Find all <div> elements with #class='grid' that have no directly-contained non-whitespace text content, i.e. no non-whitespace text content unless it's within a child element like a <span>.
Then the answer to this is
//div[#class='grid' and not(text()[normalize-space(.)])]
You need a not() statement + normalize-space() :
//div[#class='grid' and not(normalize-space(text()))]
or
//div[#class='grid' and normalize-space(text())='']

join all text from nodes xpath

Hello I have some html file:
<div class="text">
<p></p>
<p>text in p2</p>
<p></p>
<p>text in p4</p>
</div>
and other are like:
<div class="text">
<p>text in p1</p>
<p></p>
<p>text in p3</p>
<p></p>
</div>
My query is: (in rapidminer)
//h:div[contains(#class,'inside')]/h:div[contains(#class,'text')]/h:p/node()/text()
but return only first <p>.
My question is how can join all text in <p> in the same string?
Thank you
I will limit my expressions to the HTML snippets you provided, so I cut off the first few axis steps.
First, this query should not return any result, as the paragraph nodes do not have any subnodes (but text nodes).
//h:div[contains(#class,'text')]/h:p/node()/text()
To access all text nodes, you should use something like
//h:div[contains(#class,'text')]/h:p/text()
Joining a string heavily depends on the XPath version you're able to use. If rapidminer provides XPath 2.0 (it probably does not), you're lucky and can use string-join(...), which joins all string together to a single one:
string-join(//h:div[contains(#class,'text')]/h:p/text())
If you're stuck with XPath 1.0, you cannot do this but for a fixed number of strings, enumerating all of them. I added the newlines for readability reasons, remove them if you want to:
concat(
//h:div[contains(#class,'text')]/h:p[1]/text(),
//h:div[contains(#class,'text')]/h:p[2]/text(),
//h:div[contains(#class,'text')]/h:p[3]/text(),
//h:div[contains(#class,'text')]/h:p[4]/text()
)

xquery/xpath- how to get number of descendant nodes of a particular type

Take a look at the sample XML below--
<div id="main">
<div id="1">
Some random text
</div>
<div id="2">
Some random text
</div>
<div id="3">
Some random text
</div>
<p> Some more random text</p>
<div id="4">
Some random text
</div>
</div>
Now, how do I find out the number of divs within the main div using Xquery? And how to do this in XPath?
You can use the following XPath:
count(div[#id="main"]/div)
The function count does the counting, the main div is selected by its id.
The XPath expressions below can be used both in XPath and XQuery. This is so, because XPath (2.0) is a proper subset of XQuery.
Use:
count(/*//div)
If "the main div" isn't the top element of the XML document, and this is the only div whose id attribute has string value of "main", use:
count((//div[#id='main'])[1]//div)
If it is guaranteed that the div children of the "main div" dont have div descendents, use:
count((//div[#id='main'])[1]/div)
Do note: The XPath pseudo-operator // can be very inefficient -- this is why, always try to avoid using it, whenever the structure of the XML document is statically known and specific paths can be used.

Using XPath expression how can i get the first text node immediately following a node?

I want to get to the exact node having this text: 'Company'. Once I get to this node I want to get to the next text node immediately following this node because this contains the company name. How can I do this with Xpath?
Fragment of XML is:
<div id="jobsummary">
<div id="jobsummary_content">
<h2>Job Summary</h2>
<dl>
<dt>Company</dt>
<!-- the following element is the one I'm looking for -->
<dd><span class="wrappable">Pinpoint IT Services, LLC</span></dd>
<dt>Location</dt>
<dd><span class="wrappable">Newport News, VA</span></dd>
<dt>Industries</dt>
<dd><span class="wrappable">All</span></dd>
<dt>Job Type</dt>
<dd class="multipledd"><span class="wrappable">Full Time</span></dd><dd class="multipleddlast"><span class="wrappable"> Employee</span></dd>
</dl>
</div>
</div>
I got to the Company tag with following xpath: //*[text()= 'Company']
Now I want to get to the next text node. My XML is dynamic. So I can't hardcode the node type like <dd> for getting the company value. But this is for sure that the value be in the immediate next text node.
So how can I get to the text node immediately after the node with text as Company?
If you cannot hardcode any part of the following-sibling node your xpath should look like this:
//*[text()='Company']/following::*/*/text()
assuming that the desired text is always enclosed in another element like span.
To test for given dt text, modify your xpath to
//*[text()='Company' or text()='Company:' or text()='Company Name']/following::*/*/text()
use //*[text()='Company']/following-sibling::dd to get the next dd.
You can even insert conditions for that dd and also go further in it.
following-sibling::elementName just looks for the next sibling at the same parent level that meets your requirements.
With no conditions, like above, it will get the next dd after the 'Company'.
The text is in the span so you might try
//*[text()='Company']/following-sibling::dd/span
Another clarifying example would be, let's say that you want to get also the next industries text for the current selected 'Company'.
Having //*[text()='Company',
you can modify it like this: //*[text()='Company']/following-sibling::dt[text()='Industries']/dd/span
Of course, instead of hardcoding the values for text(), you can use variables.
You can Use XPathNavigator and go on to every node type one by one
I think XPathNavigator::MoveToNext is the method you are looking for.
There is the sample code as well at..
http://msdn.microsoft.com/en-us/library/9yxc3x24.aspx
Use this general XPath expression that selects the wanted text node even when it is wrapped in statically unknown markup elements:
(//*[text()='Company']/following-sibling::*[1]//text())[1]
When this XPath expression is evaluated against the provided XML document:
<div id="jobsummary">
<div id="jobsummary_content">
<h2>Job Summary</h2>
<dl>
<dt>Company</dt>
<!-- the following element is the one I'm looking for -->
<dd><span class="wrappable">Pinpoint IT Services, LLC</span></dd>
<dt>Location</dt>
<dd><span class="wrappable">Newport News, VA</span></dd>
<dt>Industries</dt>
<dd><span class="wrappable">All</span></dd>
<dt>Job Type</dt>
<dd class="multipledd"><span class="wrappable">Full Time</span></dd><dd class="multipleddlast"><span class="wrappable"> Employee</span></dd>
</dl>
</div>
</div>
exactly the wanted text node is selected:
Pinpoint IT Services, LLC
Even if we change the XML to this one:
<div id="jobsummary">
<div id="jobsummary_content">
<h2>Job Summary</h2>
<div>
<p>Company</p>
<!-- the following element is the one I'm looking for -->
<dd><span class="wrappable"><b><i><u>Pinpoint IT Services, LLC</u></i></b></span></dd>
<dt>Location</dt>
<dd><span class="wrappable">Newport News, VA</span></dd>
<dt>Industries</dt>
<dd><span class="wrappable">All</span></dd>
<dt>Job Type</dt>
<dd class="multipledd"><span class="wrappable">Full Time</span></dd><dd class="multipleddlast"><span class="wrappable"> Employee</span></dd>
</div>
</div>
</div>
the XPath expression above still selects the wanted text node:
Pinpoint IT Services, LLC

Resources