Detect first non-empty element - xpath

After reading the most relevant Xpath questions about detecting empty nodes, I still can not find the first non-empty element. The dataset looks like:
<div>
<p>
<elem> </elem>
</p>
<p>
<elem> </elem>
</p>
<p>
<elem> </elem>
</p>
<p>
<elem>   </elem>
</p>
<p>
<elem>Application</elem>
</p>
<p>
<elem>Other text that should not be detected.</elem>
</p>
<p>
<elem> </elem>
</p>
<p>
<elem>Second application</elem>
</p>
</div>
Basically the empty elements should not be taken into account, and we only want to detect the first Application element. We've been testing a lot with normalize-space, and related functions but can not get this working.
The main problem are the empty elements. The check we have right now solves the positioning flawlessly, but fails once the html contains elements:
/div/p[position() < 3]//*[normalize-space()='Application']
So, how can we ignore empty elements? This only is possible via an additional step in between?

In my definition an empty element does not have any child nodes so //*[not(node()] would select all empty elements by that definition. If you want to allow certain text content then you could check normalize-space after removing them e.g. //*[not(*) and not(normalize-space(translate(., ' ', '')))]. Basically you need to list all characters as the second argument of the translate call that you want to remove before checking with normalize-space. And the XPath expression I have written would work inside XSLT where the numeric character reference is parsed by an XML parser, in general it depends on the host language you use XPath with how to escape characters.

Related

how to remove everything after specific text with xpath

I am trying to setup a Telegram Instant View for a website.
i have something like this code and want to remove everything after "remove from here" text
<p> sample text <p> test</p> remove from here <p>test text</p> </p>
how can i access every text/nodes after this specific text ("remove from here") and remove them?
Update:
i want to have this result:
<p> sample text <p> test</p> remove from here</p>
how can i access every text/nodes after this specific text
You can use following-sibling::* from XPath to access the nodes on the same level after the one you selected.
Then use #remove function from the Instant View DSL:
$selected_node: //*[self::text() and normalize-space()="remove from here"]
#remove: $selected_node/following-sibling::*
You may want to be more specific with the $selected_node. Depending on your needs, you may want to add predicates to remove only certain types of the following siblings, for example: following-sibling::*[self::node() or self::text()].

Xpath extraction between two nodes

I have several web pages that have the following structure
<div id="w_33086" class = "eight columns">
<h2 id="about">About<span itemprop="name">Name of Place</span></h2>
<p style="margin-top: 0px;">
.
.
.
<p class="contactAdvisor"><a href="http://www.example.com/contact">
The text and markup between the two paragraphs is quite variable as it is created through individual users. In some cases it includes markup and in some cases it does not. When it does include markup, the mark up can be quite variable.
I'm trying to select all of the text and markup between these two <p> but have not been successful.
The best result I've achieved comes from //div[id='w_33086']/node()
However, that is dropping the <p> tags when those are present. It also picks up the <h2> tag and the <p class="contactAdvisor"> that I would rather exclude.
I'm using Google Sheets (and or Screaming Frog) to apply the xpath
If the two <p> elements are siblings of each other, then you could try
//div[id='w_33086']/p[#style="margin-top: 0px;"]/
following-sibling::node()[following-sibling::p[#class="contactAdvisor"]]
This is assuming that there is only one <p> under <div id="w_33086"> that has the style or class attribute value (respectively) that we're using to identify them.
Please note that XPath does not select any "markup". It selects nodes such as text nodes, elements (which have descendants attached), and attributes. Those nodes can be serialized as markup, but that's not XPath's business.

Trouble accessing a text with XPath query

I have this html snippet
<div id="overview">
<strong>some text</strong>
<br/>
some other text
<strong>more text</strong>
TEXT I NEED IS HERE
<div id="sub">...</div>
</div>
How can I get the text I am looking for (shown in caps)?
I tried this, I get an error message saying not able to locate the element.
"//div[#id='overview']/strong[position()=2]/following-sibling"
I tried this, I get the div with id=sub, but not the text (correctly so)
"//div[#id='overview']/*[preceding-sibling::strong[position()=2]]"
Is there anyway to get the text, other than doing some string matching or regex with contents of overview div?
Thanks.
following-sibling is the axis, you still need to specify the actual node (in your example the XPath processor is searching for an element named following-sibling). You separate the axis from the node with ::.
Try this:
//div[#id='overview']/strong[position()=2]/following-sibling::text()[1]
This specifies the first text node after the second strong in the div.
If you always want the text immediately preceding the <div id="sub"> then you could try
//div[#id='sub']/preceding-sibling::text()[1]
That would give you everything between the </strong> and the opening <div ..., i.e. the upper case text plus its leading and trailing new lines and whitespace.

xpath accessing information in nodes

i need to scrap information form a website contain the property details.
<div class="inner">
<div class="col">
<h2>House in Digana </h2>
<div class="meta">
<div class="date"></div>
<span class="category">Houses</span>,
<span class="location">Kandy</span>
</div>
</div>
<div class="attr polar">
<span class="data">Rs. 3,600,000</span>
</div>
what is the xpath notation for "Kandy" and "Rs. 3,600,000" ?
It is not wise to address text nodes directly using text() because of nuances in an XML document.
Rather, addressing an element node directly returns the concatenation of all descendant text nodes as the element value, which is what people usually want (and think they are getting when they address text nodes).
The canonical example I use in the classroom is this example of OCR'ed content as XML:
<cost>39<!--that 9 may be an 8-->.22</cost>
The value of the element using the XPath address cost is "39.22", but in XSLT 1.0 the value of the XPath address cost/text() is "39" which is not complete. In XSLT 2.0 (which is how the question is tagged), you get two text nodes "39" and ".22", which if you concatenate them it looks correct. But, if you pass them to a function requiring a singleton argument, you will get a run-time error. When you address an element, the text returned is concatenated into a single string, which is suitable for a singleton argument.
I tell students that in all of my professional work there are only very (very!) few times that I ever have to use text() in my stylesheets.
So //span[#class='location' or #class='data'] would find the two fields if those were the only such elements in the entire document. You may need to use ".//span" from a location inside of the document tree.

Selenium: Extracting only Text with out any sub elements from <p>

Below is the sample code
<p>
I want this Text
<sup> not this </sup>
.(Need this too).
<sup> and not this </sup>
</p>
Using Selenium RC, selenium.getText("//...") bring us the all the text including which are in < sup >.
Is there any way to get the text from <p> without <sup> tags ?
Please let me know. Thanks
Your only option is to get the text of the three elements and manipulate the parts you don't want away. That, or resort to using getEval() to run some JavaScript that get's the <P> element's innerHTML property, then remove the parts inside the <SUP> elements yourself.

Resources