how to remove everything after specific text with xpath - xpath

I am trying to setup a Telegram Instant View for a website.
i have something like this code and want to remove everything after "remove from here" text
<p> sample text <p> test</p> remove from here <p>test text</p> </p>
how can i access every text/nodes after this specific text ("remove from here") and remove them?
Update:
i want to have this result:
<p> sample text <p> test</p> remove from here</p>

how can i access every text/nodes after this specific text
You can use following-sibling::* from XPath to access the nodes on the same level after the one you selected.
Then use #remove function from the Instant View DSL:
$selected_node: //*[self::text() and normalize-space()="remove from here"]
#remove: $selected_node/following-sibling::*
You may want to be more specific with the $selected_node. Depending on your needs, you may want to add predicates to remove only certain types of the following siblings, for example: following-sibling::*[self::node() or self::text()].

Related

Extracting links (get href values) with certain text with Xpath under a div tag with certain class

SO contributors. I am fully aware of the following question How to obtain href values from a div using xpath?, which basically deals with one part of my problem yet for some reason the solution posted there does not work in my case, so I would kindly ask for help in resolving two related issues. In the example below, I would like to get the href value of the "more" hyperlink (http://www.thestraddler.com/201715/piece2.php), which is under the div tag with content class.
<div class="content">
<h3>Against the Renting of Persons: A conversation with David Ellerman</h3>
[1]
</p>
<p>More here.</p>
</div>
In theory I should be able to extract the links under a div tag with
xidel website -e //div[#class="content"]//a/#href
but for some reason it does not work. How can I resolve this and (2nd part) how can I extract the href value of only the "here" hyperlink?

Detect first non-empty element

After reading the most relevant Xpath questions about detecting empty nodes, I still can not find the first non-empty element. The dataset looks like:
<div>
<p>
<elem> </elem>
</p>
<p>
<elem> </elem>
</p>
<p>
<elem> </elem>
</p>
<p>
<elem>   </elem>
</p>
<p>
<elem>Application</elem>
</p>
<p>
<elem>Other text that should not be detected.</elem>
</p>
<p>
<elem> </elem>
</p>
<p>
<elem>Second application</elem>
</p>
</div>
Basically the empty elements should not be taken into account, and we only want to detect the first Application element. We've been testing a lot with normalize-space, and related functions but can not get this working.
The main problem are the empty elements. The check we have right now solves the positioning flawlessly, but fails once the html contains elements:
/div/p[position() < 3]//*[normalize-space()='Application']
So, how can we ignore empty elements? This only is possible via an additional step in between?
In my definition an empty element does not have any child nodes so //*[not(node()] would select all empty elements by that definition. If you want to allow certain text content then you could check normalize-space after removing them e.g. //*[not(*) and not(normalize-space(translate(., ' ', '')))]. Basically you need to list all characters as the second argument of the translate call that you want to remove before checking with normalize-space. And the XPath expression I have written would work inside XSLT where the numeric character reference is parsed by an XML parser, in general it depends on the host language you use XPath with how to escape characters.

Trouble accessing a text with XPath query

I have this html snippet
<div id="overview">
<strong>some text</strong>
<br/>
some other text
<strong>more text</strong>
TEXT I NEED IS HERE
<div id="sub">...</div>
</div>
How can I get the text I am looking for (shown in caps)?
I tried this, I get an error message saying not able to locate the element.
"//div[#id='overview']/strong[position()=2]/following-sibling"
I tried this, I get the div with id=sub, but not the text (correctly so)
"//div[#id='overview']/*[preceding-sibling::strong[position()=2]]"
Is there anyway to get the text, other than doing some string matching or regex with contents of overview div?
Thanks.
following-sibling is the axis, you still need to specify the actual node (in your example the XPath processor is searching for an element named following-sibling). You separate the axis from the node with ::.
Try this:
//div[#id='overview']/strong[position()=2]/following-sibling::text()[1]
This specifies the first text node after the second strong in the div.
If you always want the text immediately preceding the <div id="sub"> then you could try
//div[#id='sub']/preceding-sibling::text()[1]
That would give you everything between the </strong> and the opening <div ..., i.e. the upper case text plus its leading and trailing new lines and whitespace.

Selenium: Extracting only Text with out any sub elements from <p>

Below is the sample code
<p>
I want this Text
<sup> not this </sup>
.(Need this too).
<sup> and not this </sup>
</p>
Using Selenium RC, selenium.getText("//...") bring us the all the text including which are in < sup >.
Is there any way to get the text from <p> without <sup> tags ?
Please let me know. Thanks
Your only option is to get the text of the three elements and manipulate the parts you don't want away. That, or resort to using getEval() to run some JavaScript that get's the <P> element's innerHTML property, then remove the parts inside the <SUP> elements yourself.

Get XPath inner text without some but not all child

I have some HTML like this
<div>
<a>link that I do not want to get</a>
<div>Div that I do not want to get</div>
Text I want to get
<br> I like brs
<b>That text I also want, because I like bold text</b>
<div>I do not want all divs</div>
</div>
And I'd like to use xpath to extract out just the
Text I want to get
<br> I like brs
<b>That text I also want, because I like bold text</b>
In other words I want all DIV childs, but not a and not div.
How can I do this?
You can use self::a to detect a elements, and then use not to exclude them, i.e.:
/div/node()[not(self::a or self::div)]

Resources