I have some HTML like this
<div>
<a>link that I do not want to get</a>
<div>Div that I do not want to get</div>
Text I want to get
<br> I like brs
<b>That text I also want, because I like bold text</b>
<div>I do not want all divs</div>
</div>
And I'd like to use xpath to extract out just the
Text I want to get
<br> I like brs
<b>That text I also want, because I like bold text</b>
In other words I want all DIV childs, but not a and not div.
How can I do this?
You can use self::a to detect a elements, and then use not to exclude them, i.e.:
/div/node()[not(self::a or self::div)]
Related
I am trying to setup a Telegram Instant View for a website.
i have something like this code and want to remove everything after "remove from here" text
<p> sample text <p> test</p> remove from here <p>test text</p> </p>
how can i access every text/nodes after this specific text ("remove from here") and remove them?
Update:
i want to have this result:
<p> sample text <p> test</p> remove from here</p>
how can i access every text/nodes after this specific text
You can use following-sibling::* from XPath to access the nodes on the same level after the one you selected.
Then use #remove function from the Instant View DSL:
$selected_node: //*[self::text() and normalize-space()="remove from here"]
#remove: $selected_node/following-sibling::*
You may want to be more specific with the $selected_node. Depending on your needs, you may want to add predicates to remove only certain types of the following siblings, for example: following-sibling::*[self::node() or self::text()].
I have this html snippet
<div id="overview">
<strong>some text</strong>
<br/>
some other text
<strong>more text</strong>
TEXT I NEED IS HERE
<div id="sub">...</div>
</div>
How can I get the text I am looking for (shown in caps)?
I tried this, I get an error message saying not able to locate the element.
"//div[#id='overview']/strong[position()=2]/following-sibling"
I tried this, I get the div with id=sub, but not the text (correctly so)
"//div[#id='overview']/*[preceding-sibling::strong[position()=2]]"
Is there anyway to get the text, other than doing some string matching or regex with contents of overview div?
Thanks.
following-sibling is the axis, you still need to specify the actual node (in your example the XPath processor is searching for an element named following-sibling). You separate the axis from the node with ::.
Try this:
//div[#id='overview']/strong[position()=2]/following-sibling::text()[1]
This specifies the first text node after the second strong in the div.
If you always want the text immediately preceding the <div id="sub"> then you could try
//div[#id='sub']/preceding-sibling::text()[1]
That would give you everything between the </strong> and the opening <div ..., i.e. the upper case text plus its leading and trailing new lines and whitespace.
I am trying to click on the link whose site is www.qualtrapharma.com by searching in google
"qualtra" but there is problem in writing xpath as <cite> tag contains <B> tag inside it. How to do any any one suggest?
<div class="f kv" style="white-space:nowrap">
<cite class="vurls">
www.
<b>qualtra</b>
pharma.com/
</cite>
<div>
You may overcome this by using the '.' in the XPath, which stands for the 'text in the current node'.
The XPath would look like the following:
//cite[.='www.qualtrapharma.com/']
Below is the sample code
<p>
I want this Text
<sup> not this </sup>
.(Need this too).
<sup> and not this </sup>
</p>
Using Selenium RC, selenium.getText("//...") bring us the all the text including which are in < sup >.
Is there any way to get the text from <p> without <sup> tags ?
Please let me know. Thanks
Your only option is to get the text of the three elements and manipulate the parts you don't want away. That, or resort to using getEval() to run some JavaScript that get's the <P> element's innerHTML property, then remove the parts inside the <SUP> elements yourself.
I'm using scrapy to crawl a site with some odd formatting conventions. The basic idea is that I want all the text and subelements of a certain div, EXCEPT a few at the beginning, and a few at the end.
Here's the gist.
<div id="easy-id">
<stuff I don't want>
text I don't want
<div id="another-easy-id" more stuff I don't want>
text I want
<stuff I want>
...
<more stuff I want>
text I want
...
<div id="one-more-easy-id" more stuff I *don't* want>
<more stuff I *don't* want>
NB: The indenting implies closing tags, so everything here is a child of the first div -- the one with id="easy-id"
Because text and nodes are mixed, I haven't been able to figure out a simple xpath selector to grab the stuff I want. At this point, I'm wondering if it's possible to retrieve the result from xpath as an lxml.etree.elementTree, and then hack at it using the .remove() method.
Any suggestions?
I am guessing you want everything from the div with ID another-easy-id up to but not including the one-more-easy-id div.
Stack overflow has not preserved the indenting, so I do not know where the end of the first div element is, but I'm going to guess it ends before the text.
In that case you might want
//div[#id = 'another-easy-id']/following:node()
[not(preceding::div[#id = 'one-more-easy-id']) and not(#id = 'one-more-easy-id')]
If this is XHTML you'll need to bind some prefix, h, say, to the XHTML namespace and use h:div in both places.
EDIT: Here's the syntax I went with in the end. (See comments for the reasons.)
//div[#id='easy-id']/div[#id='one-more-easy-id']/preceding-sibling::node()[preceding-sibling::div[#id='another-easy-id']]