Xpth extract plain email text - xpath

I'm trying to extract the email text from a list but without success.
In particular I've used this code
//li/div/p//*[contains(., '#')]
but strangely it doesn't work! Could you help me?
Here's the code exemple
<li class="bgmp_list-item">
<h3 class="bgmp_list-placemark-title">
Name1
</h3>
<div class="bgmp_list-description">
<p class="">
<strong class="">Responsible:</strong> John Doe <br>
<strong class="">Site:</strong> <a title="www.exemple.com" href="http://www.exemple.com" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','www.2ld.it']);" target="_blank" class="">www.2ld.it</a>
<br>
<strong class="">Email:</strong> some_email#email.com
<br><strong class="">Address:</strong> 3, Main Street 00000, London <br>
<strong>Tel:</strong> 00 000000 <strong>Fax:</strong> 0000000
</p>
</div>

You're almost there but not quite. For the sample code the correct xpath would be
//p/text()[contains(.,'#')]
Not to reinvent the wheel here is a very good explanation on it on another answer

By using p//*[contains(., '#')] you apply the predicate on individual child elements of <p>, while there is no such child element because
the target email address text is direct child of <p>. This is one of the reason why the intial XPath didn't work. Applying the predicate on <p> directly should work :
//li/div/p[contains(., '#')]
but that will return the <p> element. If you need to return only the text node that contains email address, then the predicate should be applied on individual text nodes within <p>, as mentioned in the other answer :
//li/div/p/text()[contains(., '#')]

Related

How to get specific xpath tag value

<div class="container">
<span class="price">
<bdi> 140 </bdi>
</span>
<span class="price">
<del>
<bdi>90</bdi>
</del>
<ins>
<bdi> 120 </bdi>
</ins>
</span>
</div>
I want to scrape a site which html formatting like below. Here I dont want to bdi tag value which is under del tag and want bdi tag value which is under span class and ins tag. Is there any path to figure it out?
Don't pretty much usual //span/ins/bdi/text() work for you?
This is "text of <bdi> which parent is <ins> which parent is <span>"?
CSS variant span>ins>bdi::text should also work I suppose.
Sorry, haven't noticed that you need two values. In that case .xpath('//bdi[not(parent::del)]/text()').extract() will work well.

How to select by non-direct child condition in Xpath?

I would like to show an example.
This how the page looks:
<a class="aclass">
<div class="divclass"></div>
<div id="innerclass">
<span class="spanclass">Hello</span>
</div>
</a>
<a class="aclass">
<div class="divclass"></div>
<div id="innerclass">
<span class="spanclass">Pick Delivery Location</span>
</div>
</a>
I want to select anchor tags that have a child (direct or non-direct) span that has the text 'Hello'.
Right now, I do something like this:
//a[#class='aclass'][div/span[text() = 'Hello']]
I want to be able to select without having to select direct children (div in this case), like this:
//a[#class='aclass'][//span[text() = 'Hello']]
However, the second one finds all the anchor tags with the class 'aclass' rather than the one with the span with 'Hello' text.
I hope I worded my question clearly. Please feel free to edit if necessary.
In your attempt, // goes back to the root of the document - effectively you are saying "Give me the as for which there is a span anywhere in the document", which is why you get them all.
What you need is the descendant axis :
//a[#class='aclass' and descendant::span[text() = 'Hello']]
Note I have joined the conditions with and, but two separate conditions would also work.

Make XPath stop at a certain depth?

I have the following HTML
<span class="medium bold day-time-clock">
09:00
<div class="tooltip-box first-free-tip ">
<div class="tooltip-box-inner">
<span class="fa fa-clock-o"></span>
Some more text
</div>
</div>
</span>
I want an XPath that only gets the text 09:00, not Some more text NOT using text()[1] because that causes other problems. My current XPath looks like this
("//span[1][contains(#class, 'day-time-clock')]/text()")
I want one that ignores this whole part of the HTML
<div class="tooltip-box first-free-tip ">
<div class="tooltip-box-inner">
<span class="fa fa-clock-o"></span>
Some more text
</div>
</div>
You can limit the level of descendant:: nodes with position().
So the following expression does work:
span/descendant::node()[2 > position()]
Adjust the number in the predicate to your needs, 2 is only an example. A disadvantage of this approach is that the counting of the descendants is only accurate for the first child in the descending tree.
Another approach is limiting the both: the ancestors and the descendants:
span/descendant::node()[3 > count(ancestor::*) and 1 > count(descendant::*)]
Here, too, you have to adjust the numbers in the predicates to get any useful results.
Use normalize-space() for select all non-whitespace nodes of the document:
//span[contains(#class, 'day-time-clock')]/text()[normalize-space()]
I think (if I understand you correctly) that
"..//div[contains(#class, 'tooltip-box')]/parent::span"
gets you there.

Select all nodes between two elements excluding unnecessary element from the intersection using XPath

There’s a document structured as follows:
<div class="document">
<div class="title">
<AAA/>
</div class="title">
<div class="lead">
<BBB/>
</div class="lead">
<div class="photo">
<CCC/>
</div class="photo">
<div class="text">
<!-- tags in text sections can vary. they can be `div` or `p` or anything. -->
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="text">
<div class="more_text">
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="more_text">
<div class="other_stuff">
<DDD/>
</div class="other_stuff">
</div class="document">
The task is to grab all the elements between <div class="lead"> and <div class="other_stuff"> except the <div class="photo"> element.
The Kayessian method for node-set intersection $ns1[count(.|$ns2) = count($ns2)] works perfectly. After substituting $ns1 with //*[#class="lead"]/following::* and $ns2 with //*[#class="other_stuff"]/preceding::*,
the working code looks like this:
//*[#class="lead"]/following::*[count(. | //*[#class="other_stuff"]/preceding::*)
= count(//*[#class="other_stuff"]/preceding::*)]/text()
It selects everything between <div class="lead"> and <div class="other_stuff"> including the <div class="photo"> element. I tried several ways to insert not() selector in the formula itself
//*[#class="lead" and not(#class="photo ")]/following::*
//*[#class="lead"]/following::*[not(#class="photo ")]
//*[#class="lead"]/following::*[not(self::class="photo ")]
(the same things with /preceding::* part) but they don't work. It looks like this not() method is ignored – the <div class="photo"> element remains in the selection.
Question 1: How to exclude the unnecessary element from this intersection?
It’s not an option to select from <div class="photo"> element excluding it automatically because in other documents it can appear in any position or doesn't appear at all.
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
It initially selects everything up to the end and to the beginning of the whole document. Could it be better to specify the exact end point for the following:: and preceding:: ways? I tried //*[#class="lead"]/following::[#class="other_stuff"] but it doesn’t seem to work.
Question 1: How to exclude the unnecessary element from this intersection?
Adding another predicate, [not(self::div[#class='photo'])] in this case, to your working XPath should do. For this particular case, the entire XPath would look like this (formatted for readability) :
//*[#class="lead"]
/following::*[
count(. | //*[#class="other_stuff"]/preceding::*)
=
count(//*[#class="other_stuff"]/preceding::*)
][not(self::div[#class='photo'])]
/text()
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
I'm not sure if it would be 'better', what I can tell is following::[#class="other_stuff"] is invalid expression. You need to mention the element to which the predicate will be applied, for example, 'any element' following::*[#class="other_stuff"], or just 'div' following::div[#class="other_stuff"].

How to check box in Capybara if there are no name, id or label text?

I am newbie here. Please advise. How to select checkbox in my case?
<ul class="phrases-list" style="">
<li>
<input type="checkbox" class="select-phrase">
<span class="prase-title"> Dog - Wikipedia, the free encyclopedia </span>
(en.wikipedia.org)
<div class="prase-desc hidden">The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated...</div>
</li>
The following doesn't work for me:
When /I check box "([^\"]+)"$/ do |label|
page.check(label)
end
step: And I check box "Dog - Wikipedia, the free encyclopedia"
If you can change the html, wrap the input and span in a label element
<ul class="phrases-list" style="">
<li>
<label>
<input type="checkbox" class="select-phrase">
<span class="prase-title"> Dog - Wikipedia, the free encyclopedia </span>
</label>
(en.wikipedia.org)
<div class="prase-desc hidden">The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated...</div>
</li>
which has the added benefit of clicks on the "Dog - Wikipedia ..." text triggering the checkbox too. With that change your step should work as written. If you can't modify the html then things get more difficult.
Something like
find('span', text: label).find(:xpath, './preceding-sibling::input').set(true)
should work, although I'm curious how you're using these checkboxes from JS with nothing tying them to any specific value
Let's assume that you are prevented from changing the HTML. In this case, it would probably be easiest to query for the element via XPath. For example:
# Here's the XPath query
q = "//span[contains(text(), 'Dog - Wikipedia')]/preceding-sibling::input"
# Use the query to find the checkbox. Then, check the checkbox.
page.find(:xpath, q).set(true)
Okay - it's not as bad as it looks! Let's analyze this XPath so we can understand what it's doing:
//span
This first part says "Search the entire HTML document and discover all "span" elements. Of course, there are probably a LOT of "span" elements in the HTML document, so we'll need to restrict this:
//span[contains(text(), 'Dog - Wikipedia')]
Now we're only searching for the "span" elements that contain the text "Dog - Wikipedia". Presumably, this text will uniquely identify the desired "span" element on the page (if not, then just search for more of the text).
At this point, we have the "span" element that is adjacent to the desired "input" element. So, we can query for the "input" element using the "preceding-sibling::" XPath Axis:
//span[contains(text(), 'Dog - Wikipedia')]/preceding-sibling::input

Resources