xpath: extract text following <br> doesn't work - xpath

My HTML:
<strong><span style="text-decoration: underline;">Prices and Rentals</span></strong>
<br>
<br>
“ Prices of …”
<br>
<br>
I want to extract the text after "Prices and Rentals" --> br --> br
Desired Extracted Text:
“ Prices of …”
My xpath selector:
(//strong/span[contains(text(), "Prices and Rentals")])[1]/br/br//text()
It can't seems to detect the "br". Thank you

//strong[span[contains(., 'Prices and Rentals')]]/following-sibling::br[2]/following-sibling::text()[1] would give a text node containing
“ Prices of …”

Here's another one, admittedly convoluted, but doesn't depend on positioning:
//*[.//span[contains(., 'Prices and Rentals')]]//text()[not(parent::*/name()="span")][string-length(normalize-space())>0]

A workable answer is
//strong/span[contains(text(), 'Prices and Rentals')]/following::br[2]/following-sibling::text()

//strong//following-sibling::br[text()]

Related

XPATH Grab following sibling and stop at the next sibling in the tree

<div class="season-rate season-summer">
<p class="heading">Summer</p>
<p class="subHeading">from</p>
<p class="price">€180,000<span>p/week + expenses</span><span class="approx">Approx
$211,500</span></p>
</div>
I am trying to grab the price here (€180,000) based on that the heading class is "Summer":
//p[contains(.,'Summer')]/following-sibling::p[2]
This returns:
€180,000p/week + expensesApprox
$211,500
But I only want:
€180,000
So I want to stop the XPATH before this next span class:
<span class="approx">Approx
$211,500</span>
I am trying variations of this without any luck!
//p[contains(.,'Summer')]/following-sibling::p[2] [not(preceding-sibling::span[contains(.,'p/week')])]
You can try this expression to get price only
//p[.="Summer"]/following-sibling::p[#class="price"]/text()
I think this should do it:
//div[p["Summer"]]/p[#class="price"]/text()[not(self="span")]
or even simpler:
//div[p["Summer"]]/p[#class="price"]/text()[not(span)]

How to get the whole title which consists of several spans with XPATH?

How to get the whole title:
Iphone case :) #phonecases#xmas#iphone#case
When the title does not include hashtags I can get all the title with this xpath:
((//*[#class='pinWrapper'])[2]//span)[1]/text()
This line:
((//*[#class='pinWrapper'])[2]//span)[1]//text()[normalize-space()]
returns only the first one: Iphone case :).
And this:
((//*[#class='pinWrapper'])[2]//span)[1][string()]
returns whole xml:
<span>Iphone case :) <span class="pinHashtag">#phonecases</span> <span class="pinHashtag">#xmas</span> <span class="pinHashtag">#iphone</span> <span class="pinHashtag">#case</span></span>
If ((//*[#class='pinWrapper'])[2]//span)[1]/text() returns you first text node only, try
string(((//*[#class='pinWrapper'])[2]//span)[1])
to get complete string

Extract all text in between two nodes using xpath for websrcaping?

<div class="jokeContent">
<h2 style="color:#369;">Can I be Frank</h2>
What did Ellen Degeneres say to Kathy Lee?
<p></p> <p>Can I be Frank with you? </p>
<p>Submitted by Calamjo</p>
<p>Edited by Curtis</p>
<div align="right" style="margin-top:10px;margin-bottom:10px;">#joke #short </div>
<div style="clear:both;"></div>
</div>
So I am trying to extract all text after the <\h2> and before the [div aign = "right" style=...] nodes.
What I have tried so far:
jokes = response.xpath('//div[#class="jokeContent"]')
for joke in jokes:
text = joke.xpath('text()[normalize-space()]').extract()]
if len(text) > 0:
yield text
This works to some extend, but the website is inconsistent in the html and sometimes the text is embedded in <.p> TEXT <\p> and sometimes in <.br> TEXT <\br> or just TEXT.
So I thought just extracting everything after the header and before the style node might make sense and then the filtering can be done afterwords.
If you are looking for a literal xpath of what you are describing, it could be something like:
In [1]: sel.xpath("//h2/following-sibling::*[not(self::div) and not(preceding-sibling::div)]//text()").extract()
Out[1]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']
But there's probably a more logical, cleaner conclusion:
In [2]: sel.xpath("//h2/following-sibling::p//text()").extract()
Out[2]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']
This is just selecting paragraph tags. You said the paragraph tags might be something else and you can match several different tags with self::tag specification:
In [3]: sel.xpath("//h2/following-sibling::*[self::p or self::br]//text()").extract()
Out[3]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']
Edit: apparently I missed the text under the div itself. This can be ammended with | - or selector:
In [3]: sel.xpath("//h2/../text()[normalize-space(.)] | //h2/../p//text()").extract()
Out[3]:
[u'\n What did Ellen Degeneres say to Kathy Lee? \n ',
u'Can I be Frank with you? ',
u'Submitted by Calamjo',
u'Edited by Curtis']
normalize-space(.) is there only to get rid of text values that contain no text (e.g. ' \n').
You can append the first part of this xpath to any of the above and you'd get similar results.

C# htmlagilitypack XPath except containt html tag

<div id="Dossuuu11Plus" style="display: block; ">
Text need
<br/>
Not need
<a class="bot_link" href="http://abc.com" target="_self">http://abc.com</a>
<br/>
</div>
This is html code. I use: //td[#class='textdetaildrgI
but it get all content in , I just need "Text need". Please help me. Thanks
You could use
//div[#id='Dossuuu11Plus']/text()[1][normalize-space()]
Explanation:
It will select the first text node found for DIV which in this case is Text need and normalize-space() will trim leading and trailing whitespaces if any.

how to access this element

I am using Watir to write some tests for a web application. I need to get the text 'Bishop' from the HTML below but can't figure out how to do it.
<div id="dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view" style="display: block;">
<div class="workprolabel wpFieldLabel">
<span title="Please select a courtesy title from the list.">Title</span> <span class="validationIndicator wpValidationText"></span>
</div>
<span class="wpFieldViewContent" id="dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view_value"><p class="wpFieldValue ">Bishop</p></span>
</div>
Firebug tells me the xpath is:
html/body/form/div[5]/div[6]/div[2]/div[2]/div/div/span/span/div[2]/div[4]/div[1]/span[1]/div[2]/span/p/text()
but I cant format the element_by_xpath to pick it up.
You should be able to access the paragraph right away if it's unique:
my_p = browser.p(:class, "wpFieldValue ")
my_text = my_p.text
See HTML Elements Supported by Watir
Try
//span[#id='dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5b45385e5f45b_view_value']//text()
EDIT:
Maybe this will work
path = "//span[#id='dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5b45385e5f45b_view_value']/p";
ie.element_by_xpath(path).text
And check if the span's id is constant
Maybe you have an extra space in the end of the name?
<p class="wpFieldValue ">
Try one of these (worked for me, please notice trailing space after wpFieldValue in the first example):
browser.p(:class => "wpFieldValue ").text
#=> "Bishop"
browser.span(:id => "dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view_value").text
#=> "Bishop"
It seems in run time THE DIV style changing NONE to BLOCK.
So in this case we need to collect the text (Entire source or DIV Source) and will collect the value from the text
For Example :
text=ie.text
particular_div=text.scan(%r{div id="dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view" style="display: block;(.*)</span></div>}im).flatten.to_s
particular_div.scan(%r{ <p class="wpFieldValue ">(.*)</p> }im).flatten.to_s
The above code is the sample one will solve your problem.

Resources