print HtmlAnchor text inside <span> tags - htmlunit

I have the following HtmlAnchor
<a href="/property/tx/coldspring/77331/82-bradford-ln/167035284">
<span itemprop="streetAddress">SOMENUMBER Bradford Ln</span>,
<span itemprop="addressLocality">Coldspring</span>,
<span itemprop="addressRegion">TX</span>
<span itemprop="postalCode">77331</span>
</a>
When I print htmlAnchor.getTextContent() It's not printing the text inside the span tags. How can I get the whole text printed?

Related

Scrapy xpath select parent element based on text value in subelement and lacking of element

I want to select all elements article that don't contain a span element with class status and where the nested a element contains a href attribute which contains the text "rent.html".
I've managed to get the a element like so:
response.xpath('//article[#class="car"]//a[contains(#href,"rent.html")]')
But reading here and trying to select the first parent element article like so returns "data=0"
response.xpath('//article[#class="car"]//a[contains(#href,"rent.html")]//parent::article and not //article[#class="car"]//span[#class="status"]')
I also tried this.
response.xpath('//article[#class="car"][//a[contains(#href,"rent.html")]/article and not //article[#class="car"]//span[#class="status"]')')
I don't know what the expression is for my use case.
<article class="car">
<div>
<div class="container">
<a href="/34625030/rent.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/34625230/rent.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/12325230/buy.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/34632230/rent.html">
</a>
</div>
</div>
<span class="status">Rented</span>
</article>
This XPath expression will do the work:
"//article[not(.//span[#class='status'])][.//a[contains(#href,'rent.html')]]"
The entire command is:
response.xpath("//article[not(.//span[#class='status'])][.//a[contains(#href,'rent.html')]]")
Explanations:
Translating your requirements into XPath syntax.
"select all elements article" - //article
"that don't contain a span element with class status" - [not(.//span[#class='status'])]
" and where the nested a element contains a href attribute which contains the text "rent.html"" - [.//a[contains(#href,'rent.html')]]
I tested the XPath above on the shared sample XML and it worked properly.

Hugo's equivalent of PHP's strpos()

So, I want to break .Content up to fit in different nav-tabs. If there's a better pattern to accomplish, please let me know.
So, in my /content/shop/product-name/index.md front matter contains:
# summary
summary: "This is the **product's** summary which will render markdown"
---
This is the first line of the full description of the product. This section of the ./index.md
page is referenced in the `single.html` file as `.Content`.|^^|This is the next part of the
.Content that I want to throw into a different nav-tab.
Then in /layouts/shop/single.html:
<div class="row">
<div class="col-lg-12">
{{ .Params.summary }}
</div>
</div>
<div class="row">
<div class="col-lg-12">
<nav class="product-info-tabs wc-tabs mt-5 mb-5">
<div class="nav nav-tabs nav-fill" id="nav-tab" role="tablist">
<a class="nav-item nav-link active" id="nav-home-tab" data-toggle="tab" href="#nav-home" role="tab"
aria-controls="nav-home" aria-selected="true">Description</a>
<a class="nav-item nav-link" id="nav-profile-tab" data-toggle="tab" href="#nav-profile" role="tab"
aria-controls="nav-profile" aria-selected="false">Additional Information</a>
<a class="nav-item nav-link" id="nav-contact-tab" data-toggle="tab" href="#nav-contact" role="tab"
aria-controls="nav-contact" aria-selected="false">Reviews</a>
</div>
</nav>
<div class="tab-content" id="nav-tabContent">
<div class="tab-pane fade show active" id="nav-home" role="tabpanel" aria-labelledby="nav-home-tab">
{{ .Content }}
</div>
...
In days gone past, in PHP, I could use strpos(.Content, '|^^|') and then substr(.Content, 0, (strpos(.Content, '|^^|')) to get a section of text. You could also throw the string into an array with a user configured delimiter split('|^^|', .Content).
So, back in Hugo, within .Content I could have something like:
This is the content. This is the last line before being split.|^^|This is the next line, that would be in array[1] or the next indexed substr.
I'm trying to get these two sections of .Content into different tabs of the single.html page. Each product .Content will obviously be different so I can't really have a consistent count to use Hugo's substr().
The problem that I see with using the front matter, is, although it is markdown rendered, it can't span multiple lines. I know I could use \n for new lines, but that defeats the benefits of markdown.
Thanks.
Sounds like you could probably just do:
{{ replace .Content "|^^|" "</div><div class='tab-pane fade show'>" | safeHTML}}
Which would turn
<div class="tab-pane fade show active" id="nav-home" role="tabpanel" aria-labelledby="nav-home-tab">
This is the content. This is the last line before being split.|^^|This is the next line, that would be in array[1] or the next indexed substr.
</div>
into
<div class="tab-pane fade show active" id="nav-home" role="tabpanel" aria-labelledby="nav-home-tab">
This is the content. This is the last line before being split.
</div>
<div class='tab-pane fade show'>
This is the next line, that would be in array[1] or the next indexed substr.
</div>

Catching partial part of text with XPath

I have been having some difficulties finding an XPath for the following H
<div>
<p> pppppppp
<span class="rollover-people">
<a class="rollover-people-link">pppppp</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>xxxx</a>
<a>xxxxx</a>
</span>
</span>
</span>
</span>pppppppp
</p>ppppppppp
<div>
So basically I need everything inside the <p> up to <span class="rollover-people-block">. In another word, I want <p> but not <span class="rollover-people-block">. Is that even possible? Keep in mind, the <p> gets repeated more than once in the page.
This is what something closure you are looking for.
//p//text()[not(ancestor::span[#class='rollover-people-block'])]
This will get all the text nodes under p excluding the ones which are under span class='rollover-people-block'.
Sample html:
<!DOCTYPE html>
<html>
<body>
<div>
<p> A
<span class="rollover-people">
<a class="rollover-people-link">B</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>c</a>
<a>d</a>
</span>
</span>
</span>
</span>E
</p>f
<p> G
<span class="rollover-people">
<a class="rollover-people-link">H</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>i</a>
<a>j</a>
</span>
</span>
</span>
</span>K
</p>l
<div>
</body>
</html>
xpath output:

How can I get all text data of a node with xpath in scrapy

I'm trying to scrape user review data from a website. I hope to have a 2 column data (ratings and reviews) at the end.
Here is a sample xml file that emulates my scraping problem. I have tried it on https://www.freeformatter.com/xpath-tester.html#ad-output.to get the outputs.
<root>
<div class="user-review">
<div class="rating"> 5,0 </div>
<p class="review-content"> Reiew text of item/movie.
<span class="details">
<span class="details-header">Detail: </span>
<span class="details-content">Some details to emphasis</span>
</span>
Continue to review
</p>
</div>
<div class="user-review">
<div class="rating"> 4,0 </div>
<p class="review-content">Reiew text of item/movie.
</p>
</div>
<div class="user-review">
<div class="rating"> 4,0 </div>
<p class="review-content">Reiew text of item/movie.
</p>
</div>
</root>
I can get 3 rating values with query below.
/root/div/div[#class="rating"]/text()
Output:
Text=' 5,0 '
Text=' 4,0 '
Text=' 4,0 '
When I try to get the review part the first text is divided into 2 sections. Because of that I have two different sized lists(3 sized ratings and 4 sized reviews) and cannot match reviews with ratings
//p[#class="review-content"]/text()
Output:
Text=' Reiew text of item/movie.
'
Text='
Continue to review
'
Text='Reiew text of item/movie.
'
Text='Reiew text of item/movie.
Can anybody help me to get one of my expected ouputs?
Expected output1:
Text=' Reiew text of item/movie.
Continue to review
'
Text='Reiew text of item/movie.
'
Text='Reiew text of item/movie.
Expected output2:
Text=' Reiew text of item/movie. Some details to emphasis
Continue to review
'
Text='Reiew text of item/movie.
'
Text='Reiew text of item/movie.
Try this, sel is here selector, in your case may be response
tags = sel.xpath('//p[#class="review-content"]')
reviews = []
for tag in tags:
text = " ".join(tag.xpath('.//text()').extract())
reviews.append(text)
You'll have to loop over div elements with user-review class and extract the review content from each of these. If you want a one-liner, look at this:
import scrapy
text = """
<root>
<div class="user-review">
<div class="rating"> 5,0 </div>
<p class="review-content"> Reiew text of item/movie.
<span class="details">
<span class="details-header">Detail: </span>
<span class="details-content">Some details to emphasis</span>
</span>
Continue to review
</p>
</div>
<div class="user-review">
<div class="rating"> 4,0 </div>
<p class="review-content">Reiew text of item/movie.
</p>
</div>
<div class="user-review">
<div class="rating"> 4,0 </div>
<p class="review-content">Reiew text of item/movie.
</p>
</div>
</root>
"""
selector = scrapy.Selector(text=text)
review_content = [review.xpath('normalize-space(.//p[#class="review-content"])').extract_first() for review in selector.xpath('//div[#class="user-review"]')]

How to select the las Li with Xpath or Css Selector

I want to select the text in the last li with xpath, i can use Css Selector too.
here the value is "3"
<div class="pg">
<input type="hidden" value="1" name="PaginationForm.CurrentPage">
<input id="PaginationForm_TotalPage" type="hidden" value="41" name="PaginationForm.TotalPage">
<span class="pgPrev">‹</span>
<ul>
<li class="">
<span class="current">1</span>
</li>
<li class="">
<a>2</a>
</li>
<li class="">
<a>3</a>
</li>
</ul>
<a class="jsNxtPage pgNext">›</a>
</div>
i try this in Selenium
driver.find_elements_by_xpath('(//*[#class="pg"]/ul/li/text())[last()]')
As the method name suggests, it can only return element, not text node. You can find the target <a> element first :
a = driver.find_element_by_xpath('//*[#class="pg"]/ul/li[last()]/a')
And then you can get the inner text from a.text

Resources