print HtmlAnchor text inside <span> tags

print HtmlAnchor text inside <span> tags - htmlunit

I have the following HtmlAnchor
<a href="/property/tx/coldspring/77331/82-bradford-ln/167035284">
<span itemprop="streetAddress">SOMENUMBER Bradford Ln</span>,
<span itemprop="addressLocality">Coldspring</span>,
<span itemprop="addressRegion">TX</span>
<span itemprop="postalCode">77331</span>
</a>
When I print htmlAnchor.getTextContent() It's not printing the text inside the span tags. How can I get the whole text printed?

Related

Scrapy xpath select parent element based on text value in subelement and lacking of element

I want to select all elements article that don't contain a span element with class status and where the nested a element contains a href attribute which contains the text "rent.html".
I've managed to get the a element like so:
response.xpath('//article[#class="car"]//a[contains(#href,"rent.html")]')
But reading here and trying to select the first parent element article like so returns "data=0"
response.xpath('//article[#class="car"]//a[contains(#href,"rent.html")]//parent::article and not //article[#class="car"]//span[#class="status"]')
I also tried this.
response.xpath('//article[#class="car"][//a[contains(#href,"rent.html")]/article and not //article[#class="car"]//span[#class="status"]')')
I don't know what the expression is for my use case.
<article class="car">
<div>
<div class="container">
<a href="/34625030/rent.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/34625230/rent.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/12325230/buy.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/34632230/rent.html">
</a>
</div>
</div>
<span class="status">Rented</span>
</article>

This XPath expression will do the work:
"//article[not(.//span[#class='status'])][.//a[contains(#href,'rent.html')]]"
The entire command is:
response.xpath("//article[not(.//span[#class='status'])][.//a[contains(#href,'rent.html')]]")
Explanations:
Translating your requirements into XPath syntax.
"select all elements article" - //article
"that don't contain a span element with class status" - [not(.//span[#class='status'])]
" and where the nested a element contains a href attribute which contains the text "rent.html"" - [.//a[contains(#href,'rent.html')]]
I tested the XPath above on the shared sample XML and it worked properly.

Hugo's equivalent of PHP's strpos()

So, I want to break .Content up to fit in different nav-tabs. If there's a better pattern to accomplish, please let me know.
So, in my /content/shop/product-name/index.md front matter contains:
# summary
summary: "This is the **product's** summary which will render markdown"
---
This is the first line of the full description of the product. This section of the ./index.md
page is referenced in the `single.html` file as `.Content`.|^^|This is the next part of the
.Content that I want to throw into a different nav-tab.
Then in /layouts/shop/single.html:
<div class="row">
<div class="col-lg-12">
{{ .Params.summary }}
</div>
</div>
<div class="row">
<div class="col-lg-12">
<nav class="product-info-tabs wc-tabs mt-5 mb-5">
<div class="nav nav-tabs nav-fill" id="nav-tab" role="tablist">
<a class="nav-item nav-link active" id="nav-home-tab" data-toggle="tab" href="#nav-home" role="tab"
aria-controls="nav-home" aria-selected="true">Description</a>
<a class="nav-item nav-link" id="nav-profile-tab" data-toggle="tab" href="#nav-profile" role="tab"
aria-controls="nav-profile" aria-selected="false">Additional Information</a>
<a class="nav-item nav-link" id="nav-contact-tab" data-toggle="tab" href="#nav-contact" role="tab"
aria-controls="nav-contact" aria-selected="false">Reviews</a>
</div>
</nav>
<div class="tab-content" id="nav-tabContent">
<div class="tab-pane fade show active" id="nav-home" role="tabpanel" aria-labelledby="nav-home-tab">
{{ .Content }}
</div>
...
In days gone past, in PHP, I could use strpos(.Content, '|^^|') and then substr(.Content, 0, (strpos(.Content, '|^^|')) to get a section of text. You could also throw the string into an array with a user configured delimiter split('|^^|', .Content).
So, back in Hugo, within .Content I could have something like:
This is the content. This is the last line before being split.|^^|This is the next line, that would be in array[1] or the next indexed substr.
I'm trying to get these two sections of .Content into different tabs of the single.html page. Each product .Content will obviously be different so I can't really have a consistent count to use Hugo's substr().
The problem that I see with using the front matter, is, although it is markdown rendered, it can't span multiple lines. I know I could use \n for new lines, but that defeats the benefits of markdown.
Thanks.

Sounds like you could probably just do:
{{ replace .Content "|^^|" "</div><div class='tab-pane fade show'>" | safeHTML}}
Which would turn
<div class="tab-pane fade show active" id="nav-home" role="tabpanel" aria-labelledby="nav-home-tab">
This is the content. This is the last line before being split.|^^|This is the next line, that would be in array[1] or the next indexed substr.
</div>
into
<div class="tab-pane fade show active" id="nav-home" role="tabpanel" aria-labelledby="nav-home-tab">
This is the content. This is the last line before being split.
</div>
<div class='tab-pane fade show'>
This is the next line, that would be in array[1] or the next indexed substr.
</div>

Catching partial part of text with XPath

I have been having some difficulties finding an XPath for the following H
<div>
<p> pppppppp
<span class="rollover-people">
<a class="rollover-people-link">pppppp</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>xxxx</a>
<a>xxxxx</a>
</span>
</span>
</span>
</span>pppppppp
</p>ppppppppp
<div>
So basically I need everything inside the <p> up to <span class="rollover-people-block">. In another word, I want <p> but not <span class="rollover-people-block">. Is that even possible? Keep in mind, the <p> gets repeated more than once in the page.

This is what something closure you are looking for.
//p//text()[not(ancestor::span[#class='rollover-people-block'])]
This will get all the text nodes under p excluding the ones which are under span class='rollover-people-block'.
Sample html:
<!DOCTYPE html>
<html>
<body>
<div>
<p> A
<span class="rollover-people">
<a class="rollover-people-link">B</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>c</a>
<a>d</a>
</span>
</span>
</span>
</span>E
</p>f
<p> G
<span class="rollover-people">
<a class="rollover-people-link">H</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>i</a>
<a>j</a>
</span>
</span>
</span>
</span>K
</p>l
<div>
</body>
</html>
xpath output:

How can I get all text data of a node with xpath in scrapy

I'm trying to scrape user review data from a website. I hope to have a 2 column data (ratings and reviews) at the end.
Here is a sample xml file that emulates my scraping problem. I have tried it on https://www.freeformatter.com/xpath-tester.html#ad-output.to get the outputs.
<root>
<div class="user-review">
<div class="rating"> 5,0 </div>
<p class="review-content"> Reiew text of item/movie.
<span class="details">
<span class="details-header">Detail: </span>
<span class="details-content">Some details to emphasis</span>
</span>
Continue to review
</p>
</div>
<div class="user-review">
<div class="rating"> 4,0 </div>
<p class="review-content">Reiew text of item/movie.
</p>
</div>
<div class="user-review">
<div class="rating"> 4,0 </div>
<p class="review-content">Reiew text of item/movie.
</p>
</div>
</root>
I can get 3 rating values with query below.
/root/div/div[#class="rating"]/text()
Output:
Text=' 5,0 '
Text=' 4,0 '
Text=' 4,0 '
When I try to get the review part the first text is divided into 2 sections. Because of that I have two different sized lists(3 sized ratings and 4 sized reviews) and cannot match reviews with ratings
//p[#class="review-content"]/text()
Output:
Text=' Reiew text of item/movie.
'
Text='
Continue to review
'
Text='Reiew text of item/movie.
'
Text='Reiew text of item/movie.
Can anybody help me to get one of my expected ouputs?
Expected output1:
Text=' Reiew text of item/movie.
Continue to review
'
Text='Reiew text of item/movie.
'
Text='Reiew text of item/movie.
Expected output2:
Text=' Reiew text of item/movie. Some details to emphasis
Continue to review
'
Text='Reiew text of item/movie.
'
Text='Reiew text of item/movie.

Try this, sel is here selector, in your case may be response
tags = sel.xpath('//p[#class="review-content"]')
reviews = []
for tag in tags:
text = " ".join(tag.xpath('.//text()').extract())
reviews.append(text)

You'll have to loop over div elements with user-review class and extract the review content from each of these. If you want a one-liner, look at this:
import scrapy
text = """
<root>
<div class="user-review">
<div class="rating"> 5,0 </div>
<p class="review-content"> Reiew text of item/movie.
<span class="details">
<span class="details-header">Detail: </span>
<span class="details-content">Some details to emphasis</span>
</span>
Continue to review
</p>
</div>
<div class="user-review">
<div class="rating"> 4,0 </div>
<p class="review-content">Reiew text of item/movie.
</p>
</div>
<div class="user-review">
<div class="rating"> 4,0 </div>
<p class="review-content">Reiew text of item/movie.
</p>
</div>
</root>
"""
selector = scrapy.Selector(text=text)
review_content = [review.xpath('normalize-space(.//p[#class="review-content"])').extract_first() for review in selector.xpath('//div[#class="user-review"]')]

How to select the las Li with Xpath or Css Selector

I want to select the text in the last li with xpath, i can use Css Selector too.
here the value is "3"
<div class="pg">
<input type="hidden" value="1" name="PaginationForm.CurrentPage">
<input id="PaginationForm_TotalPage" type="hidden" value="41" name="PaginationForm.TotalPage">
<span class="pgPrev">‹</span>
<ul>
<li class="">
<span class="current">1</span>
</li>
<li class="">
<a>2</a>
</li>
<li class="">
<a>3</a>
</li>
</ul>
<a class="jsNxtPage pgNext">›</a>
</div>
i try this in Selenium
driver.find_elements_by_xpath('(//*[#class="pg"]/ul/li/text())[last()]')

As the method name suggests, it can only return element, not text node. You can find the target <a> element first :
a = driver.find_element_by_xpath('//*[#class="pg"]/ul/li[last()]/a')
And then you can get the inner text from a.text

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

print HtmlAnchor text inside <span> tags - htmlunit

Related

Scrapy xpath select parent element based on text value in subelement and lacking of element

Hugo's equivalent of PHP's strpos()

Catching partial part of text with XPath

How can I get all text data of a node with xpath in scrapy

How to select the las Li with Xpath or Css Selector

Categories

Resources