I'm trying to scrape the URLs contained in these:
<a href="tel:(352) 376-4090" rel="noopener noreferrer" target="_blank" class="SK-FuGf">
<a rel="noopener noreferrer" target="_blank" class="SK-FuGf" href="http://www.bodytechtattoo.com">
<a rel="noopener noreferrer" target="_blank" class="SK-FuGf" href="https://www.instagram.com/bodytech_piercings">
They all share the same class. Let's say I want to scrape only the website link (the second one).
Can you help me understand why the following Path is not working for that?
//a[#class="SK-FuGf" and not(contains(#href, “instagram.com")) and not(contains(#href, "tel:"))]/#href
Related
I am trying to scrape a web page for NAME OF COMPANY and CITY AND STATE OF COMPANY shown below.
I have an xpath code snippet that identifies both text elements at the same time:
// span[starts-with(#class,"text-align")]/text()[2]
This xpath snippet pulls the first text value (COMPANY NAME). How do I get the second text element (CITY,STATE)?
A snip of the web page code looks like this:
<div>
<ul class="pv-top-card-v3--experience-list">
<li>
<a class="pv-top-card-v3--experience-list-item" href="#" data-control-name="position_see_more" data-ember-action="" data-ember-action-172="172">
<img src="https://media.licdn.com/dms/image/C4E0BAQFhA8h46hvabA/company-logo_100_100/0?e=1582761600&v=beta&t=VAeZqaGu3Lu6Ol_n5kiiI74FSRuSOZA1ggAI5qTVRjE" id="ember173" class="EntityPhoto-square-1 flex-shrink-zero ember-view">
<span id="ember174" class="text-align-left ml2 t-14 t-black t-bold full-width lt-line-clamp lt-line-clamp--multi-line ember-view" style="-webkit-line-clamp: 2"> THIS IS THE NAME OF A COMPANY
<!----></span>
</a>
</li>
<li>
<a class="pv-top-card-v3--experience-list-item" href="#" data-control-name="education_see_more" data-ember-action="" data-ember-action-176="176">
<img src="https://media.licdn.com/dms/image/C560BAQEr2uQX-x2EwQ/company-logo_100_100/0?e=1582761600&v=beta&t=aDbYLUDMvlS4DpwOLjOaQj3Dj60C_cYLC5UUvGoyld0" id="ember177" class="EntityPhoto-square-1 flex-shrink-zero ember-view">
<span id="ember178" class="text-align-left ml2 t-14 t-black t-bold full-width lt-line-clamp lt-line-clamp--multi-line ember-view" style="-webkit-line-clamp: 2"> THIS IS THE CITY AND STATE OF COMPANY
<!----></span>
</a>
</li>
</ul>
</div>
The xpath string is picking up the two span elements using class. I can't use the span id attributes because they are dynamic and change with each page (one page per company).
Can someone advise how I extract the desired text?
Thanks.
point to the li level.
//ul/li[2]/a/span[starts-with(#class,"text-align")]
Using Thymeleaf, how can I make an image act as a hyperlink?
The Thymeleaf Documentation says nothing about images and I tried using standard HTML for this issue but none of the below attempts made my image an active hyperlink.
<a th:href="#{/user/myUser}">
<img src="../../static/images/image.jpg" alt="logo"/>
</a>
<a href="https://www.w3schools.com">
<img src="../../static/images/image.jpg" alt="logo"/>
</a>
<a href="/oauth2/authorization/google">
<img alt="Google Login" title="Google login"
th:src="#{/images/login-with-google.png}" />
</a>
Magento ver. 1.7.0.2
In email templates facing one issue.
When I use
<a style="" href="{{store url=''}}">
it give me o/p like
<a style="" href="http://www.domain.com/index.php">
But I want like following
<a style="" href="http://www.domain.com">
Now If I add
<a href="{{store direct_url='service'}}">
It give me o/p like
<a href="http://www.domain.com/index.php/service">
But I want like following
<a href="http://www.domain.com/service">
Now If I add
<a href="{{store direct_url='service/contact'}}">
It give me o/p like (/index/index automatically appended)
<a href="http://www.domain.com/index.php/service/contact/index/index">
But I want like following
<a href="http://www.domain.com/service/contact">
And when I click on link it will navigate me to /service page not service/contact page.
Any idea what's this going on?
Following has done the job.
{{config path="web/unsecure/base_url"}}
<a href="{{config path="web/unsecure/base_url"}}service"
I'm scraping some web pages in order to get some information. I'm using Scrapy and XPath language.
This is an example of page I would get. In the page there are many of this li element
<li ckIgnore="false" codmod="3857" ccar="A" area="NEW" versArea="NEW" shorturl="1" modurl="/auto">
<article>
<img width="210" height="158" src="" alt="" modello=>
<img src="" alt="logo" class="logo-listing" width="38">
<div class="hgroup">
<a href="">
<h5>ABARTH</h5>
<h3>500 cabrio</h3>
</a>
</div>
</article>
</li>
I'm using this syntax to get all the divs which have hgroup class. Unfortunately when I try to print print out models variable this is empty.
def parse(self, response):
sel = Selector(response)
models = sel.xpath("//div[#class='hgroup']/a")
It's possible that what scrapy "sees" is different than what you see on your browser. Try using scrapy shell "http://example.com" and check the response.body if what your looking for is there.
I am using scrapy to scrape images. I notice that some image url is specified by #src,like the following:
<a href="http://www.wandoujia.com/apps/com.uu">
<img src="http://img.wdjimg.com/mms/icon/v1/5/09/14687d011083dc84036fc68dc3c80095_68_68.png" width="68" height="68" alt="UU电话" class="icon">
</a>
Some are different:
<a href="http://www.wandoujia.com/apps/com.hcsql.shengqiandianhua">
<img data-original="http://img.wdjimg.com/mms/icon/v1/6/44/a27006acfbe8b6aa39bee49c6f004446_68_68.png" alt="省钱电话" class="icon lazy" width="68" height="68" src="http://img.wdjimg.com/mms/icon/v1/6/44/a27006acfbe8b6aa39bee49c6f004446_68_68.png" style="display: block;">
</a>
I use the following code to extract. The result is : 1)if only the src occur, the #src is the real link of image; 2) if the data-original occurs, the #data-original is the real link,#src is not. So my question is what should i do if I want to extract the url of the image under the both two cases.
sel.xpath('/a/img/#src').extract()
You can try:
sel.xpath('//a/img[not(#data-original)]/#src | //a/img/#data-original').extract()