Using scrapy extract the url of image

Using scrapy extract the url of image - image

I am using scrapy to scrape images. I notice that some image url is specified by #src,like the following:
<a href="http://www.wandoujia.com/apps/com.uu">
<img src="http://img.wdjimg.com/mms/icon/v1/5/09/14687d011083dc84036fc68dc3c80095_68_68.png" width="68" height="68" alt="UU电话" class="icon">
</a>
Some are different:
<a href="http://www.wandoujia.com/apps/com.hcsql.shengqiandianhua">
<img data-original="http://img.wdjimg.com/mms/icon/v1/6/44/a27006acfbe8b6aa39bee49c6f004446_68_68.png" alt="省钱电话" class="icon lazy" width="68" height="68" src="http://img.wdjimg.com/mms/icon/v1/6/44/a27006acfbe8b6aa39bee49c6f004446_68_68.png" style="display: block;">
</a>
I use the following code to extract. The result is : 1)if only the src occur, the #src is the real link of image; 2) if the data-original occurs, the #data-original is the real link,#src is not. So my question is what should i do if I want to extract the url of the image under the both two cases.
sel.xpath('/a/img/#src').extract()

You can try:
sel.xpath('//a/img[not(#data-original)]/#src | //a/img/#data-original').extract()

Related

Thymeleaf - Image to act as Hyperlink

Using Thymeleaf, how can I make an image act as a hyperlink?
The Thymeleaf Documentation says nothing about images and I tried using standard HTML for this issue but none of the below attempts made my image an active hyperlink.
<a th:href="#{/user/myUser}">
<img src="../../static/images/image.jpg" alt="logo"/>
</a>
<a href="https://www.w3schools.com">
<img src="../../static/images/image.jpg" alt="logo"/>
</a>

<a href="/oauth2/authorization/google">
<img alt="Google Login" title="Google login"
th:src="#{/images/login-with-google.png}" />
</a>

Scraping a web page by using XPath

I'm scraping some web pages in order to get some information. I'm using Scrapy and XPath language.
This is an example of page I would get. In the page there are many of this li element
<li ckIgnore="false" codmod="3857" ccar="A" area="NEW" versArea="NEW" shorturl="1" modurl="/auto">
<article>
<img width="210" height="158" src="" alt="" modello=>
<img src="" alt="logo" class="logo-listing" width="38">
<div class="hgroup">
<a href="">
<h5>ABARTH</h5>
<h3>500 cabrio</h3>
</a>
</div>
</article>
</li>
I'm using this syntax to get all the divs which have hgroup class. Unfortunately when I try to print print out models variable this is empty.
def parse(self, response):
sel = Selector(response)
models = sel.xpath("//div[#class='hgroup']/a")

It's possible that what scrapy "sees" is different than what you see on your browser. Try using scrapy shell "http://example.com" and check the response.body if what your looking for is there.

count images in a page using capybara

I want to count the images displayed in a page using capybara.The html code displayed below.for that i use following code to return the total count but the count returns 0.In my page i have 100 more images.
c= page.all('.thumbnail_select').count
puts c(returns 0)
HTML
<a class="thumbnail thumbnail_img_wrap">
<img alt="" src="test.jpg">
<div class="thumbnail_select">
<div class="thumail_selet_backnd"></div>
<div class="thumbil_selt_text">Click to Select</div>
</div>
<p>ucks</p>
<span class="info_icon"><span class="info_icon_img"></span></span>
</a>
<a class="thumbnail thumbnail_img_wrap">
<img alt="" src="test1.jpg">
<div class="thumbnail_select">
<div class="thumail_selet_backnd"></div>
<div class="thumbil_selt_text">Click to Select</div>
</div>
<p>ucks</p>
<span class="info_icon"><span class="info_icon1_img"></span></span>
</a>
.........
.........
How can i count the total images?

You have a few options.
Either find all div's with class thumbnail_select by using all("div[class='thumbnail_select']").count
But this is an awkward way of doing it since it looks for the div and not the images.
A better way would to be to look for all images using all("img").count as long as no other image is present on the page.
If neither of these works either the problem might be that your page is not loaded when you start looking for the images. Then just simply put a page.should have_content check before the image count to make sure that the page is loaded.

Align images beside text

I want to align the images beside the text, and also they need to be click-able. How can I do this? Do I need to make unordered list?
Here is the whole page: http://jsfiddle.net/dzadze/68WrB/
<div>
<a class="pic_link" href="#">
<img src="http://f.cl.ly/items/3Q1e0G1Y2b2Q2U0N1g1q/fb.png">
</a>
Следете не <br>на FACEBOOK
<a href="#" class="pic_link">
<img src="http://f.cl.ly/items/413J3G3e152p1g3W0t0l/ftp.png">
</a>
<a href="#">FTP Логин
</a>
<a class="pic_link" href="#">Што е Photobook</a>
Процес на изработка
</div>

Not the only solution but adding this to your css seems to work :
footer a{
display: inline-block;
}

It sounds like using CSS floating would be a good place to start, based on what you've said.

How to stop auto-refresh onclick from thumbnails?

I have an image gallery on my site that uses thumbnails that enlarge above the thumbnail line when clicked on. I'm having an issue with the auto-refresh; every time I click one of the thumbnails, the page refreshes, which restores it to the "master image".
I'm not (and sort of refuse, on the grounds that I believe all this can be done with simple CSS and HTML) using anything fancy to write this code, despite my knowledge of HTML being amateur at best.
Here's a sample of the code. Let me know if you need to see a different piece of it.
<div id="rightcol">
<img name="ImageOnly. src='#' /><img src="#" />
</div>
<div id="leftcol"> <div>
<a href="" onclick="ImageOnly.src='#'"><img src="#" />
</div>
Edit: Somehow I seem to have fixed this issue by changing
<a href="" onclick="ImageOnly.src='#'">
to
<a href="#" onclick="ImageOnly.src='#'">
Not really sure why this worked but would love an explanation...?

Why not just use some simple ajax/javascript .innerHTML? instead of trying to stop the auto refresh that occurs when you click on a hyperlink that has #. That way you could update the rightcol synchroniously.
HTML
<div id="rightcol">
<img name="ImageOnly.src" src='#' />
</div>
<div id="leftcol">
<img src="#" />
</div>
AJAX Script
function ajaxMove(src)
{
var image = '<img src="'+src+'" alt="my transferred image" />';
document.getElementById('rightcol').innerHTML = image;
}
How is it used?
Request the object from the onclick event.
Build an image tag based off the information in the object.
Transfer the new image tag to the element with the id 'rightcol'
Other options
You could also remove the href="#" from the <a> tag and work directly from the onclick event and then apply style="cursor:pointer;". Then it will work like a regular hyperlink but without the refresh.
<a onclick="javascript:ajaxMove('ImageOnly.src')" style="cursor:pointer;" >Click Me</a>

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Using scrapy extract the url of image - image

You can try: sel.xpath('//a/img[not(#data-original)]/#src | //a/img/#data-original').extract()

Related

Thymeleaf - Image to act as Hyperlink

Scraping a web page by using XPath

count images in a page using capybara

Align images beside text

How to stop auto-refresh onclick from thumbnails?

Categories

Resources