How to extract url and then download it with xpath and scrapy - xpath

I am trying to download images using xpath in scrapy. How do i save the URLs and how i download the images they refer to?
This is what I did so far:
def parse_imgur(self, response):
image = ImagecrawlerItem()
image['images'] = response.xpath('//a[#href]').extract()
return image

Related

Trying to scrape an image using Nokogiri but it returns a link that I was not expecting

I'm doing a scraping exercise and trying to scrape the poster from a website using Nokogiri.
This is the link that I want to get:
https://a.ltrbxd.com/resized/film-poster/5/8/6/7/2/3/586723-glass-onion-a-knives-out-mystery-0-460-0-690-crop.jpg?v=ce7ed2a83f
But instead I got this:
https://s.ltrbxd.com/static/img/empty-poster-500.825678f0.png
Why?
This is what I tried:
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
title = html.search('.headline-1').text.strip
overview = html.search('.truncate p').text.strip
poster = html.search('.film-poster img').attribute('src').value
{
title: title,
overview: overview,
poster_url: poster,
}
It has nothing to do with your ruby code.
If you run in your terminal something like
curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/
You can see that the output HTML does not have the images you are looking for. You can see then in your browser because after that initial load some javascript runs and loads more resources.
The ajax call that loads the image you are looking for is https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c
Play with the network inspector of your browser and you will be able to identify the different parts of the website and how each one loads.
Nokogiri does not execute Javascript however the link has to be there or at least there has to be a link to some API that returns the link.
First place where I would search for it would be the data attributes of the image element or its parent however in this case it was hidden in an inline script along with some other interesting data about the movie.
First download the web page using curl or wget and open the file in text editor to see what Nokogiri sees. Search for something you know about the file, I searched for ce7ed2a83f part of the image url and found the JSON.
Then the data can be extracted like this:
require 'nokogiri'
require 'open-uri'
require 'json'
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
data_str = html.search('script[type="application/ld+json"]').first.to_s.gsub("\n",'').match(/{.*}/).to_s
data = JSON.parse(data_str)
data['image']

google search url for search by image

I need to create a search url such that when clicked on a image, its url should be passed as a search parameter for search by image on google.
E.g. On clicking this image:
<img src = 'http:abc.com/image1.jpg'>
I need to pass this image url for searching.
found the answer:
https://www.google.com/searchbyimage?image_url=< http:abc.com/image1.jpg >

Selenium copy image to Base64 string?

How can I get an image on a webpage with Selenium and encode it to a Base64 string which gets added to a variable? I'm using Selenium C# but any language will probably work.
I am not quite sure what you are asking. What do you mean by "get an image on a webpage"? Do you mean:
grab a screenshot of your page and compare it with some given value? or
take a screenshot of specific element on webpage?
download an image contained in (ie) <img> tag and do something with it?
For taking screenshots, it is widely disucessed here. Although mostly java solutions, they probably could be ported to C# with ease. If what you need is nr 3, then get the URL (ie using xpath //img[#id=\"yourId\"]#src ) and download it using something like WebClient and convert that to base64:
var plainTextBytes = System.Text.Encoding.UTF8.GetBytes(plainText);
var baseString = System.Convert.ToBase64String(plainTextBytes);
This code will helps you, I am using it for my own report, instead of storing report in seperate location, better to convert into base64 form and add it to report.
String Base64StringofScreenshot="";
File src = ((TakesScreenshot) driverThread).getScreenshotAs(OutputType.FILE);
byte[] fileContent = FileUtils.readFileToByteArray(src);
Base64StringofScreenshot = "data:image/png;base64,"+Base64.getEncoder().encodeToString(fileContent);

Parsing website with Hpricot

I'm trying to parse images from reddit using Ruby and the Hpricot gem.
Using Chrome I got the XPath to the div holding the link to the image then I use doc.search to find it but the results come up empty.
doc.search('//*[#id="siteTable"]/div[1]/a').each do |r|
puts r.attributes['href']
end
Any ideas?

XPath for image in popup

I'm using Scrapy to crawl a webpage. I get the XPath selectors by using an xpath Chrome extension, which works fine. I'm getting everything I want on the product page like description, price etc.
If I click on a small image of an item, the big image of that item pops up, and I want to crawl this big image. But the Xpath I'm using for this big image isn't fetching anything. Also, when I viewed the source code, it shows that it uses a javascript function to load these pop up images. Is there a way to fetch these images?
start_urls = ['http://www.flipkart.com/nokia-lumia-620/p/itmdgkwywkmaa2w4?pid=MOBDGH6AKH9ERJAF']
description = hxs.select('/html/body/div[#class=" fkart fksk-body line "]/div[#id="fk-mainbody-id"]/div[#class="fk-content fksk-content enable-compare line"]/div[#class="fk-mproduct fk-mproduct-mobile "]/div[#class="mprod-section unit"]/div[#id="topsection"]/div[#class="mprod-summary lastUnit"]/div[#class="mprod-summary-title fksk-mprod-summary-title"]/h1/text()').extract()
price = hxs.select('/html/body/div/div/div/div/div/div/div/div/div/div/div/div/span/text()').extract()
image_urls = hxs.select('/html/body/div[#class="fk-ui-dialog fk-popup"]/div[#class="window alpha30 window-absolute"]/div[#class="content"]/div[#class="dialog-body"]/div[#id="pp-large-images-popup"]/div[#class="main-container"]/div[#class="pp-carousel-bd"]/div[#class="visible-image-large fk-text-center"]/img[#id="visible-image-large"]').extract()
Result :
{'description': [u'Nokia Lumia 620'],
'image_urls': [],
'price': u'14999'}
To get the list of image urls for the small thumbnails you can use this XPath:
//div[#class="thumbs thumbs-small"]/img/#src
You can derive the urls of the big images from the urls of the thumbnail images. Just replace 40x40 to 275x275 and you will get the url of the big images.

Resources