Web Scraping with Nokogiri::HTML and Ruby - save images - ruby

I'm working on a script to grab data & images from webshop productpages
(with approval from the owner)
I have a working script that loops through a CSV file with 20042 product URLS to get me the data I need that is stored in a CSV file. Final thing I need is to save the product images.
I have this code (thanks to Phrogz in this thread)
URL = 'http://www.sample.com/page.html'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
def make_absolute( href, root )
URI.parse(root).merge(URI.parse(href)).to_s
end
Nokogiri::HTML(open(URL)).xpath('//*[#id="zoom"]/#href').each do |src|
uri = make_absolute(src,URL)
File.open(File.basename(uri),'wb'){ |f| f.write(open(uri).read) }
end
that runs great for a seperate URL but I'm struggling to get it working and loop through the URLS from the CSV file in my main script that starts like this:
# encoding: utf-8
require 'nokogiri'
require 'open-uri'
require 'csv'
require 'mechanize'
#prices = Array.new
#title = Array.new
#description = Array.new
#warranty = Array.new
#leadtime = Array.new
#urls = Array.new
#categories = Array.new
#subcategories = Array.new
#subsubcategories = Array.new
urls = CSV.read("lotofurls.csv")
(0..urls.length - 1).each do |index|
puts urls[index][0]
doc = Nokogiri::HTML(open(urls[index][0]))
Looks like all I need to figure out is how to feed the urls to the code saving the image but any help would be much appreciated!

You can make quick work of this with something like RMagick (or ImageMagick, MiniMagick, etc)
For RMagick, you could do something like this
require 'rmagick'
images.each do |image|
url = image.url # should be a string
Magick::Image.read(url).first.resize_to_fill(200,200).write(image.desired_filename)
end
That would write a 200x200px image for each url you provide (resize_to_fill is optional, obviously). The library is very powerful, with many, many options. If you go this route, I'd recommend the railscast on image manipulation: http://railscasts.com/episodes/374-image-manipulation
And the documentation if you want to get more advanced: http://rmagick.rubyforge.org/

Related

Nokogiri example not showing array (Ruby)

When I try to run this via terminal I can parse/display the data but when I type in pets_array = []
I am not seeing anything
My code is as follows:
require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'
page = HTTParty.get('https://newyork.craigslist.org/search/pet?s=0')
parse_page = Nokogiri::HTML(page)
pets_array = []
parse_page.css('.content').css('.row').css('.result-title hdrlnk').map do |a|
post_name = a.text
pets_array.push(post_name)
end
CSV.open('pets.csv', 'w') do |csv|
csv << pets_array
end
Pry.start(binding)
Maybe to be precise you could access each anchor tag with class .result-title.hdrlnk inside .result-info, .result-row, .rows and .content:
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
p pets_array
# ["Mini pig", "Black Russian Terrier", "2 foster or forever homes needed Asap!", ...]
As you're using map, you can use the pets_array variable to store the text on each iterated element, no need to push.
If you want to write the data stored in the array, then you can push is directly, no need to redefined as an empty array (the reason because you get a blank csv file):
require 'httparty'
require 'nokogiri'
require 'csv'
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
CSV.open('pets.csv', 'w') { |csv| csv << pets_array }

Ruby crawl site, add URL parameter

I am trying to crawl a site and append a URL parameter to each address before hitting them. Here's what I have so far:
require "spidr"
Spidr.site('http://www.example.com/') do |spider|
spider.every_url { |url| puts url }
end
But I'd like the spider to hit all pages and append a param like so:
example.com/page1?var=param1
example.com/page2?var=param1
example.com/page3?var=param1
UPDATE 1 -
Tried this, not working though, errors out ("405 method not allowed") after a few iterations:
require "spidr"
require "open-uri"
Spidr.site('http://example.com') do |spider|
spider.every_url do |url|
link= url+"?foo=bar"
response = open(link).read
end
end
Instead of relying on Spidr, I just grabbed a CSV of the URLs I needed from Google Analytics, then ran thru those. Got the job done.
require 'csv'
require 'open-uri'
CSV.foreach(File.path("the-links.csv")) do |row|
link = "http://www.example.com"+row[0]+"?foo=bar"
encoded_url = URI.encode(link)
response = open(encoded_url).read
puts encoded_url
puts
end

How to use three url to make one url array. Use the same url for nokogiri

I might be crazy, but I have been trying to gather all my favorite news sites and scrap them into one ruby file. I would like to use these sites to scrape headlines and hopefully create a custom page for my site. Now so far i have been able to scrape the headlines from all three site individually. I am looking to use all three url into an array and use Nokogiri just once. Can anyone help me ?
require 'nokogiri'
require 'open-uri'
url = 'http://www.engadget.com'
data = Nokogiri::HTML(open(url))
#feeds = data.css('.post')
#feeds.each do |feed|
puts feed.css('.headline').text.strip
end
url2 = 'http://www.modmyi.com'
data2 = Nokogiri::HTML(open(url2))
#modmyi = data2.css('.title')
#modmyi.each do |mmi|
puts mmi.css('span').text
end
url3 = 'http://www.cnn.com/specials/last-50-stories'
data3 = Nokogiri::HTML(open(url3))
#cnn = data3.css('.cd__content')
#cnn.each do |cn|
puts cn.css('.cd__headline').text
end
You might want to extract the loading of the document and the extraction of the titles into its own class:
require 'nokogiri'
require 'open-uri'
class TitleLoader < Struct.new(:url, :outher_css, :inner_css)
def titles
load_posts.map { |post| extract_title(post) }
end
private
def read_document
Nokogiri::HTML(open(url))
end
def load_posts
read_document.css(outher_css)
end
def extract_title(post)
post.css(inner_css).text.strip
end
end
And than use that class like this:
urls = [
['http://www.engadget.com', '.post', '.headline'],
['http://www.modmyi.com', '.title', 'span'],
['http://www.cnn.com/specials/last-50-stories', '.cd__content', '.cd__headline']
]
urls.map { |args| TitleLoader.new(*args).titles }.flatten

filename too long when trying to read and write from an array of absolute images

Have a capybara script that among other things downloads absolute image links.
When trying to write those images to disk I receive an error:
File name too long
The output also includes a long list of all the image URLs in the array. I think a gsub would solve this but I'm not sure which one or exactly how to implement it.
Here are a few sample image URLs that are part of the link array. A suitable substitute name would be g0377p-xl-3-24c1.jpg or g0371b-m-4-6896.jpg in these examples:
http://www.example.com/media/catalog/product/cache/1/image/560x560/ced77cb19565515451b3578a3bc0ea5e/g/0/g0377p-xl-3-24c1.jpg
http://www.example.com/media/catalog/product/cache/1/image/560x560/ced77cb19565515451b3578a3bc0ea5e/g/0/g0371b-m-4-6896.jpg
This is the code:
require "capybara/dsl"
require "spreadsheet"
require 'fileutils'
require 'open-uri'
def initialize
#excel = Spreadsheet::Workbook.new
#work_list = #excel.create_worksheet
#row = 0
end
imagelink = info.all("//*[#rel='lightbox[rotation]']")
#work_list[#row, 6] = imagelink.map { |link| link['href'] }.join(', ')
image = imagelink.map { |link| link['href'] }
File.basename("#{image}", "w") do |f|
f.write(open(image).read)
end
You can use File.basename to get just the filename:
uri = 'http://www.example.com/media/catalog/product/cache/1/image/560x560/ced77cb19565515451b3578a3bc0ea5e/g/0/g0377p-xl-3-24c1.jpg'
File.basename uri #=> "g0377p-xl-3-24c1.jpg"
There is a real problem with the creation of filename.
imagelink = info.all("//*[#rel='lightbox[rotation]']")
Will return an array of nodes.
From that you get the href value using map and save the resulting array in image.
Then you try to use that array as the name of the file.

Nokogiri multiple domains

Is it possible to do multi domain searches using Nokogiri. I am aware you can do multiple Xpath/CSS searches for a single domain/page but multi domain?
For example I want to scrape http://www.asus.com/Notebooks_Ultrabooks/S56CA/#specifications and http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications
My Code
require 'nokogiri'
require 'open-uri'
require 'spreadsheet'
doc = Nokogiri::HTML(open("http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications"))
#Grab our product specifications
data = doc.css('div#specifications div#spec-area ul.product-spec li')
#Modify our data
lines = data.map(&:text)
#Create the Spreadsheet
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet
sheet1.name = 'My First Worksheet'
#Output our data to the Spreadsheet
lines.each.with_index do |line, i|
sheet1[i, 0] = line
end
book.write 'C:/Users/Barry/Desktop/output.xls'
Nokogiri has no concept of URLs, it only knows about a String or IO stream of XML or HTML. You're confusing OpenURI's purpose with Nokogiri's.
If you want to read from multiple sites, simply loop over the URLs, and pass the current URL to OpenURI to open the page:
%w[
http://www.asus.com/Notebooks_Ultrabooks/S56CA/#specifications
http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications
].each do |url|
doc = Nokogiri::HTML(open(url))
# do somethng with the document...
end
OpenURI will read the page, and pass its contents to Nokogiri for parsing. Nokogiri will still only see one page at a time, because that's all it is passed by OpenURI.

Resources