filename too long when trying to read and write from an array of absolute images - ruby

Have a capybara script that among other things downloads absolute image links.
When trying to write those images to disk I receive an error:
File name too long
The output also includes a long list of all the image URLs in the array. I think a gsub would solve this but I'm not sure which one or exactly how to implement it.
Here are a few sample image URLs that are part of the link array. A suitable substitute name would be g0377p-xl-3-24c1.jpg or g0371b-m-4-6896.jpg in these examples:
http://www.example.com/media/catalog/product/cache/1/image/560x560/ced77cb19565515451b3578a3bc0ea5e/g/0/g0377p-xl-3-24c1.jpg
http://www.example.com/media/catalog/product/cache/1/image/560x560/ced77cb19565515451b3578a3bc0ea5e/g/0/g0371b-m-4-6896.jpg
This is the code:
require "capybara/dsl"
require "spreadsheet"
require 'fileutils'
require 'open-uri'
def initialize
#excel = Spreadsheet::Workbook.new
#work_list = #excel.create_worksheet
#row = 0
end
imagelink = info.all("//*[#rel='lightbox[rotation]']")
#work_list[#row, 6] = imagelink.map { |link| link['href'] }.join(', ')
image = imagelink.map { |link| link['href'] }
File.basename("#{image}", "w") do |f|
f.write(open(image).read)
end

You can use File.basename to get just the filename:
uri = 'http://www.example.com/media/catalog/product/cache/1/image/560x560/ced77cb19565515451b3578a3bc0ea5e/g/0/g0377p-xl-3-24c1.jpg'
File.basename uri #=> "g0377p-xl-3-24c1.jpg"

There is a real problem with the creation of filename.
imagelink = info.all("//*[#rel='lightbox[rotation]']")
Will return an array of nodes.
From that you get the href value using map and save the resulting array in image.
Then you try to use that array as the name of the file.

Related

scanning a webpage for urls with ruby and regex

I'm trying to create an array of all links found at the below url. Using page.scan(URI.regexp) or URI.extract(page) returns more than just urls.
How do I get just the urls?
require 'net/http'
require 'uri'
uri = URI("https://gist.github.com/JsWatt/59f4b8ce6bbf0c7e4dc7")
page = Net::HTTP.get(uri)
p page.scan(URI.regexp)
p URI.extract(page)
If you are just trying to extract links (<a href="..."> elements) from the text file then it seems better to parse it as real HTML with Nokogiri, and then extract the links this way:
require 'nokogiri'
require 'open-uri'
# Parse the raw HTML text
doc = Nokogiri.parse(open('https://gist.githubusercontent.com/JsWatt/59f4b8ce6bbf0c7e4dc7/raw/c340b3fbcab7923e52e5b50165432b6e5f2e3cf4/for_scraper.txt'))
# Extract all a-elements (HTML links)
all_links = doc.css('a')
# Sort + weed out duplicates and empty links
links = all_links.map { |link| link.attribute('href').to_s }.uniq.
sort.delete_if { |h| h.empty? }
# Print out some of them
puts links.grep(/store/)
http://store.steampowered.com/app/214590/
http://store.steampowered.com/app/218090/
http://store.steampowered.com/app/220780/
http://store.steampowered.com/app/226720/
...

Why doesn't my web-crawling method find all the links?

I'm trying to create a simple web-crawler, so I wrote this:
(Method get_links take a parent link from which we will seek)
require 'nokogiri'
require 'open-uri'
def get_links(link)
link = "http://#{link}"
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
array = hrefs.select {|i| i[0] == "/"}
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
end
(Method search_links, takes an array from get_links method and search at this array)
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
end
This method finds most of links from the website, but not all.
What did I do wrong? Which algorithm I should use?
Some comments about your code:
def get_links(link)
link = "http://#{link}"
# You're assuming the protocol is always http.
# This isn't the only protocol on used on the web.
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
# You can write these two lines more compact as
# hrefs = doc.xpath('//a/#href').map(&:to_s).uniq.delete_if(&:empty?)
array = hrefs.select {|i| i[0] == "/"}
# I guess you want to handle URLs that are relative to the host.
# However, URLs relative to the protocol (starting with '//')
# will also be selected by this condition.
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
# The value assigned to links_list will implicitly be returned.
# (The assignment itself is futile, the right-hand-part alone would
# suffice.) Because this builds on `array` all absolute URLs will be
# missing from the return value.
end
Explanation for
hrefs = doc.xpath('//a/#href').map(&:to_s).uniq.delete_if(&:empty?)
.xpath('//a/#href') uses the attribute syntax of XPath to directly get to the href attributes of a elements
.map(&:to_s) is an abbreviated notation for .map { |item| item.to_s }
.delete_if(&:empty?) uses the same abbreviated notation
And comments about the second function:
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
# How about using a Set instead of an Array and
# thus have the collection provide uniqueness of
# its items, so that you don't have to?
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
# This function isn't recursive, it just calls `get_links` on two
# 'levels'. Thus you search only two levels deep and return findings
# from the first and second level combined. (Without the "zero'th"
# level - the URL passed into `search_links`. Unless off course if it
# also occured on the first or second level.)
#
# Is this what you intended?
end
You should probably be using mechanize:
require 'mechanize'
agent = Mechanize.new
page = agent.get url
links = page.search('a[href]').map{|a| page.uri.merge(a[:href]).to_s}
# if you want to remove links with a different host (hyperlinks?)
links.reject!{|l| URI.parse(l).host != page.uri.host}
Otherwise you'll have trouble converting relative urls to absolute properly.

Download files from URL's in array naming them by items in another array

I have a CSV with two columns, I am pushing each column's data into an array. Column 2 contains URL's of images that I would like to download. How do I name the file it's corresponding value from column 1?
require "open-uri"
require "csv"
members = []
photos = []
CSV.foreach('members.csv', :headers => true) do |csv_obj|
members << csv_obj[0]
photos << csv_obj[1]
end
photos.each {
|x| File.open({value from members array}, 'wb') do |fo|
fo.write open(x).read
end
}
Try this:
require "open-uri"
require "csv"
members = []
photos = []
CSV.foreach('members.csv', :headers => true) do |csv_obj|
members << csv_obj[0]
photos << csv_obj[1]
end
photos.each_with_index do |photo, index|
File.open(members[index], 'wb') do |fo|
fo.write open(photo) { |file| file.read }
end
end
Notes:
Try to submit a snippet of the CSV file too, it will help testing the code.
The code assumes that the members array will contain file names with extension.
The reason for using the block with open while downloading file is so that to ensure closing of file stream.
I suggest to use long descriptive variable names; it silently documents your intent and makes code very readable.
wb argument in File.open method is to ensure writing the file in binary mode.

Web Scraping with Nokogiri::HTML and Ruby - save images

I'm working on a script to grab data & images from webshop productpages
(with approval from the owner)
I have a working script that loops through a CSV file with 20042 product URLS to get me the data I need that is stored in a CSV file. Final thing I need is to save the product images.
I have this code (thanks to Phrogz in this thread)
URL = 'http://www.sample.com/page.html'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
def make_absolute( href, root )
URI.parse(root).merge(URI.parse(href)).to_s
end
Nokogiri::HTML(open(URL)).xpath('//*[#id="zoom"]/#href').each do |src|
uri = make_absolute(src,URL)
File.open(File.basename(uri),'wb'){ |f| f.write(open(uri).read) }
end
that runs great for a seperate URL but I'm struggling to get it working and loop through the URLS from the CSV file in my main script that starts like this:
# encoding: utf-8
require 'nokogiri'
require 'open-uri'
require 'csv'
require 'mechanize'
#prices = Array.new
#title = Array.new
#description = Array.new
#warranty = Array.new
#leadtime = Array.new
#urls = Array.new
#categories = Array.new
#subcategories = Array.new
#subsubcategories = Array.new
urls = CSV.read("lotofurls.csv")
(0..urls.length - 1).each do |index|
puts urls[index][0]
doc = Nokogiri::HTML(open(urls[index][0]))
Looks like all I need to figure out is how to feed the urls to the code saving the image but any help would be much appreciated!
You can make quick work of this with something like RMagick (or ImageMagick, MiniMagick, etc)
For RMagick, you could do something like this
require 'rmagick'
images.each do |image|
url = image.url # should be a string
Magick::Image.read(url).first.resize_to_fill(200,200).write(image.desired_filename)
end
That would write a 200x200px image for each url you provide (resize_to_fill is optional, obviously). The library is very powerful, with many, many options. If you go this route, I'd recommend the railscast on image manipulation: http://railscasts.com/episodes/374-image-manipulation
And the documentation if you want to get more advanced: http://rmagick.rubyforge.org/

Parsing multiple URLs with Nokogiri?

I am trying to get a list of hrefs from a list of ten URLs and running into trouble.
Each of these blocks work separately from each other, but, when I try to combine them, I get a list of pages 1-10 and an error. What is the proper way to go about this?
#!/usr/bin/env ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
#/ this prints all 10 of the URLs to pull page hrefs from.
1.upto(10) do |pagenum|
url = "http://www.mywebsite.com/page/#{pagenum}"
puts url
end
#/ Prints out all of the hrefs.
doc = Nokogiri::HTML(open(url))
doc.xpath('//h2/a/#href').each do |node|
puts node.text
end
Here's your code, annotated:
1.upto(10) do |pagenum|
# Create a local variable named `url`
url = "http://www.mywebsite.com/page/#{pagenum}"
# Print it
puts url
end
# Open...uhm...which URL?
doc = Nokogiri::HTML(open(url))
The problem is that the url variable is "scoped" locally to the upto block. It no longer exists once you exist that block. Perhaps you wanted this:
1.upto(10) do |pagenum|
# Create a local variable named `url`
url = "http://www.mywebsite.com/page/#{pagenum}"
# Print it
puts url
# Print this URL
doc = Nokogiri::HTML(open(url))
doc.xpath('//h2/a/#href').each do |node|
puts node.text
end
end

Resources