Downloading images with Mechanize gem - ruby

I'm trying to download all full-res images from a site by checking for image links, visit them and download the full image.
I have managed to make it kinda work. I can fetch all links and download the image from i.imgur. However, I want to make it work with more sites and normal imgur albums and also without wget (which I am using now as shown below).
This is the code I'm currently playing around with (Don't judge, it's only testcode):
require 'mechanize'
require 'uri'
def get_images()
crawler = Mechanize.new
img_links = crawler.get("http://www.reddit.com/r/climbing/new/?count=25&after=t3_39qccc").links_with(href: %r{i.imgur})
return img_links
end
def download_images()
img_links = get_images()
crawler = Mechanize.new
clean_links = []
img_links.each do |link|
current_link = link.uri.to_s
unless current_link.include?("domain")
unless clean_links.include?(current_link)
clean_links << current_link
end
end
end
p clean_links
clean_links.each do |link|
system("wget -P ./images -A jpeg,jpg,bmp,gif,png #{link}")
end
end
download_images()

Related

Ruby crawl site, add URL parameter

I am trying to crawl a site and append a URL parameter to each address before hitting them. Here's what I have so far:
require "spidr"
Spidr.site('http://www.example.com/') do |spider|
spider.every_url { |url| puts url }
end
But I'd like the spider to hit all pages and append a param like so:
example.com/page1?var=param1
example.com/page2?var=param1
example.com/page3?var=param1
UPDATE 1 -
Tried this, not working though, errors out ("405 method not allowed") after a few iterations:
require "spidr"
require "open-uri"
Spidr.site('http://example.com') do |spider|
spider.every_url do |url|
link= url+"?foo=bar"
response = open(link).read
end
end
Instead of relying on Spidr, I just grabbed a CSV of the URLs I needed from Google Analytics, then ran thru those. Got the job done.
require 'csv'
require 'open-uri'
CSV.foreach(File.path("the-links.csv")) do |row|
link = "http://www.example.com"+row[0]+"?foo=bar"
encoded_url = URI.encode(link)
response = open(encoded_url).read
puts encoded_url
puts
end

Web Scraping with Nokogiri::HTML and Ruby - save images

I'm working on a script to grab data & images from webshop productpages
(with approval from the owner)
I have a working script that loops through a CSV file with 20042 product URLS to get me the data I need that is stored in a CSV file. Final thing I need is to save the product images.
I have this code (thanks to Phrogz in this thread)
URL = 'http://www.sample.com/page.html'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
def make_absolute( href, root )
URI.parse(root).merge(URI.parse(href)).to_s
end
Nokogiri::HTML(open(URL)).xpath('//*[#id="zoom"]/#href').each do |src|
uri = make_absolute(src,URL)
File.open(File.basename(uri),'wb'){ |f| f.write(open(uri).read) }
end
that runs great for a seperate URL but I'm struggling to get it working and loop through the URLS from the CSV file in my main script that starts like this:
# encoding: utf-8
require 'nokogiri'
require 'open-uri'
require 'csv'
require 'mechanize'
#prices = Array.new
#title = Array.new
#description = Array.new
#warranty = Array.new
#leadtime = Array.new
#urls = Array.new
#categories = Array.new
#subcategories = Array.new
#subsubcategories = Array.new
urls = CSV.read("lotofurls.csv")
(0..urls.length - 1).each do |index|
puts urls[index][0]
doc = Nokogiri::HTML(open(urls[index][0]))
Looks like all I need to figure out is how to feed the urls to the code saving the image but any help would be much appreciated!
You can make quick work of this with something like RMagick (or ImageMagick, MiniMagick, etc)
For RMagick, you could do something like this
require 'rmagick'
images.each do |image|
url = image.url # should be a string
Magick::Image.read(url).first.resize_to_fill(200,200).write(image.desired_filename)
end
That would write a 200x200px image for each url you provide (resize_to_fill is optional, obviously). The library is very powerful, with many, many options. If you go this route, I'd recommend the railscast on image manipulation: http://railscasts.com/episodes/374-image-manipulation
And the documentation if you want to get more advanced: http://rmagick.rubyforge.org/

Iterate through pages nokogiri get link address

I'm trying to get images or image addresses from the website below. It works for the one website that I put below: "http://www.1stsourceservall.com/Category/Accessories". However--once it's finished with the page--I want it to then click on the next page link and cycle through all 20+ pages. How would I do that?
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.1stsourceservall.com/Category/Accessories"
while (url) do
doc = Nokogiri::HTML(open(url))
puts doc.css(".productImageMed")
end
link = doc.css('.pagination a')
url = link && link[0]['href'] #=> url is nil if no link is found on the page
end

Save image with Mechanize and Nokogiri?

I'm using Mechanize and Nokogiri to gather some data. I need to save a picture that's randomly generated at each request.
In my attempt I'm forced to download all pictures, but the only one I really want is the image located within div#specific.
In addition, is it possible to generate Base64 data from it, without saving it, or reloading its source?
require 'rubygems'
require 'mechanize'
require 'nokogiri'
a = Mechanize.new { |agent|
agent.keep_alive = true
agent.max_history = 0
}
urls = Array.new()
urls.push('http://www.domain.com');
urls.each {|url|
page = a.get(url)
doc = Nokogiri::HTML(page.body)
if doc.at_css('#specific')
page.images.each do |img|
img.fetch.save('picture.png')
end
end
}
To fetch the images from the specific location:
agent = Mechanize.new
page = agent.get('http://www.domain.com')
images = page.search("#specific img")
To save the image:
agent.get(images.first.attributes["src"]).save "path/to/folder/image_name.jpg"
To get image encoded without saving:
encoded_image = Base64.encode64 agent.get(images.first.attributes["src"]).body_io.string
I ran this just to make sure that the image that was encoded can be decoded back:
File.open("images/image_name.jpg", "wb") {|f| f.write(Base64.decode64(encoded_image))}

Ruby: Script won't grab URL from youtube correctly

I found this script on pastebin that is an IRC bot that will find youtube videos for you. I have not touched it at all (Bar the channel settings), it works well however it won't grab the URL to the video that has been searched. This code is not mine! I jsut would like to get it to work as it would be quite useful!
#!/usr/bin/env ruby
require 'rubygems'
require 'cinch'
require 'nokogiri'
require 'open-uri'
require 'cgi'
bot = Cinch::Bot.new do
configure do |c|
c.server = "irc.freenode.net"
c.nick = "YouTubeBot"
c.channels = ["#test"]
end
helpers do
#Grabs the first result and returns the TITLE,LINK,DESCRIPTION
def youtube(query)
doc = Nokogiri::HTML(open("http://www.youtube.com/results?q=#{CGI.escape(query)}"))
result = doc.css('div#search-results div.result-item-main-content')[0]
title = result.at('h3').text
link = "www.youtube.com"+"#{result.at('a')[:href]}"
desc = result.at('p.description').text
rescue
"No results found"
else
CGI.unescape_html "#{title} - #{desc} - #{link}"
end
end
on :channel, /^!youtube (.+)/ do |m, query|
m.reply youtube(query)
end
on :channel, "polkabot quit" do |m|
m.channel.part("bye")
end
end
bot.start
Currently if i use the command
!youtube asdf
I get this returned:
19:25 < YouTubeBot> asdfmovie - Worldwide Store www.cafepress.com ...
asdfmovie cakebomb tomska
epikkufeiru asdf movie ... tomska ... [Baby Giggling] Man: Got your nose! [Baby ...
- www.youtube.com#
As you can see the URL is just www.youtube.com# not the URL of the video.
Thanks a lot!
This is an xpath issue. It looks like the third 'a' that has the href you want so try:
link = "www.youtube.com#{result.css('a')[2][:href]}"

Resources