Watir image processing - ruby

Is there a way to get an image extension (based on the content-type header) and it's body in Watir?
Here is an example
require 'watir'
zz = Watir::IE.new
zz.goto('http://flickr.com')
image = zz.image(:src => %r/l.yimg.com\/g\/images\//)
puts image
I need to get extension and the contents (base64encoded or just location of a temp file) of the latter image

require 'watir-webdriver'
require 'open-uri'
b = Watir::Browser.new :firefox
b.goto "http://altentee.com"
b.images.each do |img|
uri = URI.parse(img.src)
open(uri) { |file| puts file.content_type; open('/tmp/file', 'wb') { |tmp| tmp.write(file.read)} }
end

Related

Web Crawling. Not able to crawl the data correctly

require 'open-uri'
require 'nokogiri'
url = "https://grofers.com/cn/grocery-staples/cid/16"
response = open(url).read
parsed_data = Nokogiri::HTML(response)
results = []
content = parsed_data.css('.section-right').css('.products products--grid').each do |row|
title = row.css('a.product__wrapper').text
puts title
results << content
end

Nokogiri example not showing array (Ruby)

When I try to run this via terminal I can parse/display the data but when I type in pets_array = []
I am not seeing anything
My code is as follows:
require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'
page = HTTParty.get('https://newyork.craigslist.org/search/pet?s=0')
parse_page = Nokogiri::HTML(page)
pets_array = []
parse_page.css('.content').css('.row').css('.result-title hdrlnk').map do |a|
post_name = a.text
pets_array.push(post_name)
end
CSV.open('pets.csv', 'w') do |csv|
csv << pets_array
end
Pry.start(binding)
Maybe to be precise you could access each anchor tag with class .result-title.hdrlnk inside .result-info, .result-row, .rows and .content:
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
p pets_array
# ["Mini pig", "Black Russian Terrier", "2 foster or forever homes needed Asap!", ...]
As you're using map, you can use the pets_array variable to store the text on each iterated element, no need to push.
If you want to write the data stored in the array, then you can push is directly, no need to redefined as an empty array (the reason because you get a blank csv file):
require 'httparty'
require 'nokogiri'
require 'csv'
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
CSV.open('pets.csv', 'w') { |csv| csv << pets_array }

Ruby,CSV and pdf's

So I am converting urls into images and downloading them into a document. The file can be an .jpg or .pdf. I can successfully download the pdf and there is something on the pdf (in form of memory) but when I try to open the pdf, adobe reader does not recognize it and deem it broken.
Here is a link to one of the URLs - http://www.finfo.se/www.artdb.finfo.se/cgi-bin/lankkod.dll/lev?knr=7770566&art=001317514&typ=PI
And here is the code =>
require 'open-uri'
require 'tempfile'
require 'uri'
require 'csv'
DOWNLOAD_DIR = "#{Dir.pwd}/PI/"
CSV_FILE = "#{Dir.pwd}/konvertera4.csv"
def downloadFile(id, url, format)
begin
open("#{DOWNLOAD_DIR}#{id}.#{format}", "w") do |file|
file << open(url).read
puts "Successfully downloaded #{url} to #{DOWNLOAD_DIR}#{id}.#{format}"
end
rescue Exception => e
puts "#{e} #{url}"
end
end
CSV.foreach(CSV_FILE, headers: true, col_sep: ";") do |row|
puts row
next unless row[0] && row[1]
id = row[0]
format = row[1].match(/PI\.(.+)$/)&.captures.first
puts format
#format = "pdf"
#format = row[1].match(/BD\.(.+)$/)&.captures.first
url = row[1].gsub ".pdf", ""
downloadFile(id, url, format)
end
Try using wb instead of w:
open("#{DOWNLOAD_DIR}#{id}.#{format}", "wb")

Web Scraping with Nokogiri::HTML and Ruby - save images

I'm working on a script to grab data & images from webshop productpages
(with approval from the owner)
I have a working script that loops through a CSV file with 20042 product URLS to get me the data I need that is stored in a CSV file. Final thing I need is to save the product images.
I have this code (thanks to Phrogz in this thread)
URL = 'http://www.sample.com/page.html'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
def make_absolute( href, root )
URI.parse(root).merge(URI.parse(href)).to_s
end
Nokogiri::HTML(open(URL)).xpath('//*[#id="zoom"]/#href').each do |src|
uri = make_absolute(src,URL)
File.open(File.basename(uri),'wb'){ |f| f.write(open(uri).read) }
end
that runs great for a seperate URL but I'm struggling to get it working and loop through the URLS from the CSV file in my main script that starts like this:
# encoding: utf-8
require 'nokogiri'
require 'open-uri'
require 'csv'
require 'mechanize'
#prices = Array.new
#title = Array.new
#description = Array.new
#warranty = Array.new
#leadtime = Array.new
#urls = Array.new
#categories = Array.new
#subcategories = Array.new
#subsubcategories = Array.new
urls = CSV.read("lotofurls.csv")
(0..urls.length - 1).each do |index|
puts urls[index][0]
doc = Nokogiri::HTML(open(urls[index][0]))
Looks like all I need to figure out is how to feed the urls to the code saving the image but any help would be much appreciated!
You can make quick work of this with something like RMagick (or ImageMagick, MiniMagick, etc)
For RMagick, you could do something like this
require 'rmagick'
images.each do |image|
url = image.url # should be a string
Magick::Image.read(url).first.resize_to_fill(200,200).write(image.desired_filename)
end
That would write a 200x200px image for each url you provide (resize_to_fill is optional, obviously). The library is very powerful, with many, many options. If you go this route, I'd recommend the railscast on image manipulation: http://railscasts.com/episodes/374-image-manipulation
And the documentation if you want to get more advanced: http://rmagick.rubyforge.org/

Save image with Mechanize and Nokogiri?

I'm using Mechanize and Nokogiri to gather some data. I need to save a picture that's randomly generated at each request.
In my attempt I'm forced to download all pictures, but the only one I really want is the image located within div#specific.
In addition, is it possible to generate Base64 data from it, without saving it, or reloading its source?
require 'rubygems'
require 'mechanize'
require 'nokogiri'
a = Mechanize.new { |agent|
agent.keep_alive = true
agent.max_history = 0
}
urls = Array.new()
urls.push('http://www.domain.com');
urls.each {|url|
page = a.get(url)
doc = Nokogiri::HTML(page.body)
if doc.at_css('#specific')
page.images.each do |img|
img.fetch.save('picture.png')
end
end
}
To fetch the images from the specific location:
agent = Mechanize.new
page = agent.get('http://www.domain.com')
images = page.search("#specific img")
To save the image:
agent.get(images.first.attributes["src"]).save "path/to/folder/image_name.jpg"
To get image encoded without saving:
encoded_image = Base64.encode64 agent.get(images.first.attributes["src"]).body_io.string
I ran this just to make sure that the image that was encoded can be decoded back:
File.open("images/image_name.jpg", "wb") {|f| f.write(Base64.decode64(encoded_image))}

Resources