Trying to scrape images from https://en.wikipedia.org/ website using mechanize gem. I am getting Mechanize::ResponseCodeError (404 => Net::HTTPNotFound for https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/FP2A3620_%252823497688248%2529.jpg/119px-FP2A3620_%252823497688248%2529.jpg -- unhandled response): for this when i try to calculate image size.
Here is my code
def images
agent = Mechanize.new
page = agent.get("https://en.wikipedia.org/")
page.images.each do |image|
puts image.url
size = agent.head( image )["content-length"].to_i/1000
end
end
Any help is appreciated.
Looked after that image on wikipedia and it renders just fine. Opened it in a new tab and compared the url from the browser to what mechanize has.
Unescaping the url, did the trick.
image_url = CGI.unescape(image.url.to_s)
size = agent.head(image_url)["content-length"].to_i/1000
Here is a working Replit.
Related
I need to retrieve all the images present under a specific div using Ruby and Mechanize. The relevant DOM structure is as follows:
<div id="item_img">
<a href="JavaScript:imageview('000000018693.jpg')">
<img src="/shop/doubleimages/0000000186932.jpg" border="0" width="500" height="500" alt="関係ないコメント z1808">
</a>
<img src="/shop/doubleimages/000000018693_1.jpg"><br><br>
<img src="/shop/doubleimages/000000018693_2.jpg"><br><br>
<img src="/shop/doubleimages/000000018693_3.jpg"><br><br>
</div>
So, I initially got all the images after spinning up a new agent by doing:
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get ('http://double14.com/shopdetail/000000018693/')
puts page.images
This was nice, but it every image on the page (as it should), and seems to strip out the div id above it, making it impossible to decide what comes from where. As a result, I had every image on the page (no bueno).
I got it down to this:
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get ('http://double14.com/shopdetail/000000018693/')
node = page.search "#item_img img"
node.each do |n|
puts n.attributes['src']
end
Unfortunately, that outputs the following -
/shop/doubleimages/0000000186932.jpg
/shop/doubleimages/000000018693_1.jpg
/shop/doubleimages/000000018693_2.jpg
/shop/doubleimages/000000018693_3.jpg
Is there a way to take the full URL and use that instead? Ultimately, I would like to save these images to a database, but I need the full URL to serialize them to disk for later upload.
Yes. You can get the full URL for the images with the #resolve method:
require 'mechanize'
mechanize = Mechanize.new
mechanize.user_agent_alias = 'Mac Safari'
page = mechanize.get('http://double14.com/shopdetail/000000018693/')
page.search('#item_img img').each do |img|
puts mechanize.resolve(img['src'])
end
Alternatively you can use the #download method to download them directly.
This is how I did it for a collection of images. In this case the base_uri is the url that you are passing to get. Let me know if you have any questions.
def self.qualify_images(base_uri, images)
images.map do |image|
next unless has_src?(image)
qualify_image(base_uri, image)
end.compact
end
def self.qualify_image(base_uri, image)
src = image.attributes["src"].value
if src =~ /^\/[\/]/
result = "#{scheme(base_uri)}#{src}"
elsif src =~ /^\//
result = "#{base_uri}#{src}"
else
result = src
end
http?(result) ? result : nil
end
def self.has_src?(image)
image.attributes["src"].value
rescue NoMethodError
false
end
def self.scheme(uri)
uri = URI.parse(uri)
"#{uri.scheme}:"
end
def self.http?(uri)
uri = URI.parse(uri)
uri.kind_of?(URI::HTTP)
rescue URI::InvalidURIError
false
end
This will ensure a fully qualified uri for each image.
It will look something like:
page.search("#item_img img").each do |img|
puts page.uri.merge(img[:src]).to_s
end
I'm trying to download all full-res images from a site by checking for image links, visit them and download the full image.
I have managed to make it kinda work. I can fetch all links and download the image from i.imgur. However, I want to make it work with more sites and normal imgur albums and also without wget (which I am using now as shown below).
This is the code I'm currently playing around with (Don't judge, it's only testcode):
require 'mechanize'
require 'uri'
def get_images()
crawler = Mechanize.new
img_links = crawler.get("http://www.reddit.com/r/climbing/new/?count=25&after=t3_39qccc").links_with(href: %r{i.imgur})
return img_links
end
def download_images()
img_links = get_images()
crawler = Mechanize.new
clean_links = []
img_links.each do |link|
current_link = link.uri.to_s
unless current_link.include?("domain")
unless clean_links.include?(current_link)
clean_links << current_link
end
end
end
p clean_links
clean_links.each do |link|
system("wget -P ./images -A jpeg,jpg,bmp,gif,png #{link}")
end
end
download_images()
I am trying to scrape multiple pages from a website. I want to scrape a page, then click on next, get that page, and repeat until I hit the end.
I wrote this so far:
page = agent.submit(form, form.buttons.first)
#submitting a form
while lien = page.link_with(:text=>'Next')
# while I have a next link on page, keep scraping
html_body = Nokogiri::HTML(body)
links = html_body.css('.list').xpath("//table/tbody/tr/td[2]/a[1]")
links.each do |link|
purelink = link['href']
puts purelink[/codeClub=([^&]*)/].gsub('codeClub=', '')
lien.click
end
end
Unfortunately, with this script I keep on scraping the same page in an infinite loop... How can I achieve what I want to do ?
I would try this, replace lien.click with page = lien.click.
It should look more like this:
page = form.submit form.button
scrape page
while link = page.link_with :text => 'Next'
page = link.click
scrape page
end
Also you don't need to parse the page body with nokogiri, mechanize already does that for you.
I'm using Mechanize and Nokogiri to gather some data. I need to save a picture that's randomly generated at each request.
In my attempt I'm forced to download all pictures, but the only one I really want is the image located within div#specific.
In addition, is it possible to generate Base64 data from it, without saving it, or reloading its source?
require 'rubygems'
require 'mechanize'
require 'nokogiri'
a = Mechanize.new { |agent|
agent.keep_alive = true
agent.max_history = 0
}
urls = Array.new()
urls.push('http://www.domain.com');
urls.each {|url|
page = a.get(url)
doc = Nokogiri::HTML(page.body)
if doc.at_css('#specific')
page.images.each do |img|
img.fetch.save('picture.png')
end
end
}
To fetch the images from the specific location:
agent = Mechanize.new
page = agent.get('http://www.domain.com')
images = page.search("#specific img")
To save the image:
agent.get(images.first.attributes["src"]).save "path/to/folder/image_name.jpg"
To get image encoded without saving:
encoded_image = Base64.encode64 agent.get(images.first.attributes["src"]).body_io.string
I ran this just to make sure that the image that was encoded can be decoded back:
File.open("images/image_name.jpg", "wb") {|f| f.write(Base64.decode64(encoded_image))}
I'm having a trouble with Mechanize gem, how to convert Mechanize::File into Mechanize::Page,
here's my piece of code:
**link** = page.link_with(:href => %r{/en/users}).click
when users link clicked it goes to the page with the list of users, now i want to click the first user, but i can't achieve this, because link return Mechanize::File object
Any help, suggestions 'd be great, thanks
Mechanize uses Content-Type to determine how the resource should be handled. Occasionally websites will not set the mime-types for their resources. Mechanize::File is the default for unset Content-Type.
If you are only dealing with 'text/html' you can following Jimm Stout's suggestion of using post_connect_hooks
agent = Mechanize.new do |a|
a.post_connect_hooks << ->(_,_,response,_) do
if response.content_type.empty?
response.content_type = 'text/html'
end
end
end
Just parse the body with nokogiri:
link = page.link_with(:href => %r{/en/users}).click
doc = Nokogiri::HTML link.body
agent.get doc.at('a')[:href]