Retrieving images under a specific Div with Mechanize and Ruby - ruby

I need to retrieve all the images present under a specific div using Ruby and Mechanize. The relevant DOM structure is as follows:
<div id="item_img">
<a href="JavaScript:imageview('000000018693.jpg')">
<img src="/shop/doubleimages/0000000186932.jpg" border="0" width="500" height="500" alt="関係ないコメント z1808">
</a>
<img src="/shop/doubleimages/000000018693_1.jpg"><br><br>
<img src="/shop/doubleimages/000000018693_2.jpg"><br><br>
<img src="/shop/doubleimages/000000018693_3.jpg"><br><br>
</div>
So, I initially got all the images after spinning up a new agent by doing:
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get ('http://double14.com/shopdetail/000000018693/')
puts page.images
This was nice, but it every image on the page (as it should), and seems to strip out the div id above it, making it impossible to decide what comes from where. As a result, I had every image on the page (no bueno).
I got it down to this:
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get ('http://double14.com/shopdetail/000000018693/')
node = page.search "#item_img img"
node.each do |n|
puts n.attributes['src']
end
Unfortunately, that outputs the following -
/shop/doubleimages/0000000186932.jpg
/shop/doubleimages/000000018693_1.jpg
/shop/doubleimages/000000018693_2.jpg
/shop/doubleimages/000000018693_3.jpg
Is there a way to take the full URL and use that instead? Ultimately, I would like to save these images to a database, but I need the full URL to serialize them to disk for later upload.

Yes. You can get the full URL for the images with the #resolve method:
require 'mechanize'
mechanize = Mechanize.new
mechanize.user_agent_alias = 'Mac Safari'
page = mechanize.get('http://double14.com/shopdetail/000000018693/')
page.search('#item_img img').each do |img|
puts mechanize.resolve(img['src'])
end
Alternatively you can use the #download method to download them directly.

This is how I did it for a collection of images. In this case the base_uri is the url that you are passing to get. Let me know if you have any questions.
def self.qualify_images(base_uri, images)
images.map do |image|
next unless has_src?(image)
qualify_image(base_uri, image)
end.compact
end
def self.qualify_image(base_uri, image)
src = image.attributes["src"].value
if src =~ /^\/[\/]/
result = "#{scheme(base_uri)}#{src}"
elsif src =~ /^\//
result = "#{base_uri}#{src}"
else
result = src
end
http?(result) ? result : nil
end
def self.has_src?(image)
image.attributes["src"].value
rescue NoMethodError
false
end
def self.scheme(uri)
uri = URI.parse(uri)
"#{uri.scheme}:"
end
def self.http?(uri)
uri = URI.parse(uri)
uri.kind_of?(URI::HTTP)
rescue URI::InvalidURIError
false
end
This will ensure a fully qualified uri for each image.

It will look something like:
page.search("#item_img img").each do |img|
puts page.uri.merge(img[:src]).to_s
end

Related

Nokogiri Throwing Exception in Function but not outside of Function

I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....

Watir or Selenium webdriver - Find for IMG SRC duplicated

Is there any possible way to index/list and compare <img src=""> values using watir or selenium webdriver?.
Update #1
I've succesfully managed to progress on the general script for finding the right <div> that contains the pictures
require 'watir-webdriver'
require 'selenium-webdriver'
b = Watir::Browser.new :firefox
$i = 1
(1..1000).each do |i|
b.goto 'http:example.com'
b.div(:id, 'pic_container').wait_until_present
puts 'div present'
begin
if
else
end
end
b.close
There will be more coding, only thing i can't resolve is enumerate all pictures available, comparing their sources and output the results.
Update #2
Thanks both JustinKo and Carldmitch for their answers. I went to this now:
require 'watir-webdriver'
require 'selenium-webdriver'
b = Watir::Browser.new :firefox
b.goto 'https://trafficmonsoon.com'
begin
Watir::Wait.until { b.url == "http://example.com" }
b.a(:href, "http://example.com/img").wait_until_present
b.a(:href, "http://example.com/img").click
Watir::Wait.until { b.url == "http://example.com/img" }
b.driver.manage.timeouts.implicit_wait = 10
b.a(:class, "btn").click
end
$i = 1
(1..1000).each do |i|
b.driver.manage.timeouts.implicit_wait = 30
pics_set = b.div(:id, 'pics_container').images
pics_array = []
pics_set.each_with_index do |image|
pics_array.push(image.current_src)
end
puts pics_array.find_all {|e| pics_array.rindex(e) != pics_array.index(e) }.uniq
end
b.close
The only problem here, is that, it is no showing which picture is duplicated, instead of, it only shows all img src without the one duplicated. Any hint on this?.
Thanks in advance.
Update #3
I got it working, it prints out the duplicated img src, but can't use the output data to do some web browser interactions, (clicks & drags)
Update #4
I've succesfully managed to interact with the data, only thing i want to know, is there any way to pic one or another duplicated picture?, since both ahve the same img srcit's impissible to click or drag from that attibute.
Here is the code that i've got by now
require 'sub'
require 'watir-webdriver'
require 'selenium-webdriver'
b = Watir::Browser.new :firefox
b.goto 'https://example'
begin
Watir::Wait.until { b.url == "http://example.com/img" }
b.a(:href, "http://example.com/imgs").wait_until_present
b.a(:href, "http://example.com/imgs").click
Watir::Wait.until { b.url == "http://example.com/imgs" }
b.driver.manage.timeouts.implicit_wait = 10
b.a(:class, "btn btn-xs btn-danger").click
end
b.driver.manage.timeouts.implicit_wait = 30
pics_set = b.div(:id, 'site_loader').images
pics_array = []
pics_set.each_with_index do |image|
pics_array.push(image.current_src)
end
duplicated = pics_array.find_all {|e| pics_array.rindex(e) != pics_array.index(e) }.uniq
duplicated[0].sub!("http://example.com
b.img(:src, duplicated).click", ".")
Update #5
Here is an example of the divi'm diggin' into
<div id="pic_container">
<img src="./images/test/3.png" style="cursor:pointer;width:64px" onclick="checkClick ("7hva9f")">
<img src="./images/test/5.png" style="cursor:pointer;width:59px" onclick="checkClick ("xt0nnc")">
<img src="./images/test/5.png" style="cursor:pointer;width:67px" onclick="checkClick ("1tyz9b")">
<img src="./images/test/1.png" style="cursor:pointer;width:67px" onclick="checkClick ("300yp7")">
<img src="./images/test/7.png" style="cursor:pointer;width:67px" onclick="checkClick ("pzxgyh")">
</div>
You can get all of the images in a browser or element by retrieving an ImageCollection. To get the collection you can either use the imgs or images method.
All of the images in the "pic_container" div can be retrieved by:
b.div(:id, 'pic_container').images
The ImageCollection is enumerable, which means you can get an array of the src attributes using:
b.div(:id, 'pic_container').images.map(&:src)
#=> ['src1', 'src2', 'etc']
Or if you need to do more custom logic per image, you can iterate through each one using each or each_with_index (if you also want an index). For example:
b.div(:id, 'pic_container').images.each_with_index do |image, i|
puts image.src
puts i
end
When I'm doing things other than driving the browser, I like to just use Ruby.
require 'watir-webdriver'
browser = Watir::Browser.new :chrome
browser.goto 'http:example.com'
#collects all images on page
image_collection = browser.images
# creates array of the 'src' urls
image_array = []
image_collection.each do |image|
image_array.push(image.current_src)
end
# outputs urls if any duplicates are found in the array
puts image_array.find_all {|e| image_array.rindex(e) != image_array.index(e) }.uniq
browser.close

Downloading images with Mechanize gem

I'm trying to download all full-res images from a site by checking for image links, visit them and download the full image.
I have managed to make it kinda work. I can fetch all links and download the image from i.imgur. However, I want to make it work with more sites and normal imgur albums and also without wget (which I am using now as shown below).
This is the code I'm currently playing around with (Don't judge, it's only testcode):
require 'mechanize'
require 'uri'
def get_images()
crawler = Mechanize.new
img_links = crawler.get("http://www.reddit.com/r/climbing/new/?count=25&after=t3_39qccc").links_with(href: %r{i.imgur})
return img_links
end
def download_images()
img_links = get_images()
crawler = Mechanize.new
clean_links = []
img_links.each do |link|
current_link = link.uri.to_s
unless current_link.include?("domain")
unless clean_links.include?(current_link)
clean_links << current_link
end
end
end
p clean_links
clean_links.each do |link|
system("wget -P ./images -A jpeg,jpg,bmp,gif,png #{link}")
end
end
download_images()

Ruby: How do I parse links with Nokogiri with content/text all the same?

What I am trying to do: Parse links from website (http://nytm.org/made-in-nyc) that all have the exact same content. "(hiring)" Then I will write to a file 'jobs.html' a list of links. (If it is a violation to publish these websites I will quickly take down the direct URL. I thought it might be useful as a reference to what I am trying to do. First time posting on stack)
DOM Structure:
<article>
<ol>
<li>#waywire</li>
<li><a href="http://1800Postcards.com" target="_self" class="vt-p">1800Postcards.com</a</li>
<li>Adafruit Industries</li>
<li><a href="http://www.adafruit.com/jobs/" target="_self" class="vt-p">(hiring)</a</li>
etc...
What I have tried:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select{|link| link.text == "(hiring)"}
results = hire_links.each{|link| puts link['href']}
begin
file = File.open("./jobs.html", "w")
file.write("#{results}")
rescue IOError => e
ensure
file.close unless file == nil
end
puts hire_links
end
find_jobs
Here is a Gist
Example Result:
[344] #<Nokogiri::XML::Element:0x3fcfa2e2276c name="a" attributes=[#<Nokogiri::XML::Attr:0x3fcfa2e226e0 name="href" value="http://www.zocdoc.com/careers">, #<Nokogiri::XML::Attr:0x3fcfa2e2267c name="target" value="_blank">] children=[#<Nokogiri::XML::Text:0x3fcfa2e1ff1c "(hiring)">]>
So it successfully writes these entries into the jobs.html file but it is in XML format? Not sure how to target just the value and create a link from that. Not sure where to go from here. Thanks!
The problem is with how results is defined. results is an array of Nokogiri::XML::Element:
results = hire_links.each{|link| puts link['href']}
p results.class
#=> Array
p results.first.class
#=> Nokogiri::XML::Element
When you go to write the Nokogiri::XML::Element to the file, you get the results of inspecting it:
puts results.first.inspect
#=> "#<Nokogiri::XML::Element:0x15b9694 name="a" attributes=...."
Given that you want the href attribute of each link, you should collect that in the results instead:
results = hire_links.map{ |link| link['href'] }
Assuming you want each href/link displayed as a line in the file, you can join the array:
File.write('./jobs.html', results.join("\n"))
The modified script:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select { |link| link.text == "(hiring)"}
results = hire_links.map { |link| link['href'] }
File.write('./jobs.html', results.join("\n"))
end
find_jobs
#=> produces a jobs.html with:
#=> http://www.20x200.com/jobs/
#=> http://www.8coupons.com/home/jobs
#=> http://jobs.about.com/index.html
#=> ...
Try using Mechanize. It leverages Nokogiri, and you can do something like
require 'mechanize'
browser = Mechanize.new
page = browser.get('http://nytm.org/made-in-nyc')
links = page.links_with(text: /(hiring)/)
Then you will have an array of link objects that you can get whatever info you want. You can also use the link.click method that Mechanize provides.

how would I search the `/aems/dic/list` from a list of anchors?

I have the below code which is a part of an html:
<td>YouTube</td>
<td><a data-category="news" href=http://kathack.com/party/aems/dic/list">Reddit</a></td>
<td>Kathack</td>
<td><a data-category="news" href="http://www.nytimes.com">New York Times</a></td>
now how would I search the /aems/dic/list and get the full url stored?
So, with nokogiri, something like this:
fragment = Nokogiri::HTML::DocumentFragment.parse text
fragment.css("a").each do |link|
href = link['href']
return href if href =~ /\/aems\/dic\/list/
end
Let's say you have a Mechanize::Page object page:
page.at('a[href*="/aems/dic/list"]')[:href]
#=> "http://kathack.com/party/aems/dic/list"
Update
For a longer example:
require 'mechanize'
agent = Mechanize.new
page = agent.get 'http://www.example.com/'
page.at('a[href*="/aems/dic/list"]')[:href]
#=> "http://kathack.com/party/aems/dic/list"

Resources