Is there any possible way to index/list and compare <img src=""> values using watir or selenium webdriver?.
Update #1
I've succesfully managed to progress on the general script for finding the right <div> that contains the pictures
require 'watir-webdriver'
require 'selenium-webdriver'
b = Watir::Browser.new :firefox
$i = 1
(1..1000).each do |i|
b.goto 'http:example.com'
b.div(:id, 'pic_container').wait_until_present
puts 'div present'
begin
if
else
end
end
b.close
There will be more coding, only thing i can't resolve is enumerate all pictures available, comparing their sources and output the results.
Update #2
Thanks both JustinKo and Carldmitch for their answers. I went to this now:
require 'watir-webdriver'
require 'selenium-webdriver'
b = Watir::Browser.new :firefox
b.goto 'https://trafficmonsoon.com'
begin
Watir::Wait.until { b.url == "http://example.com" }
b.a(:href, "http://example.com/img").wait_until_present
b.a(:href, "http://example.com/img").click
Watir::Wait.until { b.url == "http://example.com/img" }
b.driver.manage.timeouts.implicit_wait = 10
b.a(:class, "btn").click
end
$i = 1
(1..1000).each do |i|
b.driver.manage.timeouts.implicit_wait = 30
pics_set = b.div(:id, 'pics_container').images
pics_array = []
pics_set.each_with_index do |image|
pics_array.push(image.current_src)
end
puts pics_array.find_all {|e| pics_array.rindex(e) != pics_array.index(e) }.uniq
end
b.close
The only problem here, is that, it is no showing which picture is duplicated, instead of, it only shows all img src without the one duplicated. Any hint on this?.
Thanks in advance.
Update #3
I got it working, it prints out the duplicated img src, but can't use the output data to do some web browser interactions, (clicks & drags)
Update #4
I've succesfully managed to interact with the data, only thing i want to know, is there any way to pic one or another duplicated picture?, since both ahve the same img srcit's impissible to click or drag from that attibute.
Here is the code that i've got by now
require 'sub'
require 'watir-webdriver'
require 'selenium-webdriver'
b = Watir::Browser.new :firefox
b.goto 'https://example'
begin
Watir::Wait.until { b.url == "http://example.com/img" }
b.a(:href, "http://example.com/imgs").wait_until_present
b.a(:href, "http://example.com/imgs").click
Watir::Wait.until { b.url == "http://example.com/imgs" }
b.driver.manage.timeouts.implicit_wait = 10
b.a(:class, "btn btn-xs btn-danger").click
end
b.driver.manage.timeouts.implicit_wait = 30
pics_set = b.div(:id, 'site_loader').images
pics_array = []
pics_set.each_with_index do |image|
pics_array.push(image.current_src)
end
duplicated = pics_array.find_all {|e| pics_array.rindex(e) != pics_array.index(e) }.uniq
duplicated[0].sub!("http://example.com
b.img(:src, duplicated).click", ".")
Update #5
Here is an example of the divi'm diggin' into
<div id="pic_container">
<img src="./images/test/3.png" style="cursor:pointer;width:64px" onclick="checkClick ("7hva9f")">
<img src="./images/test/5.png" style="cursor:pointer;width:59px" onclick="checkClick ("xt0nnc")">
<img src="./images/test/5.png" style="cursor:pointer;width:67px" onclick="checkClick ("1tyz9b")">
<img src="./images/test/1.png" style="cursor:pointer;width:67px" onclick="checkClick ("300yp7")">
<img src="./images/test/7.png" style="cursor:pointer;width:67px" onclick="checkClick ("pzxgyh")">
</div>
You can get all of the images in a browser or element by retrieving an ImageCollection. To get the collection you can either use the imgs or images method.
All of the images in the "pic_container" div can be retrieved by:
b.div(:id, 'pic_container').images
The ImageCollection is enumerable, which means you can get an array of the src attributes using:
b.div(:id, 'pic_container').images.map(&:src)
#=> ['src1', 'src2', 'etc']
Or if you need to do more custom logic per image, you can iterate through each one using each or each_with_index (if you also want an index). For example:
b.div(:id, 'pic_container').images.each_with_index do |image, i|
puts image.src
puts i
end
When I'm doing things other than driving the browser, I like to just use Ruby.
require 'watir-webdriver'
browser = Watir::Browser.new :chrome
browser.goto 'http:example.com'
#collects all images on page
image_collection = browser.images
# creates array of the 'src' urls
image_array = []
image_collection.each do |image|
image_array.push(image.current_src)
end
# outputs urls if any duplicates are found in the array
puts image_array.find_all {|e| image_array.rindex(e) != image_array.index(e) }.uniq
browser.close
Related
I am trying to extract only the <p> that exist between Vigentes and Finalizados without achieving it.
require 'nokogiri'
require 'open-uri'
require 'time'
#url = "http://www.caru.org.uy/web/servicios/llamados-a-concurso-publico-para-contratar-personal/"
page = Nokogiri::HTML(open(#url))
div_content = page.css('.contenido')
div_content.each do |item|
puts item.text
break if item.css('h3').text == "Finalizados"
end
You should be able to do:
css = 'h3:contains(Vigentes) ~ p:has(~ h3:contains(Finalizados))'
But unfortunately, nokogiri doesn't behave properly for this one so we'll use xpath:
xpath = "//h3[contains(text(), 'Vigentes')]/following-sibling::p[./following-sibling::h3[contains(text(), 'Finalizados')]]"
page.search(xpath).each do |p|
# do something
end
I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....
I need to retrieve all the images present under a specific div using Ruby and Mechanize. The relevant DOM structure is as follows:
<div id="item_img">
<a href="JavaScript:imageview('000000018693.jpg')">
<img src="/shop/doubleimages/0000000186932.jpg" border="0" width="500" height="500" alt="関係ないコメント z1808">
</a>
<img src="/shop/doubleimages/000000018693_1.jpg"><br><br>
<img src="/shop/doubleimages/000000018693_2.jpg"><br><br>
<img src="/shop/doubleimages/000000018693_3.jpg"><br><br>
</div>
So, I initially got all the images after spinning up a new agent by doing:
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get ('http://double14.com/shopdetail/000000018693/')
puts page.images
This was nice, but it every image on the page (as it should), and seems to strip out the div id above it, making it impossible to decide what comes from where. As a result, I had every image on the page (no bueno).
I got it down to this:
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get ('http://double14.com/shopdetail/000000018693/')
node = page.search "#item_img img"
node.each do |n|
puts n.attributes['src']
end
Unfortunately, that outputs the following -
/shop/doubleimages/0000000186932.jpg
/shop/doubleimages/000000018693_1.jpg
/shop/doubleimages/000000018693_2.jpg
/shop/doubleimages/000000018693_3.jpg
Is there a way to take the full URL and use that instead? Ultimately, I would like to save these images to a database, but I need the full URL to serialize them to disk for later upload.
Yes. You can get the full URL for the images with the #resolve method:
require 'mechanize'
mechanize = Mechanize.new
mechanize.user_agent_alias = 'Mac Safari'
page = mechanize.get('http://double14.com/shopdetail/000000018693/')
page.search('#item_img img').each do |img|
puts mechanize.resolve(img['src'])
end
Alternatively you can use the #download method to download them directly.
This is how I did it for a collection of images. In this case the base_uri is the url that you are passing to get. Let me know if you have any questions.
def self.qualify_images(base_uri, images)
images.map do |image|
next unless has_src?(image)
qualify_image(base_uri, image)
end.compact
end
def self.qualify_image(base_uri, image)
src = image.attributes["src"].value
if src =~ /^\/[\/]/
result = "#{scheme(base_uri)}#{src}"
elsif src =~ /^\//
result = "#{base_uri}#{src}"
else
result = src
end
http?(result) ? result : nil
end
def self.has_src?(image)
image.attributes["src"].value
rescue NoMethodError
false
end
def self.scheme(uri)
uri = URI.parse(uri)
"#{uri.scheme}:"
end
def self.http?(uri)
uri = URI.parse(uri)
uri.kind_of?(URI::HTTP)
rescue URI::InvalidURIError
false
end
This will ensure a fully qualified uri for each image.
It will look something like:
page.search("#item_img img").each do |img|
puts page.uri.merge(img[:src]).to_s
end
What I am trying to do: Parse links from website (http://nytm.org/made-in-nyc) that all have the exact same content. "(hiring)" Then I will write to a file 'jobs.html' a list of links. (If it is a violation to publish these websites I will quickly take down the direct URL. I thought it might be useful as a reference to what I am trying to do. First time posting on stack)
DOM Structure:
<article>
<ol>
<li>#waywire</li>
<li><a href="http://1800Postcards.com" target="_self" class="vt-p">1800Postcards.com</a</li>
<li>Adafruit Industries</li>
<li><a href="http://www.adafruit.com/jobs/" target="_self" class="vt-p">(hiring)</a</li>
etc...
What I have tried:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select{|link| link.text == "(hiring)"}
results = hire_links.each{|link| puts link['href']}
begin
file = File.open("./jobs.html", "w")
file.write("#{results}")
rescue IOError => e
ensure
file.close unless file == nil
end
puts hire_links
end
find_jobs
Here is a Gist
Example Result:
[344] #<Nokogiri::XML::Element:0x3fcfa2e2276c name="a" attributes=[#<Nokogiri::XML::Attr:0x3fcfa2e226e0 name="href" value="http://www.zocdoc.com/careers">, #<Nokogiri::XML::Attr:0x3fcfa2e2267c name="target" value="_blank">] children=[#<Nokogiri::XML::Text:0x3fcfa2e1ff1c "(hiring)">]>
So it successfully writes these entries into the jobs.html file but it is in XML format? Not sure how to target just the value and create a link from that. Not sure where to go from here. Thanks!
The problem is with how results is defined. results is an array of Nokogiri::XML::Element:
results = hire_links.each{|link| puts link['href']}
p results.class
#=> Array
p results.first.class
#=> Nokogiri::XML::Element
When you go to write the Nokogiri::XML::Element to the file, you get the results of inspecting it:
puts results.first.inspect
#=> "#<Nokogiri::XML::Element:0x15b9694 name="a" attributes=...."
Given that you want the href attribute of each link, you should collect that in the results instead:
results = hire_links.map{ |link| link['href'] }
Assuming you want each href/link displayed as a line in the file, you can join the array:
File.write('./jobs.html', results.join("\n"))
The modified script:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select { |link| link.text == "(hiring)"}
results = hire_links.map { |link| link['href'] }
File.write('./jobs.html', results.join("\n"))
end
find_jobs
#=> produces a jobs.html with:
#=> http://www.20x200.com/jobs/
#=> http://www.8coupons.com/home/jobs
#=> http://jobs.about.com/index.html
#=> ...
Try using Mechanize. It leverages Nokogiri, and you can do something like
require 'mechanize'
browser = Mechanize.new
page = browser.get('http://nytm.org/made-in-nyc')
links = page.links_with(text: /(hiring)/)
Then you will have an array of link objects that you can get whatever info you want. You can also use the link.click method that Mechanize provides.
I would like to cycle threw all the input elements on a web page and print the name attribute of each. I am having trouble creating the array of elements to cycle threw. here is my code hitting the example page at bit.ly/watir-webdriver-demo
require 'watir-webdriver'
b = Watir::Browser.new
b.goto("bit.ly/watir-webdriver-demo")
listOfInputs = b.form(:method => "post")
listOfInputs.input.each do |i|
puts i.Name
end
How can I print out the name of each input on the page
looks like i just needed to not use form.
I use the body instead and this works!
require 'watir-webdriver'
browser = Watir::Browser.new
browser.goto("bit.ly/watir-webdriver-demo")
body = browser.body
body.inputs.each do |input|
puts input.name
end