I am trying to parse a website using Selenium/Capybara. Right now it looks like this:
session = Capybara::Session.new(:selenium)
session.visit "https://somesite.com/page1"
element = session.all( :css, '.table .row a' ).each do |el|
el.click
# get some element's data
session.evaluate_script('window.history.back()')
end
# repeat
Problem is, when I'm trying to get the data from the second page I am on, Capybara tells me that Either the element is no longer attached to the DOM or the page has been refreshed. which absolutely makes sense, however I'm struggling to find a way to create a new DOM variable and parse it in the documentation.
Same thing happens if I navigate back trying to repeat the actions and click on the second link in a row. I assume I need to re-create the session or is there a better way?
To work like you're trying you're going to need to keep a counter and find the elements each time through your loop - something along the lines of
counter = 0
while (el = session.all( :css, '.table .row a', minimum: 1 )[counter]) do
el.click
# get some element's data
counter += 1
session.go_back
end
or if the links are just standard you could gather the hrefs and then just visit them
element = session.all( :css, '.table .row a', minimum:1 ).map {|a| a['href']} do |url|
session.visit(url)
# get some element's data
end
Related
I've got an array of certain <a> elements like this:
array = browser.as(:class => 'foo')
I try to go to the links like this:
$i = 0
while $i < num do
browser.goto array[$i].href
.
.
.
$i += 1
end
Which works the for the first loop, but not for the second. Why is this happening? If I do
puts array[1].href
puts array[2].href
puts array[-1].href
before
browser.goto array[$i].href
it shows all the links on the first loop in terminal window.
Someone who knows this better than I do will need to verify / clarify. My understanding is that elements are given references when the page is loaded. On loading a new page you are building all new references, even if it looks like the same link on each page.
You are storing a reference to that element and then asking it to go ahead and pull the href attribute from it. It works the first time, as the element still exists. Once the new page has loaded it no longer exists. The request to pull the attribute fails.
If all you care about are the hrefs there's a shorter, more Ruby way of doing it.
array = []
browser.as(:class => 'foo').each { |link|
array << link.href
}
array.each { |link|
browser.goto link
}
By cramming the links into the array we don't have to worry about stale references.
Yes, unlike with individual elements that store the locator and will re-look them up on page reloads, an element collection stores only the Selenium Element which can't be re-looked up.
Another solution for when you need this pattern and more than just an href value is not to create a collection and to use individual elements with an index:
num.times do |i|
link = browser.link(class: 'foo', index: i)
browser.goto link.href
end
I'm using Ruby and Selenium to get some data from a page. I want to define variable with driver.find_element, but element is not currently visible on page.
next = driver.find_element(:class, 'right')
It returns Selenium::WebDriver::Error::NoSuchElementError
It works fine when element is present.
Any solutions?
Thank you!
Selenium works by executing Javascript commands. By using find_element it will search for the element on the DOM. If it cannot find it you will get the error you are getting. After all if an element is not on the DOM it cannot be found.
The real question is why do you want too find an element that is not currently present on the DOM? You can't do anything with somehing that doesn't exist.
All I could think of is that the element becomes present after the DOM has been loaded due to Javascript not being fully executed yet. If that is the case you can use a WebDriver::Wait to try and find the element for a certain amount of time.
A small example:
wait = Selenium::WebDriver::Wait.new(:timeout => 10) # seconds
begin
element = wait.until { driver.find_element(:id => "some-dynamic-element") }
ensure
driver.quit
end
Edit to include try-catch example:
begin
next = driver.find_element(:class, 'right')
# Code for when element is found here
rescue NoSuchElementError
# Code for when element is not found here
end
I have a Ruby application using Selenium Webdriver and Nokogiri. I want to choose a class, and then for each div corresponding to that class, I want to perform an action based on the contents of the div.
For example, I'm parsing the following page:
https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=puppies
It's a page of search results, and I'm looking for the first result with the word "Adoption" in the description. So the bot should look for divs with className: "result", for each one check if its .description div contains the word "adoption", and if it does, click on the .link div. In other words, if the .description does not include that word, then the bot moves on to the next .result.
This is what I have so far, which just clicks on the first result:
require "selenium-webdriver"
require "nokogiri"
driver = Selenium::WebDriver.for :chrome
driver.navigate.to "https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=puppies"
driver.find_element(:class, "link").click
You can get list of elements that contains "adopt" and "Adopt" by XPath using contains() then use union operator (|) to union results from "adopt" and "Adopt". See code below:
driver = Selenium::WebDriver.for :chrome
driver.navigate.to "https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=puppies"
sleep 5
items = driver.find_elements(:xpath,"//div[#class='g']/div[contains(.,'Adopt')]/h3/a|//div[#class='g']/div[contains(.,'adopt')]/h3/a")
for element in items
linkText = element.text
print linkText
element.click
end
The pattern to handle each iteration will be determined by the type of action executed on each item. If the action is a click, then you can't list all the links to click on each of them since the first click will load a new page, making the elements list obsolete.
So If you wish to click on each link, then one way is to use an XPath containing the position of the link for each iteration:
# iteration 1
driver.find_element(:xpath, "(//h3[#class='r']/a)[1]").click # click first link
# iteration 2
driver.find_element(:xpath, "(//h3[#class='r']/a)[2]").click # click second link
Here is an example that clicks on each link from a result page:
require 'selenium-webdriver'
driver = Selenium::WebDriver.for :chrome
wait = Selenium::WebDriver::Wait.new(timeout: 10000)
driver.navigate.to "https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=puppies"
# define the xpath
search_word = "Puppies"
xpath = ("(//h3[#class='r']/a[contains(.,'%s')]" % search_word) + ")[%s]"
# iterate each result by inserting the position in the XPath
i = 0
while true do
# wait for the results to be loaded
wait.until {driver.find_elements(:xpath, "(//h3[#class='r']/a)[1]").any?}
# get the next link
link = driver.find_elements(:xpath, xpath % [i+=1]).first
break if !link
# click the link
link.click
# wait for a new page
wait.until {driver.find_elements(:xpath, "(//h3[#class='r']/a)[1]").empty?}
# handle the new page
puts "Page #{i}: " + driver.title
# return to the main page
driver.navigate.back
end
puts "The end!"
I don't code in ruby, but one way you could do it in python is:
driver.find_elements
notice how elements is plural, I would grab all the links and put them into an array like.
href = driver.find_elements_by_xpath("//div[#class='rc]/h3/a").getAttribute("href");
Then get all of the descriptions the same way. Do a for loop for every element of description, if the description has the word "Adoption" in it navigate to that website.
for example:
if description[6] has the word adoption find the string href[6] and navigate to href[6].
I hope that makes sense!
I have a slickgrid table that I am trying to read into memory using watir-webdriver. Because the full data often cannot be seen without scrolling down, I want to make a function that can scroll through the table and also be able to tally a count of all the rows as well as access any row that might or might not be hidden within it. Here's what I have so far:
class SlickGridTable
def initialize(element)
#element = element
end
...
def scroll_down
location_y = 23
while true
location_y += 1
$browser.execute_script("arguments[0].scrollBy(0, #{location_y});", #element)
end
end
end
However I am regularly getting this error:
Selenium::WebDriver::Error::UnknownError: unknown error: undefined is not a function
I am also working with slickgrid, and considered a similar approach. Instead, I extended the Watir::Div class with a scroll_until_present method. Now we can scroll until present and then work with the data in the grid. I have not had the need to collect all the data after implementing this. Does not solve your problem with tallying rows, but does help find the records you are expecting to see.
# Extends the Watir::Div class to support slick grids
module Watir
class Div
# scrolls until it finds the item you are looking for
# can be used like wait_until_present
def scroll_until_present
scroll_height = browser.execute_script('return document.getElementsByClassName("slick-viewport")[0].scrollHeight')
(0..scroll_height).step(20).each { |item|
browser.execute_script('document.getElementsByClassName("slick-viewport")[0].scrollTop = ' + item.to_s)
if present?
# scroll a little more once the record is found
item += 30
browser.execute_script('document.getElementsByClassName("slick-viewport")[0].scrollTop = ' + item.to_s)
break
end
}
end
end
end
I am working on a javascript capable screen-scraper using capybara/dsl, selienium webdriver, and the spreadsheet gem. Very close to the desired output however two major problems arise:
I have not been able to figure out the exact xpath selector to filter out only the elements I'm looking for; to ensure that none are missing I am using a broad selector that I know will produce duplicate elements. I was planning on just calling .uniq on that selector but this throws an error. What is the proper way to do this results in the desired filtering. The error is an undefined no method for 'uniq'. Maybe I'm not using it properly: results = all("//a[contains(#onclick, 'analyticsLog')]").uniq. I know that the xpath that I have chosen to extract hrefs: //a[contains(#onclick, 'analyticsLog')] will define more nodes than I intended because using find to inspect the page elements shows 144 rather than 72 that make up the page results. I have looked for a more specific selector however I haven't been able to find one without filtering out some desired links due to the business logic used on the site.
My save_item method has two selectors that are not always found within the info results, I would like the script to just skip those that aren't found and save only the ones that are however my current iteration will throw a Capybara::ElementNotFound and exit. How could I configure this to work in the intended way.
#
code below
#
require "capybara/dsl"
require "spreadsheet"
Capybara.run_server = false
Capybara.default_driver = :selenium
Capybara.default_selector = :xpath
Spreadsheet.client_encoding = 'UTF-8'
class Tomtop
include Capybara::DSL
def initialize
#excel = Spreadsheet::Workbook.new
#work_list = #excel.create_worksheet
#row = 0
end
def go
visit_main_link
end
def visit_main_link
visit "http://www.some.com/clothing-accessories?dir=asc&limit=72&order=position"
results = all("//a[contains(#onclick, 'analyticsLog')]")# I would like to use .uniq here to filter out the duplicates that I know will be delivered by this selector
item = []
results.each do |a|
item << a[:href]
end
item.each do |link|
visit link
save_item
end
#excel.write "inventory.csv"
end
def save_item
data = all("//*[#id='content-wrapper']/div[2]/div/div")
data.each do |info|
#work_list[#row, 0] = info.find("//*[#id='productright']/div/div[1]/h1").text
#work_list[#row, 1] = info.find("//div[contains(#class, 'price font left')]").text
#work_list[#row, 2] = info.find("//*[#id='productright']/div/div[11]").text
#work_list[#row, 3] = info.find("//*[#id='tabcontent1']/div/div").text.strip
#work_list[#row, 4] = info.find("//select[contains(#name, 'options[747]')]//*[#price='0']").text #I'm aware that this will not always be found depending on the item in question but how do I ensure that it doesn't crash the program
#work_list[#row, 5] = info.find("//select[contains(#name, 'options[748]')]//*[#price='0']").text #I'm aware that this will not always be found depending on the item in question but how do I ensure that it doesn't crash the program
#row = #row + 1
end
end
end
tomtop = Tomtop.new
tomtop.go
For Question 1: Get unique elements
All of the elements returned by all are unique. Therefore, I assume by "unique" elements, you mean that the "onclick" attribute is unique.
The collection of elements returned by Capybara is an enumerable. Therefore, you can convert it to an array and then take the unique element's based on their onclick attribute:
results = all("//a[contains(#onclick, 'analyticsLog')]")
.to_a.uniq{ |e| e[:onclick] }
Note that it looks like the duplicate links are due to one for the image and one for the text below the image. You could scope your search to just one or the other and then you would not need to do the uniq check. To scope to just the text link, use the fact that the link is a child of an h5:
results = all("//h5/a[contains(#onclick, 'analyticsLog')]")
For Question 2: Capture text if element present
To solve your second problem, you could use first to locate the element. This will return the matching element if one exists and nil if one does not. You could then save the text if the element is found.
For example:
el = info.first("//select[contains(#name, 'options[747]')]//*[#price='0']")
#work_list[#row, 4] = el.text if el
If you want the text of all matching elements, then use all:
options = info.all(".//select[contains(#name, 'options[747]')]//*[#price='0']")
#work_list[#row, 4] = options.collect(&:text).join(', ')
When there are multiple matching options, you will get something like "Green, Pink". If there are no matching options, you will get "".