ruby watir - get all divs inner contents from search - ruby

I am attempting to scrape a website (respectfully). I tried with Nokogiri, then mechanize, then because the website i am scraping is loading a form dynamically, I was forced to use a webdriver. I am currently using ruby's watir.
What i am trying to do, is to fill out the dynamic form with a select, clicking submit, going to the results part of the page (form renders result on same page), and collecting all the divs with the information (traversing through sub-divs looking for hrefs).
def scrape
browser = Watir::Browser.new
browser.goto 'http://www.website-link.com'
browser.select_list(:id => 'city').select('cityName')
browser.link(:id, 'btnSearch').click
# this part; results from search are in this div w/ this ID
# however, iterating through this list does not work the way i expected
browser.div(:id, 'resultsDiv').divs.each do |div|
p div
end
browser.close
end
right now this returns
#<Watir::Div: located: true; {:id=>"resultsDiv", :tag_name=>"div"} --> {:tag_name=>"div", :index=>0}>
#<Watir::Div: located: true; {:id=>"resultsDiv", :tag_name=>"div"} --> {:tag_name=>"div", :index=>1}>
#<Watir::Div: located: true; {:id=>"resultsDiv", :tag_name=>"div"} --> {:tag_name=>"div", :index=>2}>
which looking at the page source looks like there are 3 divs inside of the resultsDiv which is probably what those indexes are. I guess what i was expecting (coming from Nokogiri/Mechanize) is an object to manipulate.
does anyone have any experience doing this that could point me to the right direction?

If you known the order that you want, you can do:
browser.driver.find_elements(:id => 'resultsDiv')[n].click
or
browser.div(:id => 'resultsDiv')[n].click
or
browser.div(:id, 'resultsDiv').div(:id,'id_n').click

Related

Watir WebDriver select list without standard html

Just started using Watir-WebDriver, came across website with and facing an problem.
1) Goto "https://www.smiles.com.br/home"
2) How to get/set values to drop down selections like selecting number of adults in "Adulto"
Selection list initially is hidden but when clicked on the text field it become visible in browser.
But when I tried using to click and check its existence it returning false
1) b.text_field(:id => "inputOrigin").click
2) irb(main):049:0> b.a(:id => 'yui_patched_v3_11_0_1_1467245841395_1777').exist?
=> false
Since its not actually a selection list, how do I set values?
Thanks in advance.
The ID is dynamic so it changes every time. It looks like it might be based on the time, but this should be sufficiently unique:
b.link(id: /^yui_patched_v3_11_0_1_/)
After clicking on <input id="inputOrigin">, you need to perform another click to trigger the "dropdown" list. Otherwise, you'll get a element not visible error. You could do something like this:
require 'watir-webdriver'
b = Watir::Browser.new :chrome
b.goto('https://www.smiles.com.br/home')
b.text_field(:id => "inputOrigin").when_present.click
b.div(:id => "dk_container_qtdAdult").when_present.click #trigger "dropdown" menu
b.div(:class => "adult").link(:text => "02").when_present.click
# b.div(:class => "adult").link(:data_dk_dropdown_value => "02").click #an alternative
Once selected, you can get the option using the .text method:
puts b.div(:id => "dk_container_qtdAdult").link(:class => "dk_toggle dk_label").text
Lastly, you could use regex locators, but you might have more success by trying to identify more discrete (and--ideally--unique) page elements.

Ruby selenium gem: open hyperlinks in multiple tabs

Goal: In irb, open a series of hyperlinks in new tabs and save a screenshot of each.
Code:
require "rubygems"
require "selenium-webdriver"
browser = Selenium::WebDriver.for:firefox
browser.get 'https://company.com'
browser.find_element(:name, "username").send_keys("myUsername")
browser.find_element(:name, "password").send_keys("myPassword")
browser.find_element(:name, "ibm-submit").click
body = browser.find_element(:tag_name => 'body')
body.send_keys(:control, 't')
parent = browser.find_element(:xpath, "//div[#id='someid']")
children = parent.find_elements(:xpath,"//a")
children.each do |i| ;
body.send_keys(:control, 't')
i.click
browser.save_screenshot("{i}")
end
Problem:
Selenium::WebDriver::Error::StaleElementReferenceError: Element not found in the cache - perhaps the page has changed since it was looked up
Question: What am I doing wrong?
Basically you can't share WebElements across pages, yet you're trying to access body across multiple tabs. Try not to think of them as self-contained objects, but as proxies to something on a real page.
The solution is to only ever perform actions on the 'current' page. In your case that means sending the Ctrl-T event on the tab that you've created. Having done that the first time, you're switched to the new tab. You then need to re-perform the lookup:
newTabsBody = browser.find_element(:tag_name => 'body')
and then:
newTabsBody.send_keys(:control, 't')
to create the next one. Continue for each child.

Posting data on website using Mechanize Nokogiri Selenium

I need to post data on a website through a program.
To achieve this I am using Mechanize Nokogiri and Selenium.
Here's my code :
def aeiexport
# first Mechanize is submitting the form to identify yourself on the website
agent = Mechanize.new
agent.get("https://www.glou.com")
form_login_AEI = agent.page.forms.first
form_login_AEI.util_vlogin = "42"
form_login_AEI.util_vpassword = "666"
# this is suppose to submit the form I think
page_compet_list = agent.submit(form_login_AEI, form_login_AEI.buttons.first)
#to be able to scrap the page you end up on after submitting form
body = page_compet_list.body
html_body = Nokogiri::HTML(body)
#tds give back an array of td
tds = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td")
# Checking my array of td with some condition
tds.each do |td|
link = td.children.first # Select the first children
if link.html = "2015 32 92 0076 012"
# Only consider the html part of the link, if matched follow the previous link
previous_td = td.previous
previous_url = previous_td.children.first.href
#following the link contained in previous_url
page_selected_compet = agent.get(previous_url)
# to be able to scrap the page I end up on
body = page_selected_compet.body
html_body = Nokogiri::HTML(body)
joueur_access = html_body.search('#tabs0head2 a')
# clicking on the link
joueur_access.click
rechercher_par_numéro_de_licence = html_body.css('.L1').xpath("//table/tbody/tr/td[1]/a[1]")
pure_link_rechercher_par_numéro_de_licence = rechercher_par_numéro_de_licence['href']
#following pure_link_rechercher_par_numéro_de_licence
page_submit_licence = agent.get(pure_link_rechercher_par_numéro_de_licence)
body_submit_licence = page_submit_licence.body
html_body = Nokogiri::HTML(body_submit_licence)
#posting my data in the right field
form.field_with(:name => 'lic_cno[0]') == "9511681"
1) So far what do you think about this code, Do you think there is an error in there
2) This part is the one I am really not sure about : I have posted my data in the right field but now I need to submit it. The problem is that the button I need to click is like this:
<input type="button" class="button" onclick="dispatchAndSubmit(document.JoueurRechercheForm, 'rechercher');" value="Rechercher">
it triggers a javascript function onclick. I am triying Selenium to trigger the click event. Then I end up on another page, where I need to click a few more times.. I tried this:
driver.find_element(:value=> 'Rechercher').click
driver.find_element(:name=> 'sel').click
driver.find_element(:value=> 'Sélectionner').click
driver.find_element(:value=> 'Inscrire').click
But so far I have not succeeded in posting the data.
Could you please tell me if selenium will enable me to do what I need to do. If can I do it ?
At a glance your code can use less indentation and more white space/empty lines to separate the internal logic of AEIexport (which should be changed to aei_export since Ruby uses snake case for method names. You can find more recommendations on how to style ruby code here).
Besides the style of your code, an error I found at the beginning of your method is using an undefined variable page when defining form_login_AEI.
For your second question, I'm not familiar with Selenium; however since it does use a real web browser it can handle JavaScript. Watir is another possible solution.
An alternative would be to view the page source (i.e. in Firebug) and understand what the JavaScript on the page does. Then use Mechanize to follow the link manually.

How to pass through "multi factor" authentication with mechanize?

I am attempting to log into a site via Mechanize that is proving to be a bit of a challenge. I can get past the first two form pages, but then after submitting an ID and password, I get a third page prior to being able to go into the main content that I'm attempting to scrape.
In my case I have the following Ruby source that gets me to the point where I'm hitting a roadblock:
agent = Mechanize.new
start_url = 'https://sub.domain.tld/action'
# Get first page
page = agent.get(start_url)
# Fill in the first ID field and submit form
login_form = page.forms.first
id_field = login_form.field_with(:name => "ctl00$PageContent$Login1$IdTextBox")
id_field.value = "XXXXXXXXXXX"
# Get the next password page & submit form:
page = agent.submit(login_form, login_form.buttons.first)
login_form = page.forms.first
password_field = login_form.field_with(:name => "ctl00$PageContent$Login1$PasswordTextBox")
password_field.value = "XXXXXXXXXXX"
# Get the next page and...
page = agent.submit(login_form, login_form.buttons.first)
# Try to go into the main content, just to hit a wall. :s
page = agent.click(page.link_with(:text => /Skip to main/))
The contents of the third page as per the mechanize agent output is:
https://gist.github.com/5ed57292c8f6532352fd
As you may note from this, it seems that one should be able to simply use agent.click on the first link to get on into the main content. Unfortunately, this is simply looping back to this page. Each time this loads I see that it is loading a new page, but it ends up having the precise same contents each time. Something is preventing me from being able to go on in through this multi-factor login to the main content it would seem, but I can't put my finger on what that might be.
Here is the page.content from that third request: http://f.cl.ly/items/252y261c303R0m2P1R0j/source.html
Any thoughts on what may be stopping me from going forward to content here?

How to count the number of images on a certain page using Mechanize?

I'm using Mechanize in a Rails 4 application. I created a new agent to scrape a page:
clienturl = #bid.mozs.where(is_main: true).first.attributes['url']
agent = Mechanize.new
#page = agent.get('http://' + clienturl)
#url = #page.uri
I can do things like get the uri, title and meta description. I'd like to now get the count of images on the page and how many of those images are missing alt attributes. Is this possible with Mechanize?
Do something like this:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.iana.org/domains/reserved')
doc = page.parser
img_count = doc.search('img').size # => 2
img_w_alt_count = doc.search('img[#alt]').size # => 1
img_count - img_w_alt_count # => 1
Nokogiri is the parser inside Mechanize. parser returns an instance of the parsed DOM. From that we can ask Nokogiri to search for all nodes matching a selector. I used a CSS selector, but you can use XPath also; CSS tends to be more readable and less verbose.
search returns a NodeSet, so size tells us how many nodes matched.

Resources